test job : lsf submit 24 setiathome jobs to compute nodes.
compute nodes hang randomly(different nodes ) , need to reset/reboot .
yesterday night sumbit jobs , this morning hang six nodes, no any clue in log file .
it is random nodes hang, I can not get any temperture info in /proc/acpi , I check it in BIOS, also no temperture info; our datacenter cooling system is ok, inside very cold, I check the datacenter temperture record , all about 20c degree.
if there is no job running on cluster nodes, all nodes fine (until now like that).
any one has this problem before ? or any suggestion ?