AMD Processors
Decrease font size
Increase font size
Topic Title: Floating point calculation problems
Topic Summary: Strange numbers returning on simple Float point calculations
Created On: 12/02/2010 01:27 PM
Status: Read Only
Linear : Threading : Single : Branch
Search Topic Search Topic
Topic Tools Topic Tools
View similar topics View similar topics
View topic in raw text format. Print this topic.
 12/02/2010 01:27 PM
User is offline View Users Profile Print this message

Author Icon
tm2010
Lurker

Posts: 1
Joined: 12/02/2010

Hi all,

We have a AMD Opteron Dual Core machine running RHEL5.5. We run lots of floating point calculations in R (CRAN) and so it was weird when normally robust code which has run for months starting coming up with NA results.

To isolate the issue I created a short C program which repeatedly counted to 49 as shown below. Occassionally it returns numbers other than 49 and sometimes nan values.

I am trying to get my hosting company to take this seriously as it is also causing checksum errors in OpenSSH, causing SSH sessions to crash. My hosting company say they replaced the chip for another, and it worked fine for about a week before NA started to creep back in and sure enough the short test started to show nan results again. We run the same code on Phenom II and it all works fine and dandy which is why I doubt very much its anything to do with the software itself.

Please find the OS, Compiler and Short application, along with a short bit of output showing the results along with the CPU specs from /proc/cpuinfo.

What I want to know is, why might this be happening, could it be heat? What are the chances its a hardware issue rather than a software issue.

Thanks

Tom

--


Linux 2.6.18-194.26.1.el5 #1 SMP Fri Oct 29 14:21:22 EDT 2010 i686 athlon i386 GNU/Linux

g++ (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48)


#include <cstdio>

int
main() {
int i=0,j=0;
float x=0.0;
while( 1 ) {
j++;
x = 0.0;
for(i=0; i < 49; i++ ) {
x = x + 1.0;
}
if ( x != 49.0 ) {
printf("[%d] Should be 49 but is %f\n",j,x);
}
}
}

[139929310] Should be 49 but is nan
[682853096] Should be 49 but is nan
[683049450] Should be 49 but is nan
[698344528] Should be 49 but is nan
[724900149] Should be 49 but is nan
[729791489] Should be 49 but is nan
[735051402] Should be 49 but is nan
[736124388] Should be 49 but is nan
[736503962] Should be 49 but is nan
[736671760] Should be 49 but is nan
[738606950] Should be 49 but is nan
[753355909] Should be 49 but is nan
[755946353] Should be 49 but is nan
[773043084] Should be 49 but is nan
[773043307] Should be 49 but is 38.000000
[863112398] Should be 49 but is nan
[863279735] Should be 49 but is nan
[1225527218] Should be 49 but is nan
[1586412167] Should be 49 but is nan
[1586490428] Should be 49 but is nan
[1586564685] Should be 49 but is nan
[1586625039] Should be 49 but is nan
[1586873145] Should be 49 but is nan


processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 67
model name : Dual-Core AMD Opteron(tm) Processor 1212
stepping : 3
cpu MHz : 1000.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow nonstop_tsc pni cx16 lahf_lm cmp_legacy svm extapic cr8legacy ts fid vid ttp tm stc
bogomips : 2010.33

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 67
model name : Dual-Core AMD Opteron(tm) Processor 1212
stepping : 3
cpu MHz : 1000.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow nonstop_tsc pni cx16 lahf_lm cmp_legacy svm extapic cr8legacy ts fid vid ttp tm stc
bogomips : 2010.33
 12/09/2010 07:33 PM
User is offline View Users Profile Print this message

Author Icon
MU_Engineer
Dr. Mu

Posts: 1837
Joined: 08/26/2006

CPUs rarely cause problems in calculations all on their own, and for the problem to persist despite replacing the CPU, I'd strongly suspect the problem lies elsewhere. Memory, motherboards, and power supplies are all common culprits for causing problems. I'd check the RAM with a day's worth of running Memtest86+. If the memory comes up all clean, then perhaps swapping out the PSU may help. I've had a PSU problem that manifested itself as only rarely causing the CPU to give an error when doing some calculations with SSE FP math. Using standard x87 led to no errors. The PSU died all of a sudden and then when I replaced it, the exact same CPU could now successfully do days and days' worth of calculations using SSE FP math with nary an error. I guess the old PSU had just enough of a voltage droop or ripple to cause the CPU to have very subtle errors that were not severe enough to hang the system, but enough to cause the calculations to fail.

-------------------------
Statistics
112018 users are registered to the AMD Processors forum.
There are currently 0 users logged in.

FuseTalk Hosting Executive Plan v3.2 - © 1999-2014 FuseTalk Inc. All rights reserved.



Contact AMD Terms and Conditions ©2007 Advanced Micro Devices, Inc. Privacy Trademark information