mehmetb Junior Member
Posts: 2
Joined: 11/28/2006

Hi everyone,
I will have a quick question please... I read from the technical report that AMD Athlon XP processor has 36entry floating point unit. Does it mean that the number of floating point registers is 36?
To be more specific, I implement two loop unrolling:
(A) Depth = 4, which includes 20 floats (4 scalars + 16 array components) (B) Depth = 8, which includes 36 floats (4 scalars + 32 array components)
Note: I am quite sure that there is no difference between scalars and array components for FPU...
However, (A) works faster than (B) in most cases although there are 36 FP registers available, and I don't understand why... =========================== Array Sizes are = 16777216 Not Unrolled, total FP number = 8, sum = 136612348428288.000000, Time = 0.850756000000 Depth = 2, total FP number = 12, sum = 141854481842176.000000, Time= 0.786128000000 (A) Depth = 4, total FP number = 20, sum = 140531497697280.000000, Time= 0.732348000000 (B) Depth = 8, total FP number = 36, sum = 140673852375040.000000, Time= 0.749005000000 Depth = 12, total FP number = 42, sum = 140719226355712.000000, Time= 0.983244000000 Depth = 16, total FP number = 68, sum = 140741732990976.000000, Time= 0.955522000000 ===========================
Example code for Depth = 4:
=========================== sum = 0.; suma = 0.; sumb = 0., sumc = 0.; sumd = 0.; then = rtc(); for (i = 0; i < n  3; i += 4) { suma += a[i+0] + a[i+1] + a[i+2] + a[i+3]; sumb += b[i+0] + b[i+1] + b[i+2] + b[i+3]; sumc += c[i+0] + c[i+1] + c[i+2] + c[i+3]; sumd += d[i+0] + d[i+1] + d[i+2] + d[i+3]; } sum = suma + sumb + sumc + sumd; now = rtc()  then; printf ("Depth = 4, total FP number = 20, sum = %f, Time= %14.12f\n", sum, now); ===========================
Thanks a lot in advance!!
Regards, Memo

Xtreeme Senior Member
Posts: 2705
Joined: 05/04/2006

Im confused now to haha. I googled it and some cluster sites for example say the XP only handles 3 FP per cycle. If thats relevant to your problem I can see the slow down then (not certain this is outta my area of expertise so to speak). http://72.14.205.104/search?q=...ZL1t...t=clnk&cd=1"The issue rate of your CPUs describes how many FP instructions it can compute per cycle. A rough guide is: (CPU type: issue rate) Pentium 2: 1, Pentium 4: 2, Athlon XP: 3, Itanium 2: 4." so 20 floats at 3 per cycle, means just over 6 cycles to complete. 36 floats at 3 per cycle, means 12 cycles to complete. If Im understanding that right it means, ya it can handle 36 but of course it will take longer to run. (or am I looking at this wrong?)

EduardoS Member
Posts: 133
Joined: 03/01/2006

There are some confusion there... In the x86 world we have 8 fp registers, the 36 entry refers to the scheduler, wich is used to reorder the instructions increasing the performace. With double precison floating point, K8 can issue one sum, and one multiply per clock, but each needs 4 clocks to complete, so the maximum throughput can be obtained with 4 sums and 4 multipliers, if all of them are independent. But with a code in C/C++ you will need a decent (inexistent at the moment) compiler to organize instructions in such way, a bit of assembly here may help. Also, using all SSE registers you can store 16 double precision or 32 single precision, twice of it in 64 bits mode.
