AMD Processors
Decrease font size
Increase font size
Topic Title: Number of Floating point registers & performance
Topic Summary:
Created On: 12/01/2006 05:21 PM
Status: Read Only
Linear : Threading : Single : Branch
Search Topic Search Topic
Topic Tools Topic Tools
View similar topics View similar topics
View topic in raw text format. Print this topic.
 12/01/2006 05:21 PM
User is offline View Users Profile Print this message

Author Icon
mehmetb
Junior Member

Posts: 2
Joined: 11/28/2006

Hi everyone,

I will have a quick question please... I read from the technical report that AMD Athlon XP processor has 36-entry floating point unit. Does it mean that the number of floating point registers is 36?

To be more specific, I implement two loop unrolling:

(A) Depth = 4, which includes 20 floats (4 scalars + 16 array components)
(B) Depth = 8, which includes 36 floats (4 scalars + 32 array components)

Note: I am quite sure that there is no difference between scalars and array components for FPU...

However, (A) works faster than (B) in most cases although there are 36 FP registers available, and I don't understand why...

===========================
Array Sizes are = 16777216
Not Unrolled, total FP number = 8, sum = 136612348428288.000000, Time = 0.850756000000
Depth = 2, total FP number = 12, sum = 141854481842176.000000, Time= 0.786128000000
(A) Depth = 4, total FP number = 20, sum = 140531497697280.000000, Time= 0.732348000000
(B) Depth = 8, total FP number = 36, sum = 140673852375040.000000, Time= 0.749005000000
Depth = 12, total FP number = 42, sum = 140719226355712.000000, Time= 0.983244000000
Depth = 16, total FP number = 68, sum = 140741732990976.000000, Time= 0.955522000000
===========================

Example code for Depth = 4:

===========================
sum = 0.; suma = 0.; sumb = 0., sumc = 0.; sumd = 0.;
then = rtc();
for (i = 0; i < n - 3; i += 4) {
suma += a[i+0] + a[i+1] + a[i+2] + a[i+3];
sumb += b[i+0] + b[i+1] + b[i+2] + b[i+3];
sumc += c[i+0] + c[i+1] + c[i+2] + c[i+3];
sumd += d[i+0] + d[i+1] + d[i+2] + d[i+3];
}
sum = suma + sumb + sumc + sumd;
now = rtc() - then;
printf ("Depth = 4, total FP number = 20, sum = %f, Time= %14.12f\n", sum, now);
===========================

Thanks a lot in advance!!

Regards,
Memo
 12/01/2006 05:35 PM
User is offline View Users Profile Print this message

Author Icon
Xtreeme
Senior Member

Posts: 2705
Joined: 05/04/2006

Im confused now to haha. I googled it and some cluster sites for example say the XP only handles 3 FP per cycle. If thats relevant to your problem I can see the slow down then (not certain this is outta my area of expertise so to speak).

http://72.14.205.104/search?q=...ZL1t...t=clnk&cd=1
"The issue rate of your CPUs describes how many FP instructions it can compute per cycle.
A rough guide is: (CPU type: issue rate) Pentium 2: 1, Pentium 4: 2, Athlon XP: 3, Itanium 2: 4."

so 20 floats at 3 per cycle, means just over 6 cycles to complete.
36 floats at 3 per cycle, means 12 cycles to complete.

If Im understanding that right it means, ya it can handle 36 but of course it will take longer to run. (or am I looking at this wrong?)
 12/01/2006 07:30 PM
User is offline View Users Profile Print this message

Author Icon
EduardoS
Member

Posts: 133
Joined: 03/01/2006

There are some confusion there...
In the x86 world we have 8 fp registers, the 36 entry refers to the scheduler, wich is used to reorder the instructions increasing the performace.
With double precison floating point, K-8 can issue one sum, and one multiply per clock, but each needs 4 clocks to complete, so the maximum throughput can be obtained with 4 sums and 4 multipliers, if all of them are independent.
But with a code in C/C++ you will need a decent (inexistent at the moment) compiler to organize instructions in such way, a bit of assembly here may help.
Also, using all SSE registers you can store 16 double precision or 32 single precision, twice of it in 64 bits mode.
Statistics
112018 users are registered to the AMD Processors forum.
There are currently 0 users logged in.

FuseTalk Hosting Executive Plan v3.2 - © 1999-2014 FuseTalk Inc. All rights reserved.



Contact AMD Terms and Conditions ©2007 Advanced Micro Devices, Inc. Privacy Trademark information