|
|
|
![]() |
AMD Developer Forums | ![]() |
|
help :
faq :
home
|
||
|
Latest News:
|
latest topics : statistics | |


|
Topic Title: How to optimize the kernel with Brook+ Topic Summary: Created On: 11/05/2009 04:03 AM Status: Post and Reply |
Linear : Threading : Single : Branch |
Search Topic |
Topic Tools
|
|
|
|
|
I has optimized this kernel. But the performance is not very good. Are there some special tricks in Brook+, which I have not used for this kernel? kernel void |
|
|
|
|
|
|
|
|
first, use a float4 scatter instead of a float2, this reduces the number of reads that you need by a factor of two.
Second, use vector math when possible and swizzles instead of using a bunch of scalar math. w1 = WsI[cntP][Widx]; Widx += 1; w2 = WsI[cntP][Widx]; Widx += 1; w3 = WsI[cntP][Widx]; Widx += 1; w4 = WsI[cntP][Widx]; Widx += 1; should be: w1 = WsI[cntP][Widx]; w2 = WsI[cntP][Widx + 1]; w3 = WsI[cntP][Widx + 2]; w4 = WsI[cntP][Widx + 3]; Widx += 4; //compute start index in input matrix if(cntG >= firstToSkip)cntG = cntG + SkipLines; Can be generated as: cntG = cntG + (SkipLines * (int)(cntG >= firstToSkip)) Finally, don't use division/modulus unless you absolutely have to. ------------------------- Micah Villmow Advanced Micro Devices Inc. -------------------------------- The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied. |
|
|
|
|
|
|
|
|
Hi licoah, I played a little with your code, but focusing more on the main loop, other than the tips Micah already give I have a few more: 1) When using brook+ in PS mode (always you don't put an [Attribute(GroupSize())} in the kernel it will do fetchs by sampling textures, sampling expect floats as parameters (in fact, float2), if you pass an int it will have to convert from int to foat and only the T unit does that, CS expects int. 2) By being float2 it will generate MOVs if not in the same register, but it is easy to solve. 3) It's possible for you to change the data layout? To me using a pair {X, Y} of float4 instead of 4 float2 seems ok. One last thing, how slow it is? What's the input data look like? How large streams are? Here a piece of my code, if all work shloud perform twice as fast: Code:
kernel void kernel_brook1(int nCha, int pSize, int gSize, int AF, int nPatLin, int nCol, int nLines, int halbpSize, int firstToSkip, int oWidth, int iWidth, int wWidth, int nChapSize, int SkipLines,
float4 dataInX[][], float4 dataInY[][], float4 WsIX[][], float4 WsIY[][], out float2 dataOut<>) { float2 res = float2(0.0f,0.0f); int2 pos = instance().xy; float4 xX,xY,wX,wY; int Y = pos.y / 4; int X = pos.y%4*oWidth + pos.x;//(pos.y - Y * 4)*oWidth + pos.x; int cntG = Y / gSize; int cntAF = Y - gSize * cntG; int cntCha = X / nCol; float cntP = X%nCol; //X - cntCha*nCol; float dataN = nChapSize/4; // number of source samples float4 k = 0; //compute start index in weights matrix k.y = nChapSize *gSize * cntCha + nChapSize * cntAF;//vvvvv******* //compute start index in input matrix if(cntG >= firstToSkip)cntG = cntG + SkipLines; k.z = nCha * (cntG - halbpSize + 1); k.w = cntP; while(k.w < dataN){ wX = WsIX[k.wy]; wY = WsIY[k.wy]; xX = dataInX[k.wz]; xY = dataInY[k.wz]; res.y += dot(wY, xX) + dot(wX, xY); res.x += dot(wX, xX) - dot(wY, xY); k.xyz += 1; } dataOut = res; } |
|
|
|
|
|
|
|
|
Thany you very much for your help. I got only 17 Gflops. The card is HD4870. float2 dataIn {1664,256} float2 WxI{6144,256 } float2 dataOut{2048,440} I use float2, because the data are comlex numbers.
|
|
|
|
|
|
|
|
|
I have try to your code. That's nice.
But when k.y = (nChapSize *gSize * cntCha + nChapSize * cntAF)/4, the execution time increased again. why? |
|
|
|
|
|
|
|
|
Try: k.y = (nChapSize *gSize * cntCha + nChapSize * cntAF)/4.0f;
As a general note, there are too many integer divisions, they are the slowest type, here the code improved by replacing all integer by floats, using floor for mod. |
|
|
|
|
FuseTalk Hosting Executive Plan v3.2 - © 1999-2009 FuseTalk Inc. All rights reserved.
| Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information |