AMD Logo AMD Developer Central
AMD Developer Forums
Decrease font size
Increase font size
Topic Title: How to optimize the kernel with Brook+
Topic Summary:
Created On: 11/05/2009 04:03 AM
Status: Post and Reply
Linear : Threading : Single : Branch
Search Topic Search Topic
Topic Tools Topic Tools
View similar topics View similar topics
View topic in raw text format. Print this topic.
 11/05/2009 04:03 AM
User is offline View Users Profile Print this message

Author Icon
licoah

Posts: 22
Joined: 01/13/2009

I has optimized this kernel. But the performance is not very good.

Are there some special tricks in Brook+, which I have not used for this kernel?

kernel void
kernel_brook1(int nCha, int pSize, int gSize, int AF, int nPatLin, int nCol, int nLines, int halbpSize,  int firstToSkip, int oWidth, int iWidth, int wWidth, int nChapSize, int SkipLines,float2 dataIn[][], float2 WsI[][], out float2 dataOut<>{

    float2 res = float2(0.0f,0.0f);
    int2 pos = instance().xy;
    float2 w1,w2,w3,w4,x1,x2,x3,x4;
    int Y = pos.y / 4;
    int X = pos.y%4*oWidth + pos.x;//(pos.y - Y * 4)*oWidth + pos.x;
    int cntG = Y / gSize;
    int cntAF = Y - gSize * cntG;
    int cntCha = X / nCol;
    int cntP = X%nCol; //X - cntCha*nCol;
    int dataN = nChapSize; // number of source samples
    int Widx, Inputidx;
    int k = 0;

    //compute start index in weights matrix
    Widx = nChapSize *gSize * cntCha + nChapSize * cntAF;//vvvvv*******

    //compute start index in input matrix
    if(cntG >= firstToSkip)cntG = cntG + SkipLines;
    Inputidx = nCha * (cntG - halbpSize + 1);


    //scalar product
    while(k < dataN){
        w1 = WsI[cntP][Widx];
        Widx += 1;
        w2 = WsI[cntP][Widx];
        Widx += 1;
        w3 = WsI[cntP][Widx];
        Widx += 1;
        w4 = WsI[cntP][Widx];
        Widx += 1;
        x1 = dataIn[cntP][Inputidx];
        Inputidx += 1;
        x2 = dataIn[cntP][Inputidx];
        Inputidx += 1;
        x3 = dataIn[cntP][Inputidx];
        Inputidx += 1;
        x4 = dataIn[cntP][Inputidx];
        Inputidx += 1;
        res.y += w1.y * x1.x + w1.x * x1.y + w2.y * x2.x + w2.x * x2.y + w3.y * x3.x + w3.x * x3.y + w4.y * x4.x + w4.x * x4.y;
        res.x += w1.x * x1.x - w1.y * x1.y + w2.x * x2.x - w2.y * x2.y + w3.x * x3.x - w3.y * x3.y + w4.x * x4.x - w4.y * x4.y;
        k += 4;
    }

    dataOut =  res;

}

 11/05/2009 01:57 PM
User is offline View Users Profile Print this message

Author Icon
MicahVillmow

Posts: 525
Joined: 02/05/2008

first, use a float4 scatter instead of a float2, this reduces the number of reads that you need by a factor of two.
Second, use vector math when possible and swizzles instead of using a bunch of scalar math.
w1 = WsI[cntP][Widx];
Widx += 1;
w2 = WsI[cntP][Widx];
Widx += 1;
w3 = WsI[cntP][Widx];
Widx += 1;
w4 = WsI[cntP][Widx];
Widx += 1;
should be:
w1 = WsI[cntP][Widx];
w2 = WsI[cntP][Widx + 1];
w3 = WsI[cntP][Widx + 2];
w4 = WsI[cntP][Widx + 3];
Widx += 4;

//compute start index in input matrix
if(cntG >= firstToSkip)cntG = cntG + SkipLines;

Can be generated as:
cntG = cntG + (SkipLines * (int)(cntG >= firstToSkip))

Finally, don't use division/modulus unless you absolutely have to.

-------------------------
Micah Villmow
Advanced Micro Devices Inc.
--------------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

 11/05/2009 10:13 PM
User is offline View Users Profile Print this message

Author Icon
eduardoschardong

Posts: 59
Joined: 07/22/2007

Hi licoah,

I played a little with your code, but focusing more on the main loop, other than the tips Micah already give I have a few more:

1) When using brook+ in PS mode (always you don't put an [Attribute(GroupSize())} in the kernel it will do fetchs by sampling textures, sampling expect floats as parameters (in fact, float2), if you pass an int it will have to convert from int to foat and only the T unit does that, CS expects int.

2) By being float2 it will generate MOVs if not in the same register, but it is easy to solve.

3) It's possible for you to change the data layout? To me using a pair {X, Y} of float4 instead of 4 float2 seems ok.

One last thing, how slow it is? What's the input data look like? How large streams are?

Here a piece of my code, if all work shloud perform twice as fast:

 

Code:
kernel void kernel_brook1(int nCha, int pSize, int gSize, int AF, int nPatLin, int nCol, int nLines, int halbpSize,  int firstToSkip, int oWidth, int iWidth, int wWidth, int nChapSize, int SkipLines,
float4 dataInX[][], float4 dataInY[][], float4 WsIX[][], float4 WsIY[][], out float2 dataOut<>)
{

    float2 res = float2(0.0f,0.0f);
    int2 pos = instance().xy;
    float4 xX,xY,wX,wY;
    int Y = pos.y / 4;
    int X = pos.y%4*oWidth + pos.x;//(pos.y - Y * 4)*oWidth + pos.x;
    int cntG = Y / gSize;
    int cntAF = Y - gSize * cntG;
    int cntCha = X / nCol;
    float cntP = X%nCol; //X - cntCha*nCol;
    float dataN = nChapSize/4; // number of source samples
    float4 k = 0;

    //compute start index in weights matrix
    k.y = nChapSize *gSize * cntCha + nChapSize * cntAF;//vvvvv*******

    //compute start index in input matrix
    if(cntG >= firstToSkip)cntG = cntG + SkipLines;
    k.z = nCha * (cntG - halbpSize + 1);
    k.w = cntP;

    while(k.w < dataN){
        wX = WsIX[k.wy];
        wY = WsIY[k.wy];
        xX = dataInX[k.wz];
        xY = dataInY[k.wz];

        res.y += dot(wY, xX) + dot(wX, xY);
        res.x += dot(wX, xX) - dot(wY, xY);

        k.xyz += 1;
    }

    dataOut =  res;

}
 11/07/2009 11:53 AM
User is offline View Users Profile Print this message

Author Icon
licoah

Posts: 22
Joined: 01/13/2009

Thany you very much for your help.

I got only 17 Gflops. The card is HD4870.

float2 dataIn {1664,256}

float2 WxI{6144,256 }

float2 dataOut{2048,440}

I use float2, because the data are comlex numbers.

 

 11/07/2009 04:27 PM
User is offline View Users Profile Print this message

Author Icon
licoah

Posts: 22
Joined: 01/13/2009

I have try to your code. That's nice.

 

But when  k.y = (nChapSize *gSize * cntCha + nChapSize * cntAF)/4, the execution time increased again.

why?

 11/08/2009 05:07 PM
User is offline View Users Profile Print this message

Author Icon
eduardoschardong

Posts: 59
Joined: 07/22/2007

Try:

k.y = (nChapSize *gSize * cntCha + nChapSize * cntAF)/4.0f;

 

As a general note, there are too many integer divisions, they are the slowest type, here the code improved by replacing all integer by floats, using floor for mod.

Statistics
6123 users are registered to the AMD Developer Forums forum.
There are currently 0 users logged in.

FuseTalk Hosting Executive Plan v3.2 - © 1999-2009 FuseTalk Inc. All rights reserved.

Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information