AMD Logo AMD Developer Central
AMD Developer Forums
Decrease font size
Increase font size
Topic Title: accessing LDS memory problem
Topic Summary:
Created On: 05/17/2009 03:40 PM
Status: Post and Reply
Linear : Threading : Single : Branch
Search Topic Search Topic
Topic Tools Topic Tools
View similar topics View similar topics
View topic in raw text format. Print this topic.
 05/17/2009 03:40 PM
User is offline View Users Profile Print this message

Author Icon
chrisanonym

Posts: 6
Joined: 05/17/2009

Hello,

I 'm trying to run the following simple kernel code:

Attribute[GroupSize(64)]
kernel void sum(int4 a<>, int4 b<>, out int4 c[]){
    shared int4 lds[64];

    int index = instance().x;
    lds[index] = instance();
    c[index] = lds[index];
}

 

while I 'm using the following code for the main function:

 

#include <iostream>
#include "brookgenfiles/kernels.h"

using namespace std;

int main(void){
    int4 cpu_a[64];
    int4 cpu_b[64];
    int4 cpu_c[64];
    unsigned int streamSize[] = {64};
    unsigned int rank = 1;

    brook::Stream<int4> gpu_a(rank, streamSize);
    brook::Stream<int4> gpu_b(rank, streamSize);
    brook::Stream<int4> gpu_c(rank, streamSize);

    gpu_a.read(cpu_a);
    gpu_b.read(cpu_b);

    sum(gpu_a, gpu_b, gpu_c);
    if(gpu_c.error()){
        cout << gpu_c.errorLog();
        return 1;
    }
    gpu_c.write(cpu_c);
    for(int i=0;i<64;i++)
        cout << "(" << cpu_c.x << ", " << cpu_c.y << ", " << cpu_c.z << ", " << cpu_c.w << ")\n";
    cout << "\n\n";
    return 0;
}

 

The results tha I 'm getting is the tupple (0, 0, 0, 0) for every element of stream C, which are obviously unexpected (see bold lines of the kernel code).

Am I doing something wrong?

 

 05/17/2009 04:02 PM
User is offline View Users Profile Print this message

Author Icon
gaurav.garg

Posts: 519
Joined: 06/26/2008

The existing hardware exposes LDS as a memory per thread (Brook+ has exposed it as per group for future purpose). Other threads within a group can read the LDS memory assigned to other threads but they cannot write to it.

If you convert your kernel to the following, it should work-

Attribute[GroupSize(64)]
kernel void sum(int4 a<>, int4 b<>, out int4 c[])
{
    shared int4 lds[64];

    int index = instance().x;
    lds[1 * instanceInGroup().x + 0] = instance();
    c[index] = lds[index];
}

The write instruction makes sure that you are writing to its own memory and not to others.



-------------------------
Gaurav Garg, VizExperts
 05/17/2009 04:12 PM
User is offline View Users Profile Print this message

Author Icon
chrisanonym

Posts: 6
Joined: 05/17/2009

I 'm still getting the same results!

 05/17/2009 09:24 PM
User is offline View Users Profile Print this message

Author Icon
Jawed

Posts: 31
Joined: 08/31/2008

Try

   syncGroup();

after setting LDS?

Jawed

 05/18/2009 02:51 AM
User is offline View Users Profile Print this message

Author Icon
chrisanonym

Posts: 6
Joined: 05/17/2009

Originally posted by: Jawed Try

 

   syncGroup();

 

after setting LDS?

 

Jawed

 

I think, there is no necessity to call syncGroup since each thread reads the same location that writes. Thus, there is no need for a thread to wait all other threads to execute their write instructions.

Nevertheless, I tried it but it leads to the same results.

 05/18/2009 03:01 AM
User is offline View Users Profile Print this message

Author Icon
gaurav.garg

Posts: 519
Joined: 06/26/2008

What is your system configuration and driver version?



-------------------------
Gaurav Garg, VizExperts
 05/18/2009 03:13 AM
User is offline View Users Profile Print this message

Author Icon
chrisanonym

Posts: 6
Joined: 05/17/2009

My system specs are the following:

GPU: HD4830

CPU: core i7 920

driver version: 8.6

catalyst version: 9.4

os: windows vista 32 bit



Edited: 05/18/2009 at 03:47 AM by chrisanonym
 05/21/2009 06:41 AM
User is offline View Users Profile Print this message

Author Icon
chrisanonym

Posts: 6
Joined: 05/17/2009

Thanks, for your responses!

After extended research, I can conclude that there is a bug. Specifically, it seems that brook+ is not able to handle int4 shared memory consistently.

I convert my kernel to the following:

Attribute[GroupSize(64)]
kernel void sum(float4 a<>, float4 b<>, out float4 c[]){
    shared float4 lds[64];
    int index = instance().x;
    float4 tmp;

    tmp.x = (float)instance().x;
    tmp.y = (float)instance().y;
    tmp.z = (float)instance().z;
    tmp.w = (float)instance().w;

    lds[1 * instanceInGroup().x + 0] = tmp;

    c[index] = lds[index];
}

 

and now it works!

 11/02/2009 12:00 PM
User is offline View Users Profile Print this message

Author Icon
emuller

Posts: 72
Joined: 04/21/2009

I confirm this bug... lds of uint4 type does not work.  Please see attached pybrook code. 

Output:

In [3]: execfile('lds_test.py')
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]

A pretty fundamental bug.  Could this be addressed please?

Casting from uint to float is not a viable work-around when I need all the bits of the uint ... for example, for the random number generator I'm working on ... or is there another work-around I'm not seeing?

 

 

Code:
import os
os.environ['BRT_ADAPTER']='1'

import kernels
import stream
import math

import numpy

def fX(dims,x):
    b = numpy.array(dims)[::-1]
    b[-1]*=x
    return tuple(b)

def f4(dims):
    return fX(dims,4)

def f2(dims):
    return fX(dims,2)


dims =(64,)

h_out = numpy.zeros(f4(dims),dtype=numpy.uint32)
h_in = numpy.ones(f4(dims),dtype=numpy.uint32)*8

d_in  = stream.Stream_uint4(dims)
d_out = stream.Stream_uint4(dims)

d_out.read(h_out)
d_in.read(h_in)


k = kernels.__lds_test_uint()

k.run(d_in,d_out)

if d_out.error()!=stream.BRerror.BR_NO_ERROR:
    print "Error:", s1.errorLog()

d_out.write(h_out)

print h_out[0::4]


k = kernels.__lds_test_float()

k.run(d_in,d_out)

if d_out.error()!=stream.BRerror.BR_NO_ERROR:
    print "Error:", s1.errorLog()

d_out.write(h_out)

print h_out[0::4]


*************** end *****************


*********** kernels ***************

Attribute[GroupSize (64)]
kernel void lds_test_uint(uint4 in_s[], out uint4 out_s[]) {

shared uint4 lds[64];
int local_id = instanceInGroup().x;


lds[1 * instanceInGroup().x + 0] = in_s[instance().x];
syncGroup();
out_s[instance().x] = lds[1 * instanceInGroup().x + 0];

}

Attribute[GroupSize (64)]
kernel void lds_test_float(uint4 in_s[], out uint4 out_s[]) {

shared float4 lds[64];
int local_id = instanceInGroup().x;


lds[1 * instanceInGroup().x + 0] = (float4)in_s[instance().x];
syncGroup();
out_s[instance().x] = (uint4)lds[1 * instanceInGroup().x + 0];

}







Edited: 11/02/2009 at 12:58 PM by emuller
 11/02/2009 05:01 PM
User is offline View Users Profile Print this message

Author Icon
emuller

Posts: 72
Joined: 04/21/2009

Note, the CPU backend, i.e.

os.environ['BRT_RUNTIME']='cpu'

appears to be working:

In [3]: execfile('lds_test.py')
[8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]
[8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]

 11/03/2009 03:42 AM
User is offline View Users Profile Print this message

Author Icon
emuller

Posts: 72
Joined: 04/21/2009

Looking at the IL code in the kernels_gpu.h file, one can detect and fix the problem in lds_test_uint.  There is an unecessary ftou after the lds_read_vec.  Removing the ftou results in a working lds_test_uint.

Specifically, something like these two lines (r#s likely differ):

   "lds_read_vec  r370.xyzw,r426.xy00\n"
   "ftou r427.xyzw,r370.xyzw\n"

should acctually be:

     "lds_read_vec  r427.xyzw,r426.xy00\n"

Manually making this change in the generated IL results in correct behavior for lds_test_uint. 

So if brcc could be made to not generate this extraneous ftou after the lds_read_vec, shared uint4 support would be working.

It would be much appreciated if this could be fixed in the brook+ sourceforge svn repo.

 

 11/03/2009 07:56 AM
User is offline View Users Profile Print this message

Author Icon
emuller

Posts: 72
Joined: 04/21/2009

Indeed looking in the astbuiltinfunctions.cpp, one finds

BEGIN_BUILTIN("ReadLds", ...)

...

The EffectiveDataType is hardcoded to return float4.

Changing to uint4 produces the desired result ... but then works only for uint4.

There should be a global variable, or something, which contains the type of the user defined lds type for the kernel, and EffectiveDataType should return that type.

@brook developers: Shall I make this change and submit a patch, or would you prefer to take care of it in the next few days?  I am presently blocked by the lack of this feature.

 

 

 11/03/2009 01:18 PM
User is offline View Users Profile Print this message

Author Icon
MicahVillmow

Posts: 525
Joined: 02/05/2008

emeuller,
A better approach is to generate a new intrinsic that is ReadLdsUint instead of ReadLds which is for float4. That way the change doesn't break other people's code.

-------------------------
Micah Villmow
Advanced Micro Devices Inc.
--------------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

 11/03/2009 02:41 PM
User is offline View Users Profile Print this message

Author Icon
emuller

Posts: 72
Joined: 04/21/2009

@Micah

Presently, in the brook decl "shared Vector4Type lds[x]" the Vector4Type has no effect.  Shared mem is always treated as having type float4, but as far as I can tell, only on the lds_read_vec side.  What I propose is that ReadLds should acctually take the user specified Vector4Type into account.  This should not break any brook+ code, as anything other than a float4 for Vector4Type is producing garbage,results.

Are you concerned about it breaking hand written hlsl code?

You propose adding a ReadLdsX instead of fixing ReadLds for presently broken cases.  And then I should change the brook->hlsl code generator to use ReadLdsX, WiteLdsX based on the user defined Vector4Type?

Is hlsl really used outside of brook+?

If not, I think a type dispatch in ReadLds based on Vector4Type is cleaner than many ReadLdsX ... and all present code should continue to work, as its all float4.

If yes, is there a hlsl testsuite?  Do you guess a dispatched ReadLds would not pass all the tests?

 

 

 11/03/2009 02:47 PM
User is offline View Users Profile Print this message

Author Icon
MicahVillmow

Posts: 525
Joined: 02/05/2008

emuller,
I see. I did not know that the type was based on the type defined in the kernel. Your solution is actually better that it should be based on the Vector4Type and not hard coded to float4 or having different function names for different types. Go ahead and submit a patch if you have one.

-------------------------
Micah Villmow
Advanced Micro Devices Inc.
--------------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

 11/03/2009 02:55 PM
User is offline View Users Profile Print this message

Author Icon
emuller

Posts: 72
Joined: 04/21/2009

@ Micah

An afterthought...

I guess your point is that in hlsl there is not a global way (like there is in brook) to declare the type of the shared memory.  So you're saying the user should choose the representation of what's written in and read from shmem by chosing the read and write routine appropriately.  Then brook should choose the correct read/write pair based on the user declared Vector4Type for sh mem.  Is that correct?

 

 11/03/2009 03:14 PM
User is offline View Users Profile Print this message

Author Icon
MicahVillmow

Posts: 525
Joined: 02/05/2008

emuller,
I'm not too familiar with brook+ or hlsl as I didn't work on those, but what should happen is that below the brook+ language level, it should specify different read/store routines that are not visible to the Brook+ developer. The brook+ compiler should generate different builtins based on the datatype of the function, i.e. ReadLds_float4, ReadLds_uint4, ReadLds_int4, etc... so that there is no conflicts in datatypes or unwanted conversions as is this case. The ReadLds function should then be mapped to one of these underlying functions transparently, and this would fix the problem. Now if doing it at the HLSL level or at some level below or above that is the correct solution, i'm sure the Brook+ developers know the location where it should occur better than myself. But there definately at some point in the compiler stack should be divergent code flows for each data type.

Another solution at the user space is to do a bitcast between float4 to uint4 instead of a conversion.

-------------------------
Micah Villmow
Advanced Micro Devices Inc.
--------------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

 11/03/2009 04:09 PM
User is offline View Users Profile Print this message

Author Icon
emuller

Posts: 72
Joined: 04/21/2009

Originally posted by: MicahVillmow  Another solution at the user space is to do a bitcast between float4 to uint4 instead of a conversion.

How does one do a bitcast in a brook+ kernel ?

Say:

uint x = 0xffffffff;

float y;

How do I get all the bits of x into y?  0xffffffff is not properly represented in a float such that when it comes back to uint via ftou utof it is corrupted.

 

 

 

 11/04/2009 09:00 AM
User is offline View Users Profile Print this message

Author Icon
emuller

Posts: 72
Joined: 04/21/2009

OK, I'm working on a patch.  I'm trying the ReadLds Vector4Type dispatch option ... for this to work, the shared mem type needs to be defined in the hlsl file.  Presently the brook shmem type is not forwarded to hlsl.  To remedy this, I've added a LocalDataShareType(Vector4Type) attribute to the hlsl [] declaration ... and modified parser.y accordingly.  If this option is absent, the type will default to float4, so backwards compatability will be maintained.  I'm having the problem that my version of flex (even the flex_old package) (Ubuntu 9.10) seems to be much newer than those origianlly used, and the resulting code does not compile ... the shader_on shader_off defines end up too far down in the file.  How can I get them to occur higher up?

 

 11/05/2009 09:17 AM
User is offline View Users Profile Print this message

Author Icon
emuller

Posts: 72
Joined: 04/21/2009

OK, I submitted a patch to the sourceforge site.  Until it is incorporated into the trunk, the patch is available for download with instructions on how to apply it.

 

 

 

 

Statistics
6123 users are registered to the AMD Developer Forums forum.
There are currently 1 users logged in.

FuseTalk Hosting Executive Plan v3.2 - © 1999-2009 FuseTalk Inc. All rights reserved.

Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information