|
|
|
![]() |
AMD Developer Forums | ![]() |


|
Topic Title: Bad performance on moving data between private memory and local memory Topic Summary: Created On: 11/03/2009 04:25 AM Status: Post and Reply |
Linear : Threading : Single : Branch |
Search Topic |
Topic Tools
|
|
|
|
|
Moving data from private memory to local memory is a very time-consuming job, isn't it? When using the local memory in the kernel, my program runs much slower than before. code:
__private float4 block[4]; __local float4 local_block[16];
//very slow here. Why? local_block[local_id] = block[0]; local_block[local_id + 1] = block[1]; local_block[local_id + 2] = block[2]; local_block[local_id + 3] = block[3];
barrier(CLK_LOCAL_MEM_FENCE); |
|
|
|
|
|
|
|
|
Local Data Share(LDS) supports only owner writes in R7xx series GPUs. It is emulated as global memory internally and hence you will not get expected performance. See this slide (note the asterix on LDS) : http://img17.imageshack.us/img17/1153/openclarchitecture.jpg
|
|
|
|
|
|
|
|
|
Please forgive my temporary inablility to check for my self, but these older cards do report CL_GLOBAL for local memory type right? |
|
|
|
|
|
|
|
|
rexiaoyu,
One think you can try that might help with performance is to use the async_copy instead of manually copying. This does the copy utilizing the whole group in parallel. ------------------------- Micah Villmow Advanced Micro Devices Inc. -------------------------------- The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied. |
|
|
|
|
FuseTalk Hosting Executive Plan v3.2 - © 1999-2009 FuseTalk Inc. All rights reserved.
| Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information |