<?xml version="1.0" ?> 
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/">
<channel>
  <title>AMD Developer Forums - ATI Stream</title> 
  <description></description> 
  <link>http://forums.amd.com/forum/index.cfm?forumid=9</link> 
  <generator>FuseTalk Hosting Executive Plan</generator> 

	<item>
		<title>output stream offset + preprocessor directives</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=122513</link> 
		<pubDate>2009-11-22T07:16:51 -05.00</pubDate> 
		<dc:creator>mpwm</dc:creator>
   	    <slash:comments>2</slash:comments> 
		<description><![CDATA[ <p>Hi everybody,</p>
<p>&nbsp;</p>
<p>I would have two questions in regards to preprocessor and output stream. I had some program in C, and I wanted to rewrite it to brook+. &nbsp;I'm not sure if I am doing it properly, so : this my part of code in C:</p>
<p>
<p>void fun (int width, int height, short* in, short* o) {<br />&nbsp;<br />&nbsp; for(int y = 1; y &lt; height-1; y ++) {<br />&nbsp;  &nbsp;for(int x = 1; x &lt; width-1; x ++) {<br />&nbsp;    &nbsp;int pi = (x + y * width) * 3;<br />&nbsp;   &nbsp;&nbsp; o[pi + 0] = 10;<br />&nbsp;      &nbsp;&nbsp; o[pi + 1] = 20;<br />&nbsp;      &nbsp;&nbsp; o[pi + 2] = 30;</p>
<p>... }</p>
</p>
<p>My first question: how could I make this ( for ex. o[pi + 2] = 10; ) access in brook+. I've tried something like this: My kernel:</p>
<p>kernel <br />void fund( &nbsp;int width, &nbsp;int height, float input[], out float output&lt;&gt<img src="i/expressions/face-icon-small-wink.gif" border="0"> {&nbsp;<br />&nbsp;int2 index = instance().xy;<br /><br /> int pi = (index.y + index.x * width) * 3;<br />&nbsp;   &nbsp;<br />&nbsp;      output+=( ( float )pi + 0)+10;<br />&nbsp;      output+=( ( float )pi + 1)+20;<br />&nbsp;      output+=( ( float )pi + 2)+30;</p>
<p>...}</p>
<p>I'm not sure if is it correct ??</p>
<p>The second question is about preprocessor: I would like to for ex. use</p>
<p>#define FX(xo, yo) in[(y + yo)*width + (x + xo)]</p>
<p>void fun (int width, int height, short* in, short* o) {..} i</p>
<p>n my code but I'm still getting error: "Problem with call expression in kernel: callee unknown"</p>
<p>Thank you for your answers in advice,</p>
<p>mpwm</p>
<p>&nbsp;</p>]]></description>
	</item>

	<item>
		<title>Brook+ on 5000</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=122508</link> 
		<pubDate>2009-11-22T04:47:00 -05.00</pubDate> 
		<dc:creator>riza.guntur</dc:creator>
   	    <slash:comments>2</slash:comments> 
		<description><![CDATA[ <p>Can it run on 5000 series?</p>]]></description>
	</item>

	<item>
		<title>glut64.dll ?</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=122314</link> 
		<pubDate>2009-11-19T06:05:51 -05.00</pubDate> 
		<dc:creator>kryman</dc:creator>
   	    <slash:comments>5</slash:comments> 
		<description><![CDATA[ <p>I did not find glut64.dll library in certain place</p>
<p>\\Users\admin\Documents\ATI Stream\lib\x86_64</p>
<p>(ati-stream-sdk-v2.0-beta4-vista-win7-64.exe)</p>
<p>but for some samples it is needed (e.g. Mandelbrot)</p>]]></description>
	</item>

	<item>
		<title>CAL 1.4 samples. bug in format?</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=122228</link> 
		<pubDate>2009-11-18T01:02:42 -05.00</pubDate> 
		<dc:creator>CaptainN</dc:creator>
   	    <slash:comments>1</slash:comments> 
		<description><![CDATA[ <p>calResAlloc{Remote|Local}2D, as well as 1D call, among other paramters requires height, width and data format, what is inline with expectations.</p>
<p>However, in all the SDK sample, while setting CAL_FORMAT_FLOAT4 as a type of the data, width parameter is not devided by 4. Is it just a relaxed restrictions in samples (as type of the data is controlled via cmd args) and in practice if CAL_FORMAT_XXX4 is set, width must be devided by 4 to save the memory?</p>]]></description>
	</item>

	<item>
		<title>Vector Dot Product</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=122148</link> 
		<pubDate>2009-11-16T23:06:11 -05.00</pubDate> 
		<dc:creator>dinaharchery</dc:creator>
   	    <slash:comments>6</slash:comments> 
		<description><![CDATA[ <p>I have been trying to create a vector dot product using the GPU with Brook+ and have been getting strange results (should be 4 but I get 7.012).&nbsp; This dot-product is being performed on a vector with itself. Can anyone tell me why this is?</p>
<p>The y[] consists of {1.0f, 1.0f, 1.0f, 1.0f}</p>]]></description>
	</item>

	<item>
		<title>About HD5970 and 4 boards on one motherboard</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=122135</link> 
		<pubDate>2009-11-16T17:44:16 -05.00</pubDate> 
		<dc:creator>riza.guntur</dc:creator>
   	    <slash:comments>10</slash:comments> 
		<description><![CDATA[ <p>I've read from some source about 5970, will it support 4 boards on one motherboard? I mean like 8 GPUs on one motherboard, will it detected properly by CAL runtime this time?</p>]]></description>
	</item>

	<item>
		<title>Deadlock? (hang) when reading from pinned memory</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=122097</link> 
		<pubDate>2009-11-16T02:01:27 -05.00</pubDate> 
		<dc:creator>frankas</dc:creator>
   	    <slash:comments>19</slash:comments> 
		<description><![CDATA[ <p>I am trying to improve performance on a currently working stream application, by moving to pinned memory streams. But after a short while my thread that handles Brook calls hangs forever in a mutex lock like this:</p>
<p>Thread 2 (Thread 0xb7ab6b90 (LWP 21941)):<br />#0&nbsp; 0xb7f4642e in __kernel_vsyscall ()<br />#1&nbsp; 0xb7f22cf9 in __lll_lock_wait () from /lib/tls/i686/cmov/libpthread.so.0<br />#2&nbsp; 0xb7f1e129 in _L_lock_89 () from /lib/tls/i686/cmov/libpthread.so.0<br />#3&nbsp; 0xb7f1da32 in pthread_mutex_lock () from /lib/tls/i686/cmov/libpthread.so.0<br />#4&nbsp; 0xb7268d2b in brook::ThreadLock::lock () from /usr/lib/libbrook.so<br />#5&nbsp; 0xb72a80c6 in CALBuffer::initializePinnedBuffer () from /usr/lib/libbrook_cal.so<br />#6&nbsp; 0xb729ac64 in CALBufferMgr::_createPinnedBuffer () from /usr/lib/libbrook_cal.so<br />#7&nbsp; 0xb729bf07 in CALBufferMgr::setBufferData () from /usr/lib/libbrook_cal.so<br />#8&nbsp; 0xb725a093 in StreamImpl::read () from /usr/lib/libbrook.so<br />#9&nbsp; 0xb7c0b20c in brook::StreamData::read () from /usr/lib/libbrook_d.so<br />#10 0xb7c5dce9 in brook::Stream&lt;uint4&gt;::read (this=0x9e43960, ptr=0x9e54900, flags=0xb7c71c99 "nocopy")<br />&nbsp;&nbsp;&nbsp; at /usr/local/atibrook/sdk/include/brook/StreamDef.h:160<br />#11 0xb7c5b49c in A5Slice::tick (this=0x9b223c8) at A5Slice.cpp:366<br />#12 0xb7c4b5c2 in BrookA5:<img src="i/expressions/face-icon-small-tongue.gif" border="0">rocess (this=0x9b25870) at A5Brook.cpp:139<br />#13 0xb7c4b637 in BrookA5::thread_stub (arg=0x9b25870) at A5Brook.cpp:52<br />#14 0xb7f1c4ff in start_thread () from /lib/tls/i686/cmov/libpthread.so.0<br />#15 0xb7e5249e in clone () from /lib/tls/i686/cmov/libc.so.6</p>
<p>When this first happened I issued 18 async read calls, I tried serializing the read operations with isSync calls, but the result is the same. Also it does not appear to be a general race condition, as the hang occurs after the exact same number of kernel invocations.</p>
<p>Since this behaviour is highly reproducible I managed to set a breakpoint in pthread_lock just prior to the read call that I know will fail (trying to see who else takes the lock) However what I observe is a large amount of buffer destructors beeing called like this:</p>
<p>#11 0xb7303da7 in calResFree () from /usr/lib/libaticalrt.so<br />#12 0xb7344c01 in CALBuffer::~CALBuffer () from /usr/lib/libbrook_cal.so<br />#13 0xb7337c21 in CALBufferMgr::_createPinnedBuffer () from /usr/lib/libbrook_cal.so<br />#14 0xb7338f07 in CALBufferMgr::setBufferData () from /usr/lib/libbrook_cal.so<br />#15 0xb7c97093 in StreamImpl::read () from /usr/lib/libbrook.so<br />#16 0xb7ca820c in brook::StreamData::read () from /usr/lib/libbrook.so<br />#17 0xb7cfa929 in brook::Stream&lt;uint4&gt;::read (this=0x9349368, ptr=0x935a300, flags=0xb7d0e8d8 "nocopy")<br />&nbsp;&nbsp;&nbsp; at /usr/local/atibrook/sdk/include/brook/StreamDef.h:160<br />#18 0xb7cf819e in A5Slice::tick (this=0x90283c8) at A5Slice.cpp:369</p>
<p>This seems to indicate that the pinned buffers are accumulated in GFX memory and are only occasionally flushed. When this flusing occurs someone forgets to realease the mutex, and the next create call hangs indefinelty.</p>
<p>Where can I find the libbrook sources ? - I tried installing 1.4.1 but it fails on Ubuntu (on of the legacy samples has a dependancy on an old libpthread) - but the shared library is the same as that found in 1.4.0 (checked md5 sum)</p>
<p>Frank</p>
<p>&nbsp;</p>
<p>&nbsp;</p>]]></description>
	</item>

	<item>
		<title>CAL_RESALLOC_GLOBAL_BUFFER and CAL_RESALLOC_CACHABLE</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121986</link> 
		<pubDate>2009-11-13T10:41:45 -05.00</pubDate> 
		<dc:creator>CaptainN</dc:creator>
   	    <slash:comments>1</slash:comments> 
		<description><![CDATA[ <p>1. If these two flags are used ORed for calResAllocRemote2D it means use non-tiled surface&nbsp;and allocate CACHABLE system memory. Whether using CACHABLE flag makes sense when calResAllocLocal2D is called?</p>
<p>2. When CAL_RESALLOC_GLOBAL_BUFFER makes sense when calResAllocRemote2D is used, as remote (system) memory normally not tiled?</p>
<p>3.Whether using CAL_RESALLOC_GLOBAL_BUFFER is the requirement for CS output buffers?</p>]]></description>
	</item>

	<item>
		<title>Error with some GPUs</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121977</link> 
		<pubDate>2009-11-13T04:12:16 -05.00</pubDate> 
		<dc:creator>franz_r</dc:creator>
   	    <slash:comments>2</slash:comments> 
		<description><![CDATA[ <p>Hi all,</p>
<p>we've got a problem with some GPU (e.g. ATI HD 4800) types. We use the Stream Kernel Analyzer to translate c code to GPU assembler code. To track down the error we created a small kernel, which produces the error:</p>
<p>"il_ps_2_0\n"<br />"dcl_output_generic o0\n"<br />"dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"<br />"dcl_input_position_interp(linear_noperspective) v0.xy__\n"<br />"sample_resource(0)_sampler(0) o0, v0.x0\n"<br />"endmain\n"</p>
<p>We found out that the sample_resource(0)_sampler(0) gets back wrong data for the range 1...0x7F, 0x7F80...0x807F and 0xFF7F...0xFFFF. This means we get back, e.g. 0 for 1, 0 for 0x7F, etc.</p>
<p>On some GPUs it works correct (e.g. HD3870).</p>
<p>Seems that on newer GPUs there is something wrong.</p>
<p>Any thoughts?</p>
<p>&nbsp;</p>
<p>Regards,</p>
<p>Franz</p>]]></description>
	</item>

	<item>
		<title>Can the device code access the host memory space?</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121962</link> 
		<pubDate>2009-11-12T18:52:12 -05.00</pubDate> 
		<dc:creator>boricorld</dc:creator>
   	    <slash:comments>3</slash:comments> 
		<description><![CDATA[ <p>I am new to AMD Stream. Please pardon my ignorance.</p>
<p>I know that certain CUDA cards allow the device code to access the host memory, given the memory segment is pinned and mapped to the device memory space. I was wondering if AMD Stream also has a similar feature.</p>
<p>Thanks,</p>
<p>B</p>]]></description>
	</item>

	<item>
		<title>Compute Shader scheduling</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121826</link> 
		<pubDate>2009-11-10T16:46:19 -05.00</pubDate> 
		<dc:creator>DTop</dc:creator>
   	    <slash:comments>10</slash:comments> 
		<description><![CDATA[ <p class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="font-size: small; font-family: Times New Roman;">I have read many threads about the topic here, and one of the best one probably this one:</span></p>
<p class="MsoNormal" style="margin: 0in 0in 0pt;"><a href="http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=99919"><span style="font-size: small; color: #606420; font-family: Times New Roman;">http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=99919</span></a></p>
<p class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="font-size: small; font-family: Times New Roman;">however, there are number of simple questions outstanding:</span></p>
<p class="MsoNormal" style="margin: 0in 0in 0pt;"><span style="font-size: small; font-family: Times New Roman;">&nbsp;</span></p>
<p class="MsoNormal" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .5in;"><span style="font-family: Times New Roman;"><span style="mso-list: Ignore;"><span style="font-size: small;">a)</span><span style="font: 7pt "Times New Roman";">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style="font-size: small;">it seems to be conclusive, that wavefronts are executing within the Thread Group. Thread Group size is defined by gridBlock.width parameter of CALprogramGrid structure. And number of Thread Groups are defined as domain execution size (in pixels) devided by Thread Group size. </span></span></p>
<p class="MsoNormal" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .5in;"><span style="font-family: Times New Roman;"><span style="mso-list: Ignore;"><span style="font-size: small;">b)</span><span style="font: 7pt "Times New Roman";">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style="font-size: small;">If Thread Group size is twice more then actual execution units (for 7xx # of execution units seems to be == 64), and set in kernel and in gridBlock.width, whether Thread Group will queue 2 wavefronts on the same SIMD still being within the same Thread Group without interruption?</span></span></p>
<p class="MsoNormal" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .5in;"><span style="font-family: Times New Roman;"><span style="mso-list: Ignore;"><span style="font-size: small;">c)</span><span style="font: 7pt "Times New Roman";">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style="font-size: small;"><span style="mso-spacerun: yes;">&nbsp;</span>If fence_ work per Group, and Group Size is more then available execution units, and execution split on 2 wavefornts (case b above), whether first wavefront will be deferred until second wavefront will reach the barrier, to have first wavefront to be continued? Or it is just incorrect setting to have Group Size &gt; then actual execution units per SIMD?</span></span></p>
<p class="MsoNormal" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .5in;"><span style="font-family: Times New Roman;"><span style="mso-list: Ignore;"><span style="font-size: small;">d)</span><span style="font: 7pt "Times New Roman";">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style="font-size: small;">If wavefornt size is set to &frac12; of executing units of SIMD, whether half of SIMD will be wasted or another Group will be started on half of SIMD?</span></span></p>
<p class="MsoNormal" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .5in;"><span style="font-family: Times New Roman;"><span style="mso-list: Ignore;"><span style="font-size: small;">e)</span><span style="font: 7pt "Times New Roman";">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style="font-size: small;">If there are more Groups set then available SIMDs, whether groups will be scheduled for execution one after another in some non-predictive order until finished?</span></span></p>
<p class="MsoNormal" style="margin: 0in 0in 0pt 0.5in; text-indent: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .5in;"><span style="font-family: Times New Roman;"><span style="mso-list: Ignore;"><span style="font-size: small;">f)</span><span style="font: 7pt "Times New Roman";">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span style="font-size: small;">Once wavefront execution finished, whether LDS content remains persistent between wavefront runs, so next Thread Group will find LDS content from previous wavefront and can be reused?</span></span></p>]]></description>
	</item>

	<item>
		<title>lds_read_vec_neighborExch</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121818</link> 
		<pubDate>2009-11-10T12:58:00 -05.00</pubDate> 
		<dc:creator>ionel</dc:creator>
   	    <slash:comments>1</slash:comments> 
		<description><![CDATA[ <p>For a thread that has tID % 4 ==0 to get x values from the 4 threads ,&nbsp; the neighbor threads must also execute the lds_read_vec_neighborExch.</p>
<p>&nbsp;</p>
<p>Is this true?</p>]]></description>
	</item>

	<item>
		<title>What is UAV?</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121817</link> 
		<pubDate>2009-11-10T12:53:56 -05.00</pubDate> 
		<dc:creator>DTop</dc:creator>
   	    <slash:comments>3</slash:comments> 
		<description><![CDATA[ <p>Where to get sane description of what UAV is and when to use UAV commands?</p>]]></description>
	</item>

	<item>
		<title>compiling brook+ with &amp;gt;= gcc 4.4</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121816</link> 
		<pubDate>2009-11-10T11:30:03 -05.00</pubDate> 
		<dc:creator>dukeleto</dc:creator>
   	    <slash:comments>6</slash:comments> 
		<description><![CDATA[ <p>Hello,</p>
<p>I'm trying to install brookplus from svn on a recent fedora which uses gcc4.4, and am getting loads of errors. I've sorted out the header issues (ie. updated the #include's as described in http://gcc.gnu.org/gcc-4.3/porting_to.html), but have a "multiple definitions of 'getvalueof" which is bugging me.</p>
<p>Has anyone managed to compile with very recent gcc?</p>
<p>Thanks</p>
<p>&nbsp;</p>]]></description>
	</item>

	<item>
		<title>Why I don&apos;t see any mad operations in Brook+ IL at all?</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121811</link> 
		<pubDate>2009-11-10T07:43:43 -05.00</pubDate> 
		<dc:creator>riza.guntur</dc:creator>
   	    <slash:comments>14</slash:comments> 
		<description><![CDATA[ <p>The subject speaks for itself</p>
<p>Haven't seen one till now</p>]]></description>
	</item>

	<item>
		<title>Multiple IL (compute) kernels to execute.</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121785</link> 
		<pubDate>2009-11-09T14:37:39 -05.00</pubDate> 
		<dc:creator>DTop</dc:creator>
   	    <slash:comments>5</slash:comments> 
		<description><![CDATA[ <p>Reposting from "Context, Compiler, Linker etc. " topic, to attract attention.</p>
<p>1. To execute multiple images without need of reloading it into CAL: If multiple contexts will be created, then every context will have it's own image to execute. whether context switch will be noticible if one context will be started after another? Whether it is better solution then load module every time before start execution? (assuming only one image can be load per contexts, so to run the second IL image must be reloaded).</p>
<p>2. Whether the same resource can be attached to different contexts for i/o? (context1 for input, context2 for output)?</p>
<p>3. What is the example of running multiple functions in calCtxRunProgramGridArray if only one image can be loaded? If multiple ILs linked into the same image, what function names must be specified to run one after another, as sample says only about "main" as the only function to execute?</p>
<p>4. Micah, in one of the posts you mentioned to use some input paramter int kernel to trigger between different kernel function calls (code paths) from the "main", so all functionality will be linked together. In this case, whether non-executing paths will be scheduled (code branches which will not be executed) degrading performance?</p>
<p>&nbsp;</p>]]></description>
	</item>

	<item>
		<title>Installing the new experimental ATI Stream driver...openSUSE</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121784</link> 
		<pubDate>2009-11-09T14:15:07 -05.00</pubDate> 
		<dc:creator>gsteri1</dc:creator>
   	    <slash:comments>5</slash:comments> 
		<description><![CDATA[ <p>Hi,</p>
<p>Apologies if this should have been posted under another topic. I attempted to install the experimental driver in order to run the new version of the SDK. I am running an opensuse 11.1 AMD64 box. The driver installed and I got the warning images (eg No 3D, experimental driver..). Aside from the loss of the 3D acceleration, scrolling did not work. Using the scroll bar on, say firefox, caused both cores of my cpu to approach 100%. Furthermore, none of the examples would run. Has anyone encountered anything similar?</p>
<p>Thank you,</p>
<p>-Greg</p>]]></description>
	</item>

	<item>
		<title>Can&apos;y open CAL project in VS2008</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121782</link> 
		<pubDate>2009-11-09T13:19:16 -05.00</pubDate> 
		<dc:creator>riza.guntur</dc:creator>
   	    <slash:comments>4</slash:comments> 
		<description><![CDATA[ <p>When using beta4 I can't open the CAL solution file</p>
<p>Something like XML error is happen</p>]]></description>
	</item>

	<item>
		<title>Fetch in compute shader</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121771</link> 
		<pubDate>2009-11-09T09:14:56 -05.00</pubDate> 
		<dc:creator>ryta1203</dc:creator>
   	    <slash:comments>7</slash:comments> 
		<description><![CDATA[ <p>Ok, sorry about the "Horrible 5870 performance" but this goes to the same topic...</p>
<p>... why is the 64x1 block size performance so horrid?</p>
<p>Compute shader might be faster but you really need to know how to get perfect texture fetch to make it so.</p>
<p>Accessing naively (64x1) gives HORRIBLE performance... WAY worse than pixel shader mode. And if LDS isn't any faster... I mean how many applications out there really need LDS?</p>]]></description>
	</item>

	<item>
		<title>scatter/gather array:  how to copy to/from graphical memor</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121765</link> 
		<pubDate>2009-11-09T04:26:28 -05.00</pubDate> 
		<dc:creator>snef</dc:creator>
   	    <slash:comments>2</slash:comments> 
		<description><![CDATA[ <p>If you creat a stream (for input) you have to perform a <strong>read</strong> to fill it up. The read will copy from main (=host) memory over the PCIe to the graphical memory.</p>
<p>How ever if you creat a scatter/gatther array as in float4 ar[1024] and fill it up in main memory. How do you create that array in graphical memory and how do you copy it?</p>
<p>I think that you can just use that array in the kernel call and&nbsp; brook+ will under the hood do the copy.</p>
<p>What happens if you first use a gather/scatter array as output for kernel1 and then use that array as input for kernel2. Is there a copy to main memory performed?</p>
<p>Where/how do you declare that array: I do't need it in main memory.</p>
<p>&nbsp;</p>
<p>Any help appreciated.</p>
<p>&nbsp;</p>
<p>Sven</p>
<p>&nbsp;</p>]]></description>
	</item>

	<item>
		<title>Get started with IL.</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121761</link> 
		<pubDate>2009-11-09T01:46:16 -05.00</pubDate> 
		<dc:creator>codeboycjy</dc:creator>
   	    <slash:comments>5</slash:comments> 
		<description><![CDATA[ <p>Hi:<br />&nbsp;&nbsp;&nbsp;i've been using brook+ and opencl for quite a while. Due to some reason, i'm gonna switch my work to CAL.</p>
<p>&nbsp;&nbsp; After checking the IL code in the samples of CAL, i think it's a little hard to learn comparing to the highlevel interfaces.</p>
<p>&nbsp;&nbsp;How could i get start with my journey? Do i have to learn ISA too? i think it's hard for me to understand everything about ISA.</p>]]></description>
	</item>

	<item>
		<title>Horrible 5870 performance</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121750</link> 
		<pubDate>2009-11-08T18:16:27 -05.00</pubDate> 
		<dc:creator>ryta1203</dc:creator>
   	    <slash:comments>4</slash:comments> 
		<description><![CDATA[ <p>&nbsp;</p>]]></description>
	</item>

	<item>
		<title>Stream and HD 4*** series AGP/PCI graphics.</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121749</link> 
		<pubDate>2009-11-08T18:06:50 -05.00</pubDate> 
		<dc:creator>klunssmurf</dc:creator>
   	    <slash:comments>2</slash:comments> 
		<description><![CDATA[ <p>Several types HD4 series are available in AGP and PCI (not PCI-E)</p>
<p><a href="http://www.hisdigital.com/un/product2-448.shtml">http://www.hisdigital.com/un/product2-448.shtml</a></p>
<p><a href="http://www.hisdigital.com/un/product2-444.shtml">http://www.hisdigital.com/un/product2-444.shtml</a></p>
<p>For both types Stream support is claimed.</p>
<p>Is it possible to use both cards in one system and make use of the Stream Technology?</p>
<p>&nbsp;</p>
<p>&nbsp;</p>]]></description>
	</item>

	<item>
		<title>Is there a bug?</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121689</link> 
		<pubDate>2009-11-07T17:06:43 -05.00</pubDate> 
		<dc:creator>ryta1203</dc:creator>
   	    <slash:comments>2</slash:comments> 
		<description><![CDATA[ <p>Running a 5870 and I can't allocate more than 20 inputs of float4 data with 1 float4 output.</p>
<p>The domain size is 1024x1024.</p>
<p><br />So that should be 4 bytes/float, 4floats/input+output.</p>
<p>So 24 inputs+1 output = 25, 25*4*4 = 400 bytes total</p>
<p>Now, the domain size is 1024*1024, *400 = *1024*1024 = 419, 430, 400 bytes.</p>
<p>There is 1GB on the card so what's the problem? Am I missing something here?</p>
<p>The same kernel runs fine on the 4870 with Catalyst 9.4.</p>
<p>Note that I am currently running Catalyst 9.10.</p>
<p>I will try the 4870 too and edit this if needed.</p>]]></description>
	</item>

	<item>
		<title>How to optimize the kernel with Brook+</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121579</link> 
		<pubDate>2009-11-05T04:03:06 -05.00</pubDate> 
		<dc:creator>licoah</dc:creator>
   	    <slash:comments>6</slash:comments> 
		<description><![CDATA[ <p>I has optimized this kernel. But the performance is not very good.</p>
<p>Are there some special tricks in Brook+, which I have not used for this kernel?</p>
<p>kernel void<br />kernel_brook1(int nCha, int pSize, int gSize, int AF, int nPatLin, int nCol, int nLines, int halbpSize,&nbsp; int firstToSkip, int oWidth, int iWidth, int wWidth, int nChapSize, int SkipLines,float2 dataIn[][], float2 WsI[][], out float2 dataOut&lt;&gt<img src="i/expressions/face-icon-small-wink.gif" border="0">{<br /><br />&nbsp;&nbsp;&nbsp; float2 res = float2(0.0f,0.0f);<br />&nbsp;&nbsp;&nbsp; int2 pos = instance().xy;<br />&nbsp;&nbsp;&nbsp; float2 w1,w2,w3,w4,x1,x2,x3,x4;<br />&nbsp;&nbsp;&nbsp; int Y = pos.y / 4;<br />&nbsp;&nbsp;&nbsp; int X = pos.y%4*oWidth + pos.x;//(pos.y - Y * 4)*oWidth + pos.x;<br />&nbsp;&nbsp;&nbsp; int cntG = Y / gSize;<br />&nbsp;&nbsp;&nbsp; int cntAF = Y - gSize * cntG;<br />&nbsp;&nbsp;&nbsp; int cntCha = X / nCol;<br />&nbsp;&nbsp;&nbsp; int cntP = X%nCol; //X - cntCha*nCol;<br />&nbsp;&nbsp;&nbsp; int dataN = nChapSize; // number of source samples<br />&nbsp;&nbsp;&nbsp; int Widx, Inputidx;<br />&nbsp;&nbsp;&nbsp; int k = 0;<br /><br />&nbsp;&nbsp;&nbsp; //compute start index in weights matrix<br />&nbsp;&nbsp;&nbsp; Widx = nChapSize *gSize * cntCha + nChapSize * cntAF;//vvvvv*******<br /><br />&nbsp;&nbsp;&nbsp; //compute start index in input matrix<br />&nbsp;&nbsp;&nbsp; if(cntG &gt;= firstToSkip)cntG = cntG + SkipLines;<br />&nbsp;&nbsp;&nbsp; Inputidx = nCha * (cntG - halbpSize + 1);<br /><br /><br />&nbsp;&nbsp;&nbsp; //scalar product<br />&nbsp;&nbsp;&nbsp; while(k &lt; dataN){<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; w1 = WsI[cntP][Widx];<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Widx += 1;<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; w2 = WsI[cntP][Widx];<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Widx += 1;<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; w3 = WsI[cntP][Widx];<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Widx += 1;<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; w4 = WsI[cntP][Widx];<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Widx += 1;<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; x1 = dataIn[cntP][Inputidx];<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Inputidx += 1;<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; x2 = dataIn[cntP][Inputidx];<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Inputidx += 1;<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; x3 = dataIn[cntP][Inputidx];<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Inputidx += 1;<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; x4 = dataIn[cntP][Inputidx];<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Inputidx += 1;<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; res.y += w1.y * x1.x + w1.x * x1.y + w2.y * x2.x + w2.x * x2.y + w3.y * x3.x + w3.x * x3.y + w4.y * x4.x + w4.x * x4.y;<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; res.x += w1.x * x1.x - w1.y * x1.y + w2.x * x2.x - w2.y * x2.y + w3.x * x3.x - w3.y * x3.y + w4.x * x4.x - w4.y * x4.y;<br />&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; k += 4;<br />&nbsp;&nbsp;&nbsp; }<br /><br />&nbsp;&nbsp;&nbsp; dataOut =&nbsp; res;<br /><br />}</p>]]></description>
	</item>

	<item>
		<title>Stream 1.4 and kernels &amp;gt;= 2.6.31</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121482</link> 
		<pubDate>2009-11-03T10:55:29 -05.00</pubDate> 
		<dc:creator>c360</dc:creator>
   	    <slash:comments>8</slash:comments> 
		<description><![CDATA[ <p>I am running Stream 1.4 on a openSuSE 11.1 system with 2.6.27.x kernel for some time.</p>
<p>In an attempt to test a newer kernel for hardware updates I have compiled 2.6.31.4 and 2.6.32-rc4 kernels.&nbsp; After a successful boot I have tried executing the stream app I am using and it Seg Faults on me.</p>
<p>I have also tested on openSuSE 11.2 kernel 2.6.31.3-5 and also receive the Seg Falut.</p>
<p>Are there kernel functions being missed during the compile to allow Stream 1.4 to operate correctly?</p>
<p>&nbsp;</p>
<p>Thanks!</p>]]></description>
	</item>

	<item>
		<title>calInit(); causes a segmentation fault</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121445</link> 
		<pubDate>2009-11-02T15:54:49 -05.00</pubDate> 
		<dc:creator>Sceptic</dc:creator>
   	    <slash:comments>7</slash:comments> 
		<description><![CDATA[ <p>Hi...</p>
<p>&nbsp;</p>
<p>To the devs:</p>
<p>I'm participating in a distributed computing project hosted by www.distributed.net and running the ATI stream version of the client.</p>
<p>I ran this project succesfully under Ubuntu 9.04 with driver 9.10 installed.</p>
<p>I then upgraded to Ubuntu 9.10 and installed the same driver version (9.10) but the client dies horribly with a segfault.</p>
<p>I filed a bug report at distributed.net (http://bugs.distributed.net/show_bug.cgi?id=4260) about this.</p>
<p>My question to you is: What does calInit() do internally? Can you show some pseudo code about what calInit() does?</p>
<p>Best thing would be to have source code for calInit() to look at, but the source are closed I guess.</p>
<p>My current theory is that calInit() fails when trying to connect to a display (e.g. :0.0) &nbsp;because of newer libraries in U9.10 versus U9.04, but I don't know which, so I'm hoping you could help me with this issue.</p>
<p>&nbsp;</p>
<p>Thank you in advance.</p>
<p>Sceptic</p>]]></description>
	</item>

	<item>
		<title>CAL MultiGPU under Linux</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121437</link> 
		<pubDate>2009-11-02T12:59:16 -05.00</pubDate> 
		<dc:creator>JeremyL</dc:creator>
   	    <slash:comments>3</slash:comments> 
		<description><![CDATA[ <p>Hello</p>
<p>i would use many HD5870 card under my RedHat linux 5.3 for GPGPU compute.</p>
<p>I have installed ati-opencl-beta-driver-v2.0-beta4-lnx and ati-stream-sdk-v2.0-beta4-lnx64.</p>
<p>i have 4&nbsp; hd5870 and 1 monitor under each</p>
<p>i make a aticonfig --adapter=all --initial for configure it. <br />my Xorg.conf is attached</p>
<p>when i start my Xserver , i have the badge "AMD Testing Use&nbsp; only" in right-bottom of each monitor</p>
<p>in the amdcccle, i have 4 screen</p>
<p>for the command :</p>
<p>aticonfig --list-adapters<br />* 0. 13:00.0 ATI Radeon HD 5800 Series<br />&nbsp; 1. 0c:00.0 ATI Radeon HD 5800 Series<br />&nbsp; 2. 8d:00.0 ATI Radeon HD 5800 Series<br />&nbsp; 3. 86:00.0 ATI Radeon HD 5800 Series<br /><br />* - Default adapter</p>
<p>&nbsp;</p>
<p>but when i run FindNumDevices, i have</p>
<p>$./FindNumDevices<br />Supported CAL Runtime Version: 1.3.185<br />Found CAL Runtime Version: 1.4.467<br />Use -? for help<br />CAL initialized.<br />Finding out number of devices :-<br />Device Count = 1<br />CAL shutdown successful.<br /><br />Press enter to exit...<br /></p>
<p>for lspci i have :</p>
<p>$ /sbin/lspci | grep VGA<br />0c:00.0 VGA compatible controller: ATI Technologies Inc Unknown device 6898<br />13:00.0 VGA compatible controller: ATI Technologies Inc Unknown device 6898<br />86:00.0 VGA compatible controller: ATI Technologies Inc Unknown device 6898<br />8d:00.0 VGA compatible controller: ATI Technologies Inc Unknown device 6898<br /></p>
<p>...?</p>
<p>ANSWER : how use many hd5870 for compute with CAL??</p>
<p>Thank</p>
<p>&nbsp;</p>]]></description>
	</item>

	<item>
		<title>Radeon 4200 : Compute shaders and global buffer</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121418</link> 
		<pubDate>2009-11-02T03:34:01 -05.00</pubDate> 
		<dc:creator>rahulgarg</dc:creator>
   	    <slash:comments>3</slash:comments> 
		<description><![CDATA[ <p>Are compute shaders and global buffer supported on the integrated radeon 4200?</p>]]></description>
	</item>

	<item>
		<title>domainSize and domainOffset for brook compute shaders</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121266</link> 
		<pubDate>2009-10-30T13:33:21 -05.00</pubDate> 
		<dc:creator>emuller</dc:creator>
   	    <slash:comments>3</slash:comments> 
		<description><![CDATA[ <p>Consider the attached computer shader kernel</p>
<p>I had to enable address virtualization for it to work at all, even for an in&lt;&gt; stream.&nbsp; Why?&nbsp; It is 1D.&nbsp; Do all compute shaders need -r option?</p>
<p>But now the puzzling part ... domainSize has effect on 1st dimension, domainOffset has no effect.</p>
<p>For 2d version of kernel, that is "out_s[global_id.y][global_id.x]=" and appropriate stream defs, domainSize on 2nd dim has no effect.&nbsp; Only first dim.&nbsp; Again no effect from domainOffset.</p>
<p>The brookplus svn shows no tests for domainOffset, domainSize, and only exec_domain example, not for computer shaders, and its only mentioned once in docs ... so is it an "experimental feature" ... or user error ?</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>]]></description>
	</item>

	<item>
		<title>Execution time dispersion</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121151</link> 
		<pubDate>2009-10-28T17:58:59 -05.00</pubDate> 
		<dc:creator>Raistmer</dc:creator>
   	    <slash:comments>1</slash:comments> 
		<description><![CDATA[ My app prints time it spent in different kernels and GPU I/O operations.<br />Somethins these time >3 times bigger for one run than for another.<br />It reflected in total runtime too of course.<br />All times differes, mean time and min time too.<br />Host reboot usually restores smaller execution times, but not always.<br />Now I see smaller execution times restored "by themselves" even w/o host reboot.<br /><br />These changes look not connected with GPU load itself. Sometimes I see small execution kernel times on full loaded GPU running other GPU/CPU-intensive apps, but sometimes app shows large execution times on completely idle host.<br /><br />What reason of such dispersion could be ?<br /><br />For example, one of kernel execution times varies from arount 1e6 ticks to 3,5e6 ticks (and usually big gap between these values, that is, either all kernels have small execution time ~same value each time, either they all have big execution time, again, ~same value between runs).<br /><br />Memory alignment issue? How do you think ?<br /><br />P.S. and another observation:<br />When there are big runtimes, disabling explicit execution domain setting speedups kernel a lot (~2 times) while with small times I see no difference between version with or w/o explicit execution domain control...<br />]]></description>
	</item>

	<item>
		<title>scatter + domainSize problems</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121128</link> 
		<pubDate>2009-10-28T07:22:04 -05.00</pubDate> 
		<dc:creator>hennequi</dc:creator>
   	    <slash:comments>4</slash:comments> 
		<description><![CDATA[ <p>Good morning,</p>
<p>I'm having problems with a simple scatter kernel for which I want to restrict the domainSize.</p>
<p>Here is a simple kernel which does what I want :</p>
<p>kernel void kernel_one(unsigned int m, float4 a&lt;&gt;, out float b[][]){<br />&nbsp; int k = instance().x;<br />&nbsp; unsigned int t;<br />&nbsp; for(t=0; t&lt;m; t++){<br />&nbsp;&nbsp;&nbsp; b[4*k+0][t] = a.x + (float)t;<br />&nbsp;&nbsp;&nbsp; b[4*k+1][t] = a.y + (float)t;<br /> &nbsp;&nbsp;&nbsp; b[4*k+2][t] = a.z + (float)t;<br />&nbsp;&nbsp;&nbsp; b[4*k+3][t] = a.w + (float)t;<br />&nbsp; };<br />}</p>
<p>it takes an input stream a of size n/4 float4s, and put it in a big matrix of size n,m, such that column t contains ((vector a) + t).</p>
<p>Prior to calling the kernel, I always set the domainSize to n/4 (ie size of the input stream).</p>
<p>This works fine.</p>
<p>Now, if I add another dummy normal output stream float4 c&lt;&gt;, things get bad:</p>
<p>kernel void kernel_two(unsigned int m, float4 a&lt;&gt;, out float4 c&lt;&gt;, out float b[][]){<br /> &nbsp; int k = instance().x;<br /> &nbsp; unsigned int t;<br /> &nbsp; for(t=0; t&lt;m; t++){<br /> &nbsp;&nbsp;&nbsp; b[4*k+0][t] = a.x + (float)t;<br /> &nbsp;&nbsp;&nbsp; b[4*k+1][t] = a.y + (float)t;<br /> &nbsp;&nbsp;&nbsp; b[4*k+2][t] = a.z + (float)t;<br /> &nbsp;&nbsp;&nbsp; b[4*k+3][t] = a.w + (float)t;<br /> &nbsp; };<br /> &nbsp; c = a;<br /> }</p>
<p>(c does nothing but copying a)</p>
<p>now, for n = 256, m=5, the result is correct :</p>
<p>0&nbsp;&nbsp; 1&nbsp;&nbsp; 2&nbsp;&nbsp; 3&nbsp;&nbsp; 4</p>
<p>1&nbsp;&nbsp; 2&nbsp;&nbsp; 3&nbsp;&nbsp; 4&nbsp;&nbsp; 5</p>
<p>2&nbsp;&nbsp; 3&nbsp;&nbsp; 4&nbsp;&nbsp; 5&nbsp;&nbsp; 6</p>
<p>......</p>
<p>255 256 257 258 259</p>
<p>but for n = 260, m=5, it breaks down :</p>
<p>0&nbsp;&nbsp; 1&nbsp;&nbsp; 2&nbsp;&nbsp; 3&nbsp;&nbsp; 4</p>
<p>0&nbsp;&nbsp; 0 &nbsp; 0 &nbsp; 0 &nbsp; 0</p>
<p>1&nbsp;&nbsp; 2&nbsp;&nbsp; 3&nbsp;&nbsp; 4&nbsp;&nbsp; 5</p>
<p>0&nbsp;&nbsp; 0&nbsp;&nbsp; 0 &nbsp; 0 &nbsp; 0</p>
<p>.......</p>
<p>129 130 131 132 133</p>
<p>0&nbsp;&nbsp; 0 &nbsp; 0 &nbsp; 0 &nbsp; 0</p>
<p>&nbsp;</p>
<p>??</p>
<p>In fact, the first kernel works fine on GPU, but when I switch to the CPU backend, I get a segfault, which gdb backtracks there:</p>
<p>#0&nbsp; 0x00007f111651f372 in brt::CPUKernel::Map () from /opt/atibrook/sdk/lib/libbrook.so</p>
<p>So, shall I conclude that having one scatter output + one normal output stream is implicitely not allowed?</p>
<p>Thanks for the help,</p>
<p>Guillaume</p>
<p>&nbsp;</p>]]></description>
	</item>

	<item>
		<title>FIX for; brtvector.hpp compile error in gcc 4.3.3</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121120</link> 
		<pubDate>2009-10-28T02:53:42 -05.00</pubDate> 
		<dc:creator>frankas</dc:creator>
   	    <slash:comments>2</slash:comments> 
		<description><![CDATA[ <pre>In file included from /usr/local/atibrook/sdk/include/brook/CPU/brt.hpp:51,<br />                 from /usr/local/atibrook/sdk/include/brook/brook.h:54,<br />                 from a5br.cpp:23:<br />/usr/local/atibrook/sdk/include/brook/CPU/brtvector.hpp:322: error: explicit template specialization cannot have a storage class<br /><br />repeated multiple times.<br /><br />http://gcc.gnu.org/gcc-4.3/porting_to.html explains:<br /><br /></pre>
<h4>Explicit template specialization cannot have a storage class</h4>
<p>Specializations of templates cannot explicitly specify a storage class, and have the same storage as the primary template. This is a change from previous behavior, based on the feedback and commentary as part of the ISO C++ Core Defect Report <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#605">605</a>.</p>
<p>&nbsp;</p>
<p>Simply removing static as shown in the attached patch works great for me:</p>
<pre><br /><br /></pre>]]></description>
	</item>

	<item>
		<title>SGEMM variations</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121030</link> 
		<pubDate>2009-10-26T17:47:57 -05.00</pubDate> 
		<dc:creator>claudio_albanese</dc:creator>
   	    <slash:comments>3</slash:comments> 
		<description><![CDATA[ <p>I wonder if anyone on this forum would like to help with a port project.&nbsp;</p>
<p>I recently released an open source pricing library based on GPU computing. You may find it on my homepage at www.albanese.co.uk by following the link to OPLib. The library includes a set of low-level routines written in CUDA and in C to which one can reduce most valuation and risk management tasks. In OPLib I also give an orchestration example for Monte Carlo pricing.</p>
<p>With CUDA and a 4-GPU system with Teslas 1060 I achieve a sustained performance of 340 GF/sec per card, i.e. about 1.36 TF/sec of sustained performance on a calibration task. Calibration is a very flop consuming operation as it takes about 5 petaflops per risk factor, give or take a factor two. 340 GF/sec is excellent if one considers that peak performance for matrix multiplication of large matrices on Teslas 1060 is 370 GF/sec while I have rather small matrices of size 512 and in the sustained performance benchmark I mentioned I am counting all the high level orchestration stuff and lots of glue code that are needed for a real life implementation. This makes me hope that once the crucial routines are optimized, sustained performance on one of the latest ATI cards can reach 2 TF/sec per card.&nbsp;</p>
<p>Achieving this depends on the ability to port a few routines which I released in the public domain in OPLib, namely:</p>
<p>(i) SGEMM4, a routine which operates on an array of pairs matrices and multiplies them concurrently.</p>
<p>(ii) SGEMV3, a routine that takes as an argument a matrix and an array of vectors stored non contiguously in memory and applies the matrix to those vectors.</p>
<p>(iv) SGEMV4, a routine that batches a number of SGEMV3 calls.</p>
<p>(v) SDOT2, a routine that batches a number of calls to SDOT while storing the dot products in an array in global GPU memory.</p>
<p>(vi) SCOPY2, a routine that batches a number of calls to SCOPY.&nbsp;</p>
<p>The single precision variants of these routines are my first priority. I would also be interested in double precision variations of course, but that's of secondary important as this sort of algorithm is quite robust also in single precision, with errors typically well below the tolerance level.&nbsp;</p>
<p>If anyone in this forum is interested in finance applications and can optimize handwritten IL code, I would be very grateful if he would contact me with advice or ideally consider contributing to OPLib. This could be a good topic for graduate students or anyone who would like exposure to the finance sector by writing a paper that I can assure would find a broad readership.</p>
<p>Regards, Claudio</p>
<p>email: claudio@albanese.co.uk&nbsp;</p>]]></description>
	</item>

	<item>
		<title>delete please</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121028</link> 
		<pubDate>2009-10-26T17:38:38 -05.00</pubDate> 
		<dc:creator>Raistmer</dc:creator>
   	    <slash:comments>1</slash:comments> 
		<description><![CDATA[ solved]]></description>
	</item>

	<item>
		<title>Install atistream under debian</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121013</link> 
		<pubDate>2009-10-26T16:03:41 -05.00</pubDate> 
		<dc:creator>kama777</dc:creator>
   	    <slash:comments>3</slash:comments> 
		<description><![CDATA[ <p>I encountered some problems installing atistream-cal under debian.</p>
<p>I receive this error:</p>
<p>"rpm: please use alien to install rpm packages on Debian, if you are really sure use --force-debian switch. See README.Debian for more details."</p>
<p>But i can't find the *.rpm to install with alien...</p>
<p>Thanks</p>]]></description>
	</item>

	<item>
		<title>CAL improvements in drivers</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=121012</link> 
		<pubDate>2009-10-26T15:04:58 -05.00</pubDate> 
		<dc:creator>ryta1203</dc:creator>
   	    <slash:comments>4</slash:comments> 
		<description><![CDATA[ <p>I went from 9.4 to 9.10 and noticed a significant increase in my cache hits with fetches...</p>
<p>.. I am curious if and what cache improvements for CAL have been made in the drivers? Can we start getting a list of the improvements in the driver release notes for CAL?</p>]]></description>
	</item>

	<item>
		<title>Some difficulties on implementing non-matrix algorithm with parallel computing</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=120990</link> 
		<pubDate>2009-10-26T06:19:56 -05.00</pubDate> 
		<dc:creator>andrewchao</dc:creator>
   	    <slash:comments>1</slash:comments> 
		<description><![CDATA[ <p>Recently i was working on some statistical issues from my mentor and wanna try a algorithm, which yields the times of the appearance of each value in a sample set, on GPU.</p>
<p>However, it seems hard to realize the implementation 'cause where brook+&nbsp;does well is&nbsp;mat-to-mat calculation. When it's not that case, in my situation especially, concerning with random access to different elements in an array(used to store the times for which different values appear in the set), i don't know what to do with it. But i believe this is still in the field of paralleling processing.</p>
<p>Calling for help</p>]]></description>
	</item>

	<item>
		<title>DX11 RawAndStructuredBuffers</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=120986</link> 
		<pubDate>2009-10-26T04:58:06 -05.00</pubDate> 
		<dc:creator>frankmaier</dc:creator>
   	    <slash:comments>1</slash:comments> 
		<description><![CDATA[ <p>Hello,</p>
<p>&nbsp;</p>
<p>I'm trying to use my Radeon HD 4670 with DX11 SDK (Aug.). In an AMD-Presentaion I found in the web I read that the DX10.1 Cards support Raw- nad Structured-Buffers. But when I'm calling the CheckFeatureSupport()-Funktion the result is ComputeShaders_Plus_RawAndStructuredBuffers_Via_Shader_4_x = false.</p>
<p>Does anybody have the same problem? I don't know what to do to get it work.</p>
<p>I've tryed the newest Driver (9.10) an searched the web - without any success.</p>
<p>&nbsp;</p>
<p>Please help! Frank</p>]]></description>
	</item>

	<item>
		<title>Why so different GPR requirements ?</title>
		<link>http://forums.amd.com/forum/messageview.cfm?catid=328&amp;threadid=120959</link> 
		<pubDate>2009-10-25T16:21:10 -05.00</pubDate> 
		<dc:creator>Raistmer</dc:creator>
   	    <slash:comments>5</slash:comments> 
		<description><![CDATA[ First of attached kernels reqires (accordingly to SKA ) only 7 registers while second one - 18 !<br /><br />But there is no change in directly declared registers number.<br />Why second one requires so many registers ?<br /><br />]]></description>
	</item>

</channel>
</rss>
