AMD Developer Blogs
January, 2008 (1) |
February, 2008 (5) |
March, 2008 (3) |
April, 2008 (4) |
May, 2008 (4) |
June, 2007 (4) |
June, 2008 (4) |
July, 2007 (4) |
July, 2008 (2) |
August, 2007 (4) |
August, 2008 (9) |
September, 2007 (19) |
 |
 |
October 3, 2008
| |
Benchmarking and collecting performance data under Microsoft Windows Vista and Windows Server 2008
Are you getting odd, unexpectedly low, and/or inconsistent results when running your internal performance validation benchmarks or when collecting other performance data? If you are running Microsoft Windows Vista or Windows Server 2008, power management power plans other than “High performance” can often result in wrong and inconsistent data resulting in a misrepresentation of bottlenecks or hot spots in your code. In response to a number of inquiries from our software developer community, we have added some guidance on AMD Developer Central. For more information on this topic and a scriptable way to manage power settings, please see the recommendations in the Windows Zone.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
October 1, 2008
| |
Help us choose a new design and layout for AMD Developer Central
Note: The Design Survey Unavailable on Thursday, October 9: 6:00pm - 4:00am PDT (Friday, October 10, 2008) for scheduled maintenance.
The AMD Developer Central staff is considering a facelift for the developer.amd.com Web site, and we want to make sure it meets your needs. That's why we're previewing some concepts, with an opportunity for you to vote on them. Your feedback will help us to decide on a final design and layout scheme that will be easier to use and more pleasing to the eye.
Check out the concepts below, and then take five minutes to vote on your favorites.
» Click here to take the survey
Thanks for participating!
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 10/09/2008 at 02:34 PM by AMD Developer Blogs Moderator
|
|
|
September 17, 2008
| |
ACML Example Programs and the Bonus RNG
The ACML examples directory contains a wealth of programs that demonstrate how to use many of the routines found in AMD’s Core Math Library (ACML) Two of these programs in particular provide a primary source of documentation on how to take advantage of the User Supplied Random Number Generator (RNG) feature found in ACML. These two programs are drandinitializeuser_c_example.c and drandinitializeuser_example.f. These are simply C and Fortran versions of the same basic program.
First a little background on the RNG routines. ACML includes 5 base generators: These are the NAG Basic Generator (aka MCG59), L’Ecuyer Combined Recursive Generator, Blum-Blum-Shub, Wichmann-Hill (based on the 1982 paper), and Mersenne Twister (MT19937). There are also 26 distribution generators, each of which starts with a base sequence from any of the 5 base generators or from a user supplied generator. It should be noted that all of these generators produce pseudo-random numbers. This means that the sequences generated are statistically indistinguishable from a true random sequence. They are, however, repeatable. This has advantages for many programs where results from one run to the next must be directly compared, and having different numbers would cause difficulties in the comparison. Using pseudo random numbers, the program acts like it has truly random numbers, but subsequent runs will behave the same for debugging and performance comparisons, as long as the same seeds – or initial values – are used.
The interface to the user supplied generators is described in the ACML User Guide. In addition to DRANDINITIALIZEUSER, two user supplied routines are described. These user supplied interfaces are documented as UINI and UGEN. These two subroutine names are provided to the DRANDINITIALIZEUSER routine. Their names are supplied by you, and can be anything you want them to be.
The example programs for DRANDINITIALIZEUSER implement an updated version of the Wichmann-Hill generator. This RNG is described in the 2005 paper by Wichmann and Hill (http://www.eurometros.org/file_download.php?file_key=247).
Looking at the source code for the example, you will find three parts. The first is the example code that calls drandinitializeuser. This is very similar to the other RNG example calls. The other two routines, WHINI and WHGEN, are the interesting parts. These are the initializer and generator for the new Wichmann-Hill generator. You easily compare the statements in the code to the algorithm defined by the 2005 paper. Note that this generator has a period of 2^121. That is it can generate up to 2.65 x 10^36 pseudo-random numbers before the sequence starts repeating.
State
WHINI contains the initialization code. It accepts the seed values provided to DRANDINITIALIZEUSER and fills them in as initial values of a STATE array. This array has 28 elements. The 4 seed values are filled into STATE (5:8), and correspond to the ix, iy, iz, and it variables from the paper. STATE (9:24) correspond to the 4 large primes and their dividers as specified in the paper.
Generation
When a sequence of numbers is requested through a call to one of the distributions that uses the state array supplied by DRANDINITIALIZEUSER, the WHGEN routine is called to supply that sequence. It will perform the Wichmann-Hill algorithm for each value required using the parameters stored in state. When done it will update STATE (5:8) with the last values of ix, iy, iz, and it.
Multiple Sequences
The WH 2005 paper spells out a method of producing an independent set of initial values for the sequence.
Starting with the value for ix (or SEED(1), which is stored in STATE(5) by WHINI), new values for ix are computed by subsequent computations of the equation:
ix := 46340 x ix mod 2147483579.
New values for iy are computed using:
iy := 22000 x iy mod 2147483543.
These new ix and iy values are then used – along with the already supplied iz and it – to form a new unique sequence of numbers.
Using this method, you can generate up to 2.3E18 unique sequences of random numbers, each with 2.6E36 numbers. That’s a lot of numbers!
An alternate method of producing more sequences would be to choose 4 new large primes and derive the corresponding multipliers.
Summary
This write-up introduces the ACML user supplied random number generator feature, and the example programs that demonstrate its use. As an extra bonus, these examples implement an upgrade of the Wichmann-Hill RNG.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
September 1, 2008
| |
IOMMU
This article is about AMD's IOMMU, coming up in future server chipsets, what it does, how it works and why it is important.
What it does
IOMMU stands for I/O Memory Management Unit and works very similar to a processor's memory management unit. The main difference is that it translates memory accesses performed by devices instead of by the processor, as the MMU already does. This address translation is implemented on a paging based scheme. As with the MMU, it is designed to allow not only implementation of translation, but protection functionality, as well. Another key feature is interrupt remapping.
How it works
Device pass-through
This is the ability to directly assign a physical device to a particular guest OS. The required address space translation is handled transparently. Ideally a device' address space is the same as a guest's physical address space; however, in the virtualized case this is hard to achieve without an IOMMU.
If done without IOMMU, our experience has been that it is very fragile, slow and works for paravirtualized OSs only. An IOMMU is designed to allow device pass-through functionality to work even with an unmodified OS. Device isolation is a key feature for increased virtualization performance, with network adapters and GPUs being the devices that benefit most, as they usually have high bandwidth requirements. As a side-effect, devices with 32 bit addressing only can be passed to guests that are physically mapped above 4 GB, to allow DMA transfers for them as well.
Device isolation
An IOMMU is designed to be able to safely map a device to a particular guest without risking the integrity of other guests. A guest should not break out of its address space with rogue DMA traffic. Additionally it is designed to provide an increased amount of security in scenarios without virtualization. Particularly the OS should be able to protect itself from buggy device drivers by limiting a device's memory accesses.
Remapping of interrupts
Usually sharing device interrupts among several guests is complicated to handle. An IOMMU provides a basis to separate device interrupts that are already shared by different devices. It remaps a shared interrupt to an exclusive vector to ease up its delivery to a particular guest OS.
Why is it important?
In virtualization, there are lots of tricks done to abstract the underlying hardware, but also to minimize virtualization overhead. Using Rapid Virtualization Indexing(tm) instead of shadow page tables for memory management is only one example. The biggest remaining performance gap in today's virtualization scenario is I/O. An IOMMU helps to bridge this gap and also improves the situation from a security point of view. Last, but not least, it allows hypervisors to be simpler and more robust.
Jörg Rödel & Peter Oruba
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
August 19, 2008
| |
AMD China University Accelerated Computing Application Contest
AMD is currently hosting the first ever AMD China University contest in accelerated computing. Ten teams from distinguished universities will be competing to see who can code the fastest application using AMD Stream technology. The teams have been selected already and are in the midst of coding their applications.
Stay tuned to see what exciting applications they come up with and who ultimately wins! 
|
|
|
August 13, 2008
| |
How to Make Sure That Benchmarks Aren't a Horror Story For You
When one of your main jobs is to optimize Java runtimes, it’s easy to lose perspective and let benchmark results take over your life. For example, the other day I had a nightmare where I was being chased by the demonic incarnations of thousands of SPECjbb2005 bops. Thankfully, due to a mix of eastern and western medicines this hasn’t happened again, although I’m thinking of selling the rights to Stephen King. This does bring up an important point: Whatever your involvement with Java benchmarks (any benchmarks, really), you do need to keep them in perspective.
AMD is a strong supporter of (and contributor to) benchmarking standards via organizations like SPEC. The goal is to come up with workloads that are representative of applications, hopefully your applications, so that you can use the results as data points in software and hardware purchasing decisions. Having said that, we also realize that benchmark results taken out of context won’t help you make those decisions.
Pat Moorhead, our V.P. of Advanced Marketing, said it well in his blog. Benchmarks like SPECjbb2005 can be configured to run in a number of different ways. Some configurations have the isolated goal of yielding the best possible performance, while others look to simulate the realities of the data center where power consumption, performance, price, and stability are all important. You’ve got to look at the context that makes the most sense for your situation, to help you interpret the results.
Essentially, benchmark results tell a story. You have to make sure that they fit your story, or there be nightmares in your future.
Ben
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
August 8, 2008
| |
Sun / AMD SPECjbb2005 World Record!
Well, I know that it's been a while. The last thing you heard from me was about JavaOne. Then the team and I just disappeared into the Texas heat.
Or so it seemed.
Just in time for the Olympics, Sun and AMD are happy to announce an x86 world record for SPECjbb2005. For the details, take a look at http://www.sun.com/aboutsun/pr/2008-08/sunflash.20080807.1.xml
It's good to see the results from our on-going collaboration with Sun. The optimizations keep coming, and you reap the benefits.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
August 6, 2008
| |
Telanetix Announces Stream Computing Collaboration With AMD
Telanetix, Inc. revealed today that it is working with AMD to jointly develop telepresence-focused Stream Computing technology and announced the first results of this ongoing effort. By utilizing this new technology in its Digital Presence product line, Telanetix claims that it enables higher quality, lower cost High Definition (HD) telepresence, and that this technology is now available and shipping in every Digital Presence system, as part of the recent Telanetix 3.4.3 Digital Presence technology platform release.
For more information, please see: http://phx.corporate-ir.net/phoenix.zhtml?c=188430&p=irol-newsArticle&ID=1165202
|
|
|
| |
Running AMD Stream Applications Remotely Under Linux
A common question the AMD Stream Team gets over the forums and which gets sent to streamdeveloper@amd.com is: “How can I run my Brook+/CAL program under Linux® without having to sit at the X console or log in first?”
While there are many different ways to try do this, here is one method that one of our AEs, Marc (you know him as marcr on the forums), has found to help with this problem:
You can edit /etc/gdm/custom.conf so that the last few lines look like this:
# Note that to disable servers defined in the defaults.conf file (such as # 0=Standard, you must put a line in this file that says 0=inactive, as # described in the Configuration section of the GDM documentation. # [servers] 0=Rendering
# Also note, that if you redefine a [server-foo] section, then GDM will # use the definition in this file, not the defaults.conf file. It is # currently not possible to disable a [server-foo] section defined # in the defaults.conf file. #
[server-Rendering] name=Rendering server
#-ac disable access control restrictions command=/usr/bin/Xorg -br -ac -audit 0 flexible=true
Then run gdm-restart, or reboot the system. This allows running Brook+/CAL programs remotely without manually logging into the system. Since this does disable X Windows security controls, you will want to make sure you are in a secure environment. There are various ways to tweak this to suit specific needs, but that is left as an exercise for the reader… 
|
|
|
July 9, 2008
| |
Mandelbrot and 16-bit fixed point multiplies ( Part II )
The original SSE2 implementation of the Mandelbrot leveraged the PMADDWD instruction to do the workload inside of the inner loop. Unfortunately, in order to get this to work, lots of code had to be inserted to shuffle data around, and this instruction leaves the packed results in a 32-bit format. This requires PACKSSDW instructions to get the data back to 16-bits before it can be used for further computation. This adds significant overhead to the inner loop calculation. The advantage that PMULHRSW provides are that there are no dependencies on how the data is ordered within the register and it produces its results in a packed 16-bit format. After gaining an understanding of the code differences between the SSE2 and SSSE3 implementations, I believed that it was possible to gain this same advantage using SSE2, but I needed to leverage PMULHW for PMULHRSW. PMULHW writes into the upper 16 bits of the destination 32 bit temporary result, as illustrated in Figure 3 below.
Figure 3: Bit selection of PMULHW
Using 4.12 fixed point arithmetic, PMULHW produces an 8.8 fixed point number as a result, as illustrated in Figure 4.
Figure 4: 4.12 PMULHW multiply; W=Whole, F=Fractional bit
This leaves the least significant bit of the result to represent 2 -8. Unfortunately, after modifying the SSE2 code to use this technique, the fidelity of the Mandelbrot picture started to degrade. Inside the edges of the Mandelbrot pattern, strings of long black pixels began to show, and the rest of the intricate pattern began to look noisy and dirty, with random pixel popping. It became evident that the precision of the Mandelbrot needed to retain the 2 -9 bit. I needed to add more fractional bits (precision) to the upper 16 bits of the multiply to get this technique to work.
The only way to add more fractional bits is to take away whole bits. Stepping back a bit, I took a hard look at the data. In this particular Mandelbrot benchmark, the left and right edges of the window are represented by -2.25 and .75 respectively, and -1.25 and 1.25 for the bottom and top edges respectively. If I took away a whole bit from the fixed point data, changing the 4.12 input data to 3.13, I still have enough range to represent the default Mandelbrot zoom. For signed data, 3 whole bits can represent a range from -4 to +3. If you factor in the value of fractional bits, the upper range is actually very close to positive 4. Figure 5 below illustrates how PMULHW treats 3.13 source data.
Figure 5: 3.13 PMULHW multiply; W=Whole, F=Fractional bit
As can be seen, the precision of PMULHW now goes to 2 -10, because a whole bit was reduced from each 16 bit source, so the multiplied result loses two whole bits. This reduces the range of the signed result to 6 bits (-32 to +31), but this goes beyond our modest needs. In addition, with 10 fractional bits, this exceeds the precision that PMULHRSW gave with 4.12 fixed point data. A testing pass verified that the fidelity of the rendered fractal with this new data format and algorithm was indeed tight and well formed. Overall, with the reduction of pack and data swizzling instructions, this resulted in about a 2.7x speedup over the original SSE2 implementation that used PMADDWD. Through my own internal measurements, this is actually slightly faster than the PMULHRSW optimized versions as well.
I think the biggest point that I want to get across with my writing is, "Think about your data". How can you optimize its use? Not only in terms of how much you have, but as illustrated in this fixed point example, the data format as well. While it's true that the PMULHW instruction enables us to do a fast 16-bit fixed point multiply, we had to change the format of the data to make use of the optimization possible. If you have control over your data (in this benchmark I had, but this is not always the case), the time spent optimizing your data up front can pay back huge dividends later on with simpler/fewer, faster code.
Kent Knox
Member of AMD Technical Staff
Kent Knox is a Member of Technical Staff in Solutions Enablement Engineering at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.
-------------------------
This response is provided for informational purposes only, is provided "AS IS" and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 07/09/2008 at 02:54 PM by kknox
|
|
|
July 1, 2008
| |
Mandelbrot and 16-bit fixed point multiplies ( Part I )
I recently had the opportunity to work on and help optimize a benchmark that uses fixed-point math to carry out an iterative calculation in a loop. The function of the benchmark is to calculate a Mandelbrot fractal in memory and report the time it takes to 'draw' the fractal as a rate. There are several different codepaths inside this benchmark, each implemented to take maximum advantage of different SIMD instruction sets, such as SSE2, SSSE3 and SSE4.1. The SSSE3 and SSE4.1 versions of this routine were approximately 2.7x faster than the SSE2 version. AMD processors support SSE2, SSE3 and SSE4a, so I wanted to investigate what could be done to optimize the SSE2 version of the function.
After I had a chance to visually inspect the various codepaths, it became obvious that the reason the SSSE3 and SSE4.1 routines had such a significant performance lead was due to the PMULHRSW instruction. There is not much literature available on this SSSE3 opcode, but it is defined as 'Packed Multiply High with Round and Scale' and is an instruction designed for fixed-point math. It operates on packed integer data; multiplying two packed 16-bit source words and producing a packed 16-bit destination word. The 16 bits that this instruction chooses to place in the destination register is a little unusual, as illustrated by Figure 1 below. When two 16 bit values are multiplied together, the result is a 32 bit value. However, in order to get 32 bits to fit in a 16 bit result, some bits have got to go, and the bits that PMULHRSW chooses to keep are significantly different than PMULHW or PMULLW. The red squares in the figure below represent bits PMULHRSW truncates, and the green bits are written as the result of the multiply.
Figure 1: Bit selection of PMULHRSW
The 31st bit is a redundant sign bit, so it gets truncated; this is an effect of the two 16-bit sources being signed inputs. Bits 30-15 are the next 16 most significant consecutive bits, and the rest of the least significant bits are truncated. For good measure, the most significant 14th truncated bit is rounded by adding a one before being truncated; this is where the 'round' comes from in the definition of the instruction and makes sense only for fixed point numbers, as this increases the accuracy of fractional bits. Since the most significant sign bit is truncated, the answer written to the destination register is logically left shifted by 1 (the 30th bit is now the most significant bit of the 16 bit result), which in effect is multiplying the result by two; this is where the 'scale' comes from in the definition of the instruction.
This particular Mandelbrot benchmark was originally written to operate on data in a 4.12 fixed point format. For those who feel a little rusty, this Book of Hook page provides a simple review of fixed point math. The zoom of the Mandelbrot includes the real x-axis number range from -2.25 to +.75, and the imaginary y-axis number range from -1.25 to +1.25, which with signed 4.12 numbers leaves plenty of slack. The inner-loop of the Mandelbrot algorithm is a sequence of mul's and add's of complex numbers. A Mandelbrot white paper describing how to calculate the Mandelbrot algorithm can be found following the link. Also, Mike Wall has an article on performance optimization in windows in which he uses a Mandelbrot sample for his explanation; full source code available. For the SSSE3 and SSE4.1 implementation, PMULHRSW was used to multiply these 4.12 fixed point numbers; two 4.12 numbers multiplied together creates an 8.24 32-bit number, and using the bit selection technique of PMULHRSW as illustrated in Figure 2, a rounded 7.9 fixed point number is written out as the packed result. The least significant fractional bit represents 2 -9, which provides enough precision to render a faithful representation of the Mandelbrot set. Eventually, this product has to be left shifted by 3 bits to get back to the original 4.12 to continue the iterations of packed mul's and add's.
Figure 2: 4.12 PMULHRSW multiply; W=Whole, F=Fractional bit
This post gave the background of the optimization problem and described the operation of the PMULHRSW opcode. In Part II of my discussion, I will describe the technique I used to optimize the Mandelbrot fixed-point multiply for SSE2.
Kent Knox
Member of AMD Technical Staff
Kent Knox is a Member of Technical Staff in Solutions Enablement Engineering at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.
-------------------------
This response is provided for informational purposes only, is provided "AS IS" and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 07/09/2008 at 02:54 PM by kknox
|
|
|
June 4, 2008
June 3, 2008
| |
AMD Stream SDK v1.1-beta Release
The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1-beta! The installation files are available for immediate download from: FTP Download Site For AMD Stream SDK v1.1-beta The AMD Stream Computing website will be updated in the next few days to reflect this new release. With v1.1-beta comes:
- AMD FireStream 9170 support - Linux support (RHEL 5.1 and SLES 10 SP1) - Brook+ integer support - Brook+ #line number support for easier .br file debugging - Various bug fixes and runtime enhancements - Preliminary Microsoft Visual Studio 2008 support
If you have any questions, please do not hesitate to post your question to the forum. Sincerely, AMD Stream Team
Edited: 06/03/2008 at 04:35 AM by michael.chu@amd.com
|
|
|
May 22, 2008
| |
Our oxygen bar was a hit!
Did you miss checking out our oxygen bar in the AMD booth at JavaOne? Well, 1700 of your fellow developers couldn't pass up the chance to try fragrances that had different effects like calming, energizing, and -- ahem -- aphrodisiac. Fortunately, we've got pictures...but we're not telling which vial is which!

-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 05/22/2008 at 05:42 PM by devcentral
|
|
|
FuseTalk Hosting Executive Plan - © 1999-2008 FuseTalk Inc. All rights reserved.
|