AMD AMD Developer Central
AMD Developer Blogs
AMD Developer Blogs
Decrease font size
Increase font size
1 2 >> Next
July 1, 2008
  Mandelbrot and 16-bit fixed point multiplies ( Part I )
I recently had the opportunity to work on and help optimize a benchmark that uses fixed-point math to carry out an iterative calculation in a loop. The function of the benchmark is to calculate a Mandelbrot fractal in memory and report the time it takes to 'draw' the fractal as a rate. There are several different codepaths inside this benchmark, each implemented to take maximum advantage of different SIMD instruction sets, such as SSE2, SSSE3 and SSE4.1. The SSSE3 and SSE4.1 versions of this routine were approximately 2.7x faster than the SSE2 version. AMD processors support SSE2, SSE3 and SSE4a, so I wanted to investigate what could be done to optimize the SSE2 version of the function.
After I had a chance to visually inspect the various codepaths, it became obvious that the reason the SSSE3 and SSE4.1 routines had such a significant performance lead was due to the PMULHRSW instruction. There is not much literature available on this SSSE3 opcode, but it is defined as 'Packed Multiply High with Round and Scale' and is an instruction designed for fixed-point math. It operates on packed integer data; multiplying two packed 16-bit source words and producing a packed 16-bit destination word. The 16 bits that this instruction chooses to place in the destination register is a little unusual, as illustrated by Figure 1 below. When two 16 bit values are multiplied together, the result is a 32 bit value. However, in order to get 32 bits to fit in a 16 bit result, some bits have got to go, and the bits that PMULHRSW chooses to keep are significantly different than PMULHW or PMULLW. The red squares in the figure below represent bits PMULHRSW truncates, and the green bits are written as the result of the multiply.


Figure 1: Bit selection of PMULHRSW

The 31st bit is a redundant sign bit, so it gets truncated; this is an effect of the two 16-bit sources being signed inputs. Bits 30-15 are the next 16 most significant consecutive bits, and the rest of the least significant bits are truncated. For good measure, the most significant 14th truncated bit is rounded by adding a one before being truncated; this is where the 'round' comes from in the definition of the instruction and makes sense only for fixed point numbers, as this increases the accuracy of fractional bits. Since the most significant sign bit is truncated, the answer written to the destination register is logically left shifted by 1 (the 30th bit is now the most significant bit of the 16 bit result), which in effect is multiplying the result by two; this is where the 'scale' comes from in the definition of the instruction.
This particular Mandelbrot benchmark was originally written to operate on data in a 4.12 fixed point format. For those who feel a little rusty, this webpage provides a simple review of fixed point math. The zoom of the Mandelbrot includes the real x-axis number range from -2.25 to +.75, and the imaginary y-axis number range from -1.25 to +1.25, which with signed 4.12 numbers leaves plenty of slack. The inner-loop of the Mandelbrot algorithm is a sequence of mul's and add's of complex numbers. A white paper describing how to calculate the Mandelbrot algorithm can be found here. For the SSSE3 and SSE4.1 implementation, PMULHRSW was used to multiply these 4.12 fixed point numbers; two 4.12 numbers multiplied together creates an 8.24 32-bit number, and using the bit selection technique of PMULHRSW as illustrated in Figure 2, a rounded 7.9 fixed point number is written out as the packed result. The least significant fractional bit represents 2^-9, which provides enough precision to render a faithful representation of the Mandelbrot set. Eventually, this product has to be left shifted by 3 bits to get back to the original 4.12 to continue the iterations of packed mul's and add's.


Figure 2: 4.12 PMULHRSW multiply; W=Whole, F=Fractional bit

This post gave the background of the optimization problem and described the operation of the PMULHRSW opcode. In Part II of my discussion, I will describe the technique I used to optimize the Mandelbrot fixed-point multiply for SSE2.

Kent Knox
Member of AMD Technical Staff

-------------------------
This response is provided for informational purposes only, is provided "AS IS" and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 07/01/2008 at 07:21 PM by kknox

 Post a Comment    

    Posted By: Kent Knox @ 07/01/2008 06:52 PM     Hard-Core Software Optimization     Comments (0)  

June 4, 2008
  RapidMind and AMD Team to Demonstrate 27x Performance Improvement of Financial Algorithm at SIFMA'08

RapidMind will be showcasing live demonstrations of a 27x performance improvement of a Binomial option pricing calculator at SIFMA’08. RapidMind will also be demonstrating the same accelerated option pricing tool running on the AMD FireStream 9170. The demonstration will occur in the AMD booth #2000.

For more information, please see: http://www.rapidmind.com/News-June4-08-SIFMA.php



Edited: 06/04/2008 at 10:51 PM by michael.chu@amd.com

 Post a Comment    

    Posted By: Michael Chu @ 06/04/2008 03:15 PM     AMD Stream™     Comments (1)  

June 3, 2008
  See AMD Stream Team at ISC'08!

Members of the AMD Stream Team will be at International Supercomputing Conference ’08, June 17-20, 2008 in Dresden, Germany.

Visit us at booth #B19-B22 and see what exciting new things the AMD Stream Team and its partners are working on.

For more information about ISC’08, please visit: http://www.supercomp.de/isc08/content/index_eng.html


 Post a Comment    

    Posted By: Michael Chu @ 06/03/2008 09:12 PM     AMD Stream™     Comments (2)  

  See AMD Stream Team at SIFMA'08!

Members of the AMD Stream Team will be at the SIFMA Technology Management Conference & Exhibit, June 10-12, 2008 in New York City, New York.

Visit us at booth #2000 and see what exciting new things the AMD Stream Team and its partners are working on.

For more information about SIFMA’08, please visit: http://events.sifma.org/2008/107/event.aspx?id=526


 Post a Comment    

    Posted By: Michael Chu @ 06/03/2008 09:11 PM     AMD Stream™     Comments (0)  

  AMD Stream SDK v1.1-beta Release

The AMD Stream Team is pleased to announce the availability of AMD Stream SDK v1.1-beta!

The installation files are available for immediate download from:
FTP Download Site For AMD Stream SDK v1.1-beta

The AMD Stream Computing website will be updated in the next few days to reflect this new release.

With v1.1-beta comes:

- AMD FireStream 9170 support
- Linux support (RHEL 5.1 and SLES 10 SP1)
- Brook+ integer support
- Brook+ #line number support for easier .br file debugging
- Various bug fixes and runtime enhancements
- Preliminary Microsoft Visual Studio 2008 support



If you have any questions, please do not hesitate to post your question to the forum.

Sincerely,
AMD Stream Team



Edited: 06/03/2008 at 04:35 AM by michael.chu@amd.com

 Post a Comment    

    Posted By: Michael Chu @ 06/03/2008 03:54 AM     AMD Stream™     Comments (0)  

May 22, 2008
  Our oxygen bar was a hit!

Did you miss checking out our oxygen bar in the AMD booth at JavaOne? Well, 1700 of your fellow developers couldn't pass up the chance to try fragrances that had different effects like calming, energizing, and -- ahem -- aphrodisiac. Fortunately, we've got pictures...but we're not telling which vial is which!

oxygen bar

 



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 05/22/2008 at 05:42 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 05/22/2008 05:02 PM     Inside Dev Central     Comments (0)  

May 20, 2008
  Wrapping up JavaOne 2008

It was great being part of AMD's big presence at JavaOne this year. We had good turnout at the booth, keynote, and technical session, talking with folks who wanted to learn about our community involvement, most notably:

·     Our CodeSleuth product

·     Our plans for enhancing Java through our processor and platform technologies

·     The partners that we are working with to have Java take advantage of those technologies

 

AMD Keynote with Dr. Leendert vanDoorn and Dr. James GoslingIf you couldn't attend the conference, be sure to check out the keynote from Dr. Leendert van Doorn, AMD Senior Fellow, on the Role of the Microprocessor in the Evolution of Java Technology. Dr. James Gosling was up on stage with Leendert to discuss

 

the future of Java. The webcast can be viewed in chapters at the following three links:

http://java.sun.com/javaone/sf/media_shell.jsp?id=FRdamp267618

http://java.sun.com/javaone/sf/media_shell.jsp?id=FRdamp267620

http://java.sun.com/javaone/sf/media_shell.jsp?id=FRdamp267622

 

A good article that provides an overview of the keynote and our plans can be found here:

http://java.sun.com/javaone/sf/2008/articles/gen_amd.jsp

 

Many thanks to Jim Falgout and the team from Pervasive Software for their booth presence and keynote participation. We look forward to working with them on their products that tap the power of multi-core processors.

 

Good turnout and probing questions at Azeem and Shrinivas' virtualization tech session. Many people are just getting their feet wet with virtualization. We expect it to be a major topic at the conference next year, which leads me to the question: What details would be of interest to you?

 

Also with an increased showing, was interest in the different Real Time Java technologies. In the booth, we had a presentation from IBM for its offering in that space, and had many folks come by with questions, and feedback for real time JVMs in general. They were mainly from the financial and manufacturing sectors. I expect that we will also see much more discussed at next year's conference concerning this space, and you will certainly see us in the thick of that discussion.

 

AMD booth at JavaOne 2008 

 

 

The team attended some good tech sessions (and saw much to like) in a number of areas including, but not limited to:

·     Garbage collection: There was some good discussion around the G1 collector. We are certainly interested in what this does for JVM performance.

·     Concurrency: Upcoming ParallelArray features look like a great way to have Java developers create code that can scale well to multi-core architectures. There were other talks about the challenges we face as developers in the multi-core era.

·     Other JDK7 features: A good talk about Closures, which will help concurrency and stream handling.

·     JavaFX: Lots of buzz around JavaFX this year, with some great demos in the initial general session.

 

We are looking forward to participating at the conference next year in June. Hope you have plans to be there as well.

 

Ben



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Ben Pollan @ 05/20/2008 11:15 PM     AMD Java Labs     Comments (0)  

May 16, 2008
  Welcome to the AMD Stream Computing Blog

Welcome to the AMD Stream Computing blog, where the AMD Stream Team will publish posts and mini-articles about all things Stream! My name is Michael Chu and I am the product manager for AMD Stream software. From time to time, other members of the team will post articles and will introduce themselves then. We are planning on bringing you interesting news as we find out about them along with any relevant releases and products.

 

We invite you to help guide the direction of this site by leaving comments that let us know if these are the types of content and topics you would like to see published.

 

If you are interested in developing with AMD Stream, please visit us on the developer forums (go to AMD Stream). We have a growing community of developers who are constantly sharing what they have learned as they developed their applications on AMD Stream.

 

Stay tuned for more exciting news!


 Post a Comment    

    Posted By: Michael Chu @ 05/16/2008 07:31 PM     AMD Stream™     Comments (2)  

May 12, 2008
  Play the Second Annual AMD Treasure Hunt Game!
Have you been paying attention to the latest trends in multi-core processors and multi-threaded programming? Are you curious about how parallel programming can dramatically improve application (and of course, PC gaming) performance? Are you a code warrior interested in a cool challenge?

If so, you might have what it takes to conquer the AMD Treasure Hunt Game!

» Learn how to play

-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 05/12/2008 at 02:56 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 05/12/2008 12:16 PM     Inside Dev Central     Comments (0)  

April 28, 2008
  Live Webcast of AMD's JavaOne Keynote

Well, it’s going to be a pretty busy week here, as AMD puts the finishing touches on our JavaOne activities.  It’s going to be a great conference.  Be sure to stop by our booth and don’t forget that AMD’s keynote will be on Wednesday, May 7 @ 5:30pm, presented by Leendert vanDoorn, Senior Fellow in our Software Technology Office.

 

If you aren’t attending the conference, but still want to see the keynote, no worries!  There will be a live webcast of the session, accessible from the JavaOne home page <http://java.sun.com/javaone.

 

Ben



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Ben Pollan @ 04/28/2008 12:17 PM     AMD Java Labs     Comments (0)  

April 14, 2008
  "Barcelona" Processor Feature: SSE Misaligned Access
We all crave high performing code and in the process we try hard to optimize the algorithms, reorder instructions, unroll loops, avoid branches, reduce pointer usage to allow compilers to optimize, replace dynamic allocation with static allocation where the size is known and so on. One such optimization is with respect to data loads and stores from memory which consume a majority of processing cycles in data-intensive applications. Here, I'll take you through one such optimization with respect to data alignment while using SSE (Streaming SIMD Extension) instructions.

Why use SSE instructions?

SSE instructions operate on 16 bytes of data in parallel. We can load 16 bytes of data at a time and compute those 16 bytes of packed data using a single SSE instruction.
Example: ADDPS xmm1, xmm2 - Add 4 single precision floating point elements packed in xmm1 register with corresponding elements packed in xmm2 and store the result back in xmm1.

SSE instructions are widely used in developing computation-intensive multimedia applications. Typically, these applications process large amounts of sequential data through the following steps:

1. Load data from memory
2. Perform computation on the data
3. Store data back to memory

First we will discuss the intricacies involved in optimizing memory operations using SSE instructions on the AMD-K8tm family of processors (first and second generation AMD Opterontm processors) and then we will discuss the architectural enhancements provided by the "Barcelona" or Family 10h processors (including Quad-Core AMD Opteron and AMD Phenomtm X4 Quad-Core and X3 Triple-Core processors).

SSE instructions consist of two types of load and store instructions. The first type is aligned loads and stores (ex: MOVDQA, MOVAPD, MOVPS) that operate on 16 byte aligned memory addresses. The second type is unaligned loads and stores (ex: MOVDQU, MOVUPD, MOVUPS) that operate on both aligned and unaligned memory addresses. On the AMD-K8 family of processors the aligned version of load and store operations are faster than the unaligned operations even if the memory is 16 byte aligned. For details on the latencies of the various types of load and store instructions, refer to the AMD Software Optimization Guide for AMD Family 10h Processors.

If we use the aligned version of memory operations without verifying the memory address alignments then there are two possible outcomes. First, if the memory is aligned then the memory operations are fast. Second, if the memory is unaligned then the system throws an exception and hence the application crashes (Bang!!!). Now, the solution to this problem is to align the input data to both gain performance and eliminate exceptions and crashes. This solution may not work always since the target user using the application may not align the data or because enforcing such a rule may be inappropriate at times. The easy solution here is to use the safer unaligned loads and stores, sacrificing performance irrespective of the data alignment.

If you are a programmer looking for the best possible performance, saving every single processing cycle, then the solution here is to handle both aligned and unaligned data by checking for alignment of the data at runtime and call the appropriate function that handles either aligned data or unaligned data.

The code to handle aligned and unaligned data is as follows:



if( isAligned(data) )
{
process_aligned (data);
}
else
{
process_unaligned(data);
}

//The 16 byte alignment check code is as follows.
bool isAligned(void* data)
{
return ((data%16) == 0);
}




Typically, the process_aligned and process_unaligned routines have identical code except for the type of load and store instructions.

Architectural enhancements in AMD Family 10h processors ("Barcelona" processors)

"Barcelona" comes with load instructions that are twice as fast as the previous generation processors. For example, the aligned loads take 2 processor cycles in "Barcelona," compared to 4 processor cycles in the AMD-K8 architecture. This is only the latency of the instruction execution; there could be additional latency depending on the locality of the actual data being present in cache or main memory.

The unaligned loads in "Barcelona" run at the speed of aligned loads if the data is aligned. Thus, it is safer to use unaligned loads whenever the alignment of the data is not guaranteed, hence eliminating the check for 16 byte alignment at runtime. If the data is unaligned then the instruction is slightly slower than aligned loads but at an improved speed compared to the unaligned loads on AMD-K8 processors. The FPU unit in "Barcelona" has been widened to 128 bits from 64 bits and the load instructions are fast path instructions. (Note: In AMD-K8 processors, SSE loads are vector path instructions which block the execution units from executing any other instruction in parallel.)

The above optimizations are not applicable for SSE stores. The unaligned stores are slower than aligned stores even when the data is aligned.

Ravindra Babu
Software Engineer, AMD

-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 04/14/2008 at 12:55 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 04/14/2008 11:14 AM     AMD “Barcelona” (Family 10h) Processor Software Visible Features     Comments (0)  

April 11, 2008
  The Software Optimization Guide Comes to Life!
I'm pleased to announce that we have just published a series of six videos that brings to life some of the key concepts outlined in the Software Optimization Guide for Family 10h Processors. This video series is a companion to the optimization guide, and provides a quick look at some highly useful tips in addition to some examples to illustrate coding best practices.

We hope you find this series valuable, and welcome your feedback. Let us know what you think by commenting on this post. If you have questions about the information contained in the videos, feel free to post a question in our forums.

Happy viewing!
Software Optimization Video Series

-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 04/11/2008 02:57 PM     Inside Dev Central     Comments (2)  

April 7, 2008
  Come join us at JavaOne 2008!

It's that time of year again.  Here in Austin, the bluebonnets are in full bloom, and that can only mean one thing.  It's time for the Java Labs to pack our bags and head over to the Moscone Convention Center in San Francisco, so that we can rub elbows with 15,000 of our closest friends at 2008 JavaOne!  As platinum sponsors, AMD is playing a much bigger role at the conference this year.  That's because as a platform company, we have a lot to say about Java.

 

There are a few things you won't want to miss.  On  Wednesday, May 7  5:30pm, Leendert van Doorn, AMD Senior Fellow will present a keynote on processor companies' roles in the Java world.  On  Tuesday, May 6  6:00pm , the Java Labs' own Azeem Jiva and Shrinivas Joshi will present a technical session titled "Virtualizing a Virtual Machine," where the duo will discuss best practices when deploying Java applications in virtualized environments.  And of course in the Pavilion there's the booth, the big beautiful booth (so beautiful that I'm thinking of having those designers give my house a makeover), where we will highlight AMD's role in the Java community today and share our vision of the future. 

 

Details can be found on our AMD at 2008 JavaOne page:  http://developer.amd.com/EVENTS/JAVAONE/Pages/default.asp. 

 

See you there!

Ben



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 04/07/2008 at 02:44 PM by bpollan

 Post a Comment    

    Posted By: Ben Pollan @ 04/07/2008 02:14 PM     AMD Java Labs     Comments (0)  

March 26, 2008
  Myths and facts about 64-bit Linux(®)

Myths

  • "You don't need 64-bit software with less than 3 GB RAM"
  • "There are less drivers for 64-bit OS"
  • "You will need all new software, all 64-bit"
  • "64-bit software is twice as fast"


AMD's 64-bit architecture extension
AMD64 introduces one new mode to the processor, the Long mode, consisting of the 64-bit mode and the Compatibility mode. The former is the new 64-bit environment, the latter a compatibility implementation to run 32-bit code in that 64-bit environment. The current operating mode is connected to the code segment. By using instructions that change code segments (like syscall) one can switch between both submodes.
Currently the paging algorithm (derived from PAE) limits the physical address to 52 bits, while the virtual address space is 48-bits wide. Even with far less physical RAM this proves to be helpful for memory mapping.

Another important part of the new architecture is the extended number of registers, both the general purpose registers as well as the SSE registers have been doubled, you can now use 16 of each in the new 64-bit mode.


Software support
Support for 64-bit in software required a new ABI to be introduced as well as extending GCC with the new architecture. As an essential part in that, up to six function parameters are now passed in registers. Linux kernel support for 64-bit started by splitting off a new architecture tree, x86-64. Since 2.6.24 this was merged with the i386 tree to one common code base (x86).

In addition a compatibility layer (aka compat layer) provides support for execution of 32-bit binaries on 64-bit processors.


32-bit compatibility
Compatibility to 32-bit code has been a crucial goal in the design. First of all it allows to run unmodified 32-bit code inside a 64-bit environment. Syscalls are used to switch between both worlds.

The processor zero-extends all 32-bit addresses to 64-bit. Applications use the lower 4 GB of the address space. However, physically that can be mapped anywhere. The kernel manages the compat layer for all applications, meaning it resolves bitness differences, structure layouts and invokes the right library version (/lib{32,64}/ld.so).

Speaking of libraries - each library has to be present twice, once for 32-bit and once for 64-bit, which also includes all dependent ones down to the lowest.

With the Linux compat layer it is even possible to run an entire and unmodified 32-bit Linux installation with a 64-bit kernel.


Benchmarks
First we picked some real world benchmarks for our 32-bit vs. 64-bit comparisons. Oggenc, Mencoder and Povray as well as some compilation tests. Furthermore micro benchmarks were used to show specific performance differences for syscalls and 64-bit arithmetics.

We set up three system configurations - a 32-bit installation, a 64-bit installation and a combination of 32-bit installation with 64-bit kernel to challenge the compat layer. All tests were performed on a dual-core AMD-K8(tm) processor with 1 GB RAM.

The tests showed that the penalty of using the compat layer instead of running your 32-bit application on a native 32-bit kernel is about 1-2 percent. So it is almost negligible.

64-bit took the lead in the media encoding tests. Our Povray and Mencoder benchmarks took about 5% less time in the 64-bit case, Oggenc even 25%. Just C-compilation tests showed a performance advantage of 5% to 8% for 32-bit versus 64-bit.

Native arithmetic performance (64-bit data types used in 64-bit software vs. 32-bit data types used in 32-bit) showed a gain of 10% for the 64-bit case. Using 64-bit data types on 32-bit and 64-bit in the arithmetic performance test showed that 64-bit is more than twice as fast as 32-bit.


Downsides of 64-bit
A 64-bit execution environment and 64-bit software surely have their downsides, too. First there is the larger memory footprint. Binaries get larger because of an increased pointer size and 64-bit operands. This leads to higher memory transfer load and therefore increases cache utilization.


Myths revisited

  • "You don't need 64-bit software with less than 3 GB RAM"
    • Performance improvement even on systems with less than 3 GB RAM
  • "There are less drivers for 64-bit OS"
    • Irrelevant to Linux, hail open source 
  • "You will need all new software, all 64 bit"
    • 32-bit compat layer performs very well and is transparent
  • "64-bit software is twice as fast"
    • Rarely the fact, software is usually optimized for 32-bit


Conclusion
Use a 64-bit system and stick to the compat layer if you have the need of running certain 32-bit applications.


Andre Przywara, Andreas Herrmann, Peter Oruba



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 03/26/2008 at 05:07 AM by peteroruba

 Post a Comment    

    Posted By: Peter Oruba @ 03/26/2008 04:54 AM     AMD Operating System Research Center (OSRC)     Comments (0)  

March 18, 2008
  Live from EclipseCon 2008
I have just a short break here, but wanted to give you all a quick update on how things are going here at EclipseCon 2008.

The booth has been quite busy, with attendees coming by to fill out our survey and get their 1GB USB drive. We've had a number of people wanting to learn what AMD's relationship is with Eclipse, and then are very interested once they find out what the CodeSleuth plugin can do for their Java development process.

Gary Frost from the AMD Java Labs team delivered his technical session this morning to a full room. After his session, I was flooded at the booth! I'll try to post some pictures when I get a moment.

Gotta go, the hall is opening up again and people are coming by! Be sure to check out CodeSleuth yourself if you're not able to join us at the show.



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 03/24/2008 at 05:04 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 03/18/2008 05:41 PM     Inside Dev Central     Comments (0)  

March 13, 2008
  Join us at EclipseCon 2008
If you're coming to EclipseCon 2008 next week (March 17-20 in Santa Clara, CA), be sure to come visit the AMD Hardware Lounge or our booths (410/411) in the Exhibit Hall. We'll be showcasing one of the servers we've donated to the Eclipse Foundation to run their backend infrastructure, along with some AMD SPIDER systems.

Gary Frost, one of our AMD Java Labs engineers, will also be delivering a technical session and will be on hand to answer questions about the new plugin for Eclipse we're demonstrating.

Plus, fill out our survey for a chance to win a fun prize!

Get all the details at our AMD@EclipseCon page.

-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 03/13/2008 03:55 PM     Inside Dev Central     Comments (0)  

February 21, 2008
  Optimizing Inter-Core Data Transfer on AMD Phenomtm processors
The AMD Phenomtm family of microprocessors (Family 0x10) is AMD's first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share. There is a small subset of compute problems that can be categorized as belonging in a Producer and Consumer paradigm; a thread of a program running on a single core produces data, which is meant to be consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory.

With a naïve implementation of a producer/consumer program on the AMD Phenom processor, measured bandwidth results will appear to be throttled by main memory speeds. Main memory speeds can vary, but with DDR2 533 memory (average grade), this is around ~4 GB/s. Why is this?

There are several architectural details on the AMD Phenom processor that can limit inter-core bandwidth if not properly understood. The type and size of the cache on the AMD Phenom core has a direct effect on bandwidth; it includes a "mostly exclusive victim" cache. The MOESI protocol that the AMD Phenom cache uses for cache coherency can also limit bandwidth; it is important to keep a cache line in the 'M' state for optimal producer/consumer performance. A detailed explanation of the AMD Phenom cache architecture and how this relates to producer/consumer performance can be found in the Software Optimization Guide for AMD Family 10h Processors ( section 11.5 ).

Assuming a single buffer has been defined for the producer and consumer threads to walk and communicate, the following bulleted list is a checklist of the constraints to follow to achieve maximum bandwidth:

  • The consumer thread needs to 'lag' the producer thread by at least L1 & L2 cache size (modulo arithmetic)

  • The producer thread needs to 'lag' the consumer thread by at least L1 & L2 cache size (modulo arithmetic)

  • The buffer should be at least 2*(L1 & L2)

  • The producer thread should not get so far ahead of the consumer to flood the L3, if a large buffer is used

  • Use prefetchw on the consumer side, even if the consumer does not want to modify the data

  • Add a small fudge factor to the calculated sizes to give the threads some 'slack' when communicating through the caches


In general, the AMD Phenom cache is optimized for widely shared data, i.e. one core produces data that many other cores may be interested in. In the producer/consumer program however, it is known ahead of time that the data the producer creates is only interesting to the matching consumer thread, and not to any other thread. Following the constraints listed above, it is possible to achieve an aggregate ~12 GB/s bandwidth for two producer/consumer pairs (to maximize 4 cores) on the AMD Phenom processor.

Kent Knox
Member of AMD Technical Staff

-------------------------
This response is provided for informational purposes only, is provided "AS IS" and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 07/01/2008 at 06:04 PM by kknox

 Post a Comment    

    Posted By: Kent Knox @ 02/21/2008 05:13 PM     Hard-Core Software Optimization     Comments (0)  

February 20, 2008
  Boosting KVM's performance with nested paging

As opposed to XEN, KVM is an in-kernel hypervisor for Linuxtm that lets you run unmodified guests like Linux (both 32 bit and 64 bit) as well as Windowstm in every kind of flavor. KVM requires hardware support like AMD's SVM (Secure Virtual Machine) to accelerate full virtualization. Practically speaking it is provided as a kernel module that comes in two pieces, a generic one and a second AMD specific one. Our most recent development for KVM is support for the new K10 hardware feature called „nested paging".
Virtualization performance highly depends on the hypervisor's virtual memory management efficiency. Here is where nested paging comes into play. The typically time consuming process of mapping guest physical addresses to host physical addresses does not have to be calculated in software anymore, but can be done in hardware instead. Memory management in software was achieved using shadow paging. However, that was revealed being a major performance bottleneck. Nested paging is a feature that lays off this address mapping to hardware.
Our KVM patch also includes live migration to/from either paging method. As far as we can tell, it will be included in KVM version 61, which will probably come with kernel version 2.6.26. So what does nested paging buy you in terms of number? We've set up a KVM host system with a Linux guest and ran kernbench. Kernbench is a kernel compilation benchmark that does several compile runs, providing a good expressiveness about overall system performance. Our guest system gains about 30% in performance with nested paging enabled as compared to shadow paging. Now a key goal in virtualization is coming much closer, namely native performance. That one improved from 60% to 90% when compiling a kernel from memory, showing KVM as a reasonable alternative to other virtualization solutions.

Jörg Rödel & Peter Oruba



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Peter Oruba @ 02/20/2008 03:04 AM     AMD Operating System Research Center (OSRC)     Comments (0)  

February 19, 2008