The AMD Phenomtm family of microprocessors (Family 0x10) is AMD's first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share. There is a small subset of compute problems that can be categorized as belonging in a Producer and Consumer paradigm; a thread of a program running on a single core produces data, which is meant to be consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory.
With a naïve implementation of a producer/consumer program on the AMD Phenom processor, measured bandwidth results will appear to be throttled by main memory speeds. Main memory speeds can vary, but with DDR2 533 memory (average grade), this is around ~4 GB/s. Why is this?
There are several architectural details on the AMD Phenom processor that can limit inter-core bandwidth if not properly understood. The type and size of the cache on the AMD Phenom core has a direct effect on bandwidth; it includes a "mostly exclusive victim" cache. The MOESI protocol that the AMD Phenom cache uses for cache coherency can also limit bandwidth; it is important to keep a cache line in the 'M' state for optimal producer/consumer performance. A detailed explanation of the AMD Phenom cache architecture and how this relates to producer/consumer performance can be found in the Software Optimization Guide for AMD Family 10h Processors ( section 11.5 ).
Assuming a single buffer has been defined for the producer and consumer threads to walk and communicate, the following bulleted list is a checklist of the constraints to follow to achieve maximum bandwidth:
The consumer thread needs to 'lag' the producer thread by at least L1 & L2 cache size (modulo arithmetic)
The producer thread needs to 'lag' the consumer thread by at least L1 & L2 cache size (modulo arithmetic)
The buffer should be at least 2*(L1 & L2)
The producer thread should not get so far ahead of the consumer to flood the L3, if a large buffer is used
Use prefetchw on the consumer side, even if the consumer does not want to modify the data
Add a small fudge factor to the calculated sizes to give the threads some 'slack' when communicating through the caches
In general, the AMD Phenom cache is optimized for widely shared data, i.e. one core produces data that many other cores may be interested in. In the producer/consumer program however, it is known ahead of time that the data the producer creates is only interesting to the matching consumer thread, and not to any other thread. Following the constraints listed above, it is possible to achieve an aggregate ~12 GB/s bandwidth for two producer/consumer pairs (to maximize 4 cores) on the AMD Phenom processor.
Kent Knox
Member of AMD Technical Staff
-------------------------
This response is provided for informational purposes only, is provided "AS IS" and does not obligate AMD to provide any of the services, technology, or programs described.
As opposed to XEN, KVM is an in-kernel hypervisor for Linuxtm that lets you run unmodified guests like Linux (both 32 bit and 64 bit) as well as Windowstm in every kind of flavor. KVM requires hardware support like AMD's SVM (Secure Virtual Machine) to accelerate full virtualization. Practically speaking it is provided as a kernel module that comes in two pieces, a generic one and a second AMD specific one. Our most recent development for KVM is support for the new K10 hardware feature called „nested paging".
Virtualization performance highly depends on the hypervisor's virtual memory management efficiency. Here is where nested paging comes into play. The typically time consuming process of mapping guest physical addresses to host physical addresses does not have to be calculated in software anymore, but can be done in hardware instead. Memory management in software was achieved using shadow paging. However, that was revealed being a major performance bottleneck. Nested paging is a feature that lays off this address mapping to hardware.
Our KVM patch also includes live migration to/from either paging method. As far as we can tell, it will be included in KVM version 61, which will probably come with kernel version 2.6.26. So what does nested paging buy you in terms of number? We've set up a KVM host system with a Linux guest and ran kernbench. Kernbench is a kernel compilation benchmark that does several compile runs, providing a good expressiveness about overall system performance. Our guest system gains about 30% in performance with nested paging enabled as compared to shadow paging. Now a key goal in virtualization is coming much closer, namely native performance. That one improved from 60% to 90% when compiling a kernel from memory, showing KVM as a reasonable alternative to other virtualization solutions.
Jörg Rödel & Peter Oruba
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
I'm happy to announce that AMD has joined the community of contributors to the OpenJDK project. Our participation will focus on Java performance improvements as well as enhancements that will make performance analysis easier.
Our work towards this project is an extension of AMD's on-going relationship with Sun. Even before the HotSpot JVM became an open-source project, we worked closely with the Java teams at Sun towards a common goal of optimizing the JVM, and providing tools to enable Java developers to focus on their own applications' performance enhancements. The next logical step was to join the community of developers who focus on building an even better Java for the increasing breadth of applications that are developed for it.
You can expect to see contributions from us soon.
Ben Pollan
Manager, AMD Java Labs
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
In AMD JavaLabs one of our activities is to analyze the performance of Java applications or benchmarks and feed performance recommendations to our JVM vendor partners like Sun, IBM, and BEA. While doing this we often write our own small benchmarks to stress performance issues and, like other developers, we sometimes get an unexpected performance decrease from a source change that we expected would either increase performance or at least be performance neutral. Of course we then have to understand the performance decrease. I've added an article with the details of one such investigation where a JVM feature called Escape Analysis can play an important role in an application's performance.
Tom Deneau
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Mike and Robin did a great talk about optimizations and multithreading last week. If you missed it you can view the webcasts on-demand. View the on-demand webcasts.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.