AMD Developer Blogs
January, 2008 (1) |
February, 2008 (5) |
March, 2008 (3) |
April, 2008 (4) |
May, 2008 (1) |
June, 2007 (4) |
July, 2007 (4) |
August, 2007 (4) |
September, 2007 (19) |
October, 2007 (1) |
November, 2007 (3) |
December, 2007 (1) |
 |
 |
May 12, 2008
| |
Play the Second Annual AMD Treasure Hunt Game!
Have you been paying attention to the latest trends in multi-core processors and multi-threaded programming? Are you curious about how parallel programming can dramatically improve application (and of course, PC gaming) performance? Are you a code warrior interested in a cool challenge?
If so, you might have what it takes to conquer the AMD Treasure Hunt Game!
» Learn how to play
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 05/12/2008 at 02:56 PM by devcentral
|
|
|
April 28, 2008
| |
Live Webcast of AMD's JavaOne Keynote
Well, it’s going to be a pretty busy week here, as AMD puts the finishing touches on our JavaOne activities. It’s going to be a great conference. Be sure to stop by our booth and don’t forget that AMD’s keynote will be on Wednesday, May 7 @ 5:30pm, presented by Leendert vanDoorn, Senior Fellow in our Software Technology Office.
If you aren’t attending the conference, but still want to see the keynote, no worries! There will be a live webcast of the session, accessible from the JavaOne home page <http://java.sun.com/javaone> .
Ben
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
April 14, 2008
| |
"Barcelona" Processor Feature: SSE Misaligned Access
We all crave high performing code and in the process we try hard to optimize the algorithms, reorder instructions, unroll loops, avoid branches, reduce pointer usage to allow compilers to optimize, replace dynamic allocation with static allocation where the size is known and so on. One such optimization is with respect to data loads and stores from memory which consume a majority of processing cycles in data-intensive applications. Here, I'll take you through one such optimization with respect to data alignment while using SSE (Streaming SIMD Extension) instructions.
Why use SSE instructions?
SSE instructions operate on 16 bytes of data in parallel. We can load 16 bytes of data at a time and compute those 16 bytes of packed data using a single SSE instruction.
Example: ADDPS xmm1, xmm2 - Add 4 single precision floating point elements packed in xmm1 register with corresponding elements packed in xmm2 and store the result back in xmm1.
SSE instructions are widely used in developing computation-intensive multimedia applications. Typically, these applications process large amounts of sequential data through the following steps:
1. Load data from memory
2. Perform computation on the data
3. Store data back to memory
First we will discuss the intricacies involved in optimizing memory operations using SSE instructions on the AMD-K8tm family of processors (first and second generation AMD Opterontm processors) and then we will discuss the architectural enhancements provided by the "Barcelona" or Family 10h processors (including Quad-Core AMD Opteron and AMD Phenomtm X4 Quad-Core and X3 Triple-Core processors).
SSE instructions consist of two types of load and store instructions. The first type is aligned loads and stores (ex: MOVDQA, MOVAPD, MOVPS) that operate on 16 byte aligned memory addresses. The second type is unaligned loads and stores (ex: MOVDQU, MOVUPD, MOVUPS) that operate on both aligned and unaligned memory addresses. On the AMD-K8 family of processors the aligned version of load and store operations are faster than the unaligned operations even if the memory is 16 byte aligned. For details on the latencies of the various types of load and store instructions, refer to the AMD Software Optimization Guide for AMD Family 10h Processors.
If we use the aligned version of memory operations without verifying the memory address alignments then there are two possible outcomes. First, if the memory is aligned then the memory operations are fast. Second, if the memory is unaligned then the system throws an exception and hence the application crashes (Bang!!!). Now, the solution to this problem is to align the input data to both gain performance and eliminate exceptions and crashes. This solution may not work always since the target user using the application may not align the data or because enforcing such a rule may be inappropriate at times. The easy solution here is to use the safer unaligned loads and stores, sacrificing performance irrespective of the data alignment.
If you are a programmer looking for the best possible performance, saving every single processing cycle, then the solution here is to handle both aligned and unaligned data by checking for alignment of the data at runtime and call the appropriate function that handles either aligned data or unaligned data.
The code to handle aligned and unaligned data is as follows:
if( isAligned(data) )
{
process_aligned (data);
}
else
{
process_unaligned(data);
}
//The 16 byte alignment check code is as follows.
bool isAligned(void* data)
{
return ((data%16) == 0);
}
Typically, the process_aligned and process_unaligned routines have identical code except for the type of load and store instructions.
Architectural enhancements in AMD Family 10h processors ("Barcelona" processors)
"Barcelona" comes with load instructions that are twice as fast as the previous generation processors. For example, the aligned loads take 2 processor cycles in "Barcelona," compared to 4 processor cycles in the AMD-K8 architecture. This is only the latency of the instruction execution; there could be additional latency depending on the locality of the actual data being present in cache or main memory.
The unaligned loads in "Barcelona" run at the speed of aligned loads if the data is aligned. Thus, it is safer to use unaligned loads whenever the alignment of the data is not guaranteed, hence eliminating the check for 16 byte alignment at runtime. If the data is unaligned then the instruction is slightly slower than aligned loads but at an improved speed compared to the unaligned loads on AMD-K8 processors. The FPU unit in "Barcelona" has been widened to 128 bits from 64 bits and the load instructions are fast path instructions. (Note: In AMD-K8 processors, SSE loads are vector path instructions which block the execution units from executing any other instruction in parallel.)
The above optimizations are not applicable for SSE stores. The unaligned stores are slower than aligned stores even when the data is aligned.
Ravindra Babu
Software Engineer, AMD
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 04/14/2008 at 12:55 PM by devcentral
|
|
|
April 11, 2008
| |
The Software Optimization Guide Comes to Life!
I'm pleased to announce that we have just published a series of six videos that brings to life some of the key concepts outlined in the Software Optimization Guide for Family 10h Processors. This video series is a companion to the optimization guide, and provides a quick look at some highly useful tips in addition to some examples to illustrate coding best practices.
We hope you find this series valuable, and welcome your feedback. Let us know what you think by commenting on this post. If you have questions about the information contained in the videos, feel free to post a question in our forums.
Happy viewing!
Software Optimization Video Series
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
April 7, 2008
| |
Come join us at JavaOne 2008!
It's that time of year again. Here in Austin, the bluebonnets are in full bloom, and that can only mean one thing. It's time for the Java Labs to pack our bags and head over to the Moscone Convention Center in San Francisco, so that we can rub elbows with 15,000 of our closest friends at 2008 JavaOne! As platinum sponsors, AMD is playing a much bigger role at the conference this year. That's because as a platform company, we have a lot to say about Java.
There are a few things you won't want to miss. On Wednesday, May 7 @ 5:30pm, Leendert van Doorn, AMD Senior Fellow will present a keynote on processor companies' roles in the Java world. On Tuesday, May 6 @ 6:00pm , the Java Labs' own Azeem Jiva and Shrinivas Joshi will present a technical session titled "Virtualizing a Virtual Machine," where the duo will discuss best practices when deploying Java applications in virtualized environments. And of course in the Pavilion there's the booth, the big beautiful booth (so beautiful that I'm thinking of having those designers give my house a makeover), where we will highlight AMD's role in the Java community today and share our vision of the future.
Details can be found on our AMD at 2008 JavaOne page: http://developer.amd.com/EVENTS/JAVAONE/Pages/default.asp.
See you there!
Ben
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 04/07/2008 at 02:44 PM by bpollan
|
|
|
March 26, 2008
| |
Myths and facts about 64-bit Linux(®)
Myths
- "You don't need 64-bit software with less than 3 GB RAM"
- "There are less drivers for 64-bit OS"
- "You will need all new software, all 64-bit"
- "64-bit software is twice as fast"
AMD's 64-bit architecture extension
AMD64 introduces one new mode to the processor, the Long mode, consisting of the 64-bit mode and the Compatibility mode. The former is the new 64-bit environment, the latter a compatibility implementation to run 32-bit code in that 64-bit environment. The current operating mode is connected to the code segment. By using instructions that change code segments (like syscall) one can switch between both submodes.
Currently the paging algorithm (derived from PAE) limits the physical address to 52 bits, while the virtual address space is 48-bits wide. Even with far less physical RAM this proves to be helpful for memory mapping.
Another important part of the new architecture is the extended number of registers, both the general purpose registers as well as the SSE registers have been doubled, you can now use 16 of each in the new 64-bit mode.
Software support
Support for 64-bit in software required a new ABI to be introduced as well as extending GCC with the new architecture. As an essential part in that, up to six function parameters are now passed in registers. Linux kernel support for 64-bit started by splitting off a new architecture tree, x86-64. Since 2.6.24 this was merged with the i386 tree to one common code base (x86).
In addition a compatibility layer (aka compat layer) provides support for execution of 32-bit binaries on 64-bit processors.
32-bit compatibility
Compatibility to 32-bit code has been a crucial goal in the design. First of all it allows to run unmodified 32-bit code inside a 64-bit environment. Syscalls are used to switch between both worlds.
The processor zero-extends all 32-bit addresses to 64-bit. Applications use the lower 4 GB of the address space. However, physically that can be mapped anywhere. The kernel manages the compat layer for all applications, meaning it resolves bitness differences, structure layouts and invokes the right library version (/lib{32,64}/ld.so).
Speaking of libraries - each library has to be present twice, once for 32-bit and once for 64-bit, which also includes all dependent ones down to the lowest.
With the Linux compat layer it is even possible to run an entire and unmodified 32-bit Linux installation with a 64-bit kernel.
Benchmarks
First we picked some real world benchmarks for our 32-bit vs. 64-bit comparisons. Oggenc, Mencoder and Povray as well as some compilation tests. Furthermore micro benchmarks were used to show specific performance differences for syscalls and 64-bit arithmetics.
We set up three system configurations - a 32-bit installation, a 64-bit installation and a combination of 32-bit installation with 64-bit kernel to challenge the compat layer. All tests were performed on a dual-core AMD-K8(tm) processor with 1 GB RAM.
The tests showed that the penalty of using the compat layer instead of running your 32-bit application on a native 32-bit kernel is about 1-2 percent. So it is almost negligible.
64-bit took the lead in the media encoding tests. Our Povray and Mencoder benchmarks took about 5% less time in the 64-bit case, Oggenc even 25%. Just C-compilation tests showed a performance advantage of 5% to 8% for 32-bit versus 64-bit.
Native arithmetic performance (64-bit data types used in 64-bit software vs. 32-bit data types used in 32-bit) showed a gain of 10% for the 64-bit case. Using 64-bit data types on 32-bit and 64-bit in the arithmetic performance test showed that 64-bit is more than twice as fast as 32-bit.
Downsides of 64-bit
A 64-bit execution environment and 64-bit software surely have their downsides, too. First there is the larger memory footprint. Binaries get larger because of an increased pointer size and 64-bit operands. This leads to higher memory transfer load and therefore increases cache utilization.
Myths revisited
- "You don't need 64-bit software with less than 3 GB RAM"
- Performance improvement even on systems with less than 3 GB RAM
- "There are less drivers for 64-bit OS"
- Irrelevant to Linux, hail open source
- "You will need all new software, all 64 bit"
- 32-bit compat layer performs very well and is transparent
- "64-bit software is twice as fast"
- Rarely the fact, software is usually optimized for 32-bit
Conclusion
Use a 64-bit system and stick to the compat layer if you have the need of running certain 32-bit applications.
Andre Przywara, Andreas Herrmann, Peter Oruba
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 03/26/2008 at 05:07 AM by peteroruba
|
|
|
March 18, 2008
| |
Live from EclipseCon 2008
I have just a short break here, but wanted to give you all a quick update on how things are going here at EclipseCon 2008.
The booth has been quite busy, with attendees coming by to fill out our survey and get their 1GB USB drive. We've had a number of people wanting to learn what AMD's relationship is with Eclipse, and then are very interested once they find out what the CodeSleuth plugin can do for their Java development process.
Gary Frost from the AMD Java Labs team delivered his technical session this morning to a full room. After his session, I was flooded at the booth! I'll try to post some pictures when I get a moment.
Gotta go, the hall is opening up again and people are coming by! Be sure to check out CodeSleuth yourself if you're not able to join us at the show.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 03/24/2008 at 05:04 PM by devcentral
|
|
|
March 13, 2008
| |
Join us at EclipseCon 2008
If you're coming to EclipseCon 2008 next week (March 17-20 in Santa Clara, CA), be sure to come visit the AMD Hardware Lounge or our booths (410/411) in the Exhibit Hall. We'll be showcasing one of the servers we've donated to the Eclipse Foundation to run their backend infrastructure, along with some AMD SPIDER systems.
Gary Frost, one of our AMD Java Labs engineers, will also be delivering a technical session and will be on hand to answer questions about the new plugin for Eclipse we're demonstrating.
Plus, fill out our survey for a chance to win a fun prize!
Get all the details at our AMD@EclipseCon page.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
February 21, 2008
| |
Optimizing Inter-Core Data Transfer on AMD Phenom processors
The AMD Phenom family of microprocessors (Family 0x10) is AMD's first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share. There is a small subset of compute problems that can be categorized as belonging in a Producer and Consumer paradigm; a thread of a program running on a single core produces data, which is meant to be consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory.
With a naïve implementation of a producer/consumer program on the AMD Phenom processor, measured bandwidth results will appear to be throttled by main memory speeds. Main memory speeds can vary, but with DDR2 533 memory (average grade), this is around ~4 GB/s. Why is this?
There are several architectural details on the AMD Phenom processor that can limit inter-core bandwidth if not properly understood. The type and size of the cache on the AMD Phenom core has a direct effect on bandwidth; it includes a "mostly exclusive victim" cache. The MOESI protocol that the AMD Phenom cache uses for cache coherency can also limit bandwidth; it is important to keep a cache line in the 'M' state for optimal producer/consumer performance. A detailed explanation of the AMD Phenom cache architecture and how this relates to producer/consumer performance can be found in the accompanying paper (coming soon).
Assuming a single buffer has been defined for the producer and consumer threads to walk and communicate, the following bulleted list is a checklist of the constraints to follow to achieve maximum bandwidth:
- The consumer thread needs to 'lag' the producer thread by at least L1 & L2 cache size (modulo arithmetic)
- The producer thread needs to 'lag' the consumer thread by at least L1 & L2 cache size (modulo arithmetic)
- The buffer should be at least 2*(L1 & L2)
- The producer thread should not get so far ahead of the consumer to flood the L3, if a large buffer is used
- Use prefetchw on the consumer side, even if the consumer does not want to modify the data
- Add a small fudge factor to the calculated sizes to give the threads some 'slack' when communicating through the caches
In general, the AMD Phenom cache is optimized for widely shared data, i.e. one core produces data that many other cores may be interested in. In the producer/consumer program however, it is known ahead of time that the data the producer creates is only interesting to the matching consumer thread, and not to any other thread. Following the constraints listed above, it is possible to achieve an aggregate ~12 GB/s bandwidth for two producer/consumer pairs (to maximize 4 cores) on the AMD Phenom processor.
Kent Knox
Member of AMD Technical Staff
-------------------------
This response is provided for informational purposes only, is provided "AS IS" and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 02/22/2008 at 11:33 AM by kknox
|
|
|
February 20, 2008
| |
Boosting KVM's performance with nested paging
As opposed to XEN, KVM is an in-kernel hypervisor for Linuxtm that lets you run unmodified guests like Linux (both 32 bit and 64 bit) as well as Windowstm in every kind of flavor. KVM requires hardware support like AMD's SVM (Secure Virtual Machine) to accelerate full virtualization. Practically speaking it is provided as a kernel module that comes in two pieces, a generic one and a second AMD specific one. Our most recent development for KVM is support for the new K10 hardware feature called „nested paging".
Virtualization performance highly depends on the hypervisor's virtual memory management efficiency. Here is where nested paging comes into play. The typically time consuming process of mapping guest physical addresses to host physical addresses does not have to be calculated in software anymore, but can be done in hardware instead. Memory management in software was achieved using shadow paging. However, that was revealed being a major performance bottleneck. Nested paging is a feature that lays off this address mapping to hardware.
Our KVM patch also includes live migration to/from either paging method. As far as we can tell, it will be included in KVM version 61, which will probably come with kernel version 2.6.26. So what does nested paging buy you in terms of number? We've set up a KVM host system with a Linux guest and ran kernbench. Kernbench is a kernel compilation benchmark that does several compile runs, providing a good expressiveness about overall system performance. Our guest system gains about 30% in performance with nested paging enabled as compared to shadow paging. Now a key goal in virtualization is coming much closer, namely native performance. That one improved from 60% to 90% when compiling a kernel from memory, showing KVM as a reasonable alternative to other virtualization solutions.
Jörg Rödel & Peter Oruba
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
February 19, 2008
| |
AMD Joins the OpenJDK Project
I'm happy to announce that AMD has joined the community of contributors to the OpenJDK project. Our participation will focus on Java performance improvements as well as enhancements that will make performance analysis easier.
Our work towards this project is an extension of AMD's on-going relationship with Sun. Even before the HotSpot JVM became an open-source project, we worked closely with the Java teams at Sun towards a common goal of optimizing the JVM, and providing tools to enable Java developers to focus on their own applications' performance enhancements. The next logical step was to join the community of developers who focus on building an even better Java for the increasing breadth of applications that are developed for it.
You can expect to see contributions from us soon.
Ben Pollan
Manager, AMD Java Labs
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
February 15, 2008
| |
New Developer Central article on Escape Analysis in Java
In AMD JavaLabs one of our activities is to analyze the performance of Java applications or benchmarks and feed performance recommendations to our JVM vendor partners like Sun, IBM, and BEA. While doing this we often write our own small benchmarks to stress performance issues and, like other developers, we sometimes get an unexpected performance decrease from a source change that we expected would either increase performance or at least be performance neutral. Of course we then have to understand the performance decrease. I've added an article with the details of one such investigation where a JVM feature called Escape Analysis can play an important role in an application's performance.
Tom Deneau
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 04/30/2008 at 01:02 PM by tdeneau
|
|
|
February 5, 2008
| |
Did you miss the webcast last week? It's on-demand now!
Mike and Robin did a great talk about optimizations and multithreading last week. If you missed it you can view the webcasts on-demand. View the on-demand webcasts.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
January 29, 2008
| |
Optimizations, Parallel Programming Techniques, New CPU Features, oh My!
There are two webcasts coming up this thursday and friday through MSDN India that feature AMD's very own software optimization guru's, Michael Wall and Robin Maffeo. If you write native code targeted at windows apps you can't afford to miss these webcasts.
Multicore is here! But how do you resolve data bottlenecks in Native Code?
AMD's new "Barcelona" processor family is at the center of the new wave of CPU architectures. These new architectures mean new opportunities for both native and managed code. So, what opportunities do processor enhancements such as native quad-core processors, improved IPC, and L3 caches mean for your solution? This session will describe enhancements made to the Microsoft .NET Framework regarding code generation improvements and cache system pressure as a result of more cores per single chip. Developers will learn how to choose between server and workstation garbage collectors for performance, object creation tips, about lock optimizations in the CLR, and how (not) to deal with threads when developing managed code solutions. NUMA (non-uniform memory access) considerations that benefit managed code execution will also be explored. For native code development, specific Microsoft Visual C++ 2008 compiler intrinsics, compiler switches (i.e. the "/favor" flag), and other compiler and linker options will be demonstrated to enable a more informed evaluation of these options during software solution performance evaluation.
Speaker: Robin Maffeo
January 31, 2008 | 9:30PM - 11:00PM (PST)
Click here to Register
Empowering Developers: AMD x86 and x64 Performance Considerations when using Microsoft Visual Studio 2008
The industry is rushing to multi core processors. The Quad-core AMD "Barcelona" family processors, including Third Generation AMD Opteron processors and coming client offerings will integrate four complete processor cores on a single chip. In this session we will show you how to optimize your code to take advantage of these cores by feeding them with the data they require. The AMD "Barcelona" processor family implements new cache and memory features to address the data bottleneck and we will show you the basic architecture so you can successfully optimize your applications. This session will explain details of the "Barcelona" processors' innovative three-level cache system and the improved integrated memory controller. Topics include the automatic data prefetcher, L1/L2/L3 cache behavior, and explicit cache management instructions. You will learn from clear examples how to use non-temporal data, and best practices for memory and thread affinity on multi-socket "Barcelona" platforms. Take-aways includes multi-threaded code and data-parallel optimization techniques, demonstrated in C/C++ using Microsoft Visual Studio 2008.
Speaker: Michael Wall
February 01, 2008 | 9:30PM - 11:00PM (PST)
Click Here to Register
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 01/30/2008 at 01:21 AM by devcentral
|
|
|
December 10, 2007
| |
Introducing the AMD Operating System Research Center Blog
Hello from AMD's OSRC. Let me give a brief introduction of who we are and what our job at AMD is. Founded one and a half years ago, we have grown to a team size of roughly two dozen people, spread over two sites, namely Dresden and Austin.
We are AMD's competence center for operating system related topics like OS Research, Linux kernel development, virtualization technology and system testing. In a CPU's early design phase we provide feedback to the architecture engineers if new features are discussed. Later we prepare and enable the use of new CPU features by the Linux(r) kernel or hypervisors like XEN.
Most of our Linux related work can be found on the Linux Kernel Mailing List, to which we submit patches regarding every kind of AMD specific parts.
Furthermore we run amd64.org which also provides a couple of AMD related mailing lists.
In this blog, we'll be sharing updates on our work here in the OSRC to help you stay informed. Stay tuned.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 12/10/2007 at 05:09 AM by peteroruba
|
|
|
November 29, 2007
| |
We Left Our Hearts in Barcelona
AMD Developer Outreach had a big, shiny presence at the recent Microsoft® TechEd Developers 2007 event, which took place in Barcelona, Spain during the week of November 5th (see photo montage, below). We enjoyed meeting all of the delegates who stopped by our booth and/or who attended our two breakout sessions. Many of the delegates we met hailed from the farthest reaches of the EMEA region. By the end of our week in Spain's capitol of cool, we had come to admire its people, its culture and especially its cuisine!
One of AMD's primary goals for sponsoring TechEd Developers 2007, which drew close to 4,000 delegates, was to educate managed code developers about the various tools, libraries and optimizations that AMD makes freely available through AMD Developer Central (developer.amd.com). We also sought to communicate the steps that managed code developers can take to optimize their code for multi-core processor platforms. In other words, this stuff ain't just for native-code junkies.
AMD's Mike Wall gave a breakout session on the topic of, "Multi Core is Here! But How Do You Resolve Data Bottlenecks in Native Code?" This well-attended session provided delegates with techniques for multi-threaded code and data-parallel optimizations as demonstrated in C/C++ using Microsoft Visual Studio 2008.
In addition, AMD's Robin Maffeo gave a breakout session on, "Empowering Developers: x86 and x64 Performance Considerations when Using Microsoft Visual Studio 2008." Robin's session also drew a sizable crowd, and Robin detailed many of the multi-threading and CLR enhancements made to the Microsoft .NET Framework in collaboration with AMD. Robin further described several compiler and linker switches and options that can help developers quickly achieve highly optimized and multi-core aware code.
AMD's display booth, located front and center within the CCIB exhibition hall, also offered a number of hands-on demos, as well as an XBox 360 lounge for Halo 3 enthusiasts. AMD's Brent Hollingsworth extolled the virtues of the AMD Performance Library, a series of low-level math routines and higher-level functions to help optimize signal and image processing. AMD's John McCrae provided real-time demos of Visual Studio 2008 performance comparisons with Visual Studio 2005, particularly focusing on using the right compiler options at the right time. Mike Wall demonstrated how developers can use AMD CodeAnalysttm to identify bottlenecks and hot spots in managed code applications within Visual Studio 2008. Finally, Robin Maffeo demonstrated key techniques on how to get the most of managed code for both client and server applications.
We really enjoyed our time in Barcelona and we'd love to hear your feedback about how AMD Developer Outreach can make future events even more useful and relevant to our growing developer community!
AMD's Mike Wall demonstrates AMD CodeAnalysttm performance analyzer.
AMD's Robin Maffeo demonstrates Microsoft® Visual Studio 2008 compiler optimizations to delegates.
Conference delegates take a break at AMD's Xbox 360 lounge.
Edit: Updated image paths to correct broken images.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 01/28/2008 at 12:12 PM by AMD Developer Blogs Moderator
|
|
|
November 13, 2007
| |
Intro to the AMD Java Labs
Welcome to the new blog series from the AMD Java Labs team. You may be wondering why a processor company is doing anything with Java at all. It may make more sense when instead of thinking of us as a processor company, you think of us as a company concerned with processing in general, and how to optimize your processing power.
Java application performance optimization requires analysis of different tiers of the application stack. Focusing on the interaction of those tiers with the JVM is AMD's Java Labs. The team analyzes the performance of Java workloads, and uses hardware assisted profiling tools to determine areas of the JVM that can be optimized. Java Labs collaborates with the major JVM vendors so that these performance enhancements can be rolled into their products. This takes the form of feedback to the individual vendors or submissions to open source JVM projects.
To ensure that performance requirements are met for the variety of Java applications that exist, it's important that the workloads that are analyzed represent a cross-section of existing application architectures. To meet this need, the test suite that the team uses is a combination of industry standard benchmarks and internally developed applications. As part of AMD's commitment to the standardization of workloads, the team contributes to the Java benchmarks of the Standard Performance Evaluation Corporation (SPEC). These benchmarks are widely used by the JVM vendors to gauge performance.
In addition to our guidance on JVM performance for processors that exist or are about to enter the market, the team works with silicon designers to enhance support of Java on future AMD platforms. This will help Java developers continue to meet their customer's increasing performance expectations over time.
We look forward to an interactive dialogue with you, to help us better meet your needs. Please take some time to leave us comments, or post questions to the developer forums.
You'll be hearing from us soon.
Ben Pollan, Tom Deneau, Gary Frost, Azeem Jiva, Shrinivas Joshi, Adam Preble, Vasanth Venkatachalam
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
November 6, 2007
| |
¡Live from Barcelona!
While you are working hard on your next project, the Developer Outreach team is also working hard in Barcelona talking about all the great tools and techniques we have for optimizing your code for quad-core processors (and of course we're taking the occasional break to sample the local cuisine).
Yesterday, Soma Somasegar, Corporate VP for Microsoft's Developer Division delivered the keynote here at TechEd-Developers. He officially announced the impending release of Visual Studio 2008 ("by the end of the November") and he let us know that the rest of the Team System products will be coming shortly.
Dan Fernandez, lead Product Manager for Visual Studio delivered a compelling demonstration of how a partner used the VS Industry Partner Program to create a customization shell for World of Warcraft. This was made possible by a couple of changes in VSIP licensing - free redist. of the shell, and changes in licensing to make it easier to use VS output on non-windows devices. Dan also discussed Microsoft Popfly, an online tool for easily creating and sharing mashups. Very Cool.
Microsoft Sync is a cool new product that was announced this week. It consists of a framework and tool set to make it much easier to synchronize your data across multiple sources in a couple of key scenarios: first, there are times when you need to capture live data but are disconnected. With Sync you can capture this data and store it locally, then sync it with backend systems when connected again - think of roving sales staff who are only sometimes connected but who need to capture data constantly, for example. The second scenario is around peer-to-peer data sharing. In a demo in the Microsoft booth, they show how a customer contact is stored in SQL Server Express by a desktop app then sync'd to Outlook by a sync service and further synchronized to a hand-held. You can see that there are huge implications for moving your important data around in a timely fashion. For more information, go to http://msdn.microsoft.com/sync
We're off to a great start here so if you're in town, come over and say hello. We'd love to chat with you.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
October 2, 2007
| |
"Barcelona" Processor Feature: SSE4a, part 2
This is a follow-up to the first post on the SSE4a instruction set. While shuffling data around in registers is extremely important (as we mentioned in the last entry on SSE4a), one of the primary bottlenecks in performance comes from the loading and storing of data. Even if a processor executes instructions really fast working with just registers, one memory access from DRAM can lead to close to a 50 nanosecond hit, which would mean hundreds of cycles on most processors. The SSE extensions already have instructions that help in reducing this bottleneck. SSE4a complements these instructions with two of its own. MOVNTSS MOVNTSD Before I move on to the applications of these instructions, let me provide some information to help set the context for what the MOVNT* instructions are really useful for, and why. Almost all user data usually exists in what is called "Write Back" memory. This means multiple things, but ideally is supposed to be the most cached mode that memory can be. The following description outlines what happens on reads or writes to Write Back memory. (For the sake of simplicity, I am not going to delve into the different combinations of the data being in the L1 or L2 cache. Assume that a cache hit means that data is either in L1 or L2.) - Read
- Cache hit: Data is read from the cache line to the target register
- Cache miss: Data is moved from memory to the cache, and read into the target register
- Write
- Cache hit: Data is moved from the register to the cache line*
- Cache miss: The cache line is fetched into the cache, and the data from the register is moved to the cache line*
- *As per MOESI cache coherency protocol rules, in both cases, the cache line is marked as modified
Using the concept of cohesion among data, writes are typically done to a memory location that has been recently read from. Using this architecture, writes to memory that have been recently read become extremely fast. Unfortunately, in cases where you know that the data is not being written to a location recently read from, this procedure is still followed. So, on every write, that cache line is fetched into the cache, causing what is called cache pollution. Cache pollution is bad, bad, bad. Considering that we have only 32k for our L1 data cache, if we're reading data from one location and writing it to another, we are literally loosing half our cache lines, making the work of the hardware pre-fetcher become rather ineffective. Plus, other memory accesses between this read and write end up running out of cache lines, also. Remember, cache lines are 64 bytes, so even if one byte of data in a megabyte needs to be cached, an entire 64 bytes are used up. This is where the non-temporal store comes to our rescue. These instructions, first of all, do NOT update the cache line, but instead directly write to memory. Along with that, they "write combine" memory, meaning they do not write data immediately to memory but instead wait for 64 bytes to accumulate at a time. Once that threshold is reached (or one of the many other triggers), this memory is written in one shot to DRAM. Of course, this also means that the data may not necessarily be written in order to memory, and/or not quite when the write was executed. To flush out the write combine buffer, the SFENCE (store fence) instruction needs to be used. I've noticed gains of up to 2x or more on simple operation loops (something like a load, add, store) working on large pieces of data, when I switched the stores to non-temporal. This is a HUGE gain considering that most of this comes from the store time, which in operations like this, is a major bottleneck. I've found this ideal typically for large buffers (~1MB+). If I wanted to write register after register to memory, this would work fine. However, in case you're working on part of the register (e.g. scalar SSE instructions) and you only want to write that part, things get complicated. Until now there has not been an instruction that would use the SS or SD parts of the register, hence any NT * memory write would span a full 16 bytes. *I often refer to these stores as either NT stores/writes, or stores/writes with the NT hint. Keep in mind, though, that these stores are often referred to as "streaming stores." All compiler intrinsics that map to these intrinsics are named _mm_stream*. Of course, with the AMD " Barcelona " processors, we now have these two new instructions: MOVNTSS : This instruction will write the least significant 32 bits of a register to memory using the non-temporal hint. For example, a loop that performs scalar single-precision floating point math on a large array can use the SSE registers and MOVNTSS to store results to memory. MOVNTSD : This instruction will write the lower 64 bits of a register to memory using the non-temporal hint. This instruction can be used for similar purposes as the MOVNTSS instruction, but typically for double-precision floating point data. Before these two instructions were available, there really was no way to do either of these stores with the NT hint. With these two new instructions, SSE4a completes the NT instruction set to more completely match our set of normal stores. Support for these two new instructions and all the SSE4a instructions is detected by the CPU ID instruction. Specifically, ECX bit 6 will be set for CPU ID function 8000_0001h. - Rahul Chaturvedi
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 11/01/2007 at 07:01 PM by devcentral
|
|
|
September 27, 2007
| |
Coding Tips for Sun Studio on AMD64
A new quick reference sheet is available that outlines some coding tips for Sun Studio on AMD64. It includes tips on compiler optimization flags t | |