I was fortunate to have the opportunity to host a panel discussion on application development and multi-core at CommunityOne West this year. It was a fantastic opportunity to meet and work with software experts who are in the trenches and every day working on parallel programming solutions. The basic question here was: "How do I get started in taking advantage of multi-core processors?" To answer this question, everybody involved brought unique experiences and perspectives to the table. In the above link, you can see a view of AMD's roadmap - from our perspective, you should take away that from the desktop to the server, multi-core will be king.
Check out the AMD Developer Inside Track video for a snapshot of three of our partners from this panel and myself answering the question of how to start taking advantage of multi-core processors.
After these events I often get asked the same how-to-get-started question, but with more detail. Someone will say, "Okay, but let me tell you about this..." - so we talk it over. The questions I ask usually include at least some of the following:
Who do you work for?
What field are you in?
What are you trying to do?
Where is your code spending the most time now?
What are your primary bottlenecks (CPU, I/O, Memory)?
Do you need to scale up, or scale out?
Are you trying to reduce response time?
Are you trying to increase throughput?
Where and how big is your data?
What are your data dependencies?
Are you using a managed runtime environment?
What tools are you using?
Are you open to using other tools?
Will you be able to rewrite code?
Who have you talked to in researching your problem?
Do you have an n-tier infrastructure?
What hardware are you using right now?
What are your hardware upgrade plans?
These questions help decompose the problem and also provide a high-level view. I find these discussions often touch on a mix of abstract principles combined with some specific practical advice. Below, I have some basic getting-started suggestions which I've mapped to the above questions, along with my perspectives on how they bear on the problem. For simplicity's sake, I've decided to map a question once to a single suggestion, though it may really have multiple applications.
Suggestion
Relevant Questions
Perspectives
Identify your problem domain.
Who do you work for?
What field are you in?
Telecommunications, financial services, manufacturing, scientific programming & HPC, web services, database, ERP/CRM, BI: for these and many other segments there is typically an ecosystem of software tools for building products and solutions, in many cases with significant experience in parallelism.
Don't be afraid to ask for advice -- talk to your community of experts.
Who have you talked to in researching your problem?
Your community of experts can be found at conferences, in online forums, and at your tools vendors.
Clearly define your performance problem and the associated metrics.
What are you trying to do?
Do you need to scale up, or scale out?
Are you trying to reduce response time?
Are you trying to increase throughput?
This is critical in explaining the problem to yourself and others. This should be an easy to understand and simple statement that includes a baseline.
Analyze and identify primary bottlenecks.
Where is your code spending the most time now?
What are your primary bottlenecks (CPU, I/O, Memory)?
Where and how big is your data?
What are your data dependencies?
Do you have an n-tier infrastructure?
If you don't know the answers to these questions then you need to do some analysis. Diagram your infrastructure. Use performance analysis tools found in your OS and from your tools vendors. There are usually a few places in your code where most of the time is spent.
Like any optimization effort, you'll analyze first, re-measure, and re-analyze throughout your parallelization effort.
Review alternate algorithms.
Will you be able to rewrite code?
After some initial analysis you should take a high-level look at your overall algorithm. It may not be the best choice. It also may place constraints on how easily you can parallelize.
Review current tools and look for acceptable alternates.
Are you using a managed runtime environment? What tools are you using?
Are you open to using other tools?
Will you be able to rewrite code?
This is often closely related to the problem domain and associated business requirements. Maybe you can take a new Fortran compiler that supports parallelization with OpenMP, or maybe you need to focus on a new math library.
Review current hardware and evaluate new hardware.
What hardware are you using right now?
What are your hardware upgrade plans?
Along with looking at the architectural and tools aspects of your software, think about how much you could improve your basic situation with new hardware, be it one of more RAM, more or faster CPUs, or bigger or faster disks.
In conclusion, I want to emphasize that after carefully stating your problem and doing some initial analysis, that you try new implementations with caution. Measure with appropriate precision, and make sure your measurements are repeatable. Only then can you be sure that your work is worthwhile. Finally, take a look at the AMD Developer Central for parallelization articles, our CPU analysis tool CodeAnalyst, and our performance libraries.
Be sure to check out the first AMD Developer Inside Track video featuring three of AMD's software tools partners giving their perspectives on taking advantage of multi-cores.
-TracyCarver, Software Developer and Evangelist, AMD
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
I'm a member of AMD's software division (and yes, you read it correctly - I said software). It turns out that a lot of people are surprised to hear that AMD has a software division. I can't count the number of times that we've been at tradeshows showing off the AMD CodeAnalyst Performance Analyzer or our Performance Libraries and people have wondered why the heck AMD was at a software developer conference. The answer is simple; you can't run the hardware without software. We have a significant investment in software within AMD and with our software partners. I've vowed to do my part to get you behind-the-scenes, one-on-one time with AMD software developers and our software partners' to get the scoop on what AMD is doing that would matter to software developers.
Next month we will be talking with Michael Houston about OpenCL. And we have a multitude of topics planned for the rest of the year. If you have a topic in mind, let us know by making a comment on this blog post, or on our forums.
-Sharon Troia, Sr. Developer Relations Engineer
ps. If you are experience any viewing problems, please let me know. We will be adding some different formats, lower resolution versions to download, as well as the transcripts over the next two weeks.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
AMD is a close collaborator with Microsoft. We work together to help ensure the operating system runs smoothly and efficiently on AMD platforms. Here are some of the key technology collaborations for Windows® 7:
Power Management: AMD worked closely with Microsoft to support a new AMD product-specific power management driver in Windows 7. This in-box driver supports older processors as well as the latest generation AMD OpteronTM processor and AMD PhenomTM II processor. In addition to the power management driver, AMD collaborated with Microsoft to fine tune default power policy parameters that control power state transitions to help optimize for power and performance. And since this driver is "in-box", there's no need to download.
Virtualization: AMD provided code to Microsoft and worked with the Hypervisor teams to help ensure that Hyper-V R2 and Windows Virtual PC in Windows 7 utilize Rapid Virtualization Indexing (aka nested paging tables) for improved performance of VM guests. All of the third-generation AMD Opteron processors, AMD Phenom processors, and AMD Phenom II processors support Rapid Virtualization Indexing. In addition, most of AMD's shipping processors (other than AMD SempronTM processors) include AMD-VTM technology and thus support Windows XP Mode for Windows 7.
Stability & Performance: Current and upcoming reference platforms containing multi-core processors from AMD were loaned to Microsoft's labs to vet out potential incompatibilities with Windows 7 and Windows Server 2008 R2.
Graphics: AMD has been working hard to support DirectX® 11, so there are plans to make native DirectX 11 hardware from AMD in its ATI RadeonTM GPUs available when Windows 7 is released.
GPU Compute: DirectX 11 Compute Shader (CS) is a new API in Windows 7 that helps enable rich applications through the use of compute on the GPU (General Purpose GPU or GPGPU). Rich experiences such as drag-and-drop media transcoding, physics, and AI are a few areas that DirectX 11 CS can help enable.
For more information on AMD and Microsoft technical collaboration visit the Windows Zone on developer.amd.com. For more information on what AMD is doing overall with Microsoft for end users, check out the Microsoft & AMD corporate site, or see the AMD video on Microsoft's Ready. Set. 7 site.
-Robin Maffeo, Microsoft Alliance Manager
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Here at AMD, we know that in order to improve program performance, you have to be able to measure it. AMD's Lightweight Profiling feature (LWP) is designed to make performance measurement even easier and with negligible overhead. In this post, I'll give you an overview of LWP and tell you why we think it's an exciting next step in the area of performance tuning.
First, a little history. Late in 2007, AMD announced Lightweight Profiling as a proposed extension to the AMD64 architecture that would allow an application to gather performance statistics about itself with low overhead. We posted the preliminary specification and asked for feedback from the developer community. Much to our delight, many of you responded with comments, criticisms, and suggestions on the proposal. We've read all of your feedback, and last week we posted the current version of the LWP specification. The announcement and the link to the spec are here. Thanks to all of you who helped us out.
What came before...
It's important to be able to measure the details of a program's performance in order to find ways to speed it up. Until now, there have been just two ways to do this. The first is via instrumentation, i.e., adding code to the program to watch the clock, or the cycle counter, or just to count the number of times an instruction or loop is executed. Instrumentation can be added by the programmer or by a compiler. Unfortunately, it seriously perturbs the application, and the instrumented code usually doesn't have the same characteristics as the original code, especially when dealing with the data and instruction caches. Also, instrumentation can't observe the hardware caches, so it can't gather data about cache behavior.
The second traditional method of monitoring performance is to use the hardware performance counters. These count hardware events and generate an interrupt after a programmed number of events have happened. The counters can report on events that are too hard to instrument (like counting each x86 instruction) or are not visible to software (like cache misses). These counters are used by the AMD CodeAnalyst Performance Analyzer and provide deep insight into application and system performance. However, each time a data sample is gathered, the processor must take an interrupt to a kernel-mode driver, and that takes hundreds or thousands of cycles. The driver, by simply executing, changes the contents of the data cache and the instruction cache and may perturb the application's performance. The counters can only be configured, started, and stopped from kernel mode, so an application must call a driver or the operating system to control them. Finally, some systems do not context-switch the performance counters when changing threads or processes, and on those systems, performance monitoring can only be done globally by a single user at a time.
Introducing LWP
After reading about current technology, you might think that an ideal performance monitor should:
Operate entirely in user mode
Cause little or no perturbation of the application
Be controlled separately for each thread
Have low overhead to allow for higher sampling rates
And that describes LWP!
Lightweight Profiling adds a set of user-controlled counters to the AMD64 architecture. They can monitor multiple events simultaneously. An application thread starts profiling by providing the address of an LWP control block (LWPCB) as the operand to the new LLWPCB instruction. The contents of the LWPCB specify which events to count and how often to count them. It also points to a ring buffer in the application's memory into which the hardware will store event records. That's it.
Once started, LWP counts the specified events. When an event counter underflows, it stores an event record at the head of the ring buffer and resets the counter. (If requested, LWP randomizes the bottom bits of the new counter value to prevent "beating" against constant length loops.) LWP stores the record without interrupting the flow of the program, so the only perturbation to the program's performance is writing the record (usually affecting only a single data cache line) and a few cycles to perform the write. The record contains the event type, the address of the instruction that caused the underflow, and other information about the event. All event types share one ring buffer and can be sorted out by the event type field in the record.
Of course, eventually the buffer will fill up. What then? Well, a program has two options for emptying the ring buffer. First, it can simply poll the buffer and remove event records from the tail of the ring. When software rewrites the tail pointer, the LWP hardware knows it can reuse the newly emptied region of the ring buffer. Since the buffer is in user memory, the program can even share the memory with another process, and that second process can be responsible for draining the buffer. Second, the application can specify that it wants LWP to generate an interrupt when the ring buffer is filled past a certain threshold. For instance, it can configure a buffer to hold 10,000 event records and tell LWP to interrupt whenever there are more than 9,000 records in the buffer. The interrupt does indeed perturb the program, but it does so 1/9000th as often as the traditional performance counters would. Better still, since the buffer is in user memory, the application can catch the interrupt and do whatever it wants with the data. It can store it to disk for later analysis, or it can process it immediately and even try to fix performance problems as they are happening.
In addition, LWP is a per-thread feature. Each thread on the system can be monitoring different events at different rates without interference. If a thread is not using LWP, there is no impact on its performance even if other threads have LWP active.
Some LWP Details
The LWP events are a small subset of the events available in the traditional performance counters. They include Instructions Retired, Branches Retired, and DCache Misses. The Branches Retired event can be filtered by whether the branch is direct or indirect, conditional or unconditional, or other criteria. It captures the target address of the branch, a useful value when looking at indirect branches. The DCache miss event can be filtered by cache level to capture only "expensive" cache misses.
One exciting feature of LWP is the ability to insert events into the ring buffer under program control. There are two new instructions to do this:
LWPINS inserts a record into the ring buffer containing data taken from the arguments to the instruction. A program can use LWPINS to insert a marker to indicate an important event, such as loading or unloading a shared library, that influences the way addresses should be interpreted in subsequent event records.
LWPVAL uses an event counter and decrements the counter each time it is executed, much the way the hardware event counters work. When the counter underflows, it inserts a record into the ring buffer containing data from its arguments. A program uses LWPVAL to implement a technique called value profiling. For instance, it can profile the divisor of a commonly executed DIV instruction and if the data show that the divisor is frequently the same number, it can rewrite the instruction to test for that value and execute an optimized code sequence. Similarly, it can profile the target of a hot indirect branch and generate better code if one way of the branch is dominant.
Who will use LWP?
LWP can be used in many different application environments. These include:
Managed Runtime Environment: Managed Runtimes (MRTEs) are programming environments such as Java and the Microsoft® .NET Framework. These environments have the ability to generate AMD x86 or x64 code for routines coded in a high level managed language (such as Java or C#), and they can do that on the fly as a program is running. The MRTE can enable LWP and periodically look for performance problems. If (when!) it finds them, it can generate better code for the hot spots and improve the program's overall performance. LWP is lightweight enough that it can run continuously.
Dynamic Optimizer: A Dynamic Optimizer is a program that monitors an application and attempts to improve its performance by modifying it as it runs. In this case, the target application is compiled to native code from a traditional language like C or C++. The Dynamic Optimizer can gather performance data without affecting the flow of control in the application.
Compiler Feedback: Most modern compilers have an option to build an instrumented program which the developer runs to gather information on the program's performance. Unfortunately, the added instrumentation (and the fact that optimization levels are often cranked down in a feedback compilation) perturbs the program so much that what's being measured is substantially different from the "real" program. With LWP, the compiler can gather statistics on the program execution without changes, and it can insert LWPVAL instructions to profile interesting areas without adding a large block of instrumentation code and without clobbering any registers. If the application runs without turning on LWP, the LWPVAL instructions act as NOPs and only take a few cycles.
Conclusion
We're very excited about Lightweight Profiling, and I hope this note has piqued your interest. You can read the full specification at the LWP page on Developer Central. There's also an email link you can use to send us your comments and suggestions.
P.S.
My colleagues suggested that I make this more "bloggy" by adding references to "traditional performance values" and "herbal performance enhancers". This postscript is dedicated to them.
Anton Chernoff is a Senior Fellow and architect at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Featuring guest blogger from Sun Microsystems, Darryl Gove.
The release of a new version of Sun Studio is always an exciting moment for Sun Studio enthusiasts. Sun Studio 12 came out pretty much two years ago, and a lot has changed in that time.
One particular trend has been that multicore processors have become mainstream. One way of illustrating that is to look at the number of threads per chip for all the submitted SPEC® CPU2006 integer speed results*. The following chart shows the cumulative number of submitted results since the benchmark came out in 2006 until the middle of June 2009 broken down by the number of threads that the chip could support.
Two years ago, when Sun Studio 12 came out, chips that could support two threads were starting to become common. Now we're looking at that being a minimal thread count, and we're starting to see the ramp up of threads that can support more than 4 threads - the latest AMD processors support six threads per chip. In tandem with the growth in thread count, we're seeing much more interest in developing applications that can use this core count. Sun is fortunate that with Solaris and Sun Studio, we have a very comprehensive, and long standing, investment in multiprocessor technology:. from virtualisation, through Zones, to scalability to huge core counts.
Sun Studio has always been on the leading edge of developing parallel applications. There are two ways of leveraging multiple cores, either through libraries provided with the compiler or through the parallisation of your application. For those people using the Performance Library, this is now optimised to take advantage of the latest AMD Quad-core and Six-core processors.
The easiest way of producing parallel code is using automatic parallelisation. Sun was the first company to submit automatically parallelised results for SPEC® CPU2000. Automatic parallelisation is a great technology. It takes some of the work of making parallel codes away from the developer, and places it firmly into the category of "just another compiler flag".
However, the compiler can't do this for all codes, which is why Sun was also one of the first companies to support the OpenMP 3.0 specification.
The OpenMP 3.0 specification is a very important step in making parallel programming easier. The 2.5 specification that was supported by Sun Studio 12 allows developers to identify loops that can be performed in parallel, and different sections of code that can be run simultaneously. The big improvement in the 3.0 specification is the support for Tasks. A task is a unit of work that one thread can request another thread to do. The developer defines the tasks in the source code, but the executed tasks and their order is dynamically determined at runtime. This massively increases the range of applications that can be parallelised using OpenMP.
Of course, writing parallel applications becomes much harder without the tools to support this. Sun Studio 12 Update 1 includes these tools. The Debugger for diagnosing bugs in parallel applications, the Performance Analyzer for determining the activity of all the threads in an application, and the Thread Analyzer for identifying data races in parallel applications. The Performance Analyzer has been enhanced to support hardware counters in the latest AMD processors. The hardware performance counters are an optimal way of determining exactly what the processor is doing during the run of your application.
Performance is often one of the motivating factors for any compiler upgrade. In a compiler suite performance comes from two sources: enabling the developer to identify opportunities to improve performance, and the ability of the compiler to produce good code for the processor. The performance analyzer is able to profile all kinds of parallel applications including those parallelised with OpenMP directives as well as distributed MPI applications. This enables you to quickly determine where, at a source code level, the application is spending its time, and to drill down into that source to understand the performance at the level of hardware events.
The goal for the Sun Studio compiler has always been to produce code that runs as fast as possible on all SPARC and x86 processors. Sun has worked closely with AMD to ensure that the compiler is aware of the best practices for producing code for the latest AMD processors. Sun Studio 12 Update 1 includes this support and continues the long track record of delivering superior performance on AMD processors.
As well as providing support for all processors, Sun Studio is also supported on a number of platforms: Solaris, OpenSolaris, and Linux (for x86). Perhaps most importantly Sun Studio 12 Update 1 is free of charge to download and use.
* SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Benchmark results stated above reflect results posted on www.spec.org as of 15 June 2009.
Darryl Gove is a Senior Staff Engineer in the compiler team at Sun Microsystems. He works on the optimisation and tuning of applications and benchmarks. He is the author of the books "Solaris Application Programming" and "The Developers Edge," and maintains a blog at http://blogs.sun.com/d/. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
You may have seen the recent blog post from our CMO Nigel Dessau about the release of the x86 Open64 Compiler Suite. Nigel makes some great points about why AMD feels this open source project is important, so I won’t go into that here.Instead, I’ll provide an overview of the latest release and what the features can mean for your development work.
Like other compilers, Open64 optimizes applications aggressively in many dimensions, but what is different is that Open64 employs innovative techniques that stem from an understanding of the underlying hardware architecture, such as laying out data structures in space and cache efficient manners and deploying aggressive forms of loop-nest optimizations to promote locality. The biggest area this helps is with multi-core scalability, a measure of throughput performance of running multiple applications simultaneously on multiple cores, where a memory sub-system is often stressed.
While the Open64 compiler suite was created to optimize software development for all x86-based architectures, it utilizes many features that take particular advantage of AMD’s technology. One such example is enabling the use of 2MB huge pages for programs built with Open64 to help reduce TLB misses. Another important feature is enhanced code generation and instruction scheduling to take advantage of core pipeline hardware features.Also, software data prefetching is better tuned to work with the hardware prefetcher and DRAM prefetcher to effectively hide memory latencies. This latest release also offers preview features of OpenMP and automatic parallelization to map program parallelism to multiple cores.
Here’s the full list of new features in x86 Open64 4.2.2 that AMD added (also detailed in the release notes):
·Support for 2 MB huge pages.
·Improved loop fusion and loop unrolling.
·Improved head/tail duplication, if-merging, scalar replacement and constant folding optimizations.
·Improved interprocedural alias analysis.
·Improved partial inlining and inlining of virtual functions.
·More aggressive re-layout optimization for structure members.
·Improved instruction selection and instruction scheduling.
·Improved tuning of library functions.
What this compiler suite really enables is highly optimized performance when running multiple applications at the same time, which is pretty much the norm for real-world workloads.In the spirit of open source projects, we’d like your feedback on how to improve this compiler suite.If you would like to suggest features for future releases, leave us a comment.While we can’t promise that the features will be added, we certainly take your feedback under serious consideration.
Roy Ju
AMD Fellow
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Looking for our "Shanghai" Zone? All the content and resources you expected to find are still there, but we've added some new information about AMD's follow-up to the Quad-Core AMD OpteronTM processor (codenamed "Barcelona", and "Shanghai") and have renamed the content section to "Istanbul" Zone. The new Six-Core AMD Opteron processors (codenamed "Istanbul") retain all the features of the "Barcelona" and "Shanghai" processors and add further advancements in the technologies for even better performance. Find out what's new with this six core processor in the "Istanbul" Zone.
We'd appreciate hearing what you think about the new "Istanbul" processors, so leave us a comment!
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
This week, AMD is making a couple of very important announcements for developers: support of Intel's Advanced Vector Extensions (AVX) instruction set in future AMD processors, and the adaptation to the AVX framework of AMD's previous SSE5 instruction set proposal. The latter step has resulted in three new extensions: XOP (for eXtended Operations), CVT16 (half-precision floating point converts), and FMA4 (four-operand Fused Multiply/Add). In this posting I'll give an overview of the capabilities that these extensions provide, and also some insight into why we're taking this step.
First, the why. When we proposed the SSE5 extensions back in mid-2007, it brought some important innovations to the SIMD side of the x86 architecture:
a non-destructive three-operand capability, and a four-operand capability to support some very powerful new operations;
a set of powerful permute and conditional move instructions for data movement, plus Fused Multiply/Add (FMA) instructions for high-performance floating point;
a variety of other new operations to address various holes in the SSE instruction set: shift/rotate, integer compares, integer multiply/accumulate, and half-precision floating point support.
In April of 2008, Intel published its AVX/FMA proposal, which incorporated several of SSE5's innovations - in particular the three- and four-operand capabilities, the Fused Multiply/Add instructions, and some of the permute instructions - except in a somewhat different form. This proposal also added some new capabilities with a new instruction format: doubling the width of SIMD FP operations, applying the non-destructive three-operand capability to most legacy SSE instructions, and greatly expanding the potential opcode space for future extensions.
With this duplication of functionality between SSE5 and AVX/FMA, and AVX's additional features, we felt the right thing to do was to support AVX. In our minds, a more unified instruction set is clearly what's best for developers and the x86 software industry. With our acceptance of AVX, a key aspect of this instruction set unification is the stability of the specification. Since we don't control the definition of AVX, all we can say for sure is that we expect our initial products to be compatible with version 5 of the specification (the most recent one, as of this writing, published in January of 2009), except for the FMA instructions, which we expect will be compatible with version 3 (published in August of 2008).
Why the FMA difference? This was not something we did lightly. In December of 2008, Intel made significant changes to the FMA definition, which we found we could not accommodate without unacceptable risk to our product schedules. Yet we did not want to deprive customers of the significant performance benefits of FMA. So we decided to stick with the earlier definition, renaming it FMA4 (for four-operand FMA - Intel's newer definition uses what we believe to be a less capable three-operand, destructive-destination format). It will have a different CPUID feature flag from Intel's FMA extension. At some future point, we will likely adopt Intel's newer FMA definition as well, coexisting with FMA4. But as you might imagine, we may wait until we're sure the specification is stable.
The fact remains that AVX does not incorporate all of SSE5's features. Since SSE5 was based on months of discussions with ISVs on what sort of capabilities they felt were needed, and had been positively reviewed by the industry when we first put out the specification, we decided to follow through with development of these additional features. To do so most effectively, we redefined them in the AVX framework, resulting in the XOP extension.
Well, quite a lot, really. First of all, the instruction formatting was changed to leverage the capabilities that the AVX VEX prefix brings, using a new VEX-like three-byte prefix sequence called (interestingly enough) the XOP prefix. This provides three- and four-operand non-destructive destination encoding, an expansive new opcode space, and extension of SIMD floating point operations to 256 bits. The SSE5 operations that are retained by the XOP extension are:
Horizontal integer add/subtract: Signed or unsigned add, or signed subtract, of adjacent byte, word, or dword elements in the source vector to word, dword or qword elements of the destination vector. 128-bit.
Integer multiply/accumulate: Multiplies elements of two input vectors, adding the results to a third input vector. 128-bit.
Shift/rotate with per-element counts: These use a vector of shift counts, allowing each element of the source vector to be shifted or rotated by a different amount. There is also a rotate instruction with an immediate-byte single count applied to all elements. 128-bit.
Integer compare: Signed and unsigned comparison of byte, word, dword and qword elements, with predicate (mask) generation as in the various SSE compare instructions. The particular comparison to perform is specified in an immediate byte. 128-bit.
Byte permute: A powerful operation which copies bytes from two 16-byte input vectors to a 16-byte destination vector, optionally performing a selected transformation on each, under the control of a third input vector. 128-bit.
Bit-wise conditional move: Selects each bit of the destination vector from either of two input vectors, per a third input vector. 128- and 256-bit.
Fraction extract: Extract the mantissa from floating point operands. Scalar and 128- or 256-bit vector, single and double precision.
Half-precision convert: These convert between half-precision and single-precision formats while loading or storing a four- or eight-element vector. They provide dynamic control of rounding and denormalized operand handling. These particular instructions form a separate extension called CVT16, with a distinct CPUID feature flag.
Along with the FMA4 instructions, these support a wide variety of numeric-intensive, multimedia, and cryptographic applications, and allow some new cases of automatic vectorization by compilers. Speaking of compilers, plans are afoot to support these in intrinsic form in various compilers, and they may be used automatically in code generation in some cases.
A version of the AMD64 SimNow! simulator with support for these extensions is planned for availability in very short order.
I hope I've given you a good taste of these new features. For all the details on the XOP and FMA4 extensions, you can find the specification here. And, I encourage you to read the blog of our CMO, Nigel Dessau, for an executive perspective on driving innovation into the x86 instruction set. We believe we've struck the right balance between innovation and standardization. Feel free to comment or ask questions - we're always happy to hear from you. As you can see below, we've already heard from ten of our technology partners on the subject.
Dave Christie is a Fellow and senior architect at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.
"The addition of AVX support by AMD is a great move as it enables superior performance potential across AMD's x86 family of processors," said Wood Lotz, Absoft CEO. "AMD's use of AVX can also simplify development of high performance compilers and tools for companies like Absoft, and enable customers across a wide variety of industries to build faster applications."
"Acumem fully supports AMD's adoption and enhancement of the AVX instructions and will follow this standard as it becomes available in the market. As an ISV for performance tools we clearly see potential for performance improvements with these new additions" said Mats Nilsson, VP Software Engineering at Acumem.
"Axceleon applauds AMDs efforts to support both specifications, AVX and SSE5, in their XOP specification proposal. The further enhancements in FMA4 which accelerate floating point algorithms are very important to Axceleon's HPC customers and will be welcomed across the board" said Mike Duffy, CEO of Axceleon.
"We at Bibble Labs are constantly looking for performance improvements, and as such we are investigating AVX because of the possible performance advantage it might bring. We also appreciate that AMD is taking an active role to ensure the instruction sets converge and not create separate, conflicting instruction sets," said Jeff Stephens, Vice President of Product Development, Bibble Labs.
"We commend AMD for taking an active role in open standards, by unifying the x86 instruction set and merging SSE5 into the AVX specification. This can help improve compatibility and simplify the work for developers implementing this. We look forward to investigating AVX for potential advantages it may bring to our real-time applications and plug-ins," said Noel Borthwick, Chief Technology Officer, Cakewalk.
"We are pleased that AMD has decided to adopt the AVX instruction set extension instead of offering a variant," said Simone Hoefer, General Manager, Technology at Nero AG. "This will help reduce implementation complexity and multiple code-paths. We are confident that the SIMD (SSE/SSE2) optimizations already implemented will scale nicely to 256-bit/AVX, allowing us to truly embrace this new development."
"Having to choose acceleration solutions that work well on both AMD and Intel CPU platforms, Smith Micro welcomes convergence of the x86 instruction set. AMD supporting AVX is desirable from Smith Micro's point of view," said Uli Klumpp, director of engineering, Smith Micro Software, Inc. "The AVX instruction set extensions are looking promising for further optimizing our computationally most demanding software, DCC and data compression products such as Poser and StuffIt."
"AMD's adoption of AVX will help Sonic unify some of its engineering efforts and reduce development costs," said Jim Roth, Chief Technical Officer, Sonic Solutions. "We welcome this initiative and the proposed enhancements to the x86 processor architecture, which we will leverage to increase the responsiveness and performance of our digital media applications."
"We are pleased that AMD has decided to adopt the AVX instruction set extension instead of offering a variant," said John Freeborg, Vice President of Engineering for Sony Creative Software. "We also appreciate that AMD is taking an active role to ensure these converge and do not create separate, conflicting instruction sets."
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Here inside AMD Developer Central, we continuously work to improve your experience on our site. As part of these efforts, we periodically check the search logs to learn what our users are interested in and to ensure that we have useful content around those topics. One search term that seems to come up often is "overclocking."
Overclocking is the action of running your computer components (like CPU, GPU) at a higher speed than specified (and designed) by the manufacturers. Some reasons why users choose to overclock are to save money and increase performance. However, running the system at a higher speed increases power consumption, which leads to more heat, noise, and potential stability issues.
In a professional setting, overclocking can be risky. (That's why we issue warnings, like the one below.) A minor error could seriously affect system performance and delay project schedules. But, if you are a hobbyist / tweaker / gamer willing to take the risk and want to extract every last drop from the CPU/GPU, then overclocking is probably a topic you are very interested in. To achieve overclocking on AMD systems, we recommend AMD OverDriveTM , a state-of-the-art system management tool that includes overclocking capabilities.
Patrick Moorhead, AMD's VP of Advanced Marketing, has also written several blogs on this topic:
We hope this information helps. So which pill would you take: the red or the blue? Do you still intend to overclock your system and, if so, is it your work system or your home system?
***WARNING*** AMD and ATI processors are intended to be operated only within their associated specifications and factory settings. Operating your AMD or ATI processor outside of specification or in excess of factory settings including, but not limited to, overclocking may damage your processor and/or lead to other problems including, but not limited to, damage to your system components (including your motherboard and components thereon (e.g. memory)), system instabilities (e.g. data loss and corrupted images), shortened processor, system component and/or system life and in extreme cases, total system failure. AMD does not provide support or service for issues or damages related to use of an AMD or ATI processor outside of processor specifications or in excess of factory settings. You also may not receive support or service from your system manufacturer. DAMAGES CAUSED BY USE OF YOUR AMD OR ATI PROCESSOR OUTSIDE OF SPECIFICATION OR IN EXCESS OF FACTORY SETTINGS ARE NOT COVERED UNDER YOUR AMD PRODUCT WARRANTY AND MAY NOT BE COVERED BY YOUR SYSTEM MANUFACTURER'S WARRANTY.
-------------------------
Velu, Jayaprakash
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
It was an awesome event for us, at the AMD booth as well as at our technical session. It was a great experience for us to meet developers who are keen to know about AMD's efforts within the software community. We appreciate the support from those of you who attended AMD's technical session, particularly when you had multiple options.
We hope you now know
Why AMD cares about Java
What contributions we've made to the Java community
Some useful tips for improving the performance of your Java application
How AMD works with many software partners to optimize their applications
As promised during my talk, here are the links to the Framewave and SSEPlus open source library projects. Check them out, and contribute your own enhancements to the libraries or let us know what enhancements you'd like to see.
If you missed our session, here are some useful resources
We hope our recommendations for coding best practices were useful. We also hope that you have upgraded to JDK 1.6 to get the latest enhancements we've contributed.
We sincerely look forward to meeting you all again next year (and to the spicy hyderabadi biryani!!).
Were you there? Drop us a comment to let us know!
-JK
-------------------------
Velu, Jayaprakash
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 03/20/2009 at 01:37 PM by AMD Developer Blogs Moderator
The Common Weakness Enumeration (CWE) list at http://cwe.mitre.org/ is a community-created living dictionary of software weaknesses -- that is, *types* of weaknesses. This dictionary and Top 25 list involved experts from the computer industry, academia, the SANS Institute, the non-profit MITRE Corporation, the NSA, and the Department of Homeland Security. The goal of their initiative is to establish standards for assessment and verification of security vulnerabilities in software and the security tools that mitigate those weaknesses. Simply put, they just want to make computer security measurable.
Currently the CWE lists 755 total weaknesses (note a few are deprecated duplicates). They leveraged previous work from the deep CVE (Common Vulnerabilities and Exposures) list at http://cve.mitre.org/cve/index.html , which is a large dictionary of specific publicly known vulnerabilities (these generally look like "Buffer overflow in Bar 5.0 in Foo OS 3"). The CVE lists over 3000 confirmed specific vulnerabilities and tens of thousands more candidate vulnerabilities. So, the CWE lists types of vulnerabilities, and you could say these are abstractions of specific entries from the CVE (as well as a variety of other sources).
So, bubbling to the top of the CWE is their Top 25 Most Dangerous Programming Errors. These are noteworthy errors that can lead to hacked systems, data theft, and system inoperability. Importantly, security exploits rooted to these errors can be easy to code. Each error is in one of three categories, and these examples will be familiar to some:
The SANS Institute believes this Top 25 list will enable buyers (for example state and US government) to buy safer software, where vendors will need to certify by checking for these errors. Universities can also teach secure coding by a set standard, and software developers and their employers will have greater assurance regarding security vulnerabilities in their software projects. This gets to the root of cyber-security problems at the source, in software development.
What should you take away from this list? Here are some higher level takeaways and implications:
1. If you develop software, educate yourself on these vulnerabilities and the programming techniques to help mitigate them. Developers should make a reasonable effort to design security in from the beginning of their projects. A large number of programmers today work on web serving applications, like e-commerce, which requires identity management, secure transactions, and secure record-keeping. For consumer confidence, security’s clearly going to be a significant concern -- more so than performance or scalability. And remember that secure programming skills are a desired requirement in many job postings.
2. If you don’t develop software directly, educate yourself at least at a high level on these vulnerabilities anyway. You don’t need to be paranoid, but some realistic awareness is healthy. Think about how you manage your confidential data on your computers, whether the data is personal or professional. A good way to get started here is to take a look at the SANS Top-20 Security Risks at http://www.sans.org/top20/. This covers diverse risks applicable to everyone, like unsecured flash drives, web browser vulnerabilities, denial-of-service attacks due to instant messaging, and so forth.
3. For the IT manager crafting an IT infrastructure, the challenge is even broader, with considerations such as risk assessment, vulnerability assessments, and regulatory compliance. All these can be looked at through the lens of the SANS Top 25 Most Dangerous Programming Errors. Remember the goal behind the CWE effort is to make software security measurable, and the Top 25 List is a focal point for that goal.
-------------------------
Tracy Carver, MTS Software Engineer
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
We had a great time at PDC 2008 this year. I was really glad to see that the in-booth theater presentations were so well received. During the expo reception, we had over 50 people gathered to watch a presentation given by Mike Wall, Principal Member of Technical Staff, on Performance Optimization of Windows® Applications on AMD Processors. As promised, this presentation is posted on the Windows Zone on developer.amd.com along with the other presentations on Virtualization, AMD Roadmaps, and an Intro to Developer Central.
If you liked Mike’s presentation you should definitely check out the paper that Mike wrote that comes with the Mandelbrot code set and optimization steps to follow along with:
The AMD CodeAnalyst Performance Analyzer also got a lot of attention in our booth. We handed out nearly 1,000 USB sticks loaded with AMD CodeAnalyst Performance Analyzer along with supporting technical docs. If you missed out on that, make sure to download your copy of the AMD CodeAnalyst tool that integrates into Visual Studio 2008. As Mike points out in his article, if you’ve never profiled your code before, you’ll be shocked at where most of the time is spent.
Our 32-core HP DL785 server was great eye candy and attracted many people's attention. Our Mandelbrot threading demo was able to show the new Microsoft® Concurrency Runtime scheduler algorithm and task queue in a very visual manner, so it was even interesting for advanced developers. Many thanks to Michael J. Miller from PC MAg for blogging about it: http://blogs.pcmag.com/miller/2008/10/more_from_pdc_sensors_surface.php
The team reported on a lot of cool things that Microsoft is doing as well, including:
·Deploying lots of small battery powered environment sensors, which feed data to a receiver. For example, they instrumented the huge LACC lecture hall with a sensor grid and showed the temperature measurements over a 24 hour period, and you could see when the people walked in (warm spots) and when the air conditioning kicked on (cool spots). The goal is to enable more efficient buildings by accurately tracking the temperature. They also used the same method to map the heat coming from a large server farm. With virtualization and live migration, you could presumably migrate jobs from a hot rack to a cold rack, to help save energy.
·A user-friendly game construction environment called Boku. Users configure and actually program a mini world populated with interactive characters and other objects. The behaviors of the objects are determined by effectively writing short programs using an intuitive icon-based programming language. It's all done using an Xbox controller- no keyboard required. This is the most accessible and friendly programming system I've ever seen, and it also looks like a lot of fun. Kids will probably love it when it's launched, currently planned for early next year. http://research.microsoft.com/projects/boku/
·Microsoft is making an investment in Cloud Computing and even making a special version of Windows® for it, called Windows Azure.
·Microsoft Research is making a CHESS tool that allows a developer to find race conditions and deadlocks in his/her code, in a repeatable maner using Visual Studio http://research.microsoft.com/Chess.
We are looking forward to the next PDC! It’s always a great group of people and Microsoft really knows how to throw a great show!
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Looking for our "Barcelona" Zone? All the content and resources you expected to find are still there, but we've added some new information about AMD's follow-up to the Quad-Core AMD Opteron™ processor (codenamed "Barcelona") and have renamed the content section to "Shanghai" Zone. The new Quad-Core AMD Opteron processors (codenamed "Shanghai") retain all the features of the "Barcelona" processors and add further advancements in the technologies for even better performance. Find out what's new with the L3 cache originally introduced in the "Barcelona" processor and learn about other memory bandwidth enhancements in the "Shanghai" Zone.
We'd appreciate hearing what you think about the new "Shanghai" processors, so leave us a comment!
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
The Microsoft Professional Developers Conference (PDC) is back this year after a three year hiatus and AMD is putting the final touches on our Gold Sponsorship participation. Our main focus is to provide you with the tools and resources you need to “Unleash Your Code’s Potential.” If you are planning to attend PDC 2008, make sure to stop by the AMD booth #301.
Here’s a sneak peak of some cool things we are planning to demo:
How to simplify testing environments by using virtualization to do processor scaling tests
How to find hotspots and do thread analysis on your code using the AMD CodeAnalyst™ performance analyzer from within the Visual Studio IDE
How to help get performance improvement through program parallelization using the Microsoft® Concurrency Runtime
How AMD and Microsoft have worked together to provide developers with a comprehensive cloud computing platform and development environment
We also plan to have sessions in our booth’s Speed Zone Theater including:
Software Optimization, Part I: Memory
Making the best use of system memory. Arrays vs. linked lists, NUMA, and using the AMD CodeAnalyst performance analyzer.
Software Optimization, Part II: Cache
Making the best use of data cache. Data packing, prefetching, non-temporal data, and using the AMD CodeAnalyst performance analyzer.
Virtualization
An overview of how developers can use virtualization in their development environments.
Intro to AMD Developer Central
A look at the tools and other resources that we offer on developer.amd.com.
Oh, and in the spirit of this being a show focusing on the future, I should mention that we will be running some of these demos on our soon-to-be released processors!
Finally, we have some awesome gifts lined up for you (while supplies last). People who attend our Speed Zone Theater sessions will get a t-shirt and if you come see each of our demos you’ll get a 2GB USB drive loaded with developer tools and optimization guidance. Hope to see you there!
What other topics do you want us to talk about? Software visible features of our processors? Roadmaps? Parallel Programming? Just take a moment to set me know by leaving your comments to this blog, I’ll do my best to get it added to our list of sessions.
Sharon Troia, AMD Developer Central
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 10/14/2008 at 07:11 PM by AMD Developer Blogs Moderator
Are you getting odd, unexpectedly low, and/or inconsistent results when running your internal performance validation benchmarks or when collecting other performance data? If you are running Microsoft Windows Vista or Windows Server 2008, power management power plans other than “High performance” can often result in wrong and inconsistent data resulting in a misrepresentation of bottlenecks or hot spots in your code. In response to a number of inquiries from our software developer community, we have added some guidance on AMD Developer Central. For more information on this topic and a scriptable way to manage power settings, please see the recommendations in the Windows Zone.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Note: The Design Survey Unavailable on Thursday, October 9: 6:00pm - 4:00am PDT (Friday, October 10, 2008) for scheduled maintenance.
The AMD Developer Central staff is considering a facelift for the developer.amd.com Web site, and we want to make sure it meets your needs. That's why we're previewing some concepts, with an opportunity for you to vote on them. Your feedback will help us to decide on a final design and layout scheme that will be easier to use and more pleasing to the eye.
Check out the concepts below, and then take five minutes to vote on your favorites.
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 10/09/2008 at 02:34 PM by AMD Developer Blogs Moderator
Did you miss checking out our oxygen bar in the AMD booth at JavaOne? Well, 1700 of your fellow developers couldn't pass up the chance to try fragrances that had different effects like calming, energizing, and -- ahem -- aphrodisiac. Fortunately, we've got pictures...but we're not telling which vial is which!
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Have you been paying attention to the latest trends in multi-core processors and multi-threaded programming? Are you curious about how parallel programming can dramatically improve application (and of course, PC gaming) performance? Are you a code warrior interested in a cool challenge?
If so, you might have what it takes to conquer the AMD Treasure Hunt Game!
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
I'm pleased to announce that we have just published a series of six videos that brings to life some of the key concepts outlined in the Software Optimization Guide for Family 10h Processors. This video series is a companion to the optimization guide, and provides a quick look at some highly useful tips in addition to some examples to illustrate coding best practices.
We hope you find this series valuable, and welcome your feedback. Let us know what you think by commenting on this post. If you have questions about the information contained in the videos, feel free to post a question in our forums.
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
I have just a short break here, but wanted to give you all a quick update on how things are going here at EclipseCon 2008.
The booth has been quite busy, with attendees coming by to fill out our survey and get their 1GB USB drive. We've had a number of people wanting to learn what AMD's relationship is with Eclipse, and then are very interested once they find out what the CodeSleuth plugin can do for their Java development process.
Gary Frost from the AMD Java Labs team delivered his technical session this morning to a full room. After his session, I was flooded at the booth! I'll try to post some pictures when I get a moment.
Gotta go, the hall is opening up again and people are coming by! Be sure to check out CodeSleuth yourself if you're not able to join us at the show.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.