AMD Logo AMD Developer Central
AMD Developer Blogs
AMD Developer Blogs - Striking a Balance
Decrease font size
Increase font size
May 6, 2009
  Striking a Balance

This week, AMD is making a couple of very important announcements for developers: support of Intel's Advanced Vector Extensions (AVX) instruction set in future AMD processors, and the adaptation to the AVX framework of AMD's previous SSE5 instruction set proposal.  The latter step has resulted in three new extensions: XOP (for eXtended Operations), CVT16 (half-precision floating point converts), and FMA4 (four-operand Fused Multiply/Add). In this posting I'll give an overview of the capabilities that these extensions provide, and also some insight into why we're taking this step.

First, the why. When we proposed the SSE5 extensions back in mid-2007, it brought some important innovations to the SIMD side of the x86 architecture:

  • a non-destructive three-operand capability, and a four-operand capability to support some very powerful new operations;
  • a set of powerful permute and conditional move instructions for data movement, plus Fused Multiply/Add (FMA) instructions for high-performance floating point;
  • a variety of other new operations to address various holes in the SSE instruction set: shift/rotate, integer compares, integer multiply/accumulate, and half-precision floating point support.

In April of 2008, Intel published its AVX/FMA proposal, which incorporated several of SSE5's innovations - in particular the three- and four-operand capabilities, the Fused Multiply/Add instructions, and some of the permute instructions - except in a somewhat different form. This proposal also added some new capabilities with a new instruction format: doubling the width of SIMD FP operations, applying the non-destructive three-operand capability to most legacy SSE instructions, and greatly expanding the potential opcode space for future extensions.

With this duplication of functionality between SSE5 and AVX/FMA, and AVX's additional features, we felt the right thing to do was to support AVX. In our minds, a more unified instruction set is clearly what's best for developers and the x86 software industry. With our acceptance of AVX, a key aspect of this instruction set unification is the stability of the specification. Since we don't control the definition of AVX, all we can say for sure is that we expect our initial products to be compatible with version 5 of the specification (the most recent one, as of this writing, published in January of 2009), except for the FMA instructions, which we expect will be compatible with version 3 (published in August of 2008).

Why the FMA difference?  This was not something we did lightly.  In December of 2008, Intel made significant changes to the FMA definition, which we found we could not accommodate without unacceptable risk to our product schedules.  Yet we did not want to deprive customers of the significant performance benefits of FMA. So we decided to stick with the earlier definition, renaming it FMA4 (for four-operand FMA - Intel's newer definition uses what we believe to be a less capable three-operand, destructive-destination format).  It will have a different CPUID feature flag from Intel's FMA extension.  At some future point, we will likely adopt Intel's newer FMA definition as well, coexisting with FMA4.  But as you might imagine, we may wait until we're sure the specification is stable.

The fact remains that AVX does not incorporate all of SSE5's features.  Since SSE5 was based on months of discussions with ISVs on what sort of capabilities they felt were needed, and had been positively reviewed by the industry when we first put out the specification, we decided to follow through with development of these additional features.  To do so most effectively, we redefined them in the AVX framework, resulting in the XOP extension.

So, what's in XOP

Well, quite a lot, really.  First of all, the instruction formatting was changed to leverage the capabilities that the AVX VEX prefix brings, using a new VEX-like three-byte prefix sequence called (interestingly enough) the XOP prefix.  This provides three- and four-operand non-destructive destination encoding, an expansive new opcode space, and extension of SIMD floating point operations to 256 bits.  The SSE5 operations that are retained by the XOP extension are:

  • Horizontal integer add/subtract: Signed or unsigned add, or signed subtract, of adjacent byte, word, or dword elements in the source vector to word, dword or qword elements of the destination vector. 128-bit.
  • Integer multiply/accumulate: Multiplies elements of two input vectors, adding the results to a third input vector. 128-bit.
  • Shift/rotate with per-element counts: These use a vector of shift counts, allowing each element of the source vector to be shifted or rotated by a different amount. There is also a rotate instruction with an immediate-byte single count applied to all elements. 128-bit.
  • Integer compare: Signed and unsigned comparison of byte, word, dword and qword elements, with predicate (mask) generation as in the various SSE compare instructions. The particular comparison to perform is specified in an immediate byte. 128-bit.
  • Byte permute: A powerful operation which copies bytes from two 16-byte input vectors to a 16-byte destination vector, optionally performing a selected transformation on each, under the control of a third input vector. 128-bit.
  • Bit-wise conditional move: Selects each bit of the destination vector from either of two input vectors, per a third input vector. 128- and 256-bit.
  • Fraction extract: Extract the mantissa from floating point operands. Scalar and 128- or 256-bit vector, single and double precision.
  • Half-precision convert: These convert between half-precision and single-precision formats while loading or storing a four- or eight-element vector. They provide dynamic control of rounding and denormalized operand handling.  These particular instructions form a separate extension called CVT16, with a distinct CPUID feature flag.

Along with the FMA4 instructions, these support a wide variety of numeric-intensive, multimedia, and cryptographic applications, and allow some new cases of automatic vectorization by compilers.  Speaking of compilers, plans are afoot to support these in intrinsic form in various compilers, and they may be used automatically in code generation in some cases.

A version of the AMD64 SimNow! simulator with support for these extensions is planned for availability in very short order.

I hope I've given you a good taste of these new features. For all the details on the XOP and FMA4 extensions, you can find the specification here. And, I encourage you to read the blog of our CMO, Nigel Dessau, for an executive perspective on driving innovation into the x86 instruction set. We believe we've struck the right balance between innovation and standardization. Feel free to comment or ask questions - we're always happy to hear from you. As you can see below, we've already heard from ten of our technology partners on the subject.

Dave Christie is a Fellow and senior architect at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

Partner Support Quotes

Absoft

"The addition of AVX support by AMD is a great move as it enables superior performance potential across AMD's x86 family of processors," said Wood Lotz, Absoft CEO. "AMD's use of AVX can also simplify development of high performance compilers and tools for companies like Absoft, and enable customers across a wide variety of industries to build faster applications."

Acumem

"Acumem fully supports AMD's adoption and enhancement of the AVX instructions and will follow this standard as it becomes available in the market. As an ISV for performance tools we clearly see potential for performance improvements with these new additions" said Mats Nilsson, VP Software Engineering at Acumem.

Axceleon

"Axceleon applauds AMDs efforts to support both specifications, AVX and SSE5,  in their XOP specification proposal. The further enhancements in FMA4 which accelerate floating point algorithms are very important to Axceleon's HPC customers and will be welcomed across the board" said Mike Duffy, CEO of Axceleon.

Bibble Labs

"We at Bibble Labs are constantly looking for performance improvements, and as such we are investigating AVX because of the possible performance advantage it might bring. We also appreciate that AMD is taking an active role to ensure the instruction sets converge and not create separate, conflicting instruction sets," said Jeff Stephens, Vice President of Product Development, Bibble Labs.

Cakewalk

"We commend AMD for taking  an active role in open standards, by unifying the x86 instruction set and merging SSE5 into the AVX specification. This can help improve compatibility and simplify the work for developers implementing this. We look forward to investigating AVX for potential advantages it may bring  to our real-time applications and plug-ins," said Noel Borthwick, Chief Technology Officer, Cakewalk.

Nero

"We are pleased that AMD has decided to adopt the AVX instruction set extension instead of offering a variant," said Simone Hoefer, General Manager, Technology at Nero AG. "This will help reduce implementation complexity and multiple code-paths. We are confident that the SIMD (SSE/SSE2) optimizations already implemented will scale nicely to 256-bit/AVX, allowing us to truly embrace this new development."

Smith Micro Software

"Having to choose acceleration solutions that work well on both AMD and Intel CPU platforms, Smith Micro welcomes convergence of the x86 instruction set. AMD supporting AVX is desirable from Smith Micro's point of view," said Uli Klumpp, director of engineering, Smith Micro Software, Inc. "The AVX instruction set extensions are looking promising for further optimizing our computationally most demanding software, DCC and data compression products such as Poser and StuffIt."

Sonic Solutions

"AMD's adoption of AVX will help Sonic unify some of its engineering efforts and reduce development costs," said Jim Roth, Chief Technical Officer, Sonic Solutions. "We welcome this initiative and the proposed enhancements to the x86 processor architecture, which we will leverage to increase the responsiveness and performance of our digital media applications."

Sony Creative Software

"We are pleased that AMD has decided to adopt the AVX instruction set extension instead of offering a variant," said John Freeborg, Vice President of Engineering for Sony Creative Software. "We also appreciate that AMD is taking an active role to ensure these converge and do not create separate, conflicting instruction sets."



-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

Edited: 05/06/2009 at 10:38 AM by rex8664

 Post a Comment    

    Posted By: Dave Christie @ 05/06/2009 09:07 AM     Inside Dev Central  

May 7, 2009

Comments


 

AMD deserves praise for this important move towards convergence and compatibility. The changes in the SSE5 spec. have very much followed the lines I have argued heavily for in the AMD developer forum and elsewhere. It is impossible to maintain compatibility when the development cycle is several years and AMD and Intel both keep secrets to each other. It was therefore a very good move when AMD published their proposed SSE5 spec years in advance. A move that Intel responded to by also publishing their AVX proposal in advance. Let's hope this new tradition continues. The whole industry is crying for compatibility. I can certainly imagine your reaction when Intel announced their last minute change from FMA4 to FMA3!

Our worst nightmare would be Intel copying the XOP instructions but making them incompatible with AMD by replacing the XOP prefix by a VEX prefix. I hope you have legal remedies to prevent such a move.

Could you perhaps outline a roadmap for if or when the various instruction sets will be supported by AMD CPUs:

  • XOP
  • FMA4
  • CVT16
  • SSSE3
  • SSE4.1
  • SSE4.2
  • AVX non-destructive instructions
  • AVX 256-bit registers

How much of this will be supported in Bulldozer?


 Posted By: Agner Fog @ 05/07/2009 12:23 PM   :  Post a reply

 

I should have mentioned something about these.  We intend to support all of it, with the possible exception of CVT16, which might not appear in the initial AVX-compatible products (hence the separate feature flag).  We also intend to support Intel's AES and PCLMULQDQ extensions.  And of course XSAVE/XRSTOR to handle YMM context switching.



-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

 Posted By: Dave Christie @ 05/07/2009 01:50 PM   :  Post a reply

May 11, 2009
 

That's good news, many thanks to you for trying to maintain ISA compatibility between CPU vendors despite the often difficult circumstances -

especially if one considers the problem of ever-changing, secret-until-publication specs for new instruction sets.

So, we may look forward to Bulldozer (more or less) unifying the x86 ISA again, while adding new ISA functionality in form of the XOP instruction set.

Once again, many thanks for keeping those SSE5 instructions that solve simple problems for which the standard MMX/SSE instructions just did not offer any solution, like shifting different vector elements by different amounts etc.

However, while reading the new FMA4/CVT16/XOP specification, I noticed that some instructions did disappear compared to the earlier SSE5 spec.

In case of the COMPS/COMPD/COMSS/COMSD instructions, it's clear that they have become (mostly) redundant given the availability of AVX three-operand compares (VCMPPS/VCMPPD etc.)

But what happened to the PERMPS/PERMPD vector permutation instructions?

While there is still a slight reference to them (forgot to delete?)

(-> VPPERM instruction "Related Instructions" -> VPERMPS, VPERMPD)

there is nothing to be found about these instructions in the rest of the text.

Given the clear superiority of these instructions over Intel's VPERMILxx AVX instructions or the earlier (V)SHUF[PS/PD], it would be very sad if they had to be sacrificed in order to have die space to implement AVX.

Being able to permutate vector elements AND to apply basic operations like neg or abs or neg(abs) to them is sort of a "killer" feature allowing for much simpler vectorization of many algorithms -

So, where did they go?

 


 Posted By: Johannes F. @ 05/11/2009 11:30 AM   :  Post a reply

May 12, 2009
 

The original SSE5 FP permute instructions are being replaced by the AVX VPERMIL2 instructions, which are part of XOP but were inadvertently left out of the document.  These were dropped by Intel along with their original FMA definition, so the functionality (and encoding) is as described in version 3 of Intel's AVX specification.  We expect to post an updated spec which will include these (and fix a few more typos) around the end of the month.



-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

Edited: 05/12/2009 at 01:50 PM by rex8664

 Posted By: Dave Christie @ 05/12/2009 11:33 AM   :  Post a reply

FuseTalk Hosting Executive Plan - © 1999-2009 FuseTalk Inc. All rights reserved.

Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information