A new quick reference sheet is available that outlines some coding tips for Sun Studio on AMD64. It includes tips on compiler optimization flags to take advantage of the AMD "Barcelona" processor features.
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
One of the new instruction sets introduced in the Third Generation AMD OpteronTM processor is Advanced Bit Manipulation (ABM), comprising two instructions that operate on general purpose registers: LZCNT and POPCNT. We'll first explore what POPCNT can do for you.
In almost every interview I have given to date, I have been asked the question, "How would you calculate the number of bits set in a given 32-bit word?" Of course, by the thirtieth time I was asked that question, I was finally able to figure out what answer to give, which wasn't very efficient. If you've tried to calculate this number yourself, or have tried to answer this question for others, I hope this discussion will be helpful because there are many ways to do it in software. One way is using lookup tables, which access memory, but multiple lookups are needed (unless you have a 4 GB table for all 32 bits!). Alternatively, you can use another common algorithm. Subtract one from the number, then perform the AND operation with the original number. Do this until the number is 0. The number of iterations it takes for the number to become zero is the number of bits set. A typical pop count function using this method would look like this:
int popcount(int x) { int popcount; for (popcount = 0; x; x = x & (x-1), popcount++); return popcount; }
This function is generic and can be applied to multiple integer types. If your integer size is limited, there are a few more techniques that are floating around (easily Googled) but none of them are as efficient as one instruction.
Before I describe POPCNT ("pop count" or population count), the first of two advanced bit manipulation instructions that are provided in the new AMD Family 10h processors, you might have the exact same question that I had the first several times I was asked this in an interview:
Why on earth would anyone need this?
As it turns out, counting the number of bits set in a word (a machine word, that is), can be quite useful. I started realizing this when I moved to using bit arrays for computations.
Let me give you a quick scenario. I have implemented an array which stores the results of a network transmit operation. Each element represents a true or a false, depending on whether that particular block transmitted correctly or not. I need to use this data to calculate how much packet loss I have experienced.
Let's say that block numbers 7, 32, and 62 were not transmitted. The values at array index 7, 32, and 62 would be set to 1 and the rest to 0. If I am transmitting megabytes of information, this array could grow very large and it would be using a minimum of 8 bits of storage for each 1 or 0 it needs to store (if I am using the smallest data type provided to me) unless I use a bit array.
If I use a bit array, my array becomes much smaller, which means that I need to do fewer memory accesses to traverse the entire array, less memory is being used, etc. The only problem is with accessing an element in this array. To see if a bit is set in the bit array, I need to read one chunk of the array into a word and then shift bit by bit to see if anything has failed.
Enter, pop count! Pop count would simply tell me how many bits were set in the word I've just accessed, with just one instruction! Let's take a look at the gain I realize by using POPCNT.
For 1MB of data with a 1k block size, I have 1,000 elements. Therefore, the number of instructions taken by each approach would have been:
Original [byte array based]: Execution: For each element, I need to read the byte value and check if it is 1. If it is, I need to increment my counter. Cost: 3 instructions [read, compare, increment] x 1000 = 3,000 Results: 3k instructions. Not a very good idea.
Bit array [without pop count]: Execution: For every 32 elements, I need to read one word, shift the bits out, check if the left-most bit is 1 or not (check the sign of the resultant number), and then increment my counter if the bit is 1. Cost: (1 read + 32 shifts + 32 compares + 32 increments) x (1000/32) = 3032 Results: Considering there are much fewer reads here, this approach would still be a lot faster because of a lot fewer memory accesses.
Bit array [with pop count]: Execution: For every 32 elements, I need to read one word, do one pop count, and increment my counter by the return value from the pop count. Cost: (1 read + 1 POPCNT + 1 add to the counter) x (1000/32) = 94 Results: Using the POPCNT instruction here gives me a whopping 32x reduction of instructions, representing a significant performance gain! This is with using 32-bit words. For 64 bits, there is even greater performance gain.
NOTE: There are other algorithms that could result in fewer instructions without using pop count, but we have chosen this x and x-1 approach because it is easily portable. Other algorithms that could perform this function faster often require a fixed number of bits, and hence are not suitable for all purposes. Even so, pop count is faster than the most optimal approach without pop count.
In addition to this specific scenario, there are several applications where pop count can substantially increase performance. Pop counts are used in cryptography (in fact, this instruction is also commonly called the 'canonical NSA instruction' because of the fact that the NSA refused to buy processors which didn't support this instruction), encoding/decoding, databases (for quickly assessing information about data), and many others. One application that I find POPCNT most useful for is to quickly calculate Hamming distances. A Hamming distance is essentially a measure of how different one word is from another. Remember, this is not how different the values held by the words are (we could just use a subtract instruction to find that out!) but how the words themselves differ. For machine words, it is defined as the number of bits that are different between the two words.
For example, take the following 8-bit words:
00110001 11010001 ^^^^^
The lower 5 bits, denoted by the carats, are the same; hence only three bits are different. Therefore, the Hamming distance between these two words is 3.
A POPCNT instruction can give us the Hamming weight of a word, which is the difference between a word and the base word in its class. Because the difference between any particular word and a word with all 0s is the number of bits which are set, that is exactly what POPCNT gives us!
Of course, this doesn't give us the Hamming distance directly, but that's easily fixed. All we have to do is zero out the common bits between the two words and the result holds only the bits that are different. Our friendly neighborhood XOR instruction can do that for us, leaving us with the following sequence of instructions for calculating the Hamming distance between two words:
; RAX and RBX contain the two words
mov rcx, rax xor rcx, rbx popcnt rcx, rcx
; RCX now contains the Hamming distance between RAX and RBX
Hamming distances can be used to calculate things like error in data, as in how much error exists. They can be used as thresholds in encoding or decoding of audio/video. In fact, any place where you need any sort of fuzzy logic, Hamming distances could be useful. There are many other potential applications, but too many to be covered here.
This covers the first of two new advanced bit manipulation instructions that are introduced with the new Family 10h architecture. This leaves us with another interesting instruction, LZCOUNT, which counts the number of leading zeros in a given word. But, I'll leave that for next time.
-Rahul Chaturvedi
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
As we mentioned, last week, we started to talk about options for misaligned access features and option available pre-Orcas. In this week's post, we explore those options and continue our discussion about the movemask instruction.
Any SIMD instruction set has one inherent deficiency. Since all the operations are done in parallel, it is not possible to write code that checks individual values and branches to specific code. Simply put, we cannot have 'if' statements.
The movemask instruction is a particularly valuable instruction for this reason. It creates a bit mask based on the result of a comparison of the high bit of each value in the xmm register. By using a comparison operation, you can look at the data in an xmm register and then use movemask to bring the results into a regular x86 register. This can be used for branching and you essentially now have an 'if' statement, even with SSE code.
On K8 processors, this instruction didn't have 'significant' latency, but to make an effective 'if' statement, we needed to use the instruction in every loop iteration we had. Even despite the fact that it takes only 6 cycles on a K8, it is six extra cycles per loop. Additionally, it is a vector path instruction, which means that it blocks any other instructions to execute in parallel, when it itself is executing. On a super scalar architecture, this can mean a fairly big performance loss because, in those six cycles, , another two instructions could have potentially finished!
On f10h, this instruction has its latency reduced (though this reduction is simply a result of going to the 128-bit FPU) and more importantly, it is no longer a vector path! Hence, this now becomes a very workable instruction, even if used in every loop iteration. The new latency is one cycle more than a simple integer add instruction.
This leads to some additional code changes, particularly in division operations, which are extremely expensive and can be totally avoided if, let's say, all the values in the register holding the divisors are 1. Of course, our changes again went in with the namespaces, to ensure we still keep maximal optimization for both K8 and f10h processors.
These were the optimizations we could do on our end from within our C++ code. The next step is to see how we can leverage the compilers to give us better performance for f10h. That should be another interesting exercise. In the end, we will make sure that barring toolset and logistical restrictions, APL runs the fastest code possible on whichever processor it is running.
--
One of the new instruction sets introduced in the Third Generation AMD OpteronTM processor is Advanced Bit Manipulation (ABM), comprising two instructions that operate on general purpose registers: LZCNT and POPCNT. We'll first explore what POPCNT can do for you.
In almost every interview I have given to date, I have been asked the question, "How would you calculate the number of bits set in a given 32-bit word?" Of course, by the thirtieth time I was asked that question, I was finally able to figure out what answer to give, which wasn't very efficient. If you've tried to calculate this number yourself, or have tried to answer this question for others, I hope this discussion will be helpful because there are many ways to do it in software. One way is using lookup tables, which access memory, but multiple lookups are needed (unless you have a 4 GB table for all 32 bits!). Alternatively, you can use another common algorithm. Subtract one from the number, then perform the AND operation with the original number. Do this until the number is 0. The number of iterations it takes for the number to become zero is the number of bits set. A typical pop count function using this method would look like this:
int popcount(int x) { int popcount; for (popcount = 0; x; x = x & (x-1), popcount++); return popcount; }
This function is generic and can be applied to multiple integer types. If your integer size is limited, there are a few more techniques that are floating around (easily Googled) but none of them are as efficient as one instruction.
Before I describe POPCNT ("pop count" or population count), the first of two advanced bit manipulation instructions that are provided in the new AMD Family 10h processors, you might have the exact same question that I had the first several times I was asked this in an interview:
Why on earth would anyone need this?
As it turns out, counting the number of bits set in a word (a machine word, that is), can be quite useful. I started realizing this when I moved to using bit arrays for computations.
Let me give you a quick scenario. I have implemented an array which stores the results of a network transmit operation. Each element represents a true or a false, depending on whether that particular block transmitted correctly or not. I need to use this data to calculate how much packet loss I have experienced.
Let's say that block numbers 7, 32, and 62 were not transmitted. The values at array index 7, 32, and 62 would be set to 1 and the rest to 0. If I am transmitting megabytes of information, this array could grow very large and it would be using a minimum of 8 bits of storage for each 1 or 0 it needs to store (if I am using the smallest data type provided to me) unless I use a bit array.
If I use a bit array, my array becomes much smaller, which means that I need to do fewer memory accesses to traverse the entire array, less memory is being used, etc. The only problem is with accessing an element in this array. To see if a bit is set in the bit array, I need to read one chunk of the array into a word and then shift bit by bit to see if anything has failed.
Enter, pop count! Pop count would simply tell me how many bits were set in the word I've just accessed, with just one instruction! Let's take a look at the gain I realize by using POPCNT.
For 1MB of data with a 1k block size, I have 1,000 elements. Therefore, the number of instructions taken by each approach would have been:
Original [byte array based]: Execution: For each element, I need to read the byte value and check if it is 1. If it is, I need to increment my counter. Cost: 3 instructions [read, compare, increment] x 1000 = 3,000 Results: 3k instructions. Not a very good idea.
Bit array [without pop count]: Execution: For every 32 elements, I need to read one word, shift the bits out, check if the left-most bit is 1 or not (check the sign of the resultant number), and then increment my counter if the bit is 1. Cost: (1 read + 32 shifts + 32 compares + 32 increments) x (1000/32) = 3032 Results: Considering there are much fewer reads here, this approach would still be a lot faster because of a lot fewer memory accesses.
Bit array [with pop count]: Execution: For every 32 elements, I need to read one word, do one pop count, and increment my counter by the return value from the pop count. Cost: (1 read + 1 POPCNT + 1 add to the counter) x (1000/32) = 94 Results: Using the POPCNT instruction here gives me a whopping 32x reduction of instructions, representing a significant performance gain! This is with using 32-bit words. For 64 bits, there is even greater performance gain.
NOTE: There are other algorithms that could result in fewer instructions without using pop count, but we have chosen this x and x-1 approach because it is easily portable. Other algorithms that could perform this function faster often require a fixed number of bits, and hence are not suitable for all purposes. Even so, pop count is faster than the most optimal approach without pop count.
In addition to this specific scenario, there are several applications where pop count can substantially increase performance. Pop counts are used in cryptography (in fact, this instruction is also commonly called the 'canonical NSA instruction' because of the fact that the NSA refused to buy processors which didn't support this instruction), encoding/decoding, databases (for quickly assessing information about data), and many others. One application that I find POPCNT most useful for is to quickly calculate Hamming distances. A Hamming distance is essentially a measure of how different one word is from another. Remember, this is not how different the values held by the words are (we could just use a subtract instruction to find that out!) but how the words themselves differ. For machine words, it is defined as the number of bits that are different between the two words.
For example, take the following 8-bit words:
00110001 11010001 ^^^^^
The lower 5 bits, denoted by the carats, are the same; hence only three bits are different. Therefore, the Hamming distance between these two words is 3.
A POPCNT instruction can give us the Hamming weight of a word, which is the difference between a word and the base word in its class. Because the difference between any particular word and a word with all 0s is the number of bits which are set, that is exactly what POPCNT gives us!
Of course, this doesn't give us the Hamming distance directly, but that's easily fixed. All we have to do is zero out the common bits between the two words and the result holds only the bits that are different. Our friendly neighborhood XOR instruction can do that for us, leaving us with the following sequence of instructions for calculating the Hamming distance between two words:
; RAX and RBX contain the two words
mov rcx, rax xor rcx, rbx popcnt rcx, rcx
; RCX now contains the Hamming distance between RAX and RBX
Hamming distances can be used to calculate things like error in data, as in how much error exists. They can be used as thresholds in encoding or decoding of audio/video. In fact, any place where you need any sort of fuzzy logic, Hamming distances could be useful. There are many other potential applications, but too many to be covered here.
This covers the first of two new advanced bit manipulation instructions that are introduced with the new Family 10h architecture. This leaves us with another interesting instruction, LZCOUNT, which counts the number of leading zeros in a given word. But, I'll leave that for next time.
-Rahul Chaturvedi
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 11/16/2007 at 12:40 PM by AMD Developer Blogs Moderator
The new AMD "Barcelona" processors introduce dramatically improved numerical performance when using the standard SSE, SSE2 and SSE3 instruction extensions. Previous AMD processors typically could execute a vectorized SSE instruction (for example, MULPS to perform four multiplies) every two clock cycles. In the AMD "Barcelona" processor, this performance is doubled so a new vectorized SSE instruction like MULPS can typically be issued every cycle. This feature is called SSE128 because an entire 128-bit SSE register is processed on each clock tick. A detailed discussion of SSE128 can be found in the article "SSE128: AMD's New Floating-Point Enhancements."
Furthermore, with separate pipelines for add-class and multiply-class instructions, the new processor has a theoretical peak throughput of 8 single-precision floating point calculations per clock cycle. Integer SSE instructions get a similar boost. For complete timing details on all the instructions, see the Software Optimization Guide for AMD Family 10h Processors appendix C.
The easiest way to realize the benefits of SSE128 in real applications is to leverage existing library code which has been optimized using vectorized SSE instructions. The AMD Performance Library (APL) is one such library, providing a collection of popular software routines designed to accelerate application development, debugging, and optimization on x86 class processors to provide a quick path to high performance development. Also, the new release of the AMD Core Math Library (ACML), version 4.0, features new kernels tuned for great performance on the new processors. Specifically, DGEMM, SGEMM and CFFT have all been optimized to take advantage of the improved floating point performance. Another new feature of ACML 4.0 is the upgrade of the LAPACK routines to the new LAPACK 3.1. Many of these LAPACK routines have been optimized with OpenMP to take advantage of AMD's new quad-core processors. ACML will continue to improve, with more optimized functions in future releases.
Intermediate or advanced programmers can write custom vectorized SSE code to improve performance. Using Microsoft's Compiler Intrinsic functions for SSE, developers can write one version of SSE code that compiles for both 32-bit and 64-bit native platforms, something which is not possible using pure assembly code. See the article "Performance Optimization of 64-bit Windows Applications for AMD Athlontm 64 and AMD Opterontm Processors using Microsoft Visual Studio 2005" for an easy-to-follow tutorial with demo code showing some examples using Microsoft Visual Studio 2005.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Recently, AnandTech published an article in response to AMD's unveiling of the SSE5 instruction set specifications. Read the full article here. AMD Developer Central checked in with Frank Gorishek, AMD's Director of Software Development Engineering, for his reaction. Frank offered a couple of points as a follow-up to the article:
AMD is publishing the specification for SSE5 well in advance of the supporting hardware release so that we can get feedback from the software community. We feel that it is critical to get real-world input from our constituents in order to ensure that the instruction set is useful and valuable. This is especially important given the new capabilities and possibilities that SSE5 presents. This is the same approach AMD took with the AMD64 instruction set architecture and the industry is better off for it. It does not help us, our customers, or the industry to release an instruction set that is developed in an environment devoid of community feedback and revealed only when hardware support is already imminent.
This disclosure also brings the discussions about new instruction architecture out into the open where it belongs. The x86 instruction set is an industry standard and is bigger than any one company. AMD seeks a role in fostering collaborative innovation that benefits the whole software community, and welcomes an open conversation with anyone who would like to contribute their perspectives.
What do you think?
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Sideband Stack Optimizer is one of many of the AMD "Barcelona" processor's evolutionary "CPU Core IPC improvement" features. The Sideband Stack Optimizer is special circuitry in the core that tracks the value that the stack-pointer (RSP) assumes, allowing parallel execution of more than one PUSH or POP instruction. This is typically implemented by modifying epilogue and prolog code to utilize PUSH/POP instead of explicit references via RSP.
Motivations:
Chains of pushes and pops are dependent through RSP (i.e. breaks serial dependence chains for consecutive PUSH/POPs)
Can remove dependency by tracking RSP changes in a sideband register a.k.a. Stack-Pointer Delta or "SPd"
RSP adjustments then don't require functional unit bandwidth (no uops)
Basic operation:
Converts PUSH ops into pure stores (i.e. Save a pass through the functional unit)
Converts POP ops into pure loads (i.e. Save an op)
Also optimizes performance of CALL and RET instructions
For the software developer, this invokes preference for small push/pop over larger explicit store/load instructions to promote code density optimization.
Examples:
Replace this : MOV reg, [RSP+disp8] (4 bytes) or this : MOV reg, [RSP+disp32] (8 bytes) With this : POP reg (1 byte)
Replace this : MOV [RSP+disp8], reg (4 bytes) or this : MOV [RSP+disp32], reg (8 bytes) With this : PUSH reg (1 byte)
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Just wanted to let you all know that we have launched a new content section on Developer Central where you can find code samples and sample applications for optimizing code. You'll find it under the Technical Articles section. Hope you find it useful!
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Writing SIMD code poses several complications. Doing 2 to 16 operations with one instruction is a powerful feature, but unless you have enough support instructions to get your data back and forth between the registers and memory, you may not always be utilizing the full potential that SIMD offers.
The SSE2 and SSE3 instruction sets include many instructions to help with this. They include packs and unpacks, shuffles, partial register move instructions, and more. These are pretty big sets, so for SSE4a to actually provide an improvement seemed like a difficult proposition. So I looked at the instructions a bit closer, and learned about the new bit field insertion and extraction instructions.
Before I start, let me mention that both of the following instructions work only on the lower 64 bits of the registers they deal with, and the upper 64 bits are undefined. When using these instructions, keep in mind that to access the upper half of any register, you'd have to shift the bits down by 64 and then do the required processing.
EXTRQ: Extract Field from Register
This instruction basically extracts a particular set of bits from one register and moves them to the register's least significant position. For example, if you want to have only the third 16 bit value in the xmm register, xmm0 (bits [47:32]) extracted and left at bits [15:0], you would use the EXTRQ instruction, and in this way,
EXTRQ xmm0, 16, 32
The first thought that came to my mind when I saw this instruction was, "Can't I do the same thing just using a shift instruction? Okay, that wouldn't clear out the rest of the bits in this 64 bit half, but I could do a mask, and then a shift...but then I'd have to use an extra register for the mask. Well, I could do two shifts, one left and one right, but then that would be two instructions."
Anyway, you get the idea. EXTRQ can be fairly useful, but not essential. Now INSERTQ, that comes close.
INSERTQ: Inserts Field from a source Register to a destination Register
This instruction takes a set of bits from one register and places those bits at ANY offset you specify (within 64 bits of course) within the destination register. For example, if you want to take a 16 bit value from xmm0 and move it to the third 16 bit value of xmm1, you would do,
INSERTQ xmm1, xmm0, 16, 32
But if you didn't have this instruction and wanted to accomplish the same thing, what would you do?
The quickest way would be to have a mask at bits [31:16] in one register, and use that mask to zero out those bits in xmm1. Then you'd have to shift the data in xmm0 to the correct location, and then merge those bits into xmm1.
So essentially INSERTQ is doing the job of three instructions in one!
If you want to do this entire process for arbitrary bit positions (in case you want to insert or extract different bits, based on other computation), you would add one more instruction here, because now the mask will also have to be shifted in place before you do the PAND. Further, if the 'source' register has more data than just the value you want to insert, then that would involve one more PAND to zero out the rest of the unwanted bits. If you put both together, you'd need to add ONE more shift for moving the mask which will zero out the bits in the 'source' register.
This means that, in order to do what INSERTQ provides - inserting a value in a register at any location, based on value stored in a register - you could potentially need to use a grand total of six SSE instructions.
If you think about it, you'll probably find a lot of places in your code where this INSERTQ instruction could save you significant time and complexity.
There are two more instructions in the SSE4a instruction set that add some more convenience - the partial stream (non-temporal hint store) instructions for floating point values. Look for future posts covering these topics.
-Rahul C.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Most people who have done SIMD programming would tell you how invaluable the shuffle instruction is. It is used extensively to get data in the correct slots in the 128-bit XMM register. This is especially true for complex number operations. Unfortunately, this instruction also has a very high cost on the K8 processors. On f10h though, this instruction is now blazing fast. However, at the same time, the instructions we were using in lieu of using shuffles (movhlps, movlhps in particular), are now going to be much slower on f10h!
Our dilemma is how to ensure that we still maintain optimal speed on both the K8 processors and the new f10h processors.
In the end, we decided to abstract everything down to a 'conceptual' shuffle. The data movement within a register achieved by the movhlps and movlhps instructions is also possible by using a shuffle instruction This is true with almost any data reorganization instruction that works on data sizes greater than 16 bits. We scoured through our code, looking for movhlps, movlhps, and any other instructions that we had been using instead of the shuffle instructions. We replaced the intrinsic calls to all of these instructions with inline function calls to our own set of abstracted 'shuffle' functions. We scoped these calls with a preprocessor macro that was defined to point to the namespace that the shuffle calls for that particular processor.
The code looks something like:
OPTIMIZATION::Shuffle_a0b0a1b1(xmmreg1, xmmreg2)
The shuffle function call is defined in two different namespaces. One namespace holds the K8 optimized shuffle code and the other holds f10h optimized code.
Now, to compile the source file for K8 optimizations, we'd set the OPTIMIZATION pre-processor macro to the name of the namespace which holds the K8 optimized code. If we want to compile the same source file for f10h optimizations, we set the processor macro accordingly to point to the namespace containing the f10h optimized shuffles.
This solved our shuffle problem. Next, we needed to make sure we were taking advantage of the misaligned access and changed instruction scheduling.
Unfortunately, Microsoft hasn't yet released its new compiler, code named, 'Orcas'. This compiler contains support for features in the f10h processors. The only way to leverage the misaligned access feature was if we wrote code in assembly, which APL does only in cases where we get a significant advantage. Even when going to assembly can give us even a 10-15% advantage, something that would be rather good for an optimization, because of portability and maintenance issues, we generally avoid it. Since this wasn't a case that would justify going to assembly, we moved onto the last, but not least, optimization on our list for f10h. More on that next time.
--
One of the new instruction sets introduced in the Third Generation AMD OpteronTM processor is Advanced Bit Manipulation (ABM), comprising two instructions that operate on general purpose registers: LZCNT and POPCNT. We'll first explore what POPCNT can do for you.
In almost every interview I have given to date, I have been asked the question, "How would you calculate the number of bits set in a given 32-bit word?" Of course, by the thirtieth time I was asked that question, I was finally able to figure out what answer to give, which wasn't very efficient. If you've tried to calculate this number yourself, or have tried to answer this question for others, I hope this discussion will be helpful because there are many ways to do it in software. One way is using lookup tables, which access memory, but multiple lookups are needed (unless you have a 4 GB table for all 32 bits!). Alternatively, you can use another common algorithm. Subtract one from the number, then perform the AND operation with the original number. Do this until the number is 0. The number of iterations it takes for the number to become zero is the number of bits set. A typical pop count function using this method would look like this:
int popcount(int x) { int popcount; for (popcount = 0; x; x = x & (x-1), popcount++); return popcount; }
This function is generic and can be applied to multiple integer types. If your integer size is limited, there are a few more techniques that are floating around (easily Googled) but none of them are as efficient as one instruction.
Before I describe POPCNT ("pop count" or population count), the first of two advanced bit manipulation instructions that are provided in the new AMD Family 10h processors, you might have the exact same question that I had the first several times I was asked this in an interview:
Why on earth would anyone need this?
As it turns out, counting the number of bits set in a word (a machine word, that is), can be quite useful. I started realizing this when I moved to using bit arrays for computations.
Let me give you a quick scenario. I have implemented an array which stores the results of a network transmit operation. Each element represents a true or a false, depending on whether that particular block transmitted correctly or not. I need to use this data to calculate how much packet loss I have experienced.
Let's say that block numbers 7, 32, and 62 were not transmitted. The values at array index 7, 32, and 62 would be set to 1 and the rest to 0. If I am transmitting megabytes of information, this array could grow very large and it would be using a minimum of 8 bits of storage for each 1 or 0 it needs to store (if I am using the smallest data type provided to me) unless I use a bit array.
If I use a bit array, my array becomes much smaller, which means that I need to do fewer memory accesses to traverse the entire array, less memory is being used, etc. The only problem is with accessing an element in this array. To see if a bit is set in the bit array, I need to read one chunk of the array into a word and then shift bit by bit to see if anything has failed.
Enter, pop count! Pop count would simply tell me how many bits were set in the word I've just accessed, with just one instruction! Let's take a look at the gain I realize by using POPCNT.
For 1MB of data with a 1k block size, I have 1,000 elements. Therefore, the number of instructions taken by each approach would have been:
Original [byte array based]: Execution: For each element, I need to read the byte value and check if it is 1. If it is, I need to increment my counter. Cost: 3 instructions [read, compare, increment] x 1000 = 3,000 Results: 3k instructions. Not a very good idea.
Bit array [without pop count]: Execution: For every 32 elements, I need to read one word, shift the bits out, check if the left-most bit is 1 or not (check the sign of the resultant number), and then increment my counter if the bit is 1. Cost: (1 read + 32 shifts + 32 compares + 32 increments) x (1000/32) = 3032 Results: Considering there are much fewer reads here, this approach would still be a lot faster because of a lot fewer memory accesses.
Bit array [with pop count]: Execution: For every 32 elements, I need to read one word, do one pop count, and increment my counter by the return value from the pop count. Cost: (1 read + 1 POPCNT + 1 add to the counter) x (1000/32) = 94 Results: Using the POPCNT instruction here gives me a whopping 32x reduction of instructions, representing a significant performance gain! This is with using 32-bit words. For 64 bits, there is even greater performance gain.
NOTE: There are other algorithms that could result in fewer instructions without using pop count, but we have chosen this x and x-1 approach because it is easily portable. Other algorithms that could perform this function faster often require a fixed number of bits, and hence are not suitable for all purposes. Even so, pop count is faster than the most optimal approach without pop count.
In addition to this specific scenario, there are several applications where pop count can substantially increase performance. Pop counts are used in cryptography (in fact, this instruction is also commonly called the 'canonical NSA instruction' because of the fact that the NSA refused to buy processors which didn't support this instruction), encoding/decoding, databases (for quickly assessing information about data), and many others. One application that I find POPCNT most useful for is to quickly calculate Hamming distances. A Hamming distance is essentially a measure of how different one word is from another. Remember, this is not how different the values held by the words are (we could just use a subtract instruction to find that out!) but how the words themselves differ. For machine words, it is defined as the number of bits that are different between the two words.
For example, take the following 8-bit words:
00110001 11010001 ^^^^^
The lower 5 bits, denoted by the carats, are the same; hence only three bits are different. Therefore, the Hamming distance between these two words is 3.
A POPCNT instruction can give us the Hamming weight of a word, which is the difference between a word and the base word in its class. Because the difference between any particular word and a word with all 0s is the number of bits which are set, that is exactly what POPCNT gives us!
Of course, this doesn't give us the Hamming distance directly, but that's easily fixed. All we have to do is zero out the common bits between the two words and the result holds only the bits that are different. Our friendly neighborhood XOR instruction can do that for us, leaving us with the following sequence of instructions for calculating the Hamming distance between two words:
; RAX and RBX contain the two words
mov rcx, rax xor rcx, rbx popcnt rcx, rcx
; RCX now contains the Hamming distance between RAX and RBX
Hamming distances can be used to calculate things like error in data, as in how much error exists. They can be used as thresholds in encoding or decoding of audio/video. In fact, any place where you need any sort of fuzzy logic, Hamming distances could be useful. There are many other potential applications, but too many to be covered here.
This covers the first of two new advanced bit manipulation instructions that are introduced with the new Family 10h architecture. This leaves us with another interesting instruction, LZCOUNT, which counts the number of leading zeros in a given word. But, I'll leave that for next time.
-Rahul Chaturvedi
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 11/16/2007 at 12:41 PM by AMD Developer Blogs Moderator
MONITOR and MWAIT are two separate instructions that are used together to monitor a range of linear memory. MONITOR tells the processor what address range to watch for a STORE instruction. MWAIT hints to the processor that it may enter an implementation dependent power state while waiting for that cache line within the address range to be written to, halting most activity in the core while doing so. These instructions allow an operating system or application to realize incremental power saving features beyond what the built-in Fire-and-Forget AMD PowerNow!TM feature provides, and can also be used to eliminate polling (the situation where an address is regularly checked for changes to data). Since MWAIT is triggered by a STORE instruction, it can also be used in creative ways with devices that handle IO through memory addresses, i.e. data sent to a device or data written to memory from a device would cause the processor to return to a higher power state.
The following sample code shows typical usage of a MONITOR/MWAIT pair:
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Instruction-Based Sampling (IBS) is a performance monitoring technique that provides precise information about AMD64 instruction fetch behavior and about the execution of operations that are issued from AMD64 instructions. This information can be used to analyze and improve the performance of programs executing on AMD "Barcelona" processors.
IBS provides four important advantages over conventional performance counter sampling:
Hardware events are attributed precisely to the instructions that cause the events. Conventional performance counter sampling is not precise making it difficult, if not impossible, to attribute events to specific instructions. This limits the ability to pin-point performance issues at the instruction and source code levels.
A wide range of events are monitored and collected with each IBS sample. Either multiple sampling runs or counter multiplexing must be used to collect the same range of information with conventional performance counter sampling.
The virtual and physical addresses of load/store operands are collected. Profiling tools can use this information to associate specific data structures with x86 instructions performing load/store operations.
Latency is measured for key performance parameters such as data cache miss latency.
The precision afforded by IBS also enables automated optimization techniques (e.g., profile-directed optimization) which require detailed, precise information about instruction-level program behavior.
More information will be forthcoming in an appendix to be added to the Software Optimization Guide for AMD Family 10h Processors.
IBS support is included in the latest beta release of AMD CodeAnalyst, version 2.7.4.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
I decided I needed to rebuild my Windows cluster to test out some automation work I was doing. In rebuilding the nodes and configuring MOSS (Microsoft Office SharePoint Server 2007) I noticed that I was repeatedly getting the following error messages in the Event Log. The error would happen immediately after 'starting' the "Windows SharePoint Services Help Search" service under Central Admin -> Operations.
Event Type: Error Event Source: Windows SharePoint Services 3 Search Event Category: Gatherer Event ID: 2424 Date: 9/7/2007 Time: 11:52:50 AM User: N/A Computer: HPC-Headnode Description: The update cannot be started because the content sources cannot be accessed. Fix the errors and try the update again.
Context: Application 'Search', Catalog 'index file on the search server Search'
Oddly enough, while researching this further I discovered that I was also getting the following error:
Event Type: Error Event Source: Office SharePoint Server Event Category: Launcher Service Event ID: 6062 Date: 9/7/2007 Time: 11:43:10 AM User: N/A Computer: HPC-Headnode Description: Found 2 valid IP addresses for this machine. Choosing this one: 172.204.98.255
After a lot of trial and error (and hair pulling) I finally realized that what was happening was that the my network configuration was busted.
My Headnode, which is also the server hosting MOSS Central Admin, and like all my HPC nodes, is dual-homed. Under the Advanced Settings section in network connections, I had this network configured as my default. However, the Public network has no knowledge of nor support for my cluster network. This means that there is no DNS or AD integration of my cluster network into the larger corporate network. Thus any attempt to connect to my SharePoint site on this network wtihout specifically knowing the IP of the public-facing NIC would fail.
So what was happening then was that MOSS was attempting to connect back to itself across the Public network interface to start the search process but was failing because a DNS lookup was failing.
To correct this I had to, first, completely uninstall MOSS (I didn't discover a way to get past the 6062 error apart from reinstalling. In my case, once I got that election error where MOSS chose the Public interface I didn't find a way to get it to re-choose the Private interface instead).
Secondly. the key to my sucess though was to go to Advanced Settings for my network connections and under the connection area of the Adapters and Bindings tab, I changed the connection order so that the private network was the first one on the list.
Now, once I reinstalled MOSS I was getting the 6062 error again, but this time it was showing me the private interface and the Gatherer service was starting correctly and Search was finally working.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 09/10/2007 at 02:12 PM by john.mccrae-at-amddotcom
To use or not to use CPUID, that is the question. CPUID is an instruction that tells you what features of a processor are supported. This instruction definitely has a time and place to be used and not to be used. For general optimizations, CPUID can be a crucial step that ensures compatibility in your code, but for multithreaded code, you're much better off using OS methods to detect the processor topology.
As newer processors enter the market, they come loaded with new software visible features. Take SSE4a, for example, a new set of SIMD instructions included in AMD "Barcelona" processors, but not included in any previous AMD processors. If you are doing optimizations and want to use those instructions, it's highly recommended to use CPUID to first determine if the processor supports SSE4a, and then take the appropriate code path depending on the result of the operation. Adding this crucial step ensures that your code doesn't break if it is run on an older processor without SSE4a instructions. Check out the CPUID Specification for details on how to use CPUID properly to detect various features.
When it comes to parallel programming, processor topology is key to writing optimized multithreaded code. Things like cache architecture (size, levels, shared or not shared, etc.), NUMA rules, and number of cores are some of the main things that could affect your code. The best method for determining processor topology is to use the methods provided by the operating system. For example, in Windows, the API call GetLogicalProcessorInformation allows an application to learn about the machine's configuration, including multi-core and NUMA. Relying on operating system APIs to enumerate topology information is not only easier from a coding standpoint, but in a virtualized environment not all processors or nodes may be available for use - something that CPUID may not properly reflect.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
The new shared L3 cache featured in AMD "Barcelona" processors is considered by many platform level software developers to be one of the two most important features of the platform. These top two features, quad-core and L3 cache, together mean that the performance characteristics of applications that you previously profiled (on the original x86-64 family of AMD processors) will be different due to the enhanced microarchitecture and additional processor resources available. This exposes new opportunities for performance improvement if a new performance study is done. For such a performance study, consider the AMD CodeAnalyst Performance Analyzer for Windows® and Linux® that has been specifically tuned for AMD "Barcelona" (Family 10h) processors, including Third-Generation AMD OpteronTM processors (i.e. reads and interprets AMD "Barcelona" processor hardware event counters). Check out the article Barcelona's Innovative Architecture Is Driven by a New Shared Cache and Chapter 5 of the Software Optimization Guide for AMD Family 10h Processors for more details on what this new L3 cache means for software developers.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Working on the AMD Performance Library, our life is SIMD (Single Instruction, Multiple Data). An 8-bit add operation takes one cycle; doing sixteen 8-bit add operations using an integer add, SSE instruction (which works on 128 bits at a time), takes four cycles. The advantage of an 8-bit add operation is enormous, so if an operation can be done in parallel, we make sure that it utilizes an instruction from one of our SIMD instruction sets: SSE/SSE2/SS3.
The family 10h processors (f10H), initially unveiled as "Barcelona", has several new and exciting changes in its FPU, which results in vast improvements for SIMD instructions. In APL, our aim was to leverage these changes and other new features in order to give the best performance possible to our users. Here's how we did it.
The first major improvement on the f10h processors is that the FPU has been widened from 64 bits to 128 bits. Let's look at what this means.
Previously, a SSE 'add' instruction (let's take PADDW), would have taken 4 cycles to execute. According to the Software Optimization Guide for the K8 family of processors, the latency for this operation is only 2 cycles, but because the FPU used to be 64 bits, each 128-bit SSE instruction was broken into two, and each part was executed separately. So straight out, most of our code runs close to twice as fast on a Barcelona machine without us touching a thing!
Life can't be that easy though, so we started going through the instruction set with a fine-toothed comb. Now most instructions in the SSE instruction set have similar latencies but several instructions have drastically different latencies (when optimizing, 4 cycles can be significant). There are always instructions that a developer writing SSE optimized code needs to avoid simply because there are sometimes two (or more) other instructions that can do the job in less time. One particular example of this is the shuffle instructions on the K8 processor, which we will discuss in our next post.
--Rahul Chaturvedi
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
Edited: 11/16/2007 at 04:50 PM by AMD Developer Blogs Moderator
Back in June, AMD Developer Central published a feature article detailing, for the first time, AMD's SSE128 extensions (aka 'FP128') for the Barcelona processor platform. With the impending release of Third-Generation AMD Opteron(TM) processors, we bring you some errata and updates.
This paragraph just below Table 1, on page 2 of the article, is slightly incorrect. AMD Opteron processors, new or old, do not perform 128-bit floating point operations. It is important to understand what the contents of a 128-bit SSE register represent. For the case of double precision data, a register, say XMM0, holds one or two 64-bit double precision values. For single precision, XMM0 holds one or four 32-bit single precision values.
This is true for both K8 and for the new Third-Generation Opteron.
The paragraph is correct in saying that K8 operates on 64-bit chunks. K8 has one multiply unit and one add unit. Each of these could retire one double precision operation per clock, for a maximum throughput of 2 FLOPS per cycle. This was the maximum throughput regardless of whether the code was scalar (e.g. MULSD), or vectorized (MULPD).
For single precision on K8, the story is a little different. The K8 single precision scalar instructions can retire one add operation and one multiply operation (see the optimization guide to determine which execution unit is used for a particular instruction) per cycle providing 2 single precision FLOPS per cycle. But for the vector instructions, the K8 single precision units can process two of the 4 values at a time. This means that vector single precision ops on K8 can run at 4 FLOPS per cycle.
Think of an SSE register of being in two halves, upper and lower. K8 works on these two halves independently, and only one can be retired at a time (per execution unit).
The Third-Generation Opteron floating point unit can now operate on both halves at the same time. The new add and multiply units can retire both 64-bit halves of the 128-bit register each clock. (This is not the same as doing 128-bit calculations aka quad precision). The net result is that you can now process 2 DP adds and 2 DP multiplies per clock, or 4 FLOPS per cycle. And for single precision math, you now can run 8 FLOPS per cycle!
--Chip F.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
The new AMD "Barcelona" processors, also known as Family 10h, and including Third-Generation AMD OpteronTM processors, contain an array of innovations in processor design and features, including a number of software visible features that can be leveraged to make your applications perform better and be ready to scale across multiple cores. This "AMD "Barcelona" (Family 10h) / Third-Generation AMD Opteron Processor Software Visible Features" blog series summarizes a few key features and provides some insight beyond the information found in datasheets and manuals.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Welcome to the AMD Libraries blog, which comes direct to you from the AMD Performance Libraries team. Here, we will discuss emerging trends, best practices and insider tips for optimizing code for stellar performance using AMD's performance and optimization routines and libraries. Follow along as the APL (AMD Performance Library) team overcomes challenges with Streaming SIMD Extensions (SSE) programming. Listen in as ACML (AMD Core Math Library) experts provide guidance on AMD's new 128-bit floating point enhancements, as well as other features that take advantage of enhanced features of the AMD "Barcelona" processor platform. Receive early guidance about new features, functions and development tools - well before they hit the street.
More than anything, we are here to engage in a dialog with our user communities. We look forward to your comments, opinions and interactions.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
While installing and testing a couple of variations of SharePoint 2007 I doscovered that my services were not starting and running correctly. Upon further inspection of the event logs, I discovered that I was getting a DCOM error event isd 10016. The end result was that the service account I was using for my SharePoint installation lacked sufficient privileges to access the IIS WAMREG Admin Service.
To correct this I went to COM Services -> COM Config -> IIS WAMREG admin Service. Right click on it -> Properties -> Security -> Launch and Activation Permissions -> Edit. Then add your service account and give it permissions for both local and remote activation and launch.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.