AMD Logo AMD Developer Central
AMD Developer Blogs
AMD Developer Blogs - AMD “Istanbul” (Family 10h) Processor Software Visible Features
Decrease font size
Increase font size
July 21, 2009
  HT Assist - what is it?

Scalable Performance with HyperTransportTM Technology HT Assist:

With the release of the Six-Core AMD OpteronTM processor, formerly code-named "Istanbul", an important new hardware feature called HT Assist has been included that helps increase performance on 4-socket and 8-socket AMD OpteronTM 8400 Series processor-based systems.

As you scale the number of sockets and, thus, processors in a system, maintaining data coherency becomes a more complex and important issue.  On a single-socket system with a multi-core processor your single processor just has to maintain cache coherency between the processor cores; there are no other sockets or processors to maintain coherency or communication with.

In a multi-socket system, each processor has to communicate with each other processor to make sure it is working on the latest data, or cache line, to maintain coherency (and thus program correctness).  This communication is done over HyperTransportTM technology links between the processor sockets in the case of systems based on HyperTransport technology.  With a broadcast coherence protocol, the latency of a memory access is always the longer of 2 paths: the time it takes to return data from DRAM and the time it takes to probe all the caches in the system.  Only when the processor has received the data and all probe responses can it actually process the required transaction.  With a 4-socket or 8-socket system (24 or 48 total processor cores with Six-Core AMD Opteron processor-based systems) the HyperTransport technology links between processors can increasingly be loaded with a significant amount of latency-sensitive cache probe requests checking for data coherency.

In a 4-socket system, one cache line coherency check can generate 10 or more messages over the 4 HyperTransport links connecting the 4 processors together.  These transactions include all the probe requests, probe responses, data request, and data responses. With HT Assist though this same check may only generate 2-3 messages.  This significantly reduces the latency of the coherency check and the amount of transactions over the HyperTransport links.

HT Assist, or the Probe Filter as it is sometimes called, works by using part of the processor's L3 cache as a directory cache.  This directory cache tracks all cache lines cached in the system.  Instead of generating numerous cache probes when checking a cache line the processor does a Probe Filter Lookup.  This helps lower latency for accesses to local DRAM because there is no need to wait for probe responses when accessing local data.  This also means there is less queuing delay due to the lower HyperTransport technology traffic.  With significantly reduced probe traffic it effectively also increases system bandwidth performance.  It also should be noted that the directory cache uses 1MB of the 6MB L3 cache in the case of the Six-Core AMD Opteron processor.  As well, HT Assist is only enabled on 4-socket and 8-socket systems, where the performance benefits largely outweigh the small decrease in available L3 data cache.  On the other hand, HT Assist is not enabled on 2-socket systems where there is much less cache probe traffic and the full L3 cache is utilized.

We've measured the difference of HT Assist on Six-Core AMD Opteron processors and the results are nothing but stunning.  On the same 4-socket system, we measured 42GB/s of memory bandwidth with the STREAM benchmark with HT Assist, while only getting 25.5GB/s when HT Assist is disabled.* For 4-socket and 8-socket Six-Core AMD Opteron processor-based systems, this can translate into a significant performance uplift for applications that depend on cache performance, memory bandwidth, and system scalability.

Applications that naturally will get a benefit from HT Assist include Database, Virtualization, and High Performance Computing (HPC).  And there is no need for software developers to change their code, just enjoy the extra performance from AMD!

-Justin Boggs

ISV Developer Relations

 

* 42GB/s using 4 x Six-Core AMD OpteronTM processors ("Istanbul") Model 8435 in Tyan Thunder n4250QE (S4985-E) motherboard, 32GB (16x2GB DDR2-800) memory, SuSE Linux® Enterprise Server 10 SP1 64-bit with HT Assist enabled vs. 25.5GB/s with HT Assist disabled. 

 



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 07/21/2009 05:01 PM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (2)  

June 1, 2009
  "Istanbul" overview

Today, AMD is launching the "Istanbul" processor. Since our first dual-core processor, those of us in AMD's CPU ISV team have been evangelizing that more cores are coming. This processor contains 6 cores on one die. Just to be clear these are 6 distinct physical cores, just as the Shanghai processors contained 4 distinct physical cores. Each core comprises 512K of L2 cache and 128k of L1 cache. The L3 is a 6MB cache shared by the six cores. The Istanbul processors are MP capable, supporting up to 8 processors (48 cores). There have been numerous refinements made to this processor. One notable change is the addition of a Probe Filter, which you may see referred to as HyperTransportTM technology, HT Assist. Simply put, this filter can greatly reduce HT traffic between multiple sockets, which in turn can improve memory bandwidth, especially on 4 socket platforms. For those with silicon interest, the "Istanbul" processors are fabricated with the 45nm SOI process. And did I mention that these processors use AMD's existing Socket F (1207) infrastructure? Which means that on many platforms all is needed is a simple BIOS upgrade.  Some of the other features are: HT3 capability and numerous power saving features.  More blogs to come on the cool new features of Istanbul - otherwise known as the new Six-core AMD OpteronTM processors.

AMD Opteron(TM) processors
Six-core AMD OpteronTM processor



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 06/01/2009 at 12:32 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 06/01/2009 12:20 PM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

  "Shanghai" blog category is now "Istanbul" blog category

With the launch of the new Six-Core AMD OpteronTM processors (codenamed "Istanbul"), the powerful follow-up to the "Shanghai" processors, we're updating the title of this blog category to reflect the information you will now find here.  Don't worry, the previous content isn't going away - it's still very valid, since the "Istanbul" processors build on foundations that were laid by the "Barcelona" and "Shanghai" processors, and add advancements in many features.  Check back often for new write-ups on these features, and visit our "Istanbul" Zone for a round-up of everything you need to know about this enhanced generation.

We'd appreciate hearing what you think about the new "Istanbul" processors, so leave us a comment!



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 06/01/2009 12:15 PM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

November 13, 2008
  "Barcelona" blog category is now "Shanghai" blog category

With the launch of the new Quad-Core AMD Opteron processors (codenamed "Shanghai"), the powerful follow-up to the "Barcelona" processors, we're updating the title of this blog category to reflect the information you will now find here.  Don't worry, the previous content isn't going away - it's still very valid, since the "Shanghai" processors build on foundations that were laid by the "Barcelona" processors, and add advancements in many features.  Check back often for new write-ups on these features, and visit our "Shanghai" Zone for a round-up of everything you need to know about this enhanced generation.

We'd appreciate hearing what you think about the new "Shanghai" processors, so leave us a comment!



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 11/13/2008 01:46 AM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

  Larger L3 cache in Shanghai, Part I

Contents:

  1. Introduction
  2. The new larger L3 cache in the Quad-Core AMD OpteronTM processor codenamed "Shanghai"
  3. Latency reduction in the new L3 cache

Introduction:

The core of the Quad-Core AMD OpteronTM processor codenamed "Barcelona" introduced a unique on-die L3 cache for the first time. Besides being the first native quad-core x86 design and a slew of other innovations in "Barcelona," the presence of the L3 cache was an important feature of the processor design.  The new, enhanced Quad-Core AMD Opteron processor codenamed "Shanghai" further adds to the sequence of innovations by tripling the size of the L3 cache. This article discusses the newly increased L3 cache of the "Shanghai" processor and its impact on performance.

The new larger L3 cache in Shanghai

The "Barcelona" processor has 2MB of L3 cache. This L3 cache is shared between "Barcelona's" four cores, L1 and L2 caches being local to each core. There are two L1 caches of 64KB each (one for data and the other for instructions) and one L2 cache of 512KB per core.

The "Shanghai" processor on the other hand has a much bigger L3 cache of 6MB, enabled in large part due to the shrinkage of the process technology to 45nm used in the "Shanghai" processor. The "Barcelona" processor was built using 65nm process technology. The "Shanghai" processor maintains the same size for the L1 and the L2 caches.

A brief look at the AMD quad-core processor cache architecture (victim caches)

First access to an address (which would lead to an L3 miss) will bring the line straight into L1 data cache. Thus the L3 is  non-inclusive.  Only after it is evicted from L1 and then from L2, will it come into L3.  Once in L3, there are various scenarios where the L3 returns data and retains the line. The L3 behaves as an inclusive cache by keeping a copy, if it is likely the data is being accessed by multiple cores, versus behaving as an exclusive cache by removing the data from the L3 cache (and placing it solely in the L1 cache, creating space for other L2 victim/copy-backs), if it is likely the data is only being accessed by a single core. So this duplication of data in this "mostly exclusive" design happens only when it is possible for the data to be shared, emphasizing the role of the L3 to enable sharing of data between the cores. This is also seen when making a decision to evict a line from L3, where it prefers to evict unshared lines over shared lines.

Thus, the caches act as victim buffers for the caches higher up in hierarchy.

Furthermore, the cache features bandwidth-adaptive policies that optimize latency when requested bandwidth is low, but allows scaling to higher aggregate L3 bandwidth when required (such as in a multi-core environment).

Also the L3 is dynamically shared between the cores so that each core gets a fair share of the cache, and if one core needs more of the cache when other cores are idle it can make use of most of the cache.

Some additional points:

  • As far as cache coherency concepts are concerned, L3 is just another independent caching entity in the system.
  • The "Shanghai" L3 is 48-way associative, whereas the "Barcelona" was 32-way associative cache.

The image below shows the complete cache hierarchy of the "Shanghai" processor. "Barcelona" also has a similar hierarchy except that it only has 2MB of L3 cache.

Latency reduction in the new L3 cache:

Within the processor the L3 cache is part of the north bridge subsystem and runs at the North Bridge (NB) frequency. Hence the L3 hit latency is also dependent on the NB frequency.

"Shanghai" has a best case latency of 29 CPU clocks, whereas "Barcelona" had a best case latency of 34 CPU clocks. So the lower latency to data stored in L3 cache should also help to significantly boost performance.

Conclusion

Looking at the above information it's clear that developers don't have to start recoding to take advantage of the new larger L3 cache in the upcoming "Shanghai" processor. This enhancement is expected to benefit many existing programs because the processor has access to a larger chunk of data sitting in the L3 now than before.

Also, since the L3 is a shared cache and if your multiple cores are going to work on the same copy of the data, it makes sense to do the work in parallel, i.e. at the same time, so that the data can be used by all the cores and it does not have to be loaded again.

In the second part of this article I will demonstrate the benefit of the larger L3 cache for a memory intensive program and use AMD CodeAnalyst to correlate the benefit to the Performance Monitoring Counter (PMC) events of the L3 cache which the "Shanghai" processor supports (also found on "Barcelona," since they share very similar cache architectures).

Other relevant articles on developer.amd.com:

  1. L3 cache in "Barcelona": http://developer.amd.com/documentation/articles/pages/8142007173.aspx
  2. Processor cache 101 : http://developer.amd.com/documentation/articles/pages/1128200684.aspx
  3. Processor cache 102 : http://developer.amd.com/documentation/articles/pages/1128200685.aspx
  4. Cache friendly programming techniques: http://developer.amd.com/documentation/articles/pages/1128200684.aspx
  5. Using AMD CodeAnalyst : http://developer.amd.com/assets/Linux_Summit_PJD_2007_v2.pdf



- Vikrant Kumar



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 01/12/2009 at 02:02 PM by AMD Developer Blogs Moderator

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 11/13/2008 01:43 AM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (3)  

  Improved reliability, availability, scalability

All the new features of the new Quad-Core AMD Opteron™ processors (codenamed "Shanghai") combine to provide a platform with superior reliability, availability, and scalability. More processing power means better handling of large data sets and intensive tasks. Now based on 45nm, the processors feature better frequency scaling, plus faster frequencies within power bands for increased performance-per-watt. HyperTransport™ 3.0 technology means that the processor retries on incomplete HT3 packets.

Improved memory bandwidth and scaling comes with the addition of DDR2-800 memory support, 2x core probe bandwidth (particularly in 4P systems), enhanced DRAM prefetching, and advancements in the L3 cache that is shared across all cores.

The L3 cache is now 6MB, which means enhanced application performance through improved memory latency and memory access. Faster access to a larger memory cache supports more efficient handling of larger amounts of data and more complex software stacks, and a cache that is shared by all cores can enhance application performance by enabling a core that needs more cache to get access to it. A significant new feature is that the L3 can now disable "bad" sections of cache to protect data against L3 cache errors and prevent performance issues. (Up to 2 of 16 sections can be disabled; requires OS/hypervisor support.)

When combined with strong ecosystem support across a wide range of OS, tools, and other technology partners, the advanced features of the new AMD "Shanghai" processors form a proven platform on which businesses can deploy their mission critical applications with confidence.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 11/13/2008 12:54 AM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

April 14, 2008
  "Barcelona" Processor Feature: SSE Misaligned Access
We all crave high performing code and in the process we try hard to optimize the algorithms, reorder instructions, unroll loops, avoid branches, reduce pointer usage to allow compilers to optimize, replace dynamic allocation with static allocation where the size is known and so on. One such optimization is with respect to data loads and stores from memory which consume a majority of processing cycles in data-intensive applications. Here, I'll take you through one such optimization with respect to data alignment while using SSE (Streaming SIMD Extension) instructions.

Why use SSE instructions?

SSE instructions operate on 16 bytes of data in parallel. We can load 16 bytes of data at a time and compute those 16 bytes of packed data using a single SSE instruction.
Example: ADDPS xmm1, xmm2 - Add 4 single precision floating point elements packed in xmm1 register with corresponding elements packed in xmm2 and store the result back in xmm1.

SSE instructions are widely used in developing computation-intensive multimedia applications. Typically, these applications process large amounts of sequential data through the following steps:

1. Load data from memory
2. Perform computation on the data
3. Store data back to memory

First we will discuss the intricacies involved in optimizing memory operations using SSE instructions on the AMD-K8tm family of processors (first and second generation AMD Opterontm processors) and then we will discuss the architectural enhancements provided by the "Barcelona" or Family 10h processors (including Quad-Core AMD Opteron and AMD Phenomtm X4 Quad-Core and X3 Triple-Core processors).

SSE instructions consist of two types of load and store instructions. The first type is aligned loads and stores (ex: MOVDQA, MOVAPD, MOVPS) that operate on 16 byte aligned memory addresses. The second type is unaligned loads and stores (ex: MOVDQU, MOVUPD, MOVUPS) that operate on both aligned and unaligned memory addresses. On the AMD-K8 family of processors the aligned version of load and store operations are faster than the unaligned operations even if the memory is 16 byte aligned. For details on the latencies of the various types of load and store instructions, refer to the AMD Software Optimization Guide for AMD Family 10h Processors.

If we use the aligned version of memory operations without verifying the memory address alignments then there are two possible outcomes. First, if the memory is aligned then the memory operations are fast. Second, if the memory is unaligned then the system throws an exception and hence the application crashes (Bang!!!). Now, the solution to this problem is to align the input data to both gain performance and eliminate exceptions and crashes. This solution may not work always since the target user using the application may not align the data or because enforcing such a rule may be inappropriate at times. The easy solution here is to use the safer unaligned loads and stores, sacrificing performance irrespective of the data alignment.

If you are a programmer looking for the best possible performance, saving every single processing cycle, then the solution here is to handle both aligned and unaligned data by checking for alignment of the data at runtime and call the appropriate function that handles either aligned data or unaligned data.

The code to handle aligned and unaligned data is as follows:



if( isAligned(data) )
{
process_aligned (data);
}
else
{
process_unaligned(data);
}

//The 16 byte alignment check code is as follows.
bool isAligned(void* data)
{
return ((data%16) == 0);
}




Typically, the process_aligned and process_unaligned routines have identical code except for the type of load and store instructions.

Architectural enhancements in AMD Family 10h processors ("Barcelona" processors)

"Barcelona" comes with load instructions that are twice as fast as the previous generation processors. For example, the aligned loads take 2 processor cycles in "Barcelona," compared to 4 processor cycles in the AMD-K8 architecture. This is only the latency of the instruction execution; there could be additional latency depending on the locality of the actual data being present in cache or main memory.

The unaligned loads in "Barcelona" run at the speed of aligned loads if the data is aligned. Thus, it is safer to use unaligned loads whenever the alignment of the data is not guaranteed, hence eliminating the check for 16 byte alignment at runtime. If the data is unaligned then the instruction is slightly slower than aligned loads but at an improved speed compared to the unaligned loads on AMD-K8 processors. The FPU unit in "Barcelona" has been widened to 128 bits from 64 bits and the load instructions are fast path instructions. (Note: In AMD-K8 processors, SSE loads are vector path instructions which block the execution units from executing any other instruction in parallel.)

The above optimizations are not applicable for SSE stores. The unaligned stores are slower than aligned stores even when the data is aligned.

Ravindra Babu
Software Engineer, AMD

-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 04/14/2008 at 12:55 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 04/14/2008 11:14 AM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

October 2, 2007
  "Barcelona" Processor Feature: SSE4a, part 2

This is a follow-up to the first post on the SSE4a instruction set.

While shuffling data around in registers is extremely important (as we mentioned in the last entry on SSE4a), one of the primary bottlenecks in performance comes from the loading and storing of data. Even if a processor executes instructions really fast working with just registers, one memory access from DRAM can lead to close to a 50 nanosecond hit, which would mean hundreds of cycles on most processors.

The SSE extensions already have instructions that help in reducing this bottleneck. SSE4a complements these instructions with two of its own.

MOVNTSS
MOVNTSD

Before I move on to the applications of these instructions, let me provide some information to help set the context for what the MOVNT* instructions are really useful for, and why.

Almost all user data usually exists in what is called "Write Back" memory. This means multiple things, but ideally is supposed to be the most cached mode that memory can be. The following description outlines what happens on reads or writes to Write Back memory. (For the sake of simplicity, I am not going to delve into the different combinations of the data being in the L1 or L2 cache. Assume that a cache hit means that data is either in L1 or L2.)

  • Read
    • Cache hit: Data is read from the cache line to the target register
    • Cache miss: Data is moved from memory to the cache, and read into the target register
  • Write
    • Cache hit: Data is moved from the register to the cache line*
    • Cache miss: The cache line is fetched into the cache, and the data from the register is moved to the cache line*
    • *As per MOESI cache coherency protocol rules, in both cases, the cache line is marked as modified

Using the concept of cohesion among data, writes are typically done to a memory location that has been recently read from. Using this architecture, writes to memory that have been recently read become extremely fast.

Unfortunately, in cases where you know that the data is not being written to a location recently read from, this procedure is still followed. So, on every write, that cache line is fetched into the cache, causing what is called cache pollution.

Cache pollution is bad, bad, bad. Considering that we have only 32k for our L1 data cache, if we're reading data from one location and writing it to another, we are literally loosing half our cache lines, making the work of the hardware pre-fetcher become rather ineffective. Plus, other memory accesses between this read and write end up running out of cache lines, also. Remember, cache lines are 64 bytes, so even if one byte of data in a megabyte needs to be cached, an entire 64 bytes are used up.

This is where the non-temporal store comes to our rescue.

These instructions, first of all, do NOT update the cache line, but instead directly write to memory. Along with that, they "write combine" memory, meaning they do not write data immediately to memory but instead wait for 64 bytes to accumulate at a time. Once that threshold is reached (or one of the many other triggers), this memory is written in one shot to DRAM.

Of course, this also means that the data may not necessarily be written in order to memory, and/or not quite when the write was executed. To flush out the write combine buffer, the SFENCE (store fence) instruction needs to be used.

I've noticed gains of up to 2x or more on simple operation loops (something like a load, add, store) working on large pieces of data, when I switched the stores to non-temporal. This is a HUGE gain considering that most of this comes from the store time, which in operations like this, is a major bottleneck. I've found this ideal typically for large buffers (~1MB+).

If I wanted to write register after register to memory, this would work fine. However, in case you're working on part of the register (e.g. scalar SSE instructions) and you only want to write that part, things get complicated. Until now there has not been an instruction that would use the SS or SD parts of the register, hence any NT * memory write would span a full 16 bytes.

*I often refer to these stores as either NT stores/writes, or stores/writes with the NT hint. Keep in mind, though, that these stores are often referred to as "streaming stores." All compiler intrinsics that map to these intrinsics are named _mm_stream*.

Of course, with the AMD " Barcelona " processors, we now have these two new instructions:

MOVNTSS : This instruction will write the least significant 32 bits of a register to memory using the non-temporal hint. For example, a loop that performs scalar single-precision floating point math on a large array can use the SSE registers and MOVNTSS to store results to memory.

MOVNTSD : This instruction will write the lower 64 bits of a register to memory using the non-temporal hint. This instruction can be used for similar purposes as the MOVNTSS instruction, but typically for double-precision floating point data.

Before these two instructions were available, there really was no way to do either of these stores with the NT hint. With these two new instructions, SSE4a completes the NT instruction set to more completely match our set of normal stores.

Support for these two new instructions and all the SSE4a instructions is detected by the CPU ID instruction. Specifically, ECX bit 6 will be set for CPU ID function 8000_0001h.

- Rahul Chaturvedi



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 11/01/2007 at 07:01 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 10/02/2007 05:07 PM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

September 26, 2007
  "Barcelona" Processor Feature: Advanced Bit Manipulation (ABM)

One of the new instruction sets introduced in the Third Generation AMD OpteronTM processor is Advanced Bit Manipulation (ABM), comprising two instructions that operate on general purpose registers: LZCNT and POPCNT. We'll first explore what POPCNT can do for you.


In almost every interview I have given to date, I have been asked the question, "How would you calculate the number of bits set in a given 32-bit word?" Of course, by the thirtieth time I was asked that question, I was finally able to figure out what answer to give, which wasn't very efficient. If you've tried to calculate this number yourself, or have tried to answer this question for others, I hope this discussion will be helpful because there are many ways to do it in software. One way is using lookup tables, which access memory, but multiple lookups are needed (unless you have a 4 GB table for all 32 bits!). Alternatively, you can use another common algorithm. Subtract one from the number, then perform the AND operation with the original number. Do this until the number is 0. The number of iterations it takes for the number to become zero is the number of bits set. A typical pop count function using this method would look like this:


int popcount(int x)
{
int popcount;
for (popcount = 0; x; x = x & (x-1), popcount++);
return popcount;
}


This function is generic and can be applied to multiple integer types. If your integer size is limited, there are a few more techniques that are floating around (easily Googled) but none of them are as efficient as one instruction.


Before I describe POPCNT ("pop count" or population count), the first of two advanced bit manipulation instructions that are provided in the new AMD Family 10h processors, you might have the exact same question that I had the first several times I was asked this in an interview:


Why on earth would anyone need this?


As it turns out, counting the number of bits set in a word (a machine word, that is), can be quite useful. I started realizing this when I moved to using bit arrays for computations.


Let me give you a quick scenario. I have implemented an array which stores the results of a network transmit operation. Each element represents a true or a false, depending on whether that particular block transmitted correctly or not. I need to use this data to calculate how much packet loss I have experienced.


Let's say that block numbers 7, 32, and 62 were not transmitted. The values at array index 7, 32, and 62 would be set to 1 and the rest to 0. If I am transmitting megabytes of information, this array could grow very large and it would be using a minimum of 8 bits of storage for each 1 or 0 it needs to store (if I am using the smallest data type provided to me) unless I use a bit array.


If I use a bit array, my array becomes much smaller, which means that I need to do fewer memory accesses to traverse the entire array, less memory is being used, etc. The only problem is with accessing an element in this array. To see if a bit is set in the bit array, I need to read one chunk of the array into a word and then shift bit by bit to see if anything has failed.


Enter, pop count! Pop count would simply tell me how many bits were set in the word I've just accessed, with just one instruction! Let's take a look at the gain I realize by using POPCNT.


For 1MB of data with a 1k block size, I have 1,000 elements. Therefore, the number of instructions taken by each approach would have been:


Original [byte array based]:
Execution: For each element, I need to read the byte value and check if it is 1. If it is, I need to increment my counter.
Cost: 3 instructions [read, compare, increment] x 1000 = 3,000
Results: 3k instructions. Not a very good idea.


Bit array [without pop count]:
Execution: For every 32 elements, I need to read one word, shift the bits out, check if the left-most bit is 1 or not (check the sign of the resultant number), and then increment my counter if the bit is 1.
Cost: (1 read + 32 shifts + 32 compares + 32 increments) x (1000/32) = 3032
Results: Considering there are much fewer reads here, this approach would still be a lot faster because of a lot fewer memory accesses.


Bit array [with pop count]:
Execution: For every 32 elements, I need to read one word, do one pop count, and increment my counter by the return value from the pop count.
Cost: (1 read + 1 POPCNT + 1 add to the counter) x (1000/32) = 94
Results: Using the POPCNT instruction here gives me a whopping 32x reduction of instructions, representing a significant performance gain! This is with using 32-bit words. For 64 bits, there is even greater performance gain.


NOTE: There are other algorithms that could result in fewer instructions without using pop count, but we have chosen this x and x-1 approach because it is easily portable. Other algorithms that could perform this function faster often require a fixed number of bits, and hence are not suitable for all purposes. Even so, pop count is faster than the most optimal approach without pop count.


In addition to this specific scenario, there are several applications where pop count can substantially increase performance. Pop counts are used in cryptography (in fact, this instruction is also commonly called the 'canonical NSA instruction' because of the fact that the NSA refused to buy processors which didn't support this instruction), encoding/decoding, databases (for quickly assessing information about data), and many others. One application that I find POPCNT most useful for is to quickly calculate Hamming distances. A Hamming distance is essentially a measure of how different one word is from another. Remember, this is not how different the values held by the words are (we could just use a subtract instruction to find that out!) but how the words themselves differ. For machine words, it is defined as the number of bits that are different between the two words.


For example, take the following 8-bit words:


00110001
11010001
^^^^^


The lower 5 bits, denoted by the carats, are the same; hence only three bits are different. Therefore, the Hamming distance between these two words is 3.


A POPCNT instruction can give us the Hamming weight of a word, which is the difference between a word and the base word in its class. Because the difference between any particular word and a word with all 0s is the number of bits which are set, that is exactly what POPCNT gives us!


Of course, this doesn't give us the Hamming distance directly, but that's easily fixed. All we have to do is zero out the common bits between the two words and the result holds only the bits that are different. Our friendly neighborhood XOR instruction can do that for us, leaving us with the following sequence of instructions for calculating the Hamming distance between two words:


; RAX and RBX contain the two words


mov rcx, rax
xor rcx, rbx
popcnt rcx, rcx


; RCX now contains the Hamming distance between RAX and RBX


Hamming distances can be used to calculate things like error in data, as in how much error exists. They can be used as thresholds in encoding or decoding of audio/video. In fact, any place where you need any sort of fuzzy logic, Hamming distances could be useful. There are many other potential applications, but too many to be covered here.


This covers the first of two new advanced bit manipulation instructions that are introduced with the new Family 10h architecture. This leaves us with another interesting instruction, LZCOUNT, which counts the number of leading zeros in a given word. But, I'll leave that for next time.


-Rahul Chaturvedi



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 11/16/2007 at 12:39 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 09/26/2007 04:18 PM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

September 20, 2007
  "Barcelona" Processor Feature: 128-bit FPU

The new AMD "Barcelona" processors introduce dramatically improved numerical performance when using the standard SSE, SSE2 and SSE3 instruction extensions. Previous AMD processors typically could execute a vectorized SSE instruction (for example, MULPS to perform four multiplies) every two clock cycles. In the AMD "Barcelona" processor, this performance is doubled so a new vectorized SSE instruction like MULPS can typically be issued every cycle. This feature is called SSE128 because an entire 128-bit SSE register is processed on each clock tick. A detailed discussion of SSE128 can be found in the article "SSE128: AMD's New Floating-Point Enhancements."

Furthermore, with separate pipelines for add-class and multiply-class instructions, the new processor has a theoretical peak throughput of 8 single-precision floating point calculations per clock cycle. Integer SSE instructions get a similar boost. For complete timing details on all the instructions, see the Software Optimization Guide for AMD Family 10h Processors appendix C.

The easiest way to realize the benefits of SSE128 in real applications is to leverage existing library code which has been optimized using vectorized SSE instructions. The AMD Performance Library (APL) is one such library, providing a collection of popular software routines designed to accelerate application development, debugging, and optimization on x86 class processors to provide a quick path to high performance development. Also, the new release of the AMD Core Math Library (ACML), version 4.0, features new kernels tuned for great performance on the new processors. Specifically, DGEMM, SGEMM and CFFT have all been optimized to take advantage of the improved floating point performance. Another new feature of ACML 4.0 is the upgrade of the LAPACK routines to the new LAPACK 3.1. Many of these LAPACK routines have been optimized with OpenMP to take advantage of AMD's new quad-core processors. ACML will continue to improve, with more optimized functions in future releases.

Intermediate or advanced programmers can write custom vectorized SSE code to improve performance. Using Microsoft's Compiler Intrinsic functions for SSE, developers can write one version of SSE code that compiles for both 32-bit and 64-bit native platforms, something which is not possible using pure assembly code. See the article "Performance Optimization of 64-bit Windows Applications for AMD Athlontm 64 and AMD Opterontm Processors using Microsoft Visual Studio 2005" for an easy-to-follow tutorial with demo code showing some examples using Microsoft Visual Studio 2005.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 09/20/2007 08:35 PM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (2)  

September 18, 2007
  "Barcelona" Processor Feature: Sideband Stack Optimizer

Sideband Stack Optimizer is one of many of the AMD "Barcelona" processor's evolutionary "CPU Core IPC improvement" features. The Sideband Stack Optimizer is special circuitry in the core that tracks the value that the stack-pointer (RSP) assumes, allowing parallel execution of more than one PUSH or POP instruction. This is typically implemented by modifying epilogue and prolog code to utilize PUSH/POP instead of explicit references via RSP.

Motivations:

  • Chains of pushes and pops are dependent through RSP (i.e. breaks serial dependence chains for consecutive PUSH/POPs)
  • Can remove dependency by tracking RSP changes in a sideband register a.k.a. Stack-Pointer Delta or "SPd"
  • RSP adjustments then don't require functional unit bandwidth (no uops)

Basic operation:

  • Converts PUSH ops into pure stores (i.e. Save a pass through the functional unit)
  • Converts POP ops into pure loads (i.e. Save an op)
  • Also optimizes performance of CALL and RET instructions

For the software developer, this invokes preference for small push/pop over larger explicit store/load instructions to promote code density optimization.

Examples:

Replace this : MOV reg, [RSP+disp8] (4 bytes)
         or this : MOV reg, [RSP+disp32] (8 bytes)
     With this : POP reg (1 byte)

Replace this : MOV [RSP+disp8], reg (4 bytes)
         or this : MOV [RSP+disp32], reg (8 bytes)
     With this : PUSH reg (1 byte)

For more details and examples of this feature, please refer to Chapter 4, section 4.7 of the Software Optimization Guide for AMD Family 10h Processors.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 09/18/2007 11:26 AM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

September 17, 2007
  "Barcelona" Processor Feature: SSE4a Instruction Set

Writing SIMD code poses several complications. Doing 2 to 16 operations with one instruction is a powerful feature, but unless you have enough support instructions to get your data back and forth between the registers and memory, you may not always be utilizing the full potential that SIMD offers.

The SSE2 and SSE3 instruction sets include many instructions to help with this. They include packs and unpacks, shuffles, partial register move instructions, and more. These are pretty big sets, so for SSE4a to actually provide an improvement seemed like a difficult proposition. So I looked at the instructions a bit closer, and learned about the new bit field insertion and extraction instructions.

Before I start, let me mention that both of the following instructions work only on the lower 64 bits of the registers they deal with, and the upper 64 bits are undefined. When using these instructions, keep in mind that to access the upper half of any register, you'd have to shift the bits down by 64 and then do the required processing.

EXTRQ: Extract Field from Register

This instruction basically extracts a particular set of bits from one register and moves them to the register's least significant position. For example, if you want to have only the third 16 bit value in the xmm register, xmm0 (bits [47:32]) extracted and left at bits [15:0], you would use the EXTRQ instruction, and in this way,

EXTRQ xmm0, 16, 32

The first thought that came to my mind when I saw this instruction was, "Can't I do the same thing just using a shift instruction? Okay, that wouldn't clear out the rest of the bits in this 64 bit half, but I could do a mask, and then a shift...but then I'd have to use an extra register for the mask. Well, I could do two shifts, one left and one right, but then that would be two instructions."

Anyway, you get the idea. EXTRQ can be fairly useful, but not essential. Now INSERTQ, that comes close.

INSERTQ: Inserts Field from a source Register to a destination Register

This instruction takes a set of bits from one register and places those bits at ANY offset you specify (within 64 bits of course) within the destination register. For example, if you want to take a 16 bit value from xmm0 and move it to the third 16 bit value of xmm1, you would do,

INSERTQ xmm1, xmm0, 16, 32

But if you didn't have this instruction and wanted to accomplish the same thing, what would you do?

The quickest way would be to have a mask at bits [31:16] in one register, and use that mask to zero out those bits in xmm1. Then you'd have to shift the data in xmm0 to the correct location, and then merge those bits into xmm1.

So essentially INSERTQ is doing the job of three instructions in one!

If you want to do this entire process for arbitrary bit positions (in case you want to insert or extract different bits, based on other computation), you would add one more instruction here, because now the mask will also have to be shifted in place before you do the PAND. Further, if the 'source' register has more data than just the value you want to insert, then that would involve one more PAND to zero out the rest of the unwanted bits. If you put both together, you'd need to add ONE more shift for moving the mask which will zero out the bits in the 'source' register.

This means that, in order to do what INSERTQ provides - inserting a value in a register at any location, based on value stored in a register - you could potentially need to use a grand total of six SSE instructions.

If you think about it, you'll probably find a lot of places in your code where this INSERTQ instruction could save you significant time and complexity.

There are two more instructions in the SSE4a instruction set that add some more convenience - the partial stream (non-temporal hint store) instructions for floating point values. Look for future posts covering these topics.

-Rahul C.

-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 09/17/2007 12:10 PM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

September 12, 2007
  "Barcelona" Processor Feature: MONITOR/MWAIT

MONITOR and MWAIT are two separate instructions that are used together to monitor a range of linear memory. MONITOR tells the processor what address range to watch for a STORE instruction. MWAIT hints to the processor that it may enter an implementation dependent power state while waiting for that cache line within the address range to be written to, halting most activity in the core while doing so. These instructions allow an operating system or application to realize incremental power saving features beyond what the built-in Fire-and-Forget AMD PowerNow!TM feature provides, and can also be used to eliminate polling (the situation where an address is regularly checked for changes to data). Since MWAIT is triggered by a STORE instruction, it can also be used in creative ways with devices that handle IO through memory addresses, i.e. data sent to a device or data written to memory from a device would cause the processor to return to a higher power state.

The following sample code shows typical usage of a MONITOR/MWAIT pair:

EAX = Linear_Address_to_Monitor;
ECX = 0;  // Extensions
EDX = 0;  // Hints

while (!matching_store_done){
     MONITOR EAX, ECX, EDX
     IF (!matching_store_done) {
          MWAIT EAX, ECX
  }
}

Additional information can be found in the AMD64 Architecture Programmer's Manual Volume 3.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 09/12/2007 05:59 PM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

September 11, 2007
  "Barcelona" Processor Feature: Instruction-Based Sampling (IBS)

Instruction-Based Sampling (IBS) is a performance monitoring technique that provides precise information about AMD64 instruction fetch behavior and about the execution of operations that are issued from AMD64 instructions. This information can be used to analyze and improve the performance of programs executing on AMD "Barcelona" processors.

IBS provides four important advantages over conventional performance counter sampling:

  1. Hardware events are attributed precisely to the instructions that cause the events. Conventional performance counter sampling is not precise making it difficult, if not impossible, to attribute events to specific instructions. This limits the ability to pin-point performance issues at the instruction and source code levels.
  2. A wide range of events are monitored and collected with each IBS sample. Either multiple sampling runs or counter multiplexing must be used to collect the same range of information with conventional performance counter sampling.
  3. The virtual and physical addresses of load/store operands are collected. Profiling tools can use this information to associate specific data structures with x86 instructions performing load/store operations.
  4. Latency is measured for key performance parameters such as data cache miss latency.

The precision afforded by IBS also enables automated optimization techniques (e.g., profile-directed optimization) which require detailed, precise information about instruction-level program behavior.

More information will be forthcoming in an appendix to be added to the Software Optimization Guide for AMD Family 10h Processors.

IBS support is included in the latest beta release of AMD CodeAnalyst, version 2.7.4.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 09/11/2007 01:53 PM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

September 10, 2007
  "Barcelona" Processor Feature: CPUID
To use or not to use CPUID, that is the question. CPUID is an instruction that tells you what features of a processor are supported. This instruction definitely has a time and place to be used and not to be used. For general optimizations, CPUID can be a crucial step that ensures compatibility in your code, but for multithreaded code, you're much better off using OS methods to detect the processor topology.

As newer processors enter the market, they come loaded with new software visible features. Take SSE4a, for example, a new set of SIMD instructions included in AMD "Barcelona" processors, but not included in any previous AMD processors. If you are doing optimizations and want to use those instructions, it's highly recommended to use CPUID to first determine if the processor supports SSE4a, and then take the appropriate code path depending on the result of the operation. Adding this crucial step ensures that your code doesn't break if it is run on an older processor without SSE4a instructions. Check out the CPUID Specification for details on how to use CPUID properly to detect various features.

When it comes to parallel programming, processor topology is key to writing optimized multithreaded code. Things like cache architecture (size, levels, shared or not shared, etc.), NUMA rules, and number of cores are some of the main things that could affect your code. The best method for determining processor topology is to use the methods provided by the operating system. For example, in Windows, the API call GetLogicalProcessorInformation allows an application to learn about the machine's configuration, including multi-core and NUMA. Relying on operating system APIs to enumerate topology information is not only easier from a coding standpoint, but in a virtualized environment not all processors or nodes may be available for use - something that CPUID may not properly reflect.

-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 09/10/2007 at 11:08 AM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 09/10/2007 11:04 AM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

  "Barcelona" Processor Feature: Shared L3 Cache
The new shared L3 cache featured in AMD "Barcelona" processors is considered by many platform level software developers to be one of the two most important features of the platform.  These top two features, quad-core and L3 cache, together mean that the performance characteristics of applications that you previously profiled (on the original x86-64 family of AMD processors) will be different due to the enhanced microarchitecture and additional processor resources available.  This exposes new opportunities for performance improvement if a new performance study is done.  For such a performance study, consider the AMD CodeAnalyst Performance Analyzer for Windows® and Linux® that has been specifically tuned for AMD "Barcelona" (Family 10h) processors, including Third-Generation AMD OpteronTM processors (i.e. reads and interprets AMD "Barcelona" processor hardware event counters).  Check out the article Barcelona's Innovative Architecture Is Driven by a New Shared Cache and Chapter 5 of the Software Optimization Guide for AMD Family 10h Processors for more details on what this new L3 cache means for software developers.

-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 09/10/2007 10:57 AM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

September 7, 2007
  Welcome to the AMD "Barcelona" (Family 10h) Processor Software Visible Features Blog
The new AMD "Barcelona" processors, also known as Family 10h, and including Third-Generation AMD OpteronTM processors, contain an array of innovations in processor design and features, including a number of software visible features that can be leveraged to make your applications perform better and be ready to scale across multiple cores.  This "AMD "Barcelona" (Family 10h) / Third-Generation AMD Opteron Processor Software Visible Features" blog series summarizes a few key features and provides some insight beyond the information found in datasheets and manuals.

-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 09/07/2007 at 06:57 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 09/07/2007 06:49 PM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (0)  

FuseTalk Hosting Executive Plan - © 1999-2009 FuseTalk Inc. All rights reserved.

Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information