Contents:
- Introduction
- The new larger L3 cache in the Quad-Core AMD OpteronTM processor codenamed "Shanghai"
- Latency reduction in the new L3 cache
Introduction:
The core of the Quad-Core AMD OpteronTM processor codenamed "Barcelona" introduced a unique on-die L3 cache for the first time. Besides being the first native quad-core x86 design and a slew of other innovations in "Barcelona," the presence of the L3 cache was an important feature of the processor design. The new, enhanced Quad-Core AMD Opteron processor codenamed "Shanghai" further adds to the sequence of innovations by tripling the size of the L3 cache. This article discusses the newly increased L3 cache of the "Shanghai" processor and its impact on performance.
The new larger L3 cache in Shanghai
The "Barcelona" processor has 2MB of L3 cache. This L3 cache is shared between "Barcelona's" four cores, L1 and L2 caches being local to each core. There are two L1 caches of 64KB each (one for data and the other for instructions) and one L2 cache of 512KB per core.
The "Shanghai" processor on the other hand has a much bigger L3 cache of 6MB, enabled in large part due to the shrinkage of the process technology to 45nm used in the "Shanghai" processor. The "Barcelona" processor was built using 65nm process technology. The "Shanghai" processor maintains the same size for the L1 and the L2 caches.
A brief look at the AMD quad-core processor cache architecture (victim caches)
First access to an address (which would lead to an L3 miss) will bring the line straight into L1 data cache. Thus the L3 is non-inclusive. Only after it is evicted from L1 and then from L2, will it come into L3. Once in L3, there are various scenarios where the L3 returns data and retains the line. The L3 behaves as an inclusive cache by keeping a copy, if it is likely the data is being accessed by multiple cores, versus behaving as an exclusive cache by removing the data from the L3 cache (and placing it solely in the L1 cache, creating space for other L2 victim/copy-backs), if it is likely the data is only being accessed by a single core. So this duplication of data in this "mostly exclusive" design happens only when it is possible for the data to be shared, emphasizing the role of the L3 to enable sharing of data between the cores. This is also seen when making a decision to evict a line from L3, where it prefers to evict unshared lines over shared lines.
Thus, the caches act as victim buffers for the caches higher up in hierarchy.
Furthermore, the cache features bandwidth-adaptive policies that optimize latency when requested bandwidth is low, but allows scaling to higher aggregate L3 bandwidth when required (such as in a multi-core environment).
Also the L3 is dynamically shared between the cores so that each core gets a fair share of the cache, and if one core needs more of the cache when other cores are idle it can make use of most of the cache.
Some additional points:
- As far as cache coherency concepts are concerned, L3 is just another independent caching entity in the system.
- The "Shanghai" L3 is 48-way associative, whereas the "Barcelona" was 32-way associative cache.
The image below shows the complete cache hierarchy of the "Shanghai" processor. "Barcelona" also has a similar hierarchy except that it only has 2MB of L3 cache.

Latency reduction in the new L3 cache:
Within the processor the L3 cache is part of the north bridge subsystem and runs at the North Bridge (NB) frequency. Hence the L3 hit latency is also dependent on the NB frequency.
"Shanghai" has a best case latency of 29 CPU clocks, whereas "Barcelona" had a best case latency of 34 CPU clocks. So the lower latency to data stored in L3 cache should also help to significantly boost performance.
Conclusion
Looking at the above information it's clear that developers don't have to start recoding to take advantage of the new larger L3 cache in the upcoming "Shanghai" processor. This enhancement is expected to benefit many existing programs because the processor has access to a larger chunk of data sitting in the L3 now than before.
Also, since the L3 is a shared cache and if your multiple cores are going to work on the same copy of the data, it makes sense to do the work in parallel, i.e. at the same time, so that the data can be used by all the cores and it does not have to be loaded again.
In the second part of this article I will demonstrate the benefit of the larger L3 cache for a memory intensive program and use AMD CodeAnalyst to correlate the benefit to the Performance Monitoring Counter (PMC) events of the L3 cache which the "Shanghai" processor supports (also found on "Barcelona," since they share very similar cache architectures).
Other relevant articles on developer.amd.com:
- L3 cache in "Barcelona": http://developer.amd.com/documentation/articles/pages/8142007173.aspx
- Processor cache 101 : http://developer.amd.com/documentation/articles/pages/1128200684.aspx
- Processor cache 102 : http://developer.amd.com/documentation/articles/pages/1128200685.aspx
- Cache friendly programming techniques: http://developer.amd.com/documentation/articles/pages/1128200684.aspx
- Using AMD CodeAnalyst : http://developer.amd.com/assets/Linux_Summit_PJD_2007_v2.pdf
- Vikrant Kumar
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 01/12/2009 at 02:02 PM by AMD Developer Blogs Moderator