AMD Logo AMD Developer Central
AMD Developer Blogs
AMD Developer Blogs - Larger L3 cache in Shanghai, Part I
Decrease font size
Increase font size
November 13, 2008
  Larger L3 cache in Shanghai, Part I

Contents:

  1. Introduction
  2. The new larger L3 cache in the Quad-Core AMD OpteronTM processor codenamed "Shanghai"
  3. Latency reduction in the new L3 cache

Introduction:

The core of the Quad-Core AMD OpteronTM processor codenamed "Barcelona" introduced a unique on-die L3 cache for the first time. Besides being the first native quad-core x86 design and a slew of other innovations in "Barcelona," the presence of the L3 cache was an important feature of the processor design.  The new, enhanced Quad-Core AMD Opteron processor codenamed "Shanghai" further adds to the sequence of innovations by tripling the size of the L3 cache. This article discusses the newly increased L3 cache of the "Shanghai" processor and its impact on performance.

The new larger L3 cache in Shanghai

The "Barcelona" processor has 2MB of L3 cache. This L3 cache is shared between "Barcelona's" four cores, L1 and L2 caches being local to each core. There are two L1 caches of 64KB each (one for data and the other for instructions) and one L2 cache of 512KB per core.

The "Shanghai" processor on the other hand has a much bigger L3 cache of 6MB, enabled in large part due to the shrinkage of the process technology to 45nm used in the "Shanghai" processor. The "Barcelona" processor was built using 65nm process technology. The "Shanghai" processor maintains the same size for the L1 and the L2 caches.

A brief look at the AMD quad-core processor cache architecture (victim caches)

First access to an address (which would lead to an L3 miss) will bring the line straight into L1 data cache. Thus the L3 is  non-inclusive.  Only after it is evicted from L1 and then from L2, will it come into L3.  Once in L3, there are various scenarios where the L3 returns data and retains the line. The L3 behaves as an inclusive cache by keeping a copy, if it is likely the data is being accessed by multiple cores, versus behaving as an exclusive cache by removing the data from the L3 cache (and placing it solely in the L1 cache, creating space for other L2 victim/copy-backs), if it is likely the data is only being accessed by a single core. So this duplication of data in this "mostly exclusive" design happens only when it is possible for the data to be shared, emphasizing the role of the L3 to enable sharing of data between the cores. This is also seen when making a decision to evict a line from L3, where it prefers to evict unshared lines over shared lines.

Thus, the caches act as victim buffers for the caches higher up in hierarchy.

Furthermore, the cache features bandwidth-adaptive policies that optimize latency when requested bandwidth is low, but allows scaling to higher aggregate L3 bandwidth when required (such as in a multi-core environment).

Also the L3 is dynamically shared between the cores so that each core gets a fair share of the cache, and if one core needs more of the cache when other cores are idle it can make use of most of the cache.

Some additional points:

  • As far as cache coherency concepts are concerned, L3 is just another independent caching entity in the system.
  • The "Shanghai" L3 is 48-way associative, whereas the "Barcelona" was 32-way associative cache.

The image below shows the complete cache hierarchy of the "Shanghai" processor. "Barcelona" also has a similar hierarchy except that it only has 2MB of L3 cache.

Latency reduction in the new L3 cache:

Within the processor the L3 cache is part of the north bridge subsystem and runs at the North Bridge (NB) frequency. Hence the L3 hit latency is also dependent on the NB frequency.

"Shanghai" has a best case latency of 29 CPU clocks, whereas "Barcelona" had a best case latency of 34 CPU clocks. So the lower latency to data stored in L3 cache should also help to significantly boost performance.

Conclusion

Looking at the above information it's clear that developers don't have to start recoding to take advantage of the new larger L3 cache in the upcoming "Shanghai" processor. This enhancement is expected to benefit many existing programs because the processor has access to a larger chunk of data sitting in the L3 now than before.

Also, since the L3 is a shared cache and if your multiple cores are going to work on the same copy of the data, it makes sense to do the work in parallel, i.e. at the same time, so that the data can be used by all the cores and it does not have to be loaded again.

In the second part of this article I will demonstrate the benefit of the larger L3 cache for a memory intensive program and use AMD CodeAnalyst to correlate the benefit to the Performance Monitoring Counter (PMC) events of the L3 cache which the "Shanghai" processor supports (also found on "Barcelona," since they share very similar cache architectures).

Other relevant articles on developer.amd.com:

  1. L3 cache in "Barcelona": http://developer.amd.com/documentation/articles/pages/8142007173.aspx
  2. Processor cache 101 : http://developer.amd.com/documentation/articles/pages/1128200684.aspx
  3. Processor cache 102 : http://developer.amd.com/documentation/articles/pages/1128200685.aspx
  4. Cache friendly programming techniques: http://developer.amd.com/documentation/articles/pages/1128200684.aspx
  5. Using AMD CodeAnalyst : http://developer.amd.com/assets/Linux_Summit_PJD_2007_v2.pdf



- Vikrant Kumar



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 01/12/2009 at 02:02 PM by AMD Developer Blogs Moderator

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 11/13/2008 01:43 AM     AMD “Istanbul” (Family 10h) Processor Software Visible Features  

November 17, 2008

Comments


 

Really interesting read..We were about to order a HP785G5 with Barcelona for a multimedia intensive real-time app. After reading this, I think we will put pressure on HP to get us the one with Shanghai, as you rightly say, if our process is switched to a different core by the CPU, then we still dont loose data in cache.

keep it comming.:-)

rosh


 Posted By: Rosh Cherian @ 11/17/2008 04:22 AM   :  Post a reply

December 19, 2008
 

Couple of comments:

WRT to L1 on K-10: Why is it just 2-way ? Even C2D has 8-way L1.

Also, it does not apper to be true that 16-way accociativity of L2 on K-10 is somehow more than standard for its class. If you take a look at its predecessor from competition C-D and C2Q, it has 16-way L2 for some time.

And one thing about L3: For a chip that has touted its "born as a true QC" from the start it is a puzzle to me why is it so painfull to make four cores share data through L3. In Application note from AMD IIRC it is depicted as some kind of round-robbin transfer between cores, where each core has to take care to fill the "buffer" just right, so that it spills from L2 into L3, where consuming coer can get to it. It's something between China Circus performance and madness.

Was it so hard to enable user to lock some "way" of L3 ( which is 32-way, so it wouldn't be a big loss) as a static RAM temporarily and enable cores to write to it directly ? Better yet, one should also be able to lock one way for caching locked L3 way in a similar manner, so that user could work around most of L3's latency.

Also, prefetchX instructions seem IMHO crippled. Since all versions of instruction can prefetch only to L1 and since L1 is only 2-way, it means that in order to enhance the performance, data should come at exactly the _right_ time.

If it comes even a bit too early when CPU needs L1 for other things, L1 will be evicted and then again refilled, which takes time. If it comes too late, core will stall.

If I could prefetch into 16-way L2, I'd pay a few extra cycles for first read, but succesive reads wouldn't be any slower and timing would be much less critical.

 

 

 


 Posted By: Branko Badrljica @ 12/19/2008 04:09 AM   :  Post a reply

December 23, 2008
 

That's all fine, but you forgot to address  a couple of issues:

- K-10 was about to be all about close core cooperation as a "true QC".

And yet only demonstrated way to do such thing in your AN is more akin to state-of-the-art Chinese Circus act than multicore programming.

Why is there not a provided way to use  a bit of L3 as a fast wide SRAM for intercore communication ?

Or better yet, why is there no some kind of switch under L2 so that one could use some way ( out of 16 assoc. ways of L2 IIRC) for such thing ?

Also, I don't see 16-way L2 associativity as anything over the class-standard.

Maybe in days of first Opterons, but not now.

And finally, whi is L1 on K-10 still _just_ 2-way associative ?

This totally kills many cool optimised routines that could have been written and are ceratinly possible on C2D/C2Q, not to mention i7/i5.

Not to mention PrefetchX instructions, which can prefetch only to L1.

So, if you are doing soem optimised loop, using both ways of L1  and you have prefetched even _a_bit_ too early next batch of data, you're basically shot yourself in a foot.

And since main feature of K-10 is CnQ, this means that CPU clock wuld jump up and down and you have no way to _not_ screw it up.

I benched RAID-6 SSE-2 generation routine in linux kernel, which is heavilly optimized, but on K-10 it works something like 20% faster _without_ prefetch  instructions, obviously because of that influence.

In short:

- get L1 some decent associativity

-get L1-L2 bus equal throughput in both directions

- get user  some way of control over each associative way of L1,L2,L3, so he can use it as a fast scratchpad RAM and maybe even how and when should it be cached by lower level caches and whether it maps to some area of ram, so that after unlock it should be written-back to some RAM area or just discarded...

- get user ability to prefetch to L1/L2/L3

And then compare your solution to competition and think about what can you brag about.

As it is now, you'r just coming across as silly...


 Posted By: Branko Badrljica @ 12/23/2008 06:51 PM   :  Post a reply

FuseTalk Hosting Executive Plan - © 1999-2009 FuseTalk Inc. All rights reserved.

Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information