In the dark ages of the computer world there were no caches. An application was loaded into main memory and every time the processor needed to retrieve a sizeable amount of data it did so from main memory. There were no other places in the system for long term data storage. That worked ok since at this point the 80386 had yet to see the light of day and processor speeds over 10mhz were considered fast. As technology progressed it became apparent that processor speeds were increasing much faster than ram speeds(as they still are today). Soon, processor speed would completely outstrip memory speed, and requiring the processor to look in main memory every time it requested data would become a major bottleneck. When this was found to be true the architects said "The processors must have cache!", and so it was done.
To understand how caches work in modern systems, it helps if you think of the processor as a workshop, and the memory subsystem as a warehouse. The workshop is very small and has very little room for inventory, so most of the items have to be stored in the warehouse. When an order comes into the workshop the raw materials are gathered from the warehouse and assembled into the finished product. The increasing gap between processor speeds and ram speeds could be likened to the warehouse being moved continually farther away from the workshop. The farther away the warehouse is the more time it's going to take to move stuff into the workshop. One solution is to create a second warehouse closer to the workshop. More buildings mean more cost though and this workshop is in a particularly expensive neighborhood. The boss of this workshop thinks cost is just another four letter word, so the second warehouse would have to be much smaller. However, if it was well taken care of and contained only the most commonly used materials the long trips to the bigger warehouse could be cut down dramatically. This is the function of the processor caches. They cut down the number of trips necessary to main memory by providing a stockpile of the most commonly needed information. Also, I should say that this is a common analogy for explaning processor caching. The first place I saw it was in Hannibal's article
on the same material.
This smaller warehouse is the L1(Level 1) cache. Modern processors also make use of an L2 and some also an L3 cache. L1 cache is usually split between instructions and data, and also is typically the smallest and fastest. The L1 is often split since data and instructions often have different usage patterns. Hitting the play button in your favorite MP3 player always invokes the same code, but it doesn't always use the same data. Splitting the L1 cache can help solve problems that might come about through trying to fit both those patterns into the same cache. The L2 is larger and slower(L3, if used even more so) and usually can be used for either instructions or data. In past systems one reason for the L2 being slower was that it was external to the processor and ran at some fraction(1/2 or 1/3 were common ones) of the processor speed. Today, process technologies have improved and transistor sizes have shrunk to the point where off-die L2 cache is a thing of the past.
Prior to this guide being written, the question was asked if the L1 is the fastest why is it the smallest and not the biggest? The reason again is cost. Cache is made from SRAM(static random access memory), while your average memory modules are made from DRAM(dynamic ram). If you look at an average memory module(assuming it doesn't have a heatspreader on it), you'll see that it contains multiple chips. Each chip contains multiple cells arranged in a grid in which the individual pieces of data are stored. SRAM is more expensive since each cell in a SRAM module is made of four to six transistors. In a DRAM module, one transistor equals one cell. Caches also now represent a large portion of the overall die size of the processor(as can be seen in this die photo of the K8 at Sandpile
). The contrast between logic and cache isn't this dramatic in all processors but bigger caches would still mean bigger processors. Making the processors bigger means you can get less of them out of a single silicon wafer which increases prices. If that increase in die size is unacceptable, a bigger L1 could also mean less room for the core logic of the processor. To return to the workshop/warehouse analogy, that would be like spending so much on warehouse space that you can't afford to hire enough workers for the workshop.
Another common question is how much of an impact more cache will have when someone's upgrading their machine. The answer(it's a common answer) is "it depends on what programs you're running". Most important in this are the concepts of spatial and temporal locality. Spatial locality is just a nifty way of saying that when a processor grabs something from memory, it'll probably grab something close to it afterwards. Temporal locality is the rule that pieces of data often are used more than once by the processor before it's completely done with them. Media apps often have very good spatial locality since they often play large data streams(like MP3 files, DVDs, etc) from start to finish without jumping around a lot inbetween. However, they also tend to have very poor temporal locality since you usually won't be going back and repeating any part of the data stream. Behavior like this where the data is just streaming though the cache doesn't really benefit from using a cache at all. If the processor is going to use a piece of data only once and throw it away afterwards then it's not really a big deal that it won't be somewhere that's quickly accessible afterwards. As you might imagine, it also doesn't take long for this kind of behavior to fill up the cache either. When that happens, any time something new enters the cache, something else has to get kicked out. It may very well be that the data that gets kicked out will be reused while the data that replaced it will not.
The rule that's used when the cache fills up is called the replacement policy. As I said above, the rule doesn't always work when data is just streaming though the cache(also called cache pollution), but the replacement policy is basically how the processor "knows" what data is most common and should be cached. A commonly used approach is to search though the cache, and look for the data that is the Least Recently Used(LRU). That is, a true LRU algorithm will look though each block in the cache and kick out the data has been in the cache longest without having been used by the processor. While an LRU policy is an efficient policy to use it's also expensive due to the control logic needed to keep tabs on each and every block of the cache. Also, for a large cache checking each and every block can take time which can slow the cache down overall. Again, it's a balancing act aimed at finding the sweet spot between cost, speed, and complexity. A lot of processors use a pseudo-LRU algorithm which approximates what a true LRU algorithm does, but doesn't cost as much to implement.
If you want to take a closer look at replacement policies you can check out a recent(as of May 2005) security flaw
that comes about through an implementation of the pseudo-LRU algorithm not being thread-aware.
With the basics covered it's time to get into some of the more complicated stuff. First off is the deal of inclusive and exclusive caching. Inclusive caching has been more common historically with exclusive caching being comparatively recent. There's probably not a whole lot of difference to which scheme will make a better performing cache overall, but explaning the two schemes will still help sort a few things out. The P4 uses inclusive caching, while the K7 and K8 use exclusive caching. Inclusive caching replicates the contents of the L1 cache in the L2 which can make life simpler especially in multi-core/multi-CPU systems. The Northwood P4 has an 8K L1 cache and a 512K L2 cache. The "useable" L2 cache size is ~504K(512-8) and the total cache size is 512K. Exclusive caching is just the opposite. In exclusive caching no data is replicated between L1 and L2. Total cache size is L1 + L2. If you've ever seen AthlonXPs advertised as having 384K(128+256) or 768K(128+512) total cache, that's the reason why. A few more operations are often needed to maintain the exclusivity(that is, make sure no data is replicated between L1 and L2), which will usually slow the cache down overall, but the added cache size seems to make up for that. Exclusive caching also allows some more flexibility. The L2 cache of the Duron was actually smaller(64K) than it's L1 cache(128K), but because of the exclusive caching, it was still a fairly decent performer. One drawback to using an exclusive cache is it puts a lower ceiling on how big any performance gain will be when cache size is increased. That is, a CPU that uses exclusive caching will often show less of a performance increase when using bigger caches than a CPU that uses inclusive caching. Exactly how big that performance increase is in either case depends(again) on the applications being used.
The P4 broke away a bit from the usual model that I described above by introducing the trace cache. In order to perform decently, the P4 needs to run at very high clock speeds. The trace cache helps that out a bit by helping to streamline the flow of information in order to make everything work a bit more smoothly. The trace cache replaces the standard L1 instruction cache(remember, the L1 is often split between instructions and data). In the usual model, the contents of the instruction cache are x86 instructions produced by the x86 compiler. The first few pipeline stages of the processor are then spent breaking down those instructions so they can be executed by the processor. The function of the trace cache is to store instructions that have already been "broken down" and are ready for execution straight out of the cache. The advantage of that comes in with instructions that are executed multiple times in succession since it allows those first few pipeline stages to be skipped, and removes the need to decode the same instructions over and over again.
The K7 and K8 use a more common trick aimed at a similar purpose. As I mentioned above, maintaining exclusivity between the caches often takes a few extra operations, and that has consequences in a number of places. If the requested data is not in the L1 but is in the L2, a line from the L1 would first have to be copied into L2, and then the requested data moved from L2 into L1. The extra step must be taken since L1 and L2 can't contain the same data, but it requires some time to do so(CPUID
has some nitfy animations that help visualize the process). One way to speed that up is with the use of a victim buffer(it's called a victim buffer, since "victims" is the term used for stuff that's been kicked out of the cache). The victim buffer basically allows the processor to stay working more of the time instead of sitting idle. The K7 and K8 don't have the same dependancy on high clock speeds, but a way to keep the processor working instead of doing nothing is almost always a good thing. A victim buffer is usually very small, very fast, and sits inbetween L1 and L2. When an L1 miss occurs, a block will be evicted from the L1 into the victim buffer, instead of the L2. At the same time the processor can be looking in the L2 for the requested data since the write to the victim buffer can be "hidden" by the higher latency of the L2 cache. If the requested data is found in the victim buffer, then it can be accessed much faster than it could be from the L2.
If you've spotted any glaring mistakes or have any questions about this guide you can either PM me or check out the thread linked to below.http://forums.amd.com/index.php?showtopic=23904
Do not meddle in the affairs of archers, for they are subtle and quick to anger.