Find me on GitHub
https://github.com/josehu07
CV | LinkedIn | Google Scholar
guanzhou.hu (at) wisc.edu | |
josehu (at) cs.wisc.edu |
To all my loved ones and my friends ♥
GitHub Pages — Theme by orderedlist
07 Aug 2020 - Guanzhou Hu
Caching is an essential technique used broadly in computer system hierarchies. This post briefly summarizes existing cache mode configurations and cache eviction algorithms. This serves as a shallow review of cache systems before I go deeper into this field.
Cache mode generally controls when to promote data into / flush data back from the cache. We can choose from one of the following cache mode configurations1:
Eviction algorithm is an orthogonal aspect to cache mode. An eviction algorithm controls which entry in cache to evict back when a new entry is promoted and the cache is full. There has been a long path of research around this topic. I will briefly summarize all existing eviction algorithms I know, from the simplest FIFO to more optimized & complex ones2 3.
Simple. Just keep a First-In-First-Out (FIFO) queue of all cache entries and evict the head of the queue.
Pure FIFO is very easy to implement but yields very poor cache performance under real-world scenarios.
Second-Chance is an improvement to FIFO that accounts for time locality. It adds a reference bit to each entry of the FIFO queue.
1
0
and continue on the next entry in queueCLOCK is a further implementation-level improvement to Second-Chance which eliminates the use of a FIFO queue. CLOCK keeps a circular list of cache lines called a clock, with a clock hand pointer pointing to the last examined entry in the list4.
Figure from the this video.
1
0
and increment the hand pointer to point to the next entry in listCLOCK acts exactly the same as Second-Chance. It is more efficient than Second-Chance since there is no dynamic FIFO queue here - it uses a fixed circular list instead.
CLOCK was introduced by Fernando Corbato at the 1960s.
Least-Recently Used (LRU) keeps track of the most recent time each cache line was referenced, and evict the least-recently used one.
LRU can yield near-optimal cache performance in theory, but implementing an LRU scheme is rather expensive. The most primitive way is to use a linked list of all cache lines.
As you can see, every cache reference operation will now trigger a traverse over the linked list, resulting in significant overhead. Segmented LRU reduces this overhead by dividing the cache into smaller segments (probably by logical address space). It evicts the least-recently used cache line in the segment that the new incoming entry belongs to.
Each segment maintains its own LRU list, which will be much shorter than an LRU list of the whole cache. Thus, the overhead of a cache reference operation to traverse over the linked list will be less significant.
Not Frequently Used (NFU) keeps a reference counter for each cache line. For each time interval, all cache lines referenced during this interval will increment its counter by 1
. At eviction, NFU evicts the cache line with the smallest counter value.
The main problem with NFU is that “it keeps track of the frequency of use without regard to the time span”2. To make it aware of the time span of use, the Aging algorithm gives a higher weight to recent references and a lower weight to older references.
0b10000000
Example of the counter value of one entry:
Counter
Time tick #0: 0b00000000
Time tick #1: referenced 0b10000000
Time tick #2: 0b01000000
Time tick #3: referenced 0b10100000
Adaptive Replacement Cache (ARC) is an advanced eviction algorithm that has better cache performance than LRU and NFU. It keeps track of BOTH frequently used and recently used cache lines. This Wikipedia page gives a good demonstration of how this algorithm works5 3.
Basically, it splits the cache into two queues: T1
for recently used entries and T2
for frequently used entries (referenced at least twice). It also attaches two ghost queues B1
and B2
to the bottom of T1
and T2
to remember the metadata of recently evicted entries from those queues.
out <-[ B1 <-[ T1 <-!-> T2 ]-> B2 ]-> out
[ . . . . [ . . . . . . ! . .^. . . . ] . . . . ]
[ fixed cache size (c) ]
This text-based representation is from the above mentioned Wikipedia page5.
!
marker indicates the entering point of the cache
T1
to the left of !
T1
or B1
that gets referenced once more will be pushed into T2
to the right of !
T2
of B2
, if referenced once more, will be re-pushed to the head of T2
to the right of !
. This can repeat indefinitely^
marker indicates the boundary of the two queues T1
and T2
and is also the adaptive target position of !
!
B1
will increase the size of T1
, pushing ^
to the right. The bottom entry of T2
will be evicted to B2
B2
will shrink the size of T1
, pushing ^
to the left. The bottom entry of T1
will be evicted to B1
!
towards ^
, evicting a corresponding entry on that side
!
and ^
controls which queue to evict an item fromIBM holds a patent for the ARC algorithm.
CLOCK with Adaptive Replacement (CAR) is an implementation-level improvement to ARC which replaces the two queues T1
and T2
with two clocks.
Low Inter-reference Recency Set (LIRS) is another advanced eviction algorithm that has better cache performance than LRU. It uses the concepts of reuse distance and recency to dynamically rank cache lines’ locality, and uses these metrics to decide which cache line to evict. This Wikipedia page gives a good demonstration of how this algorithm works3 6 7.
The reuse distance of a cache line is defined as the number of distinct cache lines accessed between the last two references of the cache line. (If a cache line is only referenced once, its reuse distance is \(\infty\).) The recency of a cache line is defined as the number of distinct cache lines accessed after the most recent reference of the cache line.
The cache is divided into two partitions: a Low Inter-reference Recency (LIR) partition which forms most of the cache and a High Inter-reference Recency (HIR) partition. All LIR entries are resident in the cache, while only some of the HIR entries are resident. All recently-referenced cache lines are placed in a FIFO queue called the LIRS stack S
. Resident HIR entries are also placed in a FIFO queue called the stack Q
.
Calling them “stacks” can be rather misleading - they are simply two queues.
This figure from the above mentioned Wikipedia page6 demonstrates the status of the two queues S
and Q
:
B
@ (a) \(\rightarrow\) goes to (b)
S
, and all HIR entries at the bottom of S
will be removedE
@ (a) \(\rightarrow\) goes to (c):
S
S
turns back to an HIR entry and moves to the top of Q
D
@ (a) \(\rightarrow\) goes to (d), and accessing C
@ (a) \(\rightarrow\) goes to (e)
Q
will be evictedCLOCK-Pro is an implementation-level improvement and an approximation to LIRS. It replaces the two queues S
and Q
with two clocks.
Multi Queue (MQ) algorithms maintain 2 or more LRU queues, each representing a different lifetime length defined by the algorithm dynamically. Say we use \(m\) LRU queues. \(Q_{m-1}\) represents the shortest lifetime \(l_{m-1}\) and \(Q_0\) represents the longest lifetime \(l_0\). Each entry in the queues also keeps an access count (frequency).
There will also be a ghost list \(Q_{out}\) holding recently-evicted entries’ access counts.
CLOCK-2Q is an implementation-level improvement and an approximation to 2Q algorithm, by replacing the LRU queues with clocks.
Of course, you can simply adopt random eviction. Random eviction is very efficient and it actually has a rather good cache performance, considering that many of the complicated real-world workloads do not actually have a high degree of locality.
ARM processors use random eviction to enable a very efficient cache implementation.
这些 cache eviction algorithms 里面,LIRS、CLOCK-Pro、MQ、CLOCK-2Q 等都是华人学者在做的工作,包括 Song Jiang、Yuanyuan Zhou、Wengang Wang 等,或许是个值得注意的现象。
Please comment below anything you wanna say! 😉