Cache Coherence
Why you should care about cache coherence protocols.
Written:
2026-03-30
Updated:
2026-03-30
Why should I care?
Let’s say we have a mutex, and N threads constantly trying to read from it.
Your intuition probably says that regardless of N, the throughput should be pretty similar (the mutex essentially serializes the threads). And you’d be right for any N > 1 – the throughput is about the same.
But I’ll bet you wouldn’t have guessed that for N = 1 we get ~10x the throughput. By understanding your hardware (and specifically the MESI cache coherence protocol), you’ll understand why.
Understanding MESI also helps with understanding atomic operations, which I’ll hopefully get around to writing soon.
Introduction
On most multicore machines, all cores:
- Have their own local caches
- Share the same RAM
If data is immutable or only accessed from one core ever, there’s no problem. But, once we add shared access and mutability, we run into problems.
Let’s imagine a basic world without cache coherence protocols:
- Core #1 reads
x=24, then caches it - Core #2 reads
x=24, then caches it - Core #1 writes
x=32to cache & RAM (* write-through is not standard, covered later) - Now Core #2′s cache holds an invalid value!
Hardware & History
Bus-Based (Legacy)
On older machines (1990s & 2000s), all cores shared a single front-side bus (FSB). The bus connected to a northbridge, which only then actually talked to the RAM:
Given this architecture, the bus had to be mutually exclusive (only controlled by 1 core at a time). While this does make it a bottleneck, it also makes coordination simple: cores can just “snoop” on the bus to know when to invalidate (or even directly update) their cache.
Modern On-Die Interconnect
Modern CPUs have neither a northbridge nor an FSB, instead:
- an on-die interconnect (note other CPUs have ~ equivalent things)
- an on-die, integrated memory controller (IMC) (which is on this interconnect)
Unlike FSB, the interconnect facilitates peer-to-peer communication (it is not a broadcast). This difference has a huge effect on coherence strategies.
Basic Cache Coherence Protocols
We have two basic cache coherence protocols for bus-based architectures:
- Write-Update: When a core writes to a cache line, it broadcasts the new value, which other cores use to update their cache.
- Write-Invalidate: When a core writes to a cache line, it broadcasts a message telling other cores to invalidate their copy.
Write-update may initially seem faster – cores don’t have to refetch data. But there’s more nuance, and in practice, write-invalidate uses significantly less bandwidth. Take this example:
- If core #1 writes 10x and core #2 reads it after the writes
- Write-update had #2 update the cache 10x
- Write-invalidate had #2 invalidate once, and refetch once
Both have room for improvement:
- The bus is a bottleneck – all coherence traffic and memory requests compete for the same mutually exclusive resource
- We’re sending invalidation messages even when no other cache holds the line (wasted bandwidth)
- These protocols assume write-through (every write goes to both cache and RAM). Write-back caches are more efficient (less memory traffic), but they make coherence harder — now main memory can be stale too, so the protocol needs to track which lines are dirty.
The MESI Protocol
MESI is a more advanced cache coherence protocol which enables write-back and avoids wasted bus traffic.
MESI achieves this by having each core store a small amount of state for every cache line.
States
| State | Meaning |
| Modifed | Dirty / differs from RAM, only this core has it |
| Exclusive | Clean / matches RAM, only this core has it |
| Shared | Clean / matches RAM, multiple cores have it |
| Invalid | This core doesn’t have it (or the cached value is stale) |
Operations
Read
Line is Modified, Exclusive, or Shared (Hit):
- Simply return the cached value (no state change or bus activity)
Line is Invalid (Miss). Next step depends on what other cores hold:
- None: retrieve from RAM, cache as
Exclusive Exclusive: retrieve from that core, both cores →SharedModified: retrieve from that core, both cores →Shared, write-back to RAM (clean it)Shared: retrieve from any core, cache asShared
Write
Line is Modified (Hit): just modify cache
Line is Exclusive (Hit): modify cache, → Modified (mark dirty)
Line is Shared (Hit): tell others → Invalid, modify cache, → Modified
Line is Invalid (Miss). Next step depends on what other cores hold:
- None: retrieve from RAM, modify, →
Modified ExclusiveorShared: tell others →Invalid, retrieve from RAM, modify, →ModifiedModified: tell other to write-back to RAM & →Invalid, retrieve from RAM, modify, →Modified
Why the Exclusive State?
Another system, MSI, exists – which is the same, but without Exclusive. MESI has Exclusive to reduce bus traffic:
- If a core is the sole owner and the line is clean, it can promote to
Modifiedon a write with zero bus activity - In MSI, a line held by only one core is still marked
Shared, so every write requires broadcasting an invalidate – even when nobody else has the line
Modern Coherence
Modern systems can’t easily broadcast like in bus-based ones. Instead, many use a directory-based approach:
- A directory tracks exactly which caches hold each line
- When a core needs to invalidate or fetch, the directory tells it who to talk to
Directory-based works with MESI, and in fact, many still use MESI (or a variant like MOESI or MESIF).
Putting it Together
Let’s go back to our example from the start – why is it so much slower with N>1 than N=1?
In Rust, locking a mutex involves an atomic operation on shared state:
pub fn lock(&self) -> MutexGuard {
// state is shared data
// this (atomic) operation tries to acquire the lock
while self.state.compare_exchange(0, 1, ...).is_err() {
// spin or park
}
}Now that we understand MESI, we know that:
- If N=1, that core can keep the cache line in
Modified. Thefetch_addsimply modifies its local cache, without ever touching RAM or communicating over the bus - If N>1, we’ll have a ton of bus traffic – cores constantly taking the line as
Modifiedand invalidating others
Aha!
Footnotes
This was my first post going into depth about topic outside my projects. I’d love to hear what you think!
Jon Gjengset’s, “The Cost of Concurrency Coordination” talk was the inspiration for this post. I’d highly recommend giving it a listen or any of his other videos.
Additionally, I referenced (and would recommend checking out!) these sources: