Cache Coherence

Why you should care about cache coherence protocols.

  • Written Icon

    Written:

    2026-03-30

  • Updated Icon

    Updated:

    2026-03-30

Why should I care?

Let’s say we have a mutex, and N threads constantly trying to read from it.

Your intuition probably says that regardless of N, the throughput should be pretty similar (the mutex essentially serializes the threads). And you’d be right for any N > 1 – the throughput is about the same.

But I’ll bet you wouldn’t have guessed that for N = 1 we get ~10x the throughput. By understanding your hardware (and specifically the MESI cache coherence protocol), you’ll understand why.

Understanding MESI also helps with understanding atomic operations, which I’ll hopefully get around to writing soon.

Introduction

On most multicore machines, all cores:

  • Have their own local caches
  • Share the same RAM

If data is immutable or only accessed from one core ever, there’s no problem. But, once we add shared access and mutability, we run into problems.

Let’s imagine a basic world without cache coherence protocols:

  1. Core #1 reads x=24, then caches it
  2. Core #2 reads x=24, then caches it
  3. Core #1 writes x=32 to cache & RAM (* write-through is not standard, covered later)
  4. Now Core #2′s cache holds an invalid value!

Hardware & History

Bus-Based (Legacy)

On older machines (1990s & 2000s), all cores shared a single front-side bus (FSB). The bus connected to a northbridge, which only then actually talked to the RAM:

Given this architecture, the bus had to be mutually exclusive (only controlled by 1 core at a time). While this does make it a bottleneck, it also makes coordination simple: cores can just “snoop” on the bus to know when to invalidate (or even directly update) their cache.

Modern On-Die Interconnect

Modern CPUs have neither a northbridge nor an FSB, instead:

Unlike FSB, the interconnect facilitates peer-to-peer communication (it is not a broadcast). This difference has a huge effect on coherence strategies.

Basic Cache Coherence Protocols

We have two basic cache coherence protocols for bus-based architectures:

  1. Write-Update: When a core writes to a cache line, it broadcasts the new value, which other cores use to update their cache.
  2. Write-Invalidate: When a core writes to a cache line, it broadcasts a message telling other cores to invalidate their copy.

Write-update may initially seem faster – cores don’t have to refetch data. But there’s more nuance, and in practice, write-invalidate uses significantly less bandwidth. Take this example:

  • If core #1 writes 10x and core #2 reads it after the writes
  • Write-update had #2 update the cache 10x
  • Write-invalidate had #2 invalidate once, and refetch once

Both have room for improvement:

  • The bus is a bottleneck – all coherence traffic and memory requests compete for the same mutually exclusive resource
  • We’re sending invalidation messages even when no other cache holds the line (wasted bandwidth)
  • These protocols assume write-through (every write goes to both cache and RAM). Write-back caches are more efficient (less memory traffic), but they make coherence harder — now main memory can be stale too, so the protocol needs to track which lines are dirty.

The MESI Protocol

MESI is a more advanced cache coherence protocol which enables write-back and avoids wasted bus traffic.

MESI achieves this by having each core store a small amount of state for every cache line.

States

StateMeaning
ModifedDirty / differs from RAM, only this core has it
ExclusiveClean / matches RAM, only this core has it
SharedClean / matches RAM, multiple cores have it
InvalidThis core doesn’t have it (or the cached value is stale)

Operations

Read

Line is Modified, Exclusive, or Shared (Hit):

  • Simply return the cached value (no state change or bus activity)

Line is Invalid (Miss). Next step depends on what other cores hold:

  1. None: retrieve from RAM, cache as Exclusive
  2. Exclusive: retrieve from that core, both cores → Shared
  3. Modified: retrieve from that core, both cores → Shared, write-back to RAM (clean it)
  4. Shared: retrieve from any core, cache as Shared

Write

Line is Modified (Hit): just modify cache

Line is Exclusive (Hit): modify cache, → Modified (mark dirty)

Line is Shared (Hit): tell others → Invalid, modify cache, → Modified

Line is Invalid (Miss). Next step depends on what other cores hold:

  1. None: retrieve from RAM, modify, → Modified
  2. Exclusive or Shared: tell others → Invalid, retrieve from RAM, modify, → Modified
  3. Modified: tell other to write-back to RAM & → Invalid, retrieve from RAM, modify, → Modified

Why the Exclusive State?

Another system, MSI, exists – which is the same, but without Exclusive. MESI has Exclusive to reduce bus traffic:

  • If a core is the sole owner and the line is clean, it can promote to Modified on a write with zero bus activity
  • In MSI, a line held by only one core is still marked Shared, so every write requires broadcasting an invalidate – even when nobody else has the line

Modern Coherence

Modern systems can’t easily broadcast like in bus-based ones. Instead, many use a directory-based approach:

  • A directory tracks exactly which caches hold each line
  • When a core needs to invalidate or fetch, the directory tells it who to talk to

Directory-based works with MESI, and in fact, many still use MESI (or a variant like MOESI or MESIF).

Putting it Together

Let’s go back to our example from the start – why is it so much slower with N>1 than N=1?

In Rust, locking a mutex involves an atomic operation on shared state:

pub fn lock(&self) -> MutexGuard {
// state is shared data
// this (atomic) operation tries to acquire the lock
while self.state.compare_exchange(0, 1, ...).is_err() {
// spin or park
}
}

Now that we understand MESI, we know that:

  • If N=1, that core can keep the cache line in Modified. The fetch_add simply modifies its local cache, without ever touching RAM or communicating over the bus
  • If N>1, we’ll have a ton of bus traffic – cores constantly taking the line as Modified and invalidating others

Aha!

Footnotes

This was my first post going into depth about topic outside my projects. I’d love to hear what you think!

Jon Gjengset’s, “The Cost of Concurrency Coordination” talk was the inspiration for this post. I’d highly recommend giving it a listen or any of his other videos.

Additionally, I referenced (and would recommend checking out!) these sources:

  1. Cache coherence in shared-memory architectures (University of Texas)
  2. The MESI protocol (University of Pittsburgh)
  3. MESI Protocol Wikipedia