When a Cache Does More Than Cache

When a Cache Does More Than Cache Link to heading

Most people think of a cache as a simple performance booster: store a result, reuse it later, move on.

But in highly concurrent systems, a cache can play a more important role. It can become a coordination mechanism that protects the rest of the system from wasteful duplicate work.

That is the idea behind cpp-cache, built by the SIMYL engineering team and based on a design that has run in production for over six years.

The problem: too many threads, same miss Link to heading

Imagine a popular key is requested at the exact moment it is not yet cached.

If ten, fifty, or one hundred threads hit that same miss at roughly the same time, a naïve implementation triggers the same expensive computation again and again. Instead of reducing load, the cache amplifies it. More CPU is consumed, more backend pressure is generated, and latency becomes harder to control.

This is the classic thundering herd problem.

flowchart LR
    subgraph without["Without coordination"]
        direction TB
        T1a[Thread 1] -->|MISS| B1[Backend]
        T2a[Thread 2] -->|MISS| B2[Backend]
        T3a[Thread 3] -->|MISS| B3[Backend]
        TNa[Thread N] -->|MISS| BN[Backend]
    end
    subgraph with["With single-flight"]
        direction TB
        T1b[Thread 1] -->|compute| C[Cache]
        T2b[Thread 2] -.->|wait| C
        T3b[Thread 3] -.->|wait| C
        TNb[Thread N] -.->|wait| C
        C -->|1 call| Backend
    end

    style without fill:#fee,stroke:#c00
    style with fill:#efe,stroke:#0a0

On the left, every thread hammers the backend independently. On the right, one thread computes while the rest wait — the backend sees a single call.

Here is the difference in practice:

// Without coordination: every thread computes independently
auto val = expensive_backend_call(key);  // x100 threads = x100 calls

// With cpp-cache: one thread computes, the rest wait and share
auto result = cache.get_or_compute(key);
// Only one backend call. All 100 threads get the same result.

if (result.is_positive())
    use(*result.value());

A good concurrent cache should not merely remember values. It should also prevent the system from stampeding when the value is missing.

The solution: one computation per key Link to heading

The most important property of the design is straightforward:

If multiple threads request the same missing key, only one of them performs the computation. The others wait and reuse the result.

This is known as single-flight behavior.

sequenceDiagram
    participant T1 as Thread 1
    participant C as Cache
    participant T2 as Thread 2
    participant T3 as Thread 3
    participant B as Backend

    T1->>C: get_or_compute("key")
    Note over C: Miss → State = COMPUTING
    T2->>C: get_or_compute("key")
    Note over C: State is COMPUTING
    T2-->>C: Wait on condition variable
    T3->>C: get_or_compute("key")
    T3-->>C: Wait on condition variable

    C->>B: Expensive Computation
    B-->>C: Result

    Note over C: State = READY
    C-->>T1: Return ComputedPositive
    C-->>T2: Notify → Return HitPositive
    C-->>T3: Notify → Return HitPositive

The beauty of this approach is that it does not sacrifice overall concurrency. Threads asking for different keys can still proceed independently. The coordination happens where it should: on the contested key, not across the entire cache.

The global lock that protects the hash table and LRU list is held only for microsecond-level structural operations — lookup, insert, reorder. It is never held while the solver runs. That means the expensive work happens with no locks held, and different keys can be computed truly in parallel.

block-beta
    columns 4
    A["Phase 1\nGlobal mutex\nlookup/insert"]:1
    B["Phase 2\nNO LOCKS\nSolver runs"]:1
    C["Phase 3\nEntry mutex\nset state + notify"]:1
    D["Phase 4\nGlobal mutex\nLRU reorder"]:1

    style A fill:#fcc,stroke:#900
    style B fill:#cfc,stroke:#090
    style C fill:#ccf,stroke:#009
    style D fill:#fcc,stroke:#900

The critical insight: Phase 2 (where the real work happens) holds zero locks. That is why multiple keys can be resolved concurrently even though there is a single global mutex.

Designed for real-world failures too Link to heading

Another notable aspect of the cache is that it treats failures as meaningful outcomes.

Successful results and failed results can both be cached, each with its own TTL.

Why does that matter? Consider a concrete scenario: your backend returns a 404 for a resource that does not exist. Without negative caching, every single request for that resource retries the backend. Under high traffic, a non-existent key can generate as much load as a popular one — perhaps more, because there is never a cached result to absorb the requests.

With negative caching, the first miss calls the solver, gets a failure, and caches that failure for 30 seconds (or whatever the negative TTL is). For the next 30 seconds, all subsequent requests see the cached failure immediately. The backend is protected.

This is a small feature with a very practical payoff — especially in systems where the set of “known absent” keys is large or where a backend failure is temporary and retrying it hundreds of times per second does no good.

sequenceDiagram
    participant T1 as Thread 1
    participant C as Cache
    participant T2 as Thread 2
    participant B as Backend

    T1->>C: get_or_compute("missing-key")
    C->>B: Solver call
    B-->>C: 404 / nullptr
    Note over C: State = FAILED (TTL=30s)
    C-->>T1: ComputedNegative

    T2->>C: get_or_compute("missing-key")
    Note over C: FAILED, within TTL
    C-->>T2: HitNegative (instant, no backend call)

The backend is called once. For the next 30 seconds, every request for that key gets an immediate cached failure — no redundant calls.

Safer lifetimes under concurrency Link to heading

The design separates logical validity from physical lifetime.

Cached values are exposed through shared_ptr, which means a value can be evicted or invalidated by the cache without immediately becoming unsafe for code that is still using it.

Why does this matter concretely? Without this separation, a thread reading a cached value could crash if another thread evicts it at the wrong moment. That is a use-after-free bug — one of the most dangerous classes of memory errors in C++, and one that would only manifest under specific timing conditions in production.

The cache also maintains an internal atomic reference count on each entry. An entry cannot be evicted if any thread is actively working with it. This invariant is enforced by the formal model (more on that below) and was the source of a real bug fix.

sequenceDiagram
    participant A as Thread A
    participant C as Cache
    participant E as Evictor

    A->>C: get_or_compute(key)
    Note over C: acquire() → refcount = 2
    A->>A: Using value (safe)
    E->>C: Need to evict this entry
    Note over C: refcount = 2 > 1 → skip!
    E->>C: Evict a different entry instead
    A->>C: Done → release() → refcount = 1
    Note over C: Entry safe for future eviction

The refcount acts as a shield: as long as any thread is actively using an entry, the evictor skips it. No use-after-free is possible.

A cache with explicit states Link to heading

Internally, entries move through explicit states: computing, ready, failed, and invalidated.

stateDiagram-v2
    [*] --> EMPTY
    EMPTY --> COMPUTING: Cache Miss
    COMPUTING --> READY: Success
    COMPUTING --> FAILED: Failure
    READY --> READY: Hit (refresh TTL)
    FAILED --> FAILED: Hit (refresh TTL)
    READY --> INVALIDATED: invalidate()
    FAILED --> INVALIDATED: TTL Expired
    INVALIDATED --> EMPTY: LRU Eviction
    READY --> EMPTY: TTL Expired + Eviction

Note the self-loops on Ready and Failed: every cache hit refreshes the TTL, implementing a sliding window that keeps frequently accessed entries alive. That clarity matters.

It makes the implementation more disciplined, the concurrency rules easier to reason about, and the intended behavior easier to verify.

When formal verification catches a real bug Link to heading

What makes this project especially interesting to me is that it does not stop at testing.

The repository includes a Promela/SPIN formal model that exhaustively explores ~3.5 million states of the concurrency protocol. That model verified 23 safety assertions and 10 liveness properties — including termination, starvation freedom, and the guarantee that every requested key is eventually computed.

But the best part of the story is not the green checkmarks. The model actually found a real bug.

The original C++ code had an eviction function that checked whether an entry was in the Computing state before evicting it — but it did not check the reference count. That meant an entry could be evicted while another thread was actively using it, leading to a potential use-after-free. The Promela model already had the correct guard (refcount <= 1 before eviction), and when SPIN explored all interleavings, it exposed the missing check in the C++ implementation.

This is exactly the kind of concurrency bug that survives code review and conventional testing — it requires a specific interleaving of eviction and active reference to trigger. The formal model caught it before production did.

That single discovery justifies the effort of building the model.

An honest look at tradeoffs Link to heading

No design is without tradeoffs, and acknowledging them is important.

Global mutex: The single lock that protects structure works well when the solver is the real bottleneck (which it usually is), but it could become a contention point under thousands of threads with very high miss rates. Sharding into N sub-caches is planned.
Dependency on Aleph-w: The cache uses Aleph-w’s intrusive data structures for performance, but this makes integration harder than a header-only library. A self-contained mode is under consideration.
No benchmarks yet: Despite production validation at 100K req/sec, the repository lacks microbenchmarks. This is the next planned addition.

These are real limitations. They do not undermine the core design, but they do constrain adoption today.

Why this is worth caring about Link to heading

Infrastructure like this rarely attracts the same attention as customer-facing features, but it has enormous leverage.

A better concurrent cache can mean:

fewer duplicate backend calls,
less waste under bursty traffic,
more predictable latency,
safer memory behavior,
and stronger confidence in correctness — not just from testing, but from exhaustive formal verification.

That is the kind of engineering improvement that quietly helps everything above it.

What’s next Link to heading

The project is openly available at lrleon/cpp-cache. If you want to try it, the building instructions show how to build and run the tests in under five minutes.

Planned next steps include:

Benchmarks with configurable thread count, key space, and miss rate
Sharding for linear scalability with cores
A try_find() variant that returns immediately if an entry is being computed
Self-contained build option without Aleph-w

Contributions, feedback, and critical reviews are welcome.

Closing thought Link to heading

What I like most about this version of cpp-cache is that it reflects a mature idea of performance engineering.

It is not just about storing results faster. It is about controlling contention intelligently, protecting the system from herd behavior, and treating correctness as part of performance work, not as an afterthought.

The fact that formal verification caught a real bug in the eviction path is the kind of story that makes the investment in rigor worth it. Not every project needs a Promela model — but when you are building concurrent infrastructure that other systems depend on, the cost of a subtle bug in production far exceeds the cost of building a model.

That is exactly the kind of infrastructure investment that pays off over time.