100K Requests per Second with cpp-cache, Hash Routing, and a Proven Scaler

100K Requests per Second with cpp-cache, Hash Routing, and a Proven Scaler Link to heading

Published by Simyl Research S.A.S. — High-Performance Backend Engineering


TL;DR — A high-performance C++20 in-process cache, combined with hash-based routing and horizontal scaling, sustains 100,000 requests per second in Kubernetes with p95 latency under 5 ms and a cache hit ratio above 95%. This post presents the architecture, the concurrency model, and the operational lessons learned.


The Core Problem: Thundering Herd in Caches Link to heading

Consider any expensive operation — a database query, an external API call, a complex computation. Without coordination, multiple threads requesting the same key simultaneously will each trigger the expensive operation independently.

This is the problem that cpp-cache solves by implementing a Thundering Herd Shield. To understand why this mechanism is critical for system stability beyond simple data storage, we recommend reading our previous post: When a Cache Does More Than Just Cache.


System Architecture Link to heading

The overall system has three layers working in concert:

flowchart TD
    Client["🌐 Clients\n(browsers, mobile, services)"]
    LB["⚖️ Load Balancer / API Gateway\n(AWS ALB · NGINX · Envoy)"]
    Router["🔀 Hash Router\n(consistent hashing ring)"]

    subgraph GW ["Gateway Layer — N worker processes / pods"]
        GW1["Worker 1\ncpp-cache shard A"]
        GW2["Worker 2\ncpp-cache shard B"]
        GWN["Worker N\ncpp-cache shard Z"]
    end

    Backend["🗄️ Origin Backend\n(database · microservice · external API)"]

    Client -->|HTTP/gRPC| LB
    LB -->|any routing| Router
    Router -->|key hash → shard| GW1
    Router -->|key hash → shard| GW2
    Router -->|key hash → shard| GWN
    GW1 -->|cache miss only| Backend
    GW2 -->|cache miss only| Backend
    GWN -->|cache miss only| Backend

The load balancer distributes connections; the hash router ensures a given key always lands on the same worker process and therefore the same cpp-cache instance. This routing affinity converts random cache misses into a near-deterministic hit rate bounded only by LRU eviction.


Layer 1 — cpp-cache: Architecture and Concurrency Model Link to heading

cpp-cache is a header-only C++20 in-process cache built on Aleph-w data structures. Its concurrency model has been running in production for over six years in its predecessor gateway_cache.

Internal Components Link to heading

Three key implementation choices drive performance:

  • OLhashTable (open-addressing with linear probing) gives cache-friendly lookups with no pointer chasing.
  • Per-entry mutex + condition variable — different keys never block each other; only threads competing for the same key contend.
  • Lock-free statistics via std::atomic<size_t> counters — cache hits never acquire the global mutex, maximizing throughput on the fast path.

Entry State Machine Link to heading

Each cache entry transitions through a well-defined state machine that is the heart of thundering herd prevention. The COMPUTING state is the shield: while a solver runs, every other thread requesting the same key finds COMPUTING and waits on the per-entry condition variable rather than launching a duplicate computation. When the solver finishes, all waiters are notified atomically and share the result.


The Go Production Implementation: gateway_cache and gw_cache Link to heading

The concurrency protocol described above did not originate in C++. It was first proven correct in production, written in Go, running 6+ years at 100K RPS for a sports data platform. Two Go repositories carry this lineage:

  • gateway_cache — the original production library.
  • gw_cache — a standalone variant with a cleaner package boundary.

Choosing Between Go and C++ Link to heading

Both implementations run the same verified concurrency protocol. In practice: start with Go if your service is already Go-based — the integration is trivial and the production track record is long. Use C++ when you need sub-millisecond p99 guarantees, are embedding the cache in a native binary, or are processing payloads where GC pauses are unacceptable.


Layer 2 — Hash Routing Link to heading

Routing by key hash transforms a fleet of independent caches into a logically partitioned, collectively large cache. Without it, each worker caches the full key space independently — memory is wasted and backend load multiplies with fleet size.

With 150 virtual nodes per physical node, load imbalance stays under 5% for typical key distributions.

Effective Capacity Multiplication Link to heading

Without routing, N pods with capacity C provide effective cache capacity C. With hash routing, capacity scales to N × C.


Layer 3 — The Predictive Scaler Link to heading

Unlike standard Kubernetes setups that rely on the Horizontal Pod Autoscaler (HPA) based on CPU or memory — which often reacts too slowly to rapid traffic spikes — we use a custom-built Predictive Scaler.

How it Works: PID-based Scaling Link to heading

Our scaler operates as a closed-loop controller (PID-like) that scales in function of real-time Requests per Second (RPS). Its efficiency stems from two key factors:

  1. Deterministic Capacity Knowledge: We have empirically measured exactly how many RPS a single pod can sustain without degrading latency.
  2. Anticipatory Logic: By analyzing historical trends and the current velocity of traffic growth, the scaler “forecasts” the required number of replicas and provisions them before the traffic hits the peak.

This prevents the “cold-start” latency spikes typical of reactive scaling and ensures the fleet is always correctly sized for the incoming load.


Lessons from Production Link to heading

  1. Hash routing is the force multiplier — key affinity is more impactful than any cache sizing increase.
  2. Negative TTL + circuit breaker — essential pair; one prevents missing-key floods, the other bounds backend degradation.
  3. Predictive scaling over reactive HPA — scaling based on RPS and trends (PID-like) is far superior to scaling on CPU for high-throughput traffic.
  4. Failover is the weakest link — Kubernetes reschedules pods without considering cache state; the cold-cache spike is the highest operational risk.
  5. Warmup period matters — ramp traffic linearly to new pods over 2–5 minutes.

Summary Link to heading

LayerTechnologyKey contribution
L1 cachecpp-cache (C++20)Single-flight, LRU, ~50 µs hit latency
RoutingConsistent hashing (xxHash, 150 vnodes)Key affinity, N× effective capacity
L2 cacheElastiCache RedisShared backing store, ~500 µs hit
OriginAurora + microservicesReceives ~0.25% of total traffic at 100K RPS
ScalerCustom PID ControllerRPS-based, predictive, trend-aware scaling

The combination sustains 100K RPS with p95 under 5 ms and over 95% hit rate, sending less than 1% of traffic to the origin.


At Simyl Research, we design and build systems like this from the ground up — from high-performance architecture to Kubernetes operations. If your backend is struggling under load, reach out.


Tags: C++20 caching high-performance distributed-systems hash-routing kubernetes aws backend-engineering