100K Requests per Second with cpp-cache, Hash Routing, and a Proven Scaler
100K Requests per Second with cpp-cache, Hash Routing, and a Proven Scaler
Link to heading
Published by Simyl Research S.A.S. — High-Performance Backend Engineering
TL;DR — A high-performance C++20 in-process cache, combined with hash-based routing and horizontal scaling, sustains 100,000 requests per second in Kubernetes with p95 latency under 5 ms and a cache hit ratio above 95%. This post presents the architecture, the concurrency model, and the operational lessons learned.
The Core Problem: Thundering Herd in Caches Link to heading
Consider any expensive operation — a database query, an external API call, a complex computation. Without coordination, multiple threads requesting the same key simultaneously will each trigger the expensive operation independently.
This is the problem that cpp-cache solves by implementing a Thundering Herd Shield. To understand why this mechanism is critical for system stability beyond simple data storage, we recommend reading our previous post: When a Cache Does More Than Just Cache.
System Architecture Link to heading
The overall system has three layers working in concert:
flowchart TD
Client["🌐 Clients\n(browsers, mobile, services)"]
LB["⚖️ Load Balancer / API Gateway\n(AWS ALB · NGINX · Envoy)"]
Router["🔀 Hash Router\n(consistent hashing ring)"]
subgraph GW ["Gateway Layer — N worker processes / pods"]
GW1["Worker 1\ncpp-cache shard A"]
GW2["Worker 2\ncpp-cache shard B"]
GWN["Worker N\ncpp-cache shard Z"]
end
Backend["🗄️ Origin Backend\n(database · microservice · external API)"]
Client -->|HTTP/gRPC| LB
LB -->|any routing| Router
Router -->|key hash → shard| GW1
Router -->|key hash → shard| GW2
Router -->|key hash → shard| GWN
GW1 -->|cache miss only| Backend
GW2 -->|cache miss only| Backend
GWN -->|cache miss only| Backend
The load balancer distributes connections; the hash router ensures a given key always lands on the same worker process and therefore the same cpp-cache instance. This routing affinity converts random cache misses into a near-deterministic hit rate bounded only by LRU eviction.
Layer 1 — cpp-cache: Architecture and Concurrency Model
Link to heading
cpp-cache is a header-only C++20 in-process cache built on Aleph-w data structures. Its concurrency model has been running in production for over six years in its predecessor gateway_cache.
Internal Components Link to heading
Three key implementation choices drive performance:
OLhashTable(open-addressing with linear probing) gives cache-friendly lookups with no pointer chasing.- Per-entry mutex + condition variable — different keys never block each other; only threads competing for the same key contend.
- Lock-free statistics via
std::atomic<size_t>counters — cache hits never acquire the global mutex, maximizing throughput on the fast path.
Entry State Machine Link to heading
Each cache entry transitions through a well-defined state machine that is the heart of thundering herd prevention. The COMPUTING state is the shield: while a solver runs, every other thread requesting the same key finds COMPUTING and waits on the per-entry condition variable rather than launching a duplicate computation. When the solver finishes, all waiters are notified atomically and share the result.
The Go Production Implementation: gateway_cache and gw_cache
Link to heading
The concurrency protocol described above did not originate in C++. It was first proven correct in production, written in Go, running 6+ years at 100K RPS for a sports data platform. Two Go repositories carry this lineage:
gateway_cache— the original production library.gw_cache— a standalone variant with a cleaner package boundary.
Choosing Between Go and C++ Link to heading
Both implementations run the same verified concurrency protocol. In practice: start with Go if your service is already Go-based — the integration is trivial and the production track record is long. Use C++ when you need sub-millisecond p99 guarantees, are embedding the cache in a native binary, or are processing payloads where GC pauses are unacceptable.
Layer 2 — Hash Routing Link to heading
Routing by key hash transforms a fleet of independent caches into a logically partitioned, collectively large cache. Without it, each worker caches the full key space independently — memory is wasted and backend load multiplies with fleet size.
With 150 virtual nodes per physical node, load imbalance stays under 5% for typical key distributions.
Effective Capacity Multiplication Link to heading
Without routing, N pods with capacity C provide effective cache capacity C. With hash routing, capacity scales to N × C.
Layer 3 — The Predictive Scaler Link to heading
Unlike standard Kubernetes setups that rely on the Horizontal Pod Autoscaler (HPA) based on CPU or memory — which often reacts too slowly to rapid traffic spikes — we use a custom-built Predictive Scaler.
How it Works: PID-based Scaling Link to heading
Our scaler operates as a closed-loop controller (PID-like) that scales in function of real-time Requests per Second (RPS). Its efficiency stems from two key factors:
- Deterministic Capacity Knowledge: We have empirically measured exactly how many RPS a single pod can sustain without degrading latency.
- Anticipatory Logic: By analyzing historical trends and the current velocity of traffic growth, the scaler “forecasts” the required number of replicas and provisions them before the traffic hits the peak.
This prevents the “cold-start” latency spikes typical of reactive scaling and ensures the fleet is always correctly sized for the incoming load.
Lessons from Production Link to heading
- Hash routing is the force multiplier — key affinity is more impactful than any cache sizing increase.
- Negative TTL + circuit breaker — essential pair; one prevents missing-key floods, the other bounds backend degradation.
- Predictive scaling over reactive HPA — scaling based on RPS and trends (PID-like) is far superior to scaling on CPU for high-throughput traffic.
- Failover is the weakest link — Kubernetes reschedules pods without considering cache state; the cold-cache spike is the highest operational risk.
- Warmup period matters — ramp traffic linearly to new pods over 2–5 minutes.
Summary Link to heading
| Layer | Technology | Key contribution |
|---|---|---|
| L1 cache | cpp-cache (C++20) | Single-flight, LRU, ~50 µs hit latency |
| Routing | Consistent hashing (xxHash, 150 vnodes) | Key affinity, N× effective capacity |
| L2 cache | ElastiCache Redis | Shared backing store, ~500 µs hit |
| Origin | Aurora + microservices | Receives ~0.25% of total traffic at 100K RPS |
| Scaler | Custom PID Controller | RPS-based, predictive, trend-aware scaling |
The combination sustains 100K RPS with p95 under 5 ms and over 95% hit rate, sending less than 1% of traffic to the origin.
At Simyl Research, we design and build systems like this from the ground up — from high-performance architecture to Kubernetes operations. If your backend is struggling under load, reach out.
Tags: C++20 caching high-performance distributed-systems hash-routing kubernetes aws backend-engineering