Caches sit in front of databases and services to cut latency and load. In 2026, systems are more distributed, latency budgets are tighter, and users expect sub-100 ms response times. A well-designed distributed cache is often the difference between a snappy API and a sluggish one. This guide covers topology, eviction, consistency, and observability—and how caching fits with rate limiting and real-time collaboration at scale.
Goals
Before choosing a topology or eviction policy, be clear on what you need:
Low latency — Sub-millisecond or single-digit milliseconds for hot data. That usually means in-memory storage, minimal serialisation, and network proximity (same region or edge).
High throughput — Millions of reads per second; writes as needed. Caches are read-heavy; design for read path first. Use connection pooling, pipelining, or connection-per-key patterns depending on your client and workload.
Availability — Survive node and zone failures. No single point of failure. Replication (primary + replicas) or clustering with automatic failover is standard in 2026.
Consistency — A clear policy for when the cache is updated or invalidated. Cache-aside (app-managed) and write-through (cache and DB updated together) are the most common. Staleness is acceptable for many use cases; document the trade-off and set TTLs or invalidation rules accordingly.
If you are building APIs that need to stay fast under load, see rate limiter design and my services for API and backend work.
Topology Choices (2026)
Single-node cache — One Redis (or similar) instance. Simple, very fast, but no redundancy. Good for development or non-critical data. In production, you usually want at least replication.
Replicated (primary + replicas) — Writes go to the primary; reads can hit replicas. Failover (automatic or manual) promotes a replica to primary if the primary dies. Common in managed Redis (AWS ElastiCache, Google Memorystore, Redis Cloud). Use when you need HA and can fit your working set in one node’s memory.
Clustered / sharded — Data is partitioned by key (e.g. hash slot). Each node owns a subset of keys. Linear scale-out: add nodes to hold more data and serve more traffic. Use when one node cannot hold all data or when you need more throughput than a single primary-replica pair can provide.
Multi-region — Replicas or clusters per region; traffic stays local. Cross-region replication adds complexity: eventual consistency, conflict handling, and higher latency for cross-region writes. In 2026, many teams use managed offerings (e.g. Redis Enterprise, AWS Global Datastore) that handle replication and failover. Use when you have users in multiple regions and want low read latency everywhere.
For more on scaling stateful systems, see real-time collaboration (pub/sub and state sync) and rate limiter design (using Redis for rate-limit state).
Eviction and Capacity
When the cache is full, you must decide what to evict. This affects hit rate and latency.
Memory limit — Set a maxmemory cap. When the cache reaches it, apply an eviction policy. Do not run without a limit; otherwise one key or a bug can exhaust memory and crash the node.
LRU (Least Recently Used) — Evict the key that was accessed longest ago. Good for workloads where “recently used” correlates with “likely to be used again.” Redis and many caches support LRU. Tune the sample size (how many keys to consider) for a balance between accuracy and CPU.
LFU (Least Frequently Used) — Evict the key that was accessed least often. Good for workloads with stable hot set (e.g. popular items). Redis 4+ supports LFU. Use when access frequency is a better predictor than recency.
TTL — Expire keys by time. Use TTL for all keys or for a subset (e.g. session data). Prevents unbounded growth and stale data. Combine TTL with LRU/LFU so that when memory is full, you evict among keys that are still within TTL.
Noeviction — Do not evict; reject writes when full. Use only when you must never lose data and have sized the cache to fit the working set. Rare in general-purpose caches.
In 2026, LRU or LFU plus TTL is the default for most apps. Choose based on your access pattern; measure hit rate and latency after deployment.
Consistency and Invalidation
The cache is a copy of data that lives elsewhere (usually a database). You need a clear policy for when to update or invalidate the cache.
Cache-aside — The application owns the cache. On read: check cache; on miss, load from DB, write to cache, return. On write: update DB, then invalidate (delete) or update the cache entry. Simple and flexible; staleness depends on how quickly you invalidate after writes. This is the most common pattern in 2026. Use short TTLs as a safety net so that even if invalidation fails, data eventually refreshes.
Write-through — On write, the application writes to both cache and DB (or the cache layer writes to DB). The cache always has the latest value. Higher write latency and load on the DB; use when reads dominate and you need strong consistency for a subset of keys.
Write-behind (write-back) — On write, the application writes to the cache only; the cache asynchronously flushes to the DB. Lowest write latency, but risk of data loss if the cache dies before flush. Use only when you can tolerate loss and have a clear recovery story (e.g. replay from a log).
Read-through — The cache layer (e.g. a sidecar or proxy) loads from DB on miss and populates the cache. The application only talks to the cache. Simplifies the app but couples you to a cache that can do DB reads. Common in CDNs and some managed cache services.
In 2026, cache-aside with explicit invalidation (or short TTL) is still the default for most applications. Write-through is used where consistency is critical and write volume is moderate. For more on building robust APIs and backends, see TypeScript in 2026 and my services.
Observability and Operations
Metrics — Track hit rate, miss rate, latency (p50, p95, p99), memory usage, evictions per second, and connection count. Without these, you cannot tune capacity or debug slowdowns.
Alerts — Alert on high miss rate (e.g. cache might be too small or invalidation too aggressive), latency spikes (e.g. network or CPU), OOM risk (memory approaching max), and replica lag (if using replication). Set thresholds based on SLOs.
Security — Use TLS for connections, authentication (e.g. Redis ACLs, IAM for managed services), and network isolation (VPC, private endpoints). Never expose the cache to the public internet without auth and encryption.
Capacity planning — Size the cache for your working set (the set of keys that account for most traffic). Monitor eviction rate; if it is high, consider increasing memory or optimising key design (e.g. smaller values, compression). For more on scaling and system design, see rate limiter design and real-time collaboration.
Common Pitfalls
Cache stampede — When a popular key expires, many requests miss and hit the DB at once. Use a lock (e.g. Redis SETNX) or “request coalescing” so that only one request loads from DB and others wait or get a stale value. Alternatively, use probabilistic early expiration (e.g. expire a bit before TTL for a subset of requests) to spread reloads.
Stale reads — After a DB write, the cache might still serve an old value until invalidated or TTL expires. Ensure invalidation is correct (right keys, no race conditions). Use short TTLs as a backstop.
Key design — Use consistent, bounded key names. Avoid unbounded growth (e.g. user ID in key is fine; “all items user ever viewed” as one key is not). Consider key prefixes for namespacing and bulk invalidation (e.g. delete by pattern or use a tag-based cache).
Connection pooling — Opening a new connection per request is expensive. Use a connection pool (or a client that pools) and tune pool size based on concurrency and node count.
Summary
A production distributed cache in 2026 is usually replicated or clustered, with LRU/LFU + TTL for eviction and cache-aside (or write-through) for consistency. Use managed services where possible; focus on eviction policy, invalidation, and observability to keep latency and correctness under control. Avoid stampedes, stale reads, and poor key design; plan for failover and multi-region if your users are global. Caching is one piece of a larger backend; combine it with rate limiting, real-time design, and solid API practices. Need help designing or implementing caches or APIs? Get in touch or browse my services.