← Back to blog
API rate limiting and traffic control diagram

System Design: Designing a Rate Limiter for High-Traffic APIs

10 min read

Rate limiters protect your APIs from abuse, ensure fair usage, and keep systems stable under load. In 2026, with more traffic, stricter SLAs, and higher expectations for API reliability, getting the design right from day one saves costly rewrites later. This guide walks through the core concepts, algorithms, and distributed design choices you need to ship a production-grade rate limiter—and how it fits alongside other system design decisions like designing a distributed cache and real-time collaboration at scale.

Why Rate Limiting?

Rate limiting is not optional for public or partner APIs. Without it, a single misbehaving client, a bug in a mobile app, or a deliberate attack can consume all capacity and degrade service for everyone. The main goals are:

Prevent overload — A few clients (or one buggy integration) cannot take down the service. By capping how many requests each key can make per second or per minute, you keep the system within its operating envelope.

Fair usage — In multi-tenant or B2B APIs, you want to share capacity fairly. Rate limits per API key or per tenant ensure that one customer’s spike does not starve others. Tiered limits (free vs paid) are a common way to monetise and prioritise.

Cost control — Many operations have variable cost: calling an external AI API, running heavy queries, or sending emails. Rate limiting those operations prevents runaway cost and helps you align usage with pricing.

Security — Slowing down requests makes brute-force attacks, credential stuffing, and scraping less economical. It is one layer in a broader security strategy; combine it with auth, validation, and monitoring.

If you are building or scaling APIs and want to go deeper on backend architecture, see distributed cache design and services I offer for API and full-stack work.

Core Algorithms (2026 Still Relevant)

Choosing the right algorithm affects both accuracy and implementation complexity. The four most common approaches are still the same in 2026; what has evolved is how we run them in distributed systems.

Token bucket — You maintain a bucket of tokens that refill at a fixed rate (e.g. 100 tokens per second). Each request consumes one token; if no tokens are available, the request is rejected or delayed. The bucket has a maximum size, so short bursts are allowed. This is ideal for APIs where occasional bursts are acceptable (e.g. dashboard loads, batch triggers). Implementation is straightforward: track a last-refill timestamp and a current token count; on each request, refill based on elapsed time, then decrement if allowed.

Sliding window log — You store the timestamp of every request in the current window. To decide if a new request is allowed, you count how many timestamps fall within the last N seconds (or minutes) and compare to the limit. This is accurate but requires more storage and cleanup. In practice, many teams approximate it with a sliding window counter (e.g. in Redis) that uses weighted averages or two fixed windows to estimate the sliding count without storing every timestamp.

Fixed window — You count requests per calendar window (e.g. “per minute” or “per hour”). At the start of each window, the counter resets. Implementation is simple (increment + TTL or periodic reset), but at window boundaries you can get double the intended rate (e.g. 100 at end of minute 1 and 100 at start of minute 2). For many use cases that is acceptable; for strict fairness, sliding window or token bucket is better.

Leaky bucket — Requests are processed at a constant rate (the “leak”); excess requests wait in a queue or are dropped. This smooths traffic and prevents bursts entirely. It is common in network and payment systems where you want predictable throughput. For REST APIs, token bucket or sliding window is more common because they allow controlled bursts.

In 2026, most production APIs use token bucket or a sliding-window-style limit backed by a fast store. The choice depends on whether you need strict sliding fairness or can accept the simplicity of token bucket.

Distributed Rate Limiting

A single-node rate limiter does not scale when you have many API servers behind a load balancer. Each node would maintain its own counters, so a user could send N requests to each of M nodes and get N×M requests through. You need a shared view of usage, which means a shared store.

Redis — The standard choice. Use INCR with a key per user/IP/API key and a window (e.g. ratelimit:user:123:2026-01-15:minute). Set EXPIRE so keys disappear after the window. For sliding window, use a Lua script that trims old entries and counts recent ones, or use a sorted set with timestamps as scores. Redis is fast, supports atomic operations, and is widely available in managed offerings (ElastiCache, Memorystore, Redis Cloud). Latency is typically sub-millisecond in the same region.

Consistent hashing — When you have many keys, you can shard rate-limit state by key (e.g. user ID) so that the same key always hits the same node. That reduces cross-node coordination and can improve cache locality. You still need a backing store (e.g. Redis cluster) that handles the sharding; the application just uses a stable key.

Hybrid (local + central) — To reduce latency and load on the central store, you can cache “allowed” decisions locally (in-process or in a local cache like memcached). For example, if the central store says “user has 80 of 100 requests left,” you might allow the next 10 requests without rechecking, then sync again. This trades a bit of accuracy for lower latency and fewer Redis calls. Use it when you can tolerate a small overshoot (e.g. 5–10% extra requests under burst).

Edge and serverless — In 2026, many APIs run on edge or serverless. Rate limiting at the edge (e.g. in middleware) can use a distributed KV (e.g. Vercel KV, Cloudflare Durable Objects, or a Redis-compatible API) so that limits are enforced close to the user while still being global. See Edge Functions and Next.js for when edge fits your architecture.

Choosing the Limit Key

What you count matters as much as how you count. Common choices:

User ID or API key — The standard for authenticated APIs. Each user or key has its own limit. Prevents one user from monopolising the service and allows tiered limits (e.g. free vs enterprise).

IP address — Useful for unauthenticated endpoints (login, signup, password reset). Be aware of NAT and proxies: many users can share one IP, so limits are often more relaxed. Combine with user-based limits once authenticated.

Endpoint or resource — You might limit expensive operations (e.g. “export” or “bulk delete”) more tightly than read-only ones. Implement with separate limit keys per (user, endpoint) or (user, resource_type).

Composite keys — For multi-tenant SaaS, you might have limits per (tenant_id, user_id) and also a global limit per tenant_id. That way both per-user and per-tenant caps are enforced.

Document your limits in API docs and return clear headers (e.g. X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After) so clients can adapt. For more on building robust APIs, check TypeScript in 2026 for type-safe API design and services for API development.

Design Checklist

Before you implement, lock in these decisions:

  1. Limit key — User, API key, IP, or a combination. Ensure it is stable and not easily spoofed where it matters.
  2. Limits — Per second, per minute, per day; different tiers (free, pro, enterprise). Start conservative; you can relax later.
  3. Response — Return 429 Too Many Requests with a Retry-After header and a JSON body that explains the limit and when to retry. Avoid generic error messages so clients and support can debug.
  4. Observability — Emit metrics (reject rate, latency to the rate limiter, cache hit/miss if you use local cache). Alert on sudden spikes in reject rate or latency. Log a sample of rejected requests for analysis.
  5. Testing — Load test with burst and steady traffic. Verify that limits are enforced and that the system does not thundering-herd the central store. Run chaos tests (e.g. Redis down) to see how you degrade (e.g. fail open vs fail closed).

Implementation Sketch (Redis + Token Bucket)

A minimal token-bucket in Redis could work as follows. Key: ratelimit:{key}:{window}. Store a hash or string with: tokens (float), last_refill (timestamp). On each request:

  1. Get current state (tokens, last_refill).
  2. Refill: elapsed = now - last_refill; tokens = min(capacity, tokens + elapsed * rate).
  3. If tokens >= 1, decrement tokens, update last_refill, allow request. Otherwise reject.

Use a Lua script so the refill and decrement are atomic. Set TTL on the key to the window length so keys do not leak. This gives you a single-node view per key; with Redis Cluster or a proxy, the same key always routes to the same shard, so consistency is preserved.

For sliding window, a Lua script that uses a sorted set (member = request id or timestamp, score = timestamp) and ZREMRANGEBYSCORE to drop old entries, then ZCARD to count, is a common pattern. Run it atomically so you do not overshoot under concurrency.

Common Pitfalls

Clock skew — If you use “current time” for refill or window boundaries, ensure servers and Redis use NTP and are roughly in sync. Large skew can cause double-counting or under-counting at boundaries. Prefer monotonic clocks or server-side timestamps from the store where possible.

Thundering herd — When a popular key hits the limit, many requests might simultaneously try to refill or check. Use atomic operations (Lua scripts, INCR with conditional logic) so you do not overshoot. Consider a small random backoff or jitter before retry so clients do not all retry at once.

Fail open vs fail closed — If Redis or the rate-limit service is down, do you allow all requests (fail open) or reject all (fail closed)? Fail open improves availability but removes protection; fail closed is safer but can cause outages. Many teams fail open with a circuit breaker and alert so they can fix the store quickly. Document the choice and test it.

Stale local cache — If you cache “allowed” decisions locally, set a short TTL or sync when the user approaches the limit. Otherwise, after a burst, the local cache might think the user still has quota while the central store has already limited them, leading to inconsistent 429s across nodes.

Scaling and Multi-Region

In 2026, many APIs run in multiple regions. Rate limiting must work across regions so that a user in EU and US cannot double their effective limit. Options:

Global Redis — Use a globally replicated or multi-region Redis (e.g. Redis Enterprise, or a managed offering with cross-region replication). Writes go to a primary or are replicated; reads can be local. Latency and consistency depend on the product; ensure you understand the replication model.

Per-region limits — Allow N requests per region per user. The user gets N in EU and N in US. Simpler to implement (each region has its own Redis), but total usage can be N × regions. Acceptable when N is high and abuse is low.

Central coordinator — All regions check a single global store (e.g. in one primary region). Higher latency for distant regions; use only when strict global limits are required and you can afford the latency.

Hybrid — Local rate limit for fast path (e.g. 90% of traffic stays under limit); periodic sync or a central store for the long tail. Reduces cross-region calls while keeping global caps.

Choose based on your latency, consistency, and cost requirements. For most APIs, per-region limits or a global Redis with careful tuning is enough. For more on scaling backend systems, see distributed cache design and real-time collaboration.

Summary

A production rate limiter in 2026 is typically token bucket or sliding window, backed by Redis (or compatible store) for distribution, with optional local caching for hot paths. Choose the limit key (user, IP, endpoint) and limits (per second/minute/day, per tier) up front, and invest in observability and testing so you can tune and operate it in production. Avoid clock skew, thundering herd, and unclear fail-open behaviour; plan for multi-region if your API is global. Rate limiting is one piece of a larger system design; combine it with caching, real-time design, and solid API practices for a robust backend. If you need help designing or implementing rate limiters or APIs, get in touch or browse my services.