A rate limiter is the bouncer at the door of every production API. The real challenge isn't the algorithm - it's making it work across a distributed fleet of servers with sub-5ms latency and graceful failure modes. This walkthrough covers token bucket vs sliding window, Redis-based distributed counting, Lua scripts for atomicity, and the fail-open vs fail-closed decision.
Practice this design with AI
Get coached through each section in a mock interview setting
A rate limiter answers one question: should this request be allowed or rejected? The answer must come in under 5ms, be consistent across a distributed fleet, and handle 50K QPS without becoming the bottleneck it's supposed to prevent.
The rate limiter sits in the hot path of every API request, so resource usage scales with total QPS, not with data volume.
Bottom line: the rate limiter is CPU and latency bound, not memory or storage bound.
The rate limiter is not a user-facing API. It's middleware that intercepts every request. But it does need an admin API for rule management.
X-RateLimit-Limit: 100 X-RateLimit-Remaining: 73 X-RateLimit-Reset: 1710680400 Retry-After: 12 (only on 429 responses)
/internal/rate-check```/admin/v1/rules/{rule_id}``Note: the internal rate-check endpoint returns 200 even for denied requests. The API gateway reads the allowed field and decides whether to return 429 to the client. This keeps the rate limiter decoupled from HTTP semantics.

High-Level Architecture
The rate limiter is embedded in the API gateway layer, not a standalone service that every request hops through. This is critical for latency - adding a network hop for every request would blow the 5ms budget.
1. Client sends a request to the API gateway (Nginx/Envoy/Kong) 2. The gateway's rate limit middleware extracts the client_id and endpoint from the request 3. Middleware executes a Lua script on Redis that atomically checks and updates the counter 4. If allowed: request passes through to the backend service. Rate limit headers are added to the response. 5. If denied: gateway returns 429 immediately with Retry-After header. Request never reaches the backend.

Detailed Component Design
Two components need detailed treatment: the rate limiting algorithm and the Redis Lua script that implements it.
Algorithm: Sliding Window Counter (not Token Bucket)
Token bucket is the textbook answer, but sliding window counter is what you should actually use. Here's why:
Example: limit is 100 req/min. At 15 seconds into the current minute, previous window had 80 requests, current window has 30.
Why not fixed window: a client can send 100 requests at 11:00:59 and 100 more at 11:01:00, effectively getting 200 req/min. The sliding window eliminates this boundary problem.
The entire rate limit check must be atomic. If you do GET then SET as separate Redis commands, two concurrent requests can both read "99 remaining" and both think they're allowed. Lua scripts execute atomically on a single Redis node.
```lua local key_prev = KEYS[1] local key_curr = KEYS[2] local limit = tonumber(ARGV[1]) local window = tonumber(ARGV[2]) local now = tonumber(ARGV[3])
local prev_count = tonumber(redis.call('GET', key_prev) or '0') local curr_count = tonumber(redis.call('GET', key_curr) or '0') local elapsed = now % window local weight = (window - elapsed) / window local effective = math.floor(prev_count * weight) + curr_count
if effective >= limit then return {0, limit - effective, window - elapsed} end
redis.call('INCR', key_curr) redis.call('EXPIRE', key_curr, window * 2) return {1, limit - effective - 1, window - elapsed} ```
Returns: {allowed (0/1), remaining, seconds until reset}. The gateway reads these three values and populates the response headers.
rules:{endpoint}:{tier} with 60-second TTL
Data Model & Database Design
Two storage layers: PostgreSQL for durable rule configuration, Redis for ephemeral counters.
```sql CREATE TABLE rate_limit_rules ( id SERIAL PRIMARY KEY, endpoint VARCHAR(255) NOT NULL, tier VARCHAR(50) NOT NULL DEFAULT 'default', limit_count INT NOT NULL, window_secs INT NOT NULL, created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW(), UNIQUE (endpoint, tier) );
CREATE TABLE api_keys ( api_key VARCHAR(64) PRIMARY KEY, user_id BIGINT NOT NULL, tier VARCHAR(50) NOT NULL DEFAULT 'free', enabled BOOLEAN DEFAULT TRUE, created_at TIMESTAMPTZ DEFAULT NOW() ); CREATE INDEX idx_apikeys_user ON api_keys (user_id); ```
The UNIQUE constraint on (endpoint, tier) ensures one rule per endpoint-tier combination. No ambiguity in rule resolution.
Key pattern: rl:{client_id}:{endpoint}:{window_start}
rl:user_abc:/api/v1/posts:1710680400Two keys exist per client-endpoint pair at any time: the current window and the previous window. Once the previous window's TTL expires, Redis evicts it automatically. No cleanup jobs needed.
rules:{endpoint}:{tier}{"limit": 1000, "window_secs": 60}No sharding needed for rules (< 1MB total). Counter keys are automatically distributed across Redis cluster masters by consistent hashing on the key.

Deep Dives
Deep Dive 1: Distributed Consistency - Race Conditions Across Gateway Instances
The problem: 10 API gateway instances all talk to the same Redis cluster. Two requests from the same client hit different gateways simultaneously. Both execute the Lua script. Is the count accurate?
Yes - because Redis executes Lua scripts atomically on a single thread. Both requests hash to the same Redis master (same client_id in the key), and Redis serializes the script executions. No race condition.
But what if the Redis cluster uses multiple masters? The client_id-based key ensures both requests route to the same master via consistent hashing. The only risk is during a Redis cluster rebalancing (slot migration), where a brief window of inconsistency can allow 1-2 extra requests through. This is within the 1% accuracy tolerance.
For global per-endpoint limits (e.g., 10K req/sec to /api/search across all users), the counter key doesn't include client_id. All gateways hit the same Redis key on the same master. At 50K QPS to a single key, Redis handles it fine - single-key operations are its sweet spot.
Deep Dive 2: Fail-Open vs Fail-Closed
The problem: Redis goes down. Every request now can't be rate-checked. What do you do?
Detection: use a circuit breaker pattern. After 3 consecutive Redis failures within 1 second, trip the circuit and switch to local mode. After 10 seconds, try one Redis call (half-open). If it succeeds, resume normal mode.
Deep Dive 3: Hot Key Problem at Scale
The problem: a single API key making millions of requests per second. The Redis key for that client becomes a hot key on one master node, causing latency spikes.
rl:user_abc:/api/search:17106804, split into rl:user_abc:/api/search:17106804:shard_0 through shard_4. Each gateway picks a shard randomly, INCRs it, and the Lua script sums all shards. Spreads load across Redis masters at the cost of slightly less accurate counting.In practice, combine both: local pre-filtering catches obvious abusers, key splitting handles the gray zone.
Key Trade-offs:
Previous
Design a News Feed (Twitter / Facebook)
Next
Design a Video Streaming Platform (YouTube / Netflix)
Our AI interviewer will test your understanding with follow-up questions
Start Mock Interview