Rate limiting protects services from abuse and from themselves. The algorithm you pick affects bursting behavior, memory cost per client, and fairness — and the wrong choice causes either lost legitimate traffic or service overload.
Fixed window — simple, broken
Count requests in the current minute; reject when count > limit; reset every minute. Easy to implement but lets a client send 2× the limit by timing requests around the boundary (e.g., 100 at 12:59:59, 100 at 13:00:00). Use only for non-critical limits.
Sliding window log — accurate, expensive
Store every request timestamp; on each new request, count timestamps in the trailing window. Perfectly accurate but O(N) memory per client. Fine for low-RPS APIs; impossible at scale.
Token bucket — the production default
Each client has a bucket with capacity C and refill rate R/sec. Request consumes 1 token; reject if bucket empty. Allows controlled bursting up to C, then enforces sustained rate R. 2 numbers per client. This is what most production rate limiters use.
Token bucket implementation
import time
class TokenBucket:
def __init__(self, capacity, refill_per_sec):
self.cap, self.rate = capacity, refill_per_sec
self.tokens = capacity; self.last = time.monotonic()
def take(self, n=1):
now = time.monotonic()
self.tokens = min(self.cap, self.tokens + (now - self.last) * self.rate)
self.last = now
if self.tokens >= n:
self.tokens -= n; return True
return FalseDistributed rate limiting
Single-instance buckets are easy. Distributed across 100 API servers, you have options: local + sloppy (each server tracks its share, total may overshoot by 10%) — usually fine. Redis-backed (every request hits Redis) — accurate but adds 1-2 ms. Sliding window counter in Redis (Cloudflare's approach) — best of both. Pick based on whether 10% overshoot is acceptable.