Capping how many requests a client can make to your API per time window — the four algorithms, the HTTP 429 contract from RFC 6585, and the implementation pattern most solo SaaS will ship in 2026.
Research-based overview. This article synthesizes IETF RFCs (notably RFC 6585), public documentation from Cloudflare, Vercel, Upstash, Stripe, and OpenAI, and the prevailing patterns in the Node.js and Next.js ecosystems. How we research.
Retry-After header, and rate-limit metadata.Every API has a maximum capacity. If a single misbehaving client consumes that entire capacity, every other client suffers. Rate limiting is the contract that says “you can have N requests per minute, then you have to wait.” In a 2026 solo-SaaS context the stakes are higher than they used to be: a buggy script can issue thousands of requests per second, and if your endpoints proxy to OpenAI or Anthropic, a runaway client can run up a four-figure AI bill in an afternoon. Rate limiting is no longer an enterprise nice-to-have; it’s a basic operational seatbelt.
The four reasons rate limits exist, ordered by how often they bite small SaaS:
For most solo founders the trigger is the AI-cost story: someone writes a quick script, points it at your /api/chat endpoint, and watches your OpenAI dashboard light up. Our webhook security best practices piece covers the “authenticate inbound calls” problem; this page covers throttling.
There are four algorithms you will see in the wild. Each has a different shape of allowed traffic and different memory cost. The right choice depends on whether you want to allow bursts, smooth them, or queue them.
The simplest. You divide time into fixed windows (say, one-minute buckets) and count requests in the current window. If the count exceeds the limit, reject. Counter resets at the window boundary. Memory cost is O(1) per client.
Pros: trivially easy to implement, cheap to store, easy to explain. Cons: the boundary problem — a client can fire its full quota at the last second of one window and the full quota at the first second of the next, effectively doubling the limit at the boundary. For limits that need to be tight (auth, payment endpoints), this matters.
The window slides with the current request. You count how many requests this client made in the trailing 60 seconds and reject if the count exceeds the limit. Two common implementations: a precise version that stores every request timestamp (high memory) and a weighted version that combines current and previous fixed-window counts (low memory, near-precise).
Pros: no boundary edge-case; smoother enforcement than fixed window. Cons: more memory and CPU; the precise variant scales poorly at high request volume.
The model: each client has a bucket that holds up to N tokens. Tokens refill at a fixed rate (say, 10 per second). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket can fill up to its cap, which means a client who has been idle can burst up to N requests instantly when they return.
Pros: handles bursty traffic gracefully — ideal for B2B APIs where customers naturally batch work. The most popular algorithm for B2B SaaS rate limits because it matches how real applications use APIs (idle, then burst, then idle). Cons: two parameters to tune (bucket capacity and refill rate) instead of one, slightly more code than fixed window.
A queue model. Requests arrive into a fixed-size queue and are processed (“leak out”) at a constant rate. If the queue is full, new requests are rejected. Unlike token bucket, this smooths bursts rather than allowing them — the consumer of the API sees a steady, throttled rate regardless of arrival pattern.
Pros: protects downstream systems that hate spikes (databases, payment processors). Cons: introduces latency for queued requests; not ideal for interactive APIs where users expect immediate response or rejection.
For most solo SaaS APIs in 2026, token bucket is the default choice. It handles real-world usage patterns, is well-supported by libraries like Upstash Ratelimit, and the API contract (“you have a budget, it refills, you can burst”) maps cleanly to plan-tier-style pricing.
When a request exceeds the limit, the server has to tell the client. The conventions here are codified in RFC 6585, which defines the HTTP 429 Too Many Requests status code, and in widely-followed (though not formally standardized) header conventions used by GitHub, Stripe, Twitter, OpenAI, and most other major APIs.
429 Too Many Requests is the canonical response. RFC 6585 explicitly recommends including a body and headers that explain when the client may try again. Returning a 503 or a generic 500 is wrong; well-behaved API clients treat 429 as a backoff signal but treat 5xx as “the server is broken” and may not back off correctly.
RFC 6585 (and earlier RFCs that defined Retry-After) specifies that the server should send a Retry-After header indicating either a delta-seconds value (e.g. Retry-After: 30) or an HTTP date. This is the most important header in the rate-limit response — it tells the client exactly when retrying makes sense. Honor it on the server, and well-built clients will back off correctly without further work.
Beyond Retry-After, three headers are de-facto standard:
X-RateLimit-Limit — the maximum number of requests in the window (e.g. 100).X-RateLimit-Remaining — the number of requests left in the current window (e.g. 42).X-RateLimit-Reset — the time at which the limit resets, usually as a Unix timestamp.These headers are sent on every response (not just 429s) so clients can self-throttle before they trip the limit. The IETF has a draft — draft-ietf-httpapi-ratelimit-headers — aiming to standardize a RateLimit header without the X- prefix; until it ships, the X-prefixed names remain the safe default.
The 429 body should be machine-readable JSON with a clear error code and message:
This lets the client log a clear error and lets your dashboard surface a meaningful message instead of “something went wrong.”
Three layers can enforce rate limits, and the right choice depends on what you’re protecting against.
| Layer | Tool examples | Best for |
|---|---|---|
| Edge | Cloudflare, Vercel Firewall, AWS WAF | Coarse-grained DDoS, IP-based abuse, geographic blocks. Stops bad traffic before it touches your origin. |
| API gateway | Kong, AWS API Gateway, Tyk | Per-API-key limits, plan-tier enforcement when you have a real product gateway in front of multiple services. |
| Application | Upstash Ratelimit, express-rate-limit, custom middleware | Per-user, per-tenant, per-endpoint limits where you need access to the authenticated identity to key the limit. |
For most solo SaaS the answer is both edge and application: a coarse limit at the edge to absorb DDoS-like traffic, and a fine-grained limit at the application layer to enforce plan tiers and per-user quotas. The edge layer doesn’t know who your authenticated user is; the application layer can’t cheaply turn away a million requests per second.
If you are deploying a Next.js app on Vercel, the dominant pattern in 2026 is to pair Upstash’s serverless Redis with their Ratelimit SDK. The combination gives you token-bucket or sliding-window limits with sub-millisecond Redis lookups from Vercel’s edge or serverless runtime, no infrastructure to manage, and a free tier that covers low-traffic apps.
The shape of the integration is roughly:
The pattern works equally well for plan-tier-based limits — key the limit by user ID and look up the user’s plan to choose the limit:
Ratelimit.slidingWindow(100, "1 h") — 100 requests per hour.Ratelimit.slidingWindow(1000, "1 h") — 1,000 per hour.Ratelimit.slidingWindow(10000, "1 h") — 10,000 per hour.If you’re wiring this together with auth, our how to add OAuth to your SaaS guide covers the identity side, and what is a JWT covers the token format you’ll typically use to identify the calling user.
The key you rate-limit on matters at least as much as the algorithm. Common choices:
The pragmatic combination for most solo SaaS: per-IP limits on auth endpoints, per-user limits on the rest of the API, with per-endpoint limits added as you discover specific abuse patterns.
Retry-After and half return a custom X-Wait-Seconds header, client SDKs can’t back off correctly. Pick the standard names (Retry-After, X-RateLimit-*) and use them everywhere.X-RateLimit-Reset values. Write a test that asserts the headers are present and numeric on every 429.The dual problem of rate-limiting your own API is gracefully handling rate limits imposed by the APIs you depend on. OpenAI, Anthropic, Stripe, GitHub — every upstream provider has a 429 in its future for your service. Best practices on the client side:
Retry-After. Your client should read the header, sleep, and retry. Anything else either gives up too soon (lost work) or hammers the upstream (worse rate limits).retryAfter + random(0, 5000ms)) so retries spread out.API rate limiting is the seatbelt that prevents one client — abusive, buggy, or just enthusiastic — from ruining the experience for everyone else and from running up unbounded upstream bills. The four algorithms (fixed window, sliding window, token bucket, leaky bucket) cover almost every real-world need; token bucket is the default for SaaS APIs in 2026. The HTTP contract is small but matters: 429 status, Retry-After header per RFC 6585, and the X-RateLimit-* trio of metadata headers. The implementation is increasingly boring — Upstash Ratelimit on Vercel handles the hot path for most Next.js solo SaaS — which means the interesting decisions are about what to limit (per IP, per user, per tenant, per endpoint) and how to communicate the limit gracefully to your clients.
The stack, prompts, pricing, and mistakes to avoid — for solo founders building with AI.