What Happens When You Don’t Rate Limit Your API (and How Token Buckets Fix It)

12 min readJust now

–

When you deploy your first API, it feels like something real has shipped. You’ve tested your endpoints, your JSON is valid, and the responses are coming back in millisecond. Everything feels smooth until thousands of requests come flooding in instantly.

Press enter or click to view image in full size

Image Source: https://stytch.com/blog/api-rate-limiting/

It starts with a slight spike in latency. A specific endpoint, maybe the search function or the image uploader starts taking two hundred milliseconds instead of twenty milliseconds. Then, the CPU usage graph on your admin dashboard, which was hovering at 5% suddenly shoots up to 70% or even higher. The memory usage climbs the mountain.

Suddenly logs are no…

12 min readJust now

–

Press enter or click to view image in full size

Image Source: https://stytch.com/blog/api-rate-limiting/

Suddenly logs are not scrolling gently anymore, they are flying in a screen of red. 500 Internal Server Error, **504 Gateway Timeout. **Your database connection is under strain.

You check the source of the traffic and realize with horror what happened. It wasn’t a malicious hacker. It was a frontend developer on your team who accidentally placed an API call inside a ***useEffect ***hook without an dependency array, causing the client to retry the request ten thousand times a second.

Your server just collapsed under the weight of its own users. For every backend engineer, this is the moment you realize that functionality alone isn’t enough. You need protection. You need control. You need a Rate Limiter.

The Dangerous Illusion of Infinite Capacity

When you are learning to code, you usually run your server on “localhost”. In this environment, you are the only user. If you click a button 10 times, the server handles it 10 times. It feels instant. This creates an Illusion to teach you to assume that the server will always handle any amount of requests.

In a real world, a server is just a computer. It has a specific number of CPU cores and a specfic amount of RAM to store data. Every time a request arrives at your API, your server has to set aside some of those resources to handle them.

Without a rate limiter, your API invites everyone in at once, promising to serve them all, but since no server can handle infinite requests, a limiter steps in to control the flow, serving only what it can handle and making the rest wait.

Three Silent Killers of APIs

**The Accidental Loop, **happens when human errors like an app retrying too aggressively or a script stuck in an infinite loop cause “friendly fire” that can be just as deadly as an attack.
**The Noisy Neighbor, **in cloud computing, your application often shares resources with other parts of the system. Think off having an API that resizes images. This is a heavy operation. If a user wants to upload 1,000 high quality photos, the CPU will be almost 100% dedicated to resizing the images. Meanwhile, another user who just wants to log in, cannot get through cause the CPU is busy. Without Rate Limiting, one heavy user can starve the entire system.
**The Wallet Drain, **happens when uncontrolled API calls to paid services like OpenAI or AWS Lambda let a single bug or malicious user rack up thousands overnight, and rate limiting is the only thing standing between you and bankruptcy.

Why Counting is Not Enough

The solution may seem simple, let us allow 60 requests per minute. It may sound reasonable, you set up a counter. Every time a request comes in, you add one to the count and when the count exceeds 60 you block everything else until the minute is over.

But there is a flaw in this logic, “Per minute” is a measure of duration, not distribution.

Imagine your “minute” starts at 12:00:00. Between 12:00:00 and 12:00:01, a malicious user sends 60 requests. Your server allows them all. The counter hits 60. The user is now blocked for the next 59 seconds. This seems fine.

But what if they wait until 12:01:00? The counter resets. Between 12:01:00 and 12:01:01, they send another 60 requests.

Did you see what happened there? By exploiting the transition between one minute to the next, the server was hit with 120 requests in what feels like an instant.

To solve this, we need a smarter algorithm. We need something that smoothes out the traffic, something that understands the flow of time rather than just checking a calendar. We need the Token Bucket.

Think of Tokens Like Arcade Credits

The Token Bucket algorithm is the gold standard for rate limiting because it matches our intuition about capacity. Suppose, you are at an arcade in front of your favorite cabinet. To play the game, you need a physical token. Next to your machine there is a bucket which is your “allowance”.

Every ten seconds, an employee walks and drops exactly one token into the bucket. This is called the Refill rate. However, the bucket can only hold upto five tokens maximum. If you have not been playing for a while the bucket gets full and the employees who are trying to drop a token into the bucket, it bounces off the rim and falls on the floor. You cannot save the tokens forever. This is called **Capacity **or Burst limit.

This analogy perfectly captures the two behaviors we want in our API:

The Burst: If you step away for lunch, your bucket fills up to five tokens. When you return, you can play five games in rapid succession without waiting. This is great for an API! It means that when a user first opens your app, they can load their profile, their feed, and their notifications instantly. We don’t want to slow them down when they are acting normally.
The Throttle: Once you have used up that initial burst, you can no longer play fast. You are forced to play at the speed of the employee, one game every ten seconds. You have been successfully “rate limited” to a steady pace.

Why Go Was Born for This Problem

If we were trying to implement this in a language like Python, Java or C++, we would run into a issue named “Concurrency Control”.

In a real webserver, thousands of requests are happening at the same time. This means thousands of “hands” are reaching into the “bucket” simultaneously.

What happens if two requests try to grab the last token at the same time? Without careful design of algorithm, both requests might see the toke, and both might go through. The token count can become negative. The memory can get corrupted.

To prevent this in other languages, we have to use “Locks” or “Mutexes” (Mutual Exclusions). We have to tell the computer: “Freeze the entire universe. Let Request A check the bucket. Okay, unfreeze. Now freeze again for Request B.” This works, but it is slow, and writing the code is error-prone.

To support this, Go includes a primitive type called a Channel. You can think of a channel as a conveyor belt. You can put things into one end, and take things out of the other. Most importantly, the channel is “thread safe” by design. Go’s runtime manages the chaos. If a thousand goroutines try to pull from a channel at once, Go ensures that they form an orderly line.

Building the Rate Limiter

Now, let’s build a rate limiter in Go, keeping the implementation simple.

Step 1: The Blueprint

First, we need to define what our Rate Limiter actually is. We need a struct, a container for our data to store.

type RateLimiter struct {    tokens     chan struct{}    refillTime time.Duration}

The ***tokens ***is our bucket. Notice the type **chan struct{}? **In Go, an empty struct uses Zero Bytes of memory, meaning it carrier non information. It is a pure Signal in our case. The **refillTime **tells our system how often the “employee” should drop the new token.

Step 2: The Constructor

Now we need a method to handle a new limiter.

func NewRateLimiter(maxTokens int, refillTime time.Duration) *RateLimiter {    rl := &RateLimiter{        tokens:     make(chan struct{}, maxTokens),        refillTime: refillTime,    }    // Phase 1: The Initial Fill    // We want our users to have their "burst" allowance immediately.    for i := 0; i < maxTokens; i++ {        rl.tokens <- struct{}{}    }    // Phase 2: Start the Background Refill    // This is the "employee" who works in the background    go rl.startRefill()    return rl}

Look at the loop: for **i := 0; i < maxTokens; i++. **When we create the bucket, we immediately fill it to the top. We put as many tokens as possible in it. Why you may ask… Because when a server restarts, or when a new user arrives, we want them to have a full battery. We want their first experience to be smooth and fast. If we started with an empty bucket, the very first request would be denied, which would be a terrible user experience.

Now have a look at **go rl.startRefill(). **The go keyword spins up a Goroutine. This is a background thread that will run forever, parallel to the rest of the program. It is the “employee” walking over to the bucket.

Step 3: The Heartbeat

Now we will define what the employee actually does. This is the logic that runs inside startRefill.

func (rl *RateLimiter) startRefill() {    ticker := time.NewTicker(rl.refillTime)    defer ticker.Stop()    for {        // Wait for the tick        <-ticker.C                // Try to add a token        select {        case rl.tokens <- struct{}{}:            // Success! Token added.        default:            // Bucket is full. Drop the token.        }    }}

This is an indinite loop, but it does not eat up the CPU because of ***ticker.C. ***The program pauses at that line until the clock ticks.

The selectis the main part of this method. In Go, if you try to push an item into a channel that is already full, the program “blocks”, it freezes and waits for space to open up. We do not want our background worker to freeze. If the bucket is full, we want the worker to throw the token away and wait for the next tick. The default case handles this. It effectively says: "Try to send a token. If you can’t do it instantly, never mind. Just skip it."

Step 4: The Gatekeeper

Finally we need the method that our web server will actually call. This is the method that says “Yes” or “No”.

func (rl *RateLimiter) Allow() bool {    select {    case <-rl.tokens:        return true    default:        return false    }}

This method is simple. It tries to read from the channel (<- rl.tokens). If there is a token, it consumes it and return True. The request is allowed. If the channel is empty, it hits the default case where it returns False and the request is denied.

Seeing it in Action

Now that we have built our engine, we need to verify that it actually behaves the way we wanted to.

func main() {    // 1. Create the Limiter    // Capacity: 100 tokens (The Burst)    // Refill Rate: 1 token every 1 second (The Throttle)    limiter := NewRateLimiter(100, time.Second)    fmt.Println("STARTING TRAFFIC SIMULATION")    // 2. We simulate 300 incoming requests    for i := 1; i <= 300; i++ {                // Ask the gatekeeper for permission        allowed := limiter.Allow()        // Log the result        if allowed {            fmt.Printf("Request %d: [ALLOWED]\n", i)        } else {            fmt.Printf("Request %d: [DENIED]\n", i)        }        // Simulate the speed of the user (Fast! Every 50ms)        time.Sleep(50 * time.Millisecond)    }    fmt.Println("SIMULATION COMPLETE")}

Analyzing the Output: A Story of Three Phases

When you run this program, the console output will fly by. But if you scroll back up and read the logs, you will see a fascinating story unfold in three distinct phases.

**Phase 1: The Open Floodgates (Requests 1 to 100) **Because we initialized the bucket with 100 tokens, the first 100 requests are all [ALLOWED]. Even though the user is hammering the server every 50ms, the rate limiter doesn’t stop them. It says: "You have a large allowance, go ahead."

This is exactly what we want. If a legitimate user opens your app and their mobile needs to fetch 50 images or 20 JSON files instantly, we should not block them. We absorb the spike because the bucket was full.

**Phase 2: The Draining (Requests 101 to ~105) **Around request 100, the bucket hits zero. The initial battery is dead. However, notice something interesting? The program takes time to run. We are sleeping 50ms per request. But the time we hit 100 requests, roughly 5 seconds has passed, during this 5 seconds, our “employee” has woken up 5 times and dropped 5 tokens into the bucket. So, request 101, 102, 103, 105 and 105 might still go through. The user feels the system might be slowing doen but they are not full blocked.

**Phase 3: The Wall (Requests 106 to 300) **Now, suppose the user is still trying to hit the server 20 times a second but the server is only refilling 1 token per second. The supply of 1 token cannot meet the demand of 20.

The console will become a sea of Red: Request 110: [DENIED], Request 111: [DENIED]... Suddenly, a single Green line appears! Request 120: [ALLOWED](One second passed, a token appeared). Then back to Red.

Understanding the Console Output

1. The Math Behind the “Wall”

To understand why Request 120 is the one that suddenly turns green, we need to look at elapsed time, not just raw request count.

Each request arrives every 50ms, so by the time we reach Request 100, exactly 5 seconds have passed.

The bucket initially starts with 100 tokens, which are fully consumed by Request 100. However, during those 5 seconds, the background **refiller **has quietly added 5 new tokens (1 token per second).

Those refilled tokens create a small buffer. Requests 101 through 105 consume them immediately. At Request 106, the bucket is officially empty.

This is the moment the wall appears.

2. Why Request 120 Is Allowed

Now let us see what happens after the bucket is drained.

The last available toke is consumer by Request 105, which arrives at 5.25 seconds. From this point onwards, the system depends entirely on the refiller.

The **refiller **adds one token every 1 second. Since the fifth token was added exactly at the 5 second mark, the sixth token will not exist until the clock hits 6 second mark.

So which request lands exactly at 6th second?

Req Number * 50ms = 6000ms

Req Number = 6000/50 = 120

So at the request 120, the background goroutine will drop a token into the channel and the Allow() call succeeds. That is why the request suddenly got accepted.

3. Explaining the “Sea of Red”

Between Requests 106 and 119, the user continues sending requests every 50ms, but the server can only generate one token per second.

This creates a dead zone of roughly 950ms, where every incoming request fails.

Requests 106–119: 14 consecutive failures (700 ms total)
Request 120: Success (6th second refill)
Requests 121–139: 19 consecutive failures
Request 140: Success (7th second refill)

The failures aren’t random. They’re a direct consequence of time-based token generation.

This is the elegance of the Token Bucket algorithm. It doesn’t crash your server or abruptly cut clients off. Instead, it calmly enforces the rate you’ve decided is safe, turning uncontrolled bursts into predictable, steady traffic.

What Changes in Production

What we have built here is a fully functional rate limiter. However, moving from blog to a production environment requires a shift in mindset. There are a few challenges that experienced engineers know.

The “Wall Clock” vs. The “CPU Clock”

In our code, we used time.NewTicker. In an ideal world, if you tell the computer "tick every 1 second," it ticks exactly every 1.0000 seconds. In the real world, computers get busy. If the CPU is under heavy load (which is exactly when you need rate limiting!), the operating system might pause your program for a few milliseconds to handle something else. This means your refill might happen at 1.05 seconds or 1.1 seconds. This "drift" is usually acceptable for rate limiting. We aren’t launching a rocket; we are protecting a server. If we let in 59 requests in a minute instead of 60, nobody will complain. But it is important to remember that software timing is rarely perfect.

The User Experience of Rejection

In our code, we simply printed “DENIED” to the console, but in a real web API you can’t just drop a request that looks like a crash. HTTP already defines the correct response for this scenario: 429 Too Many Requests. When Allow() returns false, your handler should respond with a 429 status code, and ideally include a Retry-After: 60 header to tell the client exactly when it’s safe to try again. That small detail turns a blunt rejection into a clear, respectful contract between server and client.

Distributed Systems

The code we wrote today works perfectly for a single server. But what if your app becomes huge like Netflix or Uber? You might have 50 servers running the same code. If you have 50 servers, and each one has a bucket of 100 tokens, then effectively your system allows 5000 requests bursts! Moreover, a user might hit server A for their first request, and server B for their second, Server B does not know that the user already used a token in Server A. To solve this, big tech’s move the **bucket **out of the memory of the code and into a shared database like **Redis. **The logic remains same, but the storage becomes centralized.

Final Thoughts

By implementing a rate limiter, you have taken a massive step in your journey. You have acknowledged that your system has limits, and you have built a mechanism to enforce them. You have moved from “hoping” your server stays up to “ensuring” it stays up.

The Token Bucket algorithm is elegant, fair, and resilient. And thanks to Go’s channels, it is also incredibly simple to implement. You don’t need a heavy framework. You don’t need a complex library. You just need a bucket, a ticker, and the wisdom to say “No” when the traffic gets too loud.

Thanks for taking the time to read this. If this helped clarify how rate limiting works in practice, or saved you from learning it the hard way, I’m glad. I’ll be writing more about real world backend engineering, system behavior under load, and the small details that matter at scale, so feel free to follow if that sounds useful.