Your biggest traffic day will find every weakness in your system. It always does.
You can prepare. You can test. You can run load simulations and review query plans and feel confident that your infrastructure is solid. And then real users show up in numbers you have never seen before, and suddenly you are watching your database buckle under pressure you thought you were ready for.
I have been building software for years. I have seen high traffic systems before. I have dealt with scaling problems and database bottlenecks and all the fun that comes with production load. But every system has its own breaking point, and sometimes you find it the hard way. There is nothing quite like watching response times climb from milliseconds to seconds to timeouts while your monitoring das…
Your biggest traffic day will find every weakness in your system. It always does.
You can prepare. You can test. You can run load simulations and review query plans and feel confident that your infrastructure is solid. And then real users show up in numbers you have never seen before, and suddenly you are watching your database buckle under pressure you thought you were ready for.
I have been building software for years. I have seen high traffic systems before. I have dealt with scaling problems and database bottlenecks and all the fun that comes with production load. But every system has its own breaking point, and sometimes you find it the hard way. There is nothing quite like watching response times climb from milliseconds to seconds to timeouts while your monitoring dashboard turns red and your community channels fill with frustrated users.
This is the story of how that happened to us at Reveel during a major campaign, how we diagnosed and fixed a cascading database failure under live fire, and what I learned about Postgres behavior at scale that I wish I had internalized earlier.
REVA, Reveel’s latest flagship product, had been live for about three months. We had just partnered with Binance for an extended campaign spanning multiple phases over two months. Each phase launch brought a predictable pattern: traffic would spike hard in the first 12 hours as users rushed to complete tasks, then settle into a steady state until the next phase.
Our stack is Node.js with Prisma as the ORM, running on Heroku with a managed Postgres instance. We had the fundamentals in place. Redis caching for frequently accessed data, reasonable connection settings, monitoring dashboards. The early phases had gone smoothly enough that we started feeling confident.
Confidence is dangerous.
One of the later phase launches brought traffic that exceeded anything we had seen before. Binance had pushed harder on promotion. The user influx in those first 12 hours was massive. And our database, which had been handling things just fine, started falling apart.
The symptoms hit fast. H12 and H13 errors flooding our Heroku logs. Simple database lookups taking 15 to 20 seconds. Campaign activity verification failing. PayID registrations timing out. Users bouncing off error pages.
Our Discord exploded with reports. Our Telegram channel with the Binance team turned into a crisis room. Real users, real partner, real stakes. And our Postgres instance was choking.
I pulled up the database metrics. Connection count pegged at the limit. CPU spiking. Query times through the roof. Our caching layer was doing its job, but every request that actually needed to hit Postgres was piling up and waiting.
What followed was several intense hours of live debugging. We identified the problems. We fixed them in record time. The system stabilized. But those hours taught me more about database behavior under stress than years of reading documentation ever could.
Here is what we found and how we fixed it.
From traffic spike to stabilization in under 3 hours
The first thing I noticed was our connection count. We were hitting the maximum connections allowed by our Postgres plan, and we were hitting it hard.
Here is what was happening under the hood.
Each Heroku dyno runs multiple workers. Each worker maintains its own connection pool through Prisma. Under normal load, this is fine. But when traffic spikes, you end up with a multiplication problem: dynos times workers times pool size equals a lot of connections.
Making it worse, when a query runs slow (which they were, but I will get to that), the connection stays open longer. Other requests pile up waiting for connections. Eventually Postgres hits its connection ceiling and starts refusing new connections entirely. That is when everything falls apart.
The Fix: Tuning PgBouncer Configuration
We already had PgBouncer as a connection pooler between our application and Postgres. The problem was the configuration. Pool sizes were not tuned for the traffic we were now seeing. The key change was switching to transaction pooling mode (which releases connections back to the pool as soon as a transaction completes rather than keeping them tied to a client session) and adjusting the numbers to match our actual load.
Here is the configuration that actually mattered:
Prisma side (per dyno):
We dropped the connection pool size to 3 to 5 connections per dyno. This felt counterintuitive at first. Fewer connections per dyno? But the goal was to stop individual dynos from hoarding connections during spikes.
PgBouncer side:
We set default_pool_size to start around 50 and scaled up cautiously to 100. We added a reserve_pool_size of about 10 to 20 percent of the default pool to handle burst traffic without immediately rejecting connections.
The math we followed:
We calculated our effective connection usage as: dynos × workers × prisma_pool_size × 1.2 (for headroom). The rule was simple: never let this number exceed 60 to 70 percent of our Postgres plan’s hard connection limit. Leave room for admin connections, migrations, and the unexpected.
What went through PgBouncer:
All Node.js API dynos using Prisma. All HTTP request driven database access. Essentially every hot path touching Postgres during the campaign.
What stayed on direct connections:
One-off admin scripts and migrations. Background jobs that did not need pooling and had minimal connection requirements.
Transaction pooling has one gotcha: it does not play well with prepared statements. Prisma has a configuration option to disable these, which we enabled. A small performance tradeoff for massive stability gains.
The result was immediate. Instead of thousands of connections fighting for access, we had a controlled, capped connection set with PgBouncer efficiently multiplexing requests. Slow queries could no longer starve the entire system.
The difference was in the configuration, not the tooling
Why the numbers matter: 96% usage left no room for spikes
With connection exhaustion under control, we could finally see what was actually slow. And oh, it was slow.
Two query patterns were responsible for most of our pain.
The ILIKE Problem
We had search functionality using queries like:
SELECT * FROM users WHERE name ILIKE ‘%john%’
This looks innocent. It works fine in development with a few thousand rows. It becomes a disaster at scale.
The problem is that leading wildcard. When you search for `%john%`, Postgres cannot use a standard B-tree index. The leading percent sign means the search term could appear anywhere in the string, so Postgres has to scan every single row and check. Every time. It does not matter if you have an index on the column. Postgres ignores it.
Under normal traffic, you do not notice. The scans complete fast enough. Under campaign traffic with thousands of concurrent users, you are running thousands of full table scans per minute. CPU spikes, queries queue up, and everything downstream suffers.
The Fix: Trigram Indexes
Postgres has an extension called pg_trgm that breaks strings into three character chunks (trigrams) and indexes those. With a GIN index using trigram operations, Postgres can actually use an index for wildcard searches.
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_users_name_trgm ON users USING GIN (name gin_trgm_ops);
After adding this, our `ILIKE ‘%john%’` queries could use an index. The improvement was dramatic.
The OFFSET Pagination Problem
We also had pagination using OFFSET:
SELECT * FROM users ORDER BY created_at OFFSET 10000 LIMIT 20
Here is what most developers do not realize about OFFSET. It does not skip rows. It fetches all of them up to the offset, discards them, and then returns your requested rows. An OFFSET of 10000 means Postgres reads and throws away 10000 rows before giving you 20.
The deeper someone paginates, the slower it gets. Linear degradation. During heavy traffic, with users browsing deep into lists, these queries were taking forever.
The Fix: Cursor-Based Pagination
We switched to keyset pagination:
SELECT * FROM users
WHERE created_at > $last_seen_timestamp
ORDER BY created_at
LIMIT 20
Instead of saying “skip 10000 rows,” we say “get rows after this timestamp.” Postgres can use an index to jump directly to the right position. Performance is constant regardless of how deep you paginate.
The combination of trigram indexes and cursor pagination took our worst queries from 20 seconds down to sub-second.
Even with connection pooling and optimized queries, we noticed some requests were holding database connections open for uncomfortably long periods. Tracing these down revealed the issue: we were doing too much synchronous work inside API request handlers.
External validations, third-party checks, additional API calls. All happening inside the request path while a database connection sat open waiting. The actual database work might take 50 milliseconds, but the request held the connection for 2 seconds while waiting on external services.
Under normal load, this inefficiency is invisible. Under campaign load, it meant connections were tied up far longer than necessary, contributing to the exhaustion problem we had already addressed with PgBouncer.
The Fix: Background Jobs
We moved heavy, non-immediate work to background job queues. The API request does the minimum necessary, returns quickly to the user, and releases its database connection. Background workers pick up the heavier tasks asynchronously.
This had a multiplier effect on throughput. Faster response times for users, better connection utilization, and more predictable system behavior under load.
One pattern we observed during the incident was runaway queries. A poorly optimized query would run for 30 seconds, holding connections and resources while accomplishing nothing useful. Meanwhile, other queries backed up behind it.
Postgres has configuration options to prevent this, but we had not set them aggressively enough.
The Fix: Timeout Configuration
We configured three critical timeouts:
statement_timeout: Maximum time any single query can run before Postgres kills it. We set this low enough that genuinely runaway queries get terminated before they cause cascading problems.
lock_timeout: Maximum time a query will wait to acquire a lock. Prevents deadlock scenarios from blocking everything indefinitely.
**idle_in_transaction_session_timeout: **Kills connections that have been sitting idle inside an open transaction for too long. Prevents forgotten transactions from holding locks and connections.
The philosophy here is that a fast failure is better than a slow failure. If a query is going to fail anyway because something is wrong, fail it quickly so resources can be freed for requests that might succeed.
The final piece was our Heroku dyno configuration. We had autoscaling enabled, which sounds good in theory. In practice, during traffic spikes, autoscaling was making things worse.
Here is why. When a new dyno spins up, it opens new database connections. If traffic is spiking and you are already near connection limits, adding more dynos actually accelerates the connection exhaustion problem. The autoscaler sees high load, adds dynos, which causes more connections, which causes more pressure, which looks like more load.
The Fix: Right-Sizing for Traffic Events
We pre-scaled for subsequent campaign phases instead of relying on reactive autoscaling. We analyzed expected traffic patterns, sized our dyno count appropriately, and reduced the autoscaling sensitivity to prevent thrashing.
For high-stakes traffic events, predictable infrastructure beats reactive infrastructure.
Critical queries dropped from 20 seconds to under 500 milliseconds. Campaign activity verification started working again. PayID registrations flowed through. Our Discord calmed down. The Telegram channel stopped being a crisis room.
40x improvement in query response times
The fixes we implemented became permanent parts of our infrastructure. What started as a firefight became our standard playbook. The rest of the Binance campaign ran without major database incidents.
Looking back, most of these issues were discoverable before they became emergencies. Here is the checklist I run through now before any major traffic event.
Connection Math
Calculate your maximum possible connections: dynos × workers × pool_size. Compare this to your database plan’s connection limit. If you are above 60 percent of the limit under normal operations, you have no headroom for spikes. Either reduce pool sizes, implement connection pooling, or upgrade your database plan.
Query Audit
Run `EXPLAIN ANALYZE` on your most common queries. Look for sequential scans on large tables. Look for queries that get slower as data grows. Pay special attention to any search functionality using LIKE or ILIKE with leading wildcards.
** Pagination Review**
Find every OFFSET in your codebase. For user-facing pagination on large tables, convert to cursor-based pagination. OFFSET works fine for admin interfaces with small result sets. It does not work for high-traffic endpoints.
** Request Path Analysis**
Trace a few representative API requests. How long do they hold database connections open? Is there external work happening while connections are open? Consider moving anything that does not need to be synchronous to background processing.
** Timeout Configuration**
Set aggressive but reasonable timeouts on your database. A 30-second query timeout might feel generous, but it means one bad query can tie up resources for 30 seconds. Consider what timeout would actually be acceptable for your users and set limits accordingly.
** Load Testing at Scale**
Test with realistic data volumes and realistic traffic patterns. Your local database with 1000 rows will not reveal the problems that show up with millions of rows and thousands of concurrent requests. Simulate your expected peak, then simulate 2x that peak. See what breaks.
Beyond the immediate fixes, there are strategies worth considering for teams building systems that need to handle unpredictable traffic.
Read Replicas for Read-Heavy Workloads
If your application is read-heavy (and most are), consider read replicas. Route read queries to replicas and reserve your primary for writes. This multiplies your read capacity without requiring a larger primary instance.
Connection Pooling at Multiple Levels
Beyond PgBouncer, consider application-level connection management. Some frameworks support connection pooling that can intelligently route queries and shed load under pressure. The more layers of defense, the more gracefully you fail.
Database Observability
Invest in query-level observability. Know which queries run most frequently, which take longest, which cause the most lock contention. Tools like pg_stat_statements or dedicated database monitoring services can surface problems before they become incidents.
Capacity Planning for Events
For planned traffic events, do the math ahead of time. Estimate expected traffic multipliers. Pre-scale infrastructure. Have a runbook for what to do if reality exceeds estimates. The worst time to figure out your scaling strategy is when everything is already on fire.
Every startup that survives long enough will have a moment like this. You build something, it works, you start to trust it, and then reality shows up with traffic you did not expect and finds every weakness you did not know you had.
The Binance campaign was that moment for us. Two months, multiple phases, thousands of users. The database incident forced us to level up our infrastructure in ways we should have done proactively but learned to do reactively.
I would be lying if I said those hours were not stressful. There is a particular kind of tension when you are debugging production systems under load, watching your community channels fill with frustrated users, coordinating with partners, knowing that real people are affected by every minute the problem persists. You do not have the luxury of careful analysis. You have to be fast and right at the same time.
What I took away from this experience goes beyond the specific technical fixes. Systems behave differently under stress. Patterns that work fine at normal scale can become disasters at high scale. The queries that never showed up in your performance monitoring can suddenly dominate your database load when traffic multiplies.
The fixes we implemented were not exotic. Connection pooling, query optimization, timeouts, background processing. These are well-known techniques. The lesson is that they need to be in place before you need them. By the time your database is melting, it is too late to start reading documentation.
Good engineering under pressure is not about avoiding problems entirely. It is about recognizing them fast, staying calm, and solving them systematically. We did that. REVA came out of it stronger. And now we have a playbook that we run before every major launch.
If you are building something that might face sudden traffic spikes, do the work now. Run the checklist. Do the connection math. Audit your queries. Set your timeouts. Your future self, staring at red dashboards while your community channels explode, will thank you.
No posts