Killing me softly with his bash song, Zero‑downtime all along…
Artwork by Bojan Jevtić. Used with permission.
Introduction
Upgrading PostgreSQL in production is never just about running pg_upgrade
.
At Fresha we run around 200+ PostgreSQL databases, and by the end of 2024 some of our most critical and heavily loaded ones were still stuck on Postgres 12 — about 20 databases in total. PostgreSQL 12 was heading for end‑of‑life in November 2024 – February 2025 on Amazon RDS, but the risk and pain of upgrades had been high enough that we postponed them again and again.
This created a dangerous …
Killing me softly with his bash song, Zero‑downtime all along…
Artwork by Bojan Jevtić. Used with permission.
Introduction
Upgrading PostgreSQL in production is never just about running pg_upgrade
.
At Fresha we run around 200+ PostgreSQL databases, and by the end of 2024 some of our most critical and heavily loaded ones were still stuck on Postgres 12 — about 20 databases in total. PostgreSQL 12 was heading for end‑of‑life in November 2024 – February 2025 on Amazon RDS, but the risk and pain of upgrades had been high enough that we postponed them again and again.
This created a dangerous situation: business‑critical data on a soon‑to‑be unsupported version, and no safe path forward. We had Debezium streaming change events into Kafka, outbox connectors producing ordered domain events, and fleets of replicas serving reads. Dropping into “maintenance mode” for even a few minutes was unacceptable.
For years, upgrades had been treated like staring into car headlights — everyone froze. I wanted to break that cycle. By combining database internals, streaming knowledge, and a healthy dose of Bash automation, I showed how zero‑downtime upgrades could work in practice. Once we proved it on the hardest PG12 databases, we scaled the method into a repeatable solution for dozens of teams and hundreds of databases.
Let’s step back for a moment. The obvious path was always “schedule maintenance windows” and take the hit. But that’s not realistic when your system is 24/7 and globally distributed. The real challenge wasn’t just moving the data — it was dealing with the entire ecosystem around it:
- Debezium CDC connectors continuously streaming changes into Kafka
- Outbox pattern connectors generating ordered business events
- Replication slots and WAL management that can’t be dropped and recreated without data loss
- Physical replicas serving production reads
- PgBouncer pools handling thousands of concurrent connections Instacart’s zero‑downtime cutover post describes a well‑known approach in the DBA world:
- Take a consistent base backup (RDS snapshot or
pg_basebackup
) - Keep a replication slot open on the primary so WAL keeps accumulating
- Align the replica’s replication origin with the backup’s last known LSN
- Stream the changes from the slot until the replica catches up
- Promote the replica and cut traffic over
Logical replication during a blue‑green upgrade: accumulate WAL during restore, then advance replication origin and catch up without loss.
That mechanism works, but it’s only the foundation. In practice it doesn’t address the messy parts we faced: Debezium connectors tied to those slots, outbox event ordering, sequence alignment, PgBouncer switchovers, and idempotency guarantees. Those were the hard problems we had to solve to make upgrades safe and repeatable at Fresha. And yes, we didn’t stop at a single major step — we did 12 → 17.
Why Classic Approaches Don’t Work
The classic “restore from snapshot, upgrade, and cut over” is conceptually simple. But in reality:
Logical replication must continue through the upgrade process.
Every physical replica needs to remain consistent and available for read‑only queries.
Kafka connectors tied to slots on the old cluster must transition cleanly — otherwise they either lose data or emit duplicates in unpredictable order.
Before building our own method, we looked at the obvious options:
RDS Blue/Green Deployments
Amazon RDS offers Blue/Green upgrades that spin up a parallel environment and promise minimal downtime. Sounds perfect until you realize it doesn’t integrate with the logical replication slots and Debezium connectors we rely on. Blue/Green can swap endpoints, but it doesn’t handle Kafka offsets, connector states, or downstream deduplication requirements. For us, that’s a non‑starter.
Direct pg_upgrade
on Production
In‑place upgrades sound easy: stop Postgres, run pg_upgrade, restart. But that implies minutes or hours of downtime on terabyte‑scale databases. It also assumes you’re comfortable with no rollback option beyond restoring backups. We weren’t.
So we built something different:
- Start the target database as a logical replica restored from a snapshot
- Rebuild its physical replicas alongside, so the new cluster mirrors the old one
- Use PgBouncer as the traffic switch, with a scripted pause/resume to freeze connections during switchover
- Design explicit Debezium handling modes for both CDC and outbox connectors
Synchronizing slots and connectors: duplicate replication slots, create new Kafka Connect connectors, then switch PgBouncer to move traffic to the upgraded database.
The rest of this article walks through how we turned that into an automated, developer‑friendly upgrade framework that anyone on our team could run in ~5 minutes per database.
Winter Is Coming: Prewarm & Analyze
Catching up a replica is only half the story. Before you flip traffic, you need to make sure the new cluster can actually carry production.
Consistency checks. On staging we ran PG17 with full app tests: migrations, queries, extensions. Don’t skip this. If the app isn’t green here, it won’t be green in prod.
Prewarm and analyze. A fresh RDS restore is cold. First queries crawl while blocks come in from object storage. We pre‑warmed critical tables and ran ANALYZE so stats were correct and the planner didn’t go wild on the first real query.
Two‑phase switchover.
1) Read‑only flip — move replicas first. All writes still hit Blue, but reads come from Green. This is the production smoke test — if queries fail here, you still have a rollback button.
2) Full flip — once RO is solid, pause PgBouncer pools, wait for replication to be fully caught up, then point writes at Green. Under the hood, “full” just means “RO first, then RW.”
PgBouncer routes traffic: switch read‑only first, then read‑write.
PgBouncer tricks. Running it on Kubernetes has its quirks. One big one: how do you safely reschedule pods without jolting the on‑call at 3 a.m.? Our pain point was ConfigMaps. Redeploying them was too slow. The fix: pre‑mount all configs on every pod, and when it’s time to switch, skip redeploy and just fire off a few admin commands.
SET conffile = '/etc/pgbouncer/pgbouncer_new_rw.ini';
RELOAD;
No pod restarts, no reconciliation delay.
Mind the Gap (setval +100k)
- Advance early, outside the pause. Right before switchover we run a
sync_sequences
script which iterates all sequences frompg_sequences
on Blue, reads each sequence’slast_value
, and then bumps the same sequence on Green tolast_value + sync_sequences_gap
(default100000
). This keeps the flip window short. - Exact mechanics. For each
schemaname.sequencename
:
-- On Green
SET transaction_read_only = OFF;
SELECT setval('<schema.seq>', <blue.last_value> + <gap>, true);
Why this is fine. We only need sequences to never collide after the flip. Pushing Green well ahead of Blue guarantees new inserts won’t reuse a value, and sequences are monotonic — no need to “pull back” on rollback.
Handling Debezium During Switchover
The problem. Debezium ties each connector to a logical replication slot on the source (Blue). When you bring up the target (Green) and want the same data to continue into the same topics, you may want to accept a short overlap window where both Blue and Green can emit the same changes.
CDC to Warehouse (easy mode: dedup)
For CDC → Debezium → Snowflake, we already deduplicate downstream. So we intentionally create a small overlap:
Managing WAL overlap: both Blue and Green CDC slots accumulate changes, with a controlled overlap to ensure continuity.
- Keep Blue’s connector running
- Create Green’s connector with the same schema/topic config, start from the fresh slot on Green
- Allow a short duplication window so both produce the same tail of WAL
- Stop Blue’s connector Result: Snowflake reconciles quickly and retains one copy of each row/version. Reliable, fast, and zero‑risk if dedup is in place.
Outbox/Event Streams (stricter)
Outbox topics care about ordering semantics. Our rule of thumb:
-
During the flip there should be no “funny ordering” — the consumer either sees an event it has already processed, or it’s a brand‑new event. We guarantee that by:
-
Pausing writes on Blue right before the RW flip (brief), giving Debezium time to flush the tail (config‑controlled)
-
Only then starting the Green connector
-
Keeping topic keying and partitioning identical, so any duplicates are the same key in the same partition
Outbox connector cutover: old connector flushes WAL, new connector starts reading from the overlapped slot on the upgraded database.
If your consumers are idempotent (exactly‑once at the business level), this pattern is safe: duplicates are replays of the same message, never re‑ordered across the cut.
Slots & offsets. You don’t “move” a slot; you create a new slot on Green and start the connector there. The old connector keeps its Blue slot until you stop it. Kafka offsets continue per topic/partition and duplicates are identical payloads, not new keys.
Overlap discipline. Keep the window short. Long overlaps create needless churn and noisy metrics.
Sensible checks. Only flip RW after:
- Blue → Green replication lag under threshold
- Outbox writes paused and flush complete (give it time to drain)
- Green connector running and healthy Rollback. If anything looks off, stop the Green connector, resume Blue, unpause writes. Because we haven’t changed keys/ordering, consumers remain consistent.
Orchestration Mode: Strict Ordering, Zero Confusion
When idempotency isn’t a sure bet for outbox consumers, we split the stream cleanly: old DB emits only “old” events, new DB emits only “new” events. No interleaving, no ambiguity.
How we mark events
-
Add a boolean version column to outbox tables, e.g.
use_pg17 BOOLEAN DEFAULT NULL
(present on all partitions) -
Right before the flip, change the default on the target (Green) to
TRUE
(and keep itNULL
/FALSE
on source/Blue) — new writes on Green automatically carryuse_pg17 = true
How we route events -
Create a new outbox connector on Green with a JSR‑223 Groovy filter that only passes rows where
use_pg17 == true
. The old connector keeps running on Blue but filters to “not true” (null/false). This cleanly divides the topic stream. Green filter:
value.after.use_pg17 == true
Blue filter:
value.after.use_pg17 == null || value.after.use_pg17 == false
- In the connector config set:
transforms=filter
transforms.filter.type=io.debezium.transforms.Filter
transforms.filter.language=jsr223.groovy
transforms.filter.condition=<condition above>
transforms.outbox.type=io.debezium.transforms.outbox.EventRouter
snapshot.mode=never
This is applied to old and new connectors.
Switchover choreography
- Prep: ensure
use_pg17
exists on all outbox tables; the script checks this and fails fast if missing - Flip default on Green: set
use_pg17 DEFAULT TRUE
so any new events after the write flip carry the tag - Start Green outbox connector with the
use_pg17 == true
filter; old Blue connector still runs with “not true” filter - Pause writes on Blue, let Debezium flush and the old connector process current events, switchover to Green, resume writes — Green produces new (tagged) events; filtering removes duplicates from old
- Retire Blue connector once the tail is drained; drop the filter on Green if desired (after Blue is off, everything is “new” anyway). Connector lifecycle and config mutation are automated via the Connect REST API.
Outbox orchestration: old and new connectors overlap, using a version flag to filter duplicate events during switchover.
Why this works
- Topics never receive mixed “old/new” events during the window. Consumers either see already‑processed events (from Blue) or brand‑new events (from Green). There’s no reordering across the boundary because we pause writes, drain, then resume.
- Rollback is symmetric: flip the defaults back, adjust filters (the script handles forward/reverse conditions automatically).
The Greatest Teacher, Failure Is
No matter how carefully you plan, switchover days have a way of surfacing unknowns. We built in multiple safety nets to make sure “wrong” never meant “irreversible.”
CDC connectors are the easy part. If a Debezium CDC connector stumbles, our downstream (Snowflake) is already deduplicating on keys, so we can always re‑consume the stream. Worst‑case, we spin up a new connector with snapshot.mode=initial
against a trimmed dataset, relying on table indices to cut over at a known safe point. Consumers will reconcile cleanly.
Outbox connectors are trickier. Because they carry business events, we can’t just blindly replay snapshots. That’s why we lean on idempotent consumers and, when possible, the Kafka Connect REST API to drop offsets. On older Connect versions, the fallback is heavier: create a fresh connector/slot pair, trim the table online to the events we care about, then re‑bootstrap. You can use something like my previous article as a recipe.
If the new database turns out to be unusable. Our orchestration doesn’t just go forward — it can reverse. Once the switchover completes, we can run the same scripts in reverse: switch PgBouncer back, flip replication slots/publications, and re‑attach connectors to the old cluster. In theory, you can loop this forward/backward multiple times until you’re satisfied.
Dry‑run mode as insurance. We added a “test catch‑up” mode in switchover.sh
: it pauses writes, lets logical replication and slots catch up, but never actually flips traffic to the new cluster. This gave us two superpowers:
- Measure how long the pause would last under real workload
- Test whether the application tolerated read‑only windows If it didn’t, we’d just resume writes on the source and treat it as a rehearsal, not a disaster.
Takeaway: A zero‑downtime upgrade isn’t just about the happy path. It’s about building escape hatches: reset CDC, replay outbox carefully, reverse the switchover if needed, and dry‑run until you trust the process.
Everything Everywhere All at Once, known as YAML
One of the biggest lessons we learned after the first few upgrades was: you don’t want hardcoded scripts. Each database had its quirks — different connector sets, replica layouts, outbox usage — so we pushed everything into a YAML config and a single entrypoint.
Yeah, you can do modules with Bash :)
Our switchover.sh
script is a thin wrapper around sub‑modules in lib/
, reading from switchover-config.yaml
. That config defines:
Source and target clusters
source:
internal_name: postgres_source
host: localhost
port: 5433
replicas:
- name: postgres_source_ro_1
- name: postgres_source_ro_2
target:
internal_name: postgres_target
host: localhost
port: 5434
replicas:
- name: postgres_target_ro_1
- name: postgres_target_ro_2
This lets us model not just primaries, but physical replicas, so PgBouncer can safely switch read‑only traffic separately from read‑write.
Database‑level details (slots, publications, subscriptions)
database:
name: testdb
user: testuser
password: testpass
logical_slot: dms_slot
logical_publication: dms_pub
logical_subscription: dms_sub
transaction_wait_time: 3 # allow txs to finish before going RO on Blue
PgBouncer pools and pre‑delivered switchover configs
pgbouncer:
switchover_rw_config_file: pgbouncer_new_rw.ini
switchover_ro_config_file: pgbouncer_new_ro.ini
Kafka Connect clusters and outbox mapping
kafka:
connect_clusters:
- name: cdc
url: http://localhost:8083
- name: outbox
url: http://localhost:8084
debezium:
outbox:
version_column: use_pg17
tables:
- outbox_events_1
- outbox_events_2
Replication tolerances (lag thresholds, sequence gaps, catch‑up timeout)
replication:
max_lag_bytes: 20000 # what to consider "no lag"
catchup_timeout: 10 # how long to pause to catch up
sync_sequences_gap: 100000 # safe gap for sequences
Why expose so many options? Because upgrades are never uniform:
- Sometimes you want duplication mode for Debezium (let duplicates flow, downstream deduplicates)
- Sometimes you need orchestration mode (JSR filters + version columns to separate old vs new events)
- Some teams only wanted to test read‑only switchover, without touching writes
- On the biggest DBs, we needed to pause transactions, wait for catch‑up, then flip. On smaller ones, we could skip the wait entirely Instead of maintaining a dozen scripts, we ended up with a single command:
./run.sh switchover \
--direction forward \
--mode full \
--debezium-mode orchestration \
--transaction-mode wait \
--test-catchup disabled \
--config configs/switchover-config.yaml
Every option maps directly to one of those operational decisions. This made the process reproducible, reversible, and explainable.
Results
Between January and February, we upgraded 20+ PostgreSQL databases from PG12 to PG17 without a single minute of downtime. Even the largest and most sensitive database — the one where we couldn’t assume idempotent consumers — was migrated safely using the orchestration trick with version columns and JSR filters.
And yes, after the final switchover, we had a cake. 🎂
More importantly, we ended up with a repeatable framework:
- The same scripts and configs could be applied across clusters
- Every option (duplication vs orchestration, test catch‑up vs full switchover) was codified and reusable
- Rollback and dry‑run modes gave us confidence under pressure
End‑to‑end switchover flow: from prewarming and replication slot setup, through pausing PgBouncer and Debezium orchestration, to resuming traffic on Green.
Conclusion
Zero‑downtime PostgreSQL upgrades are not just possible — they’re now our reality. With careful orchestration of CDC connectors, outbox event streams, replication slots, and PgBouncer traffic, we moved our production fleet forward to PG17 without disruption.
That said, the switchover is only half the story. The switchover itself is elegant, but the supporting work can still feel tedious:
-
Disabling migrations while replicas catch up
-
Tweaking PgBouncer configs
-
Changing Terraform definitions for connectors
-
Creating PRs, waiting for approvals, and fixing the inevitable typo We’re not stopping here. The scripts and configs are good enough today that teams across Fresha can already run upgrades on their own by following the documented process. The next step is to take the human glue out of it:
-
Templated configs
-
PR generation
-
Orchestration with AWS Step Functions
-
And finally a workflow where upgrading Postgres feels like a safe, boring routine We’ve proven this works at Fresha scale, and now we’re investing in making it a one‑button operation. When PG18 lands, we don’t want anyone on‑call to break a sweat.
Credits
- Anton Borisov — rule‑breaker and architect of the impossible; brought the idea nobody thought would fly, built the first script, solved the hardest DB hurdle, and set the pace as the project’s boldest innovator. Basically the one person you shouldn’t let near production… unless you want results.
- Emiliano Mancuso — Head of Infrastructure; believed in the process when others said it was impossible. Also doubles as our premier rubber duck.
- Jan Zaremba — leader of Infrastructure; encyclopedic knowledge and steady hand that made every iteration safer than the last.
- Robert Pyciarz — doomsday architect; mapped out every failure path and tucked Kubernetes into all the right corners.
- Paritosh Anand & Paweł Michna — deadline wrestlers; ran upgrades at crunch time and wrote down every step so the rest of us could sleep.
- Blend Halilaj — snapshot tactician; turned the idea of blue‑green RDS restores into a clean, repeatable workflow.
- Special encore for Rehan Ullah, Sam Stewart, and (again) Paweł Michna, the crew turning all of this into a one‑button upgrade right now.