Umair Shahid: PostgreSQL on Kubernetes vs VMs: A Technical Decision Guide

If your organization is standardizing on Kubernetes, this question shows up fast:

“Should PostgreSQL run on Kubernetes too?”

The worst answers are the confident ones:

“Yes, because everything else is on Kubernetes.”
“No, because databases are special.”

Both are lazy. The right answer depends on what you’re optimizing for: delivery velocity, platform consistency, latency predictability, operational risk, compliance constraints, and, most importantly, who is on-call when things go sideways.

I have seen PostgreSQL run very well on Kubernetes. I’ve also seen teams pay a high “complexity tax” for benefits they never actually used. This post is an attempt to give you a technical evaluation you can use to make a decision that fits your environment.

Start with the real questio…

If your organization is standardizing on Kubernetes, this question shows up fast:

“Should PostgreSQL run on Kubernetes too?”

The worst answers are the confident ones:

“Yes, because everything else is on Kubernetes.”
“No, because databases are special.”

Start with the real question: are you running a database, or building a database platform?

This is the cleanest framing I have found:

Running a database: You have a small number of production clusters that are business-critical. You want predictable performance, understandable failure modes, straightforward upgrades, and clean runbooks.
Building a database platform: You want self-service provisioning, standardized guardrails, GitOps workflows, multi-tenancy controls, and a repeatable API so teams can spin up PostgreSQL clusters without opening tickets.

Kubernetes shines in the second world. VMs shine in the first.

Yes, you can do either on either platform. But the default fit differs.

A neutral comparison model: 6 dimensions that actually matter

Here is a practical rubric you can use in architecture reviews.

If you want a quick decision shortcut:

If your main goal is self-service and standardization, Kubernetes is compelling. If your main goal is predictable performance and lower operational surface area, VMs metal are compelling.

What Kubernetes adds (and why it’s both good and risky)

Kubernetes wasn’t designed primarily for databases. It was designed for scheduling workloads, handling health checks, rolling updates, and service discovery. PostgreSQL can run well there, but you typically stack multiple control layers:

Stateful identity and scheduling
Persistent volumes
CSI/storage drivers
Operators for lifecycle management
Sidecars for backups/metrics/log shipping

That’s not inherently bad. It’s powerful. But each layer is another thing to understand, upgrade, monitor, and debug. There is also the ‘agony of choice’ when selecting the operator for lifecycle management. There are quite a few available, and none are perfect.

The biggest Kubernetes “gotcha” for PostgreSQL isn’t that it doesn’t work. It’s that when something goes wrong, the failure analysis can shift from “what is Postgres doing?” to “which Kubernetes subsystem is influencing Postgres right now?”

A very common pattern: a performance incident that starts as “write latency spiked” turns out to be tied to eviction behavior, scheduling pressure, or storage-layer hiccups. Those are solvable problems, but only if you already have deep Kubernetes operational maturity.

What VMs give you (and what they don’t)

VMs are boring in the best way: fewer abstraction layers between PostgreSQL and the hardware.

That usually means:

More predictable latency (especially disk + network)
Easier kernel-level tuning (huge pages, I/O scheduler, NUMA considerations)
Simpler operational failure analysis (“the host is slow” is a real thing you can measure and act on)
More straightforward incident response for teams that already have VM/host tooling

But VM isn’t “free” either. The cost shows up in different places:

Slower provisioning and less self-service
More configuration drift risk (“snowflake servers”)
More manual day-2 operations unless you build good automation
Higher discipline required for patching, backups, and failover testing

The platform might be simpler; the process still needs maturity.

The performance reality: storage and network decide more than “K8s vs VM”

Most “Postgres on Kubernetes is slow” stories are really one of these:

The storage class wasn’t suited for database workloads.
CPU throttling or noisy neighbor effects were introduced through cgroups / limits / oversubscription.
Network paths became less predictable (overlay, MTU issues, cross-zone routing).
Failover / restart behavior wasn’t tested under real load.

Storage: the durability and jitter problem

PostgreSQL is very sensitive to storage behavior because it relies heavily on fsync semantics, WAL throughput, and predictable latency for sync writes. On bare metal or a well-provisioned VM, you can often get very stable performance by:

Using fast SSD/NVMe
Separating WAL and data volumes when appropriate
Benchmarking with fio and Postgres tools (pg_test_fsync) before you commit to architecture

On Kubernetes, you can do this too, but you must be intentional:

Prefer storage classes built for sustained IOPS and latency stability (not just “it supports PVCs”)
Validate snapshot/restore behavior end-to-end (because snapshots that exist but can’t restore correctly are theatre)
Consider dedicated node pools and careful volume placement if you’re chasing low jitter

Network: the “multi-region makes everything harder” lesson

Replication lag is a good example of why network matters more than platform ideology. In one benchmark study1 (single-region vs multi-region), average replication lag in single-region was around a few milliseconds, while multi-region averaged tens of milliseconds with occasional spikes under load. The big takeaway: geography and network dominate lag behavior far more than whether you run inside a pod or on a VM.

So if your decision is driven by “we want multi-region active-active,” focus on replication architecture and network reality first. Kubernetes won’t save you from physics.

Reliability and HA: Kubernetes gives you rescheduling, not correctness

A controversial statement that’s still true:

Kubernetes gives you rescheduling. PostgreSQL needs correctness.

If a Postgres pod dies, Kubernetes will restart it. Great. But high availability for PostgreSQL is about:

avoiding split brain
promoting the right node at the right time
fencing the old primary
ensuring replicas are consistent
ensuring client traffic shifts cleanly
ensuring backups and restore paths are proven

Kubernetes can help you automate that with mature operators. VMs can help you automate it with mature HA tooling (Patroni/repmgr + a DCS + load balancers, etc.). In both cases, correctness comes from your HA design, your fencing strategy, and your tests, not from the platform’s marketing.

When Kubernetes is a strong fit for PostgreSQL

Kubernetes becomes a very rational choice when:

1. You already run a mature Kubernetes platform

You have stable storage classes
You have strong observability
You have SREs who understand scheduling, disruption, and capacity planning

2. You want an internal “Postgres-as-a-service” model

Developers request databases via a ticket/API and get guardrails by default
Standardized backups, monitoring, parameter baselines, and security policies

3. You need many isolated Postgres clusters

Multi-tenant environments where per-tenant isolation is valuable
Frequent creation/destruction of clusters (CI, preview environments, ephemeral staging)

4. Your org operates with GitOps discipline

Declarative config changes
Reviewable diffs
Automated drift detection

In these cases, the platform benefits can outweigh the complexity, because you’re actually using the platform benefits.

When VMs are a stronger fit

VMs tend to be the better choice when:

1. Your Postgres cluster is “crown jewel” infrastructure

Latency-sensitive OLTP
Predictable I/O behavior matters more than provisioning speed

2. You don’t have Kubernetes specialists on-call

The fastest path to reliability is fewer moving parts, not more automation

3. You’re running a small number of large databases

Dedicated instances, tuned for workload
Scaling is mostly vertical and carefully planned

4. You need tight control over kernel + host settings

NUMA behavior, huge pages, I/O scheduling, direct-attached NVMe, etc.

If you’re in this world, “boring infrastructure” is a feature.

Two reference architectures you can copy

Option A: Kubernetes with an operator (platform-oriented)

Key design choices:

Use a mature Postgres operator for day-2 operations (backups, failover, upgrades)
Use dedicated node pools for Postgres
Use pod anti-affinity so replicas land on different nodes
Use PodDisruptionBudgets so maintenance doesn’t take you down
Keep backups off-cluster (object storage) and run restore drills

And your operator-managed cluster spec should include:

explicit resource requests
storage class selection
monitoring enablement
backup configuration
replication settings

Option B: VMs with Patroni (database-runbook oriented)

Key design choices:

3-node cluster (1 primary, 2 replicas)
Patroni for HA with a DCS (etcd/Consul)
HAProxy for routing writes to primary and reads to replicas (optional)
PgBouncer for connection pooling
pgBackRest (or similar) for backups and PITR
Monitoring stack: node metrics + Postgres metrics + log analysis

This model is widely understood, auditable, and tends to fail in more predictable ways.

Common gotchas (the ones that create 2am incidents)

Kubernetes gotchas

1. CPU limits causing throttling

You can meet “CPU request” but still get throttled under burst if limits are too tight.

2. Pod evictions during load

Especially if PDBs, priorities, and eviction policies aren’t designed for stateful workloads.

3. Storage that looks fast on paper but has latency spikes

Sustained performance is what matters, not peak IOPS marketing.

4. Backups that exist but restores that fail

Test restores on a schedule as a drill, not during an incident.

5. Operator upgrades as a hidden dependency

Your database lifecycle now depends on the operator lifecycle.

VM gotchas

1. Unvalidated failover

You “have HA” but haven’t practiced it under load with real application behavior.

2. Backup confidence without restore drills

The only backup that matters is the one you restored successfully.

3. Configuration drift

Two replicas that aren’t actually identical are a slow-motion outage.

4. Noisy neighbor on shared hypervisors

“It’s on a VM” doesn’t mean you own the underlying contention story.

5. OS patching and reboots without a runbook

Routine maintenance becomes risky without clear procedures.

The punchline: choose the platform that matches your org’s operating model

My take is simple:

Kubernetes is excellent when you’re building a database platform.
VMs are excellent when you’re running a database.

Both can be production-grade. Both can be disasters. The difference is whether your organization is set up to operate the platform you choose.

If you want one practical recommendation that avoids regret, this is it:

Run dev/test Postgres on Kubernetes if it helps delivery speed. Run production Postgres where you can guarantee predictable storage, clear failure modes, and strong operational ownership. That might be Kubernetes, or it might not.

[1] Benchmark Study on Replication Lag in PostgreSQL using Single Region and Multi-Region Architectures

[2] Is Your PostgreSQL Deployment Production Grade?

[3] From Downtime to Reliability: How Stormatics Solved High Availability Issues for a Middle Eastern Government’s Critical Cybersecurity Operations

[4] Clustering in PostgreSQL: Because One Database Server is Never Enough (and neither is two)

[5] Database in Kubernetes: Is that a good idea?

[6] Databases on K8s — Really? [Part 1] [Part 2] [Part 3] [Part 4]