System outages, hardware failures, or accidental data loss can strike without warning. What determines whether operations resume smoothly or grind to a halt is the strength of the disaster recovery setup. PostgreSQL is built with powerful features that make reliable recovery possible.
This post takes a closer look at how these components work together behind the scenes to protect data integrity, enable consistent restores, and ensure your database can recover from any failure scenario.
What is Disaster Recovery?
Disaster Recovery (DR) refers to a set of practices and strategies designed to back up and restore databases in the event of a disaster. In this context, a disaster means any event that renders the entire database environment unusable, such as:
- Cloud region outages…
System outages, hardware failures, or accidental data loss can strike without warning. What determines whether operations resume smoothly or grind to a halt is the strength of the disaster recovery setup. PostgreSQL is built with powerful features that make reliable recovery possible.
This post takes a closer look at how these components work together behind the scenes to protect data integrity, enable consistent restores, and ensure your database can recover from any failure scenario.
What is Disaster Recovery?
Disaster Recovery (DR) refers to a set of practices and strategies designed to back up and restore databases in the event of a disaster. In this context, a disaster means any event that renders the entire database environment unusable, such as:
- Cloud region outages – e.g., when AWS us-east-1 goes down and takes half the internet with it
- Physical disasters – fire, flood, earthquakes, or even a backhoe cutting fiber lines
- Catastrophic human error – like a faulty migration that corrupts critical tables
- Major security incidents – where you must rebuild from known good backups
- Power outages – extended downtime impacting availability
- Hardware failures – disks, memory, or server crashes
- Software failures – bugs, crashes, or corrupted processes
- Cyberattacks – ransomware, data breaches, or malicious tampering
- …and whatever else you can’t imagine!
The goal of DR is to ensure that the system can be quickly restored to a normal operational state in the event of an unexpected incident.
The Pillars of Business Continuity: RTO and RPO
Before designing a disaster recovery strategy, we must understand the two metrics that define it:
- RPO
- RTO
Recovery Point Objective (RPO) means the maximum amount of data you can afford to lose in case of a failure, measured in time. It answers the question, “When we recover, how far back in time will we be?” If you can only tolerate 5 minutes of data loss, your RPO is 5 minutes. RPO is primarily a Disaster Recovery metric, directly tied to your backup and WAL archiving strategy.
Recovery Time Objective (RTO) means the maximum acceptable time to restore service after a failure. It answers, “How long can the application be down?” If you need to be back online in 30 minutes, your RTO is 30 minutes. RTO is primarily a High Availability metric, often addressed through replication and failover mechanisms.
These two metrics are often in tension, and the goal is to bring both RTO and RPO as close to zero as possible. A low RPO requires frequent backups and robust WAL archiving, which can be costly. A low RTO requires on-demand compute resources and automation, which also adds cost. The art of DR planning is finding the balance between these opposing forces through risk management and cost-efficiency analysis.
The PostgreSQL Evolution
One of the most impressive aspects of Postgres is its flexibility and resilience right out of the box. Back in 2001, Postgres took a major leap forward by introducing crash recovery through Write-Ahead Logging (WAL), a milestone for data durability.
In 2005, the introduction of continuous backup and point-in-time recovery fortified Postgres through online physical backups and WAL archiving, enabling effective disaster recovery.
Over the following decade, Postgres evolved its continuous backup framework into the sophisticated replication system we know today, purpose-built to meet the high availability demands of modern organizations.
Before looking into Postgres backup and recovery infrastructure, let’s grasp some fundamental concepts.
What is a Backup?
A backup is a consistent copy of the data that can be used to recover the database. Without backups, there’s no real disaster recovery plan. There are two main types:
- Logical Backups
- Physical Backups
Logical Backups
Logical backups capture the contents of the database by exporting them into SQL scripts or other portable formats. It’s a set of commands that can recreate the database’s structure, including tables, schemas, constraints, and reinsert all the data. Common tools are:
- pg_dump – Creates a backup of a single database.
- pg_dumpall – Captures an entire cluster, including all databases, roles, and global objects.
Physical Backups
Physical backups are low-level copies of the database’s actual files at the storage layer. They capture the exact state of the database by copying its underlying data files directly. Common tools are:
- Pg_basebackup – The standard tool for online physical backups.
- pgBackRest or Barman – These are external tools that add automation and better management for large-scale environments.
Pros and Cons of Logical vs. Physical Backup:
Here’s a quick comparison of how logical and physical backups differ in capabilities and use cases:
Feature | Logical Backups | Physical Backups |
Data Copy | Yes | Yes |
Backup Single Tables | Yes | No |
Migrate Between Versions | Yes | No |
Recover to a Point in Time | No | Yes |
Incremental/Delta Restore | No | Yes |
In short, logical backups are flexible, portable, and ideal for migrations, while physical backups are faster, enable point-in-time recovery, and scale more effectively for large production environments.
Write-Ahead Logging (WAL)
To understand PostgreSQL recovery, we must understand WAL. It is the single most important component for durability.
In the infrastructure diagram below, there are four main components:
- Shared Buffers
- PGDATA
- pg_wal
- Postgres Backend
Let’s understand these first. Postgres stores data in 8-kilobyte pages within a directory called PGDATA. Transaction logs are stored in Write-Ahead Log (WAL) files, located in the pg_wal directory. Shared Buffers act as an in-memory cache to improve performance, and each client connection is handled by a dedicated process known as a Postgres Backend.
The golden rule of PostgreSQL is that any modification to a data page must be logged in the Write-Ahead Log (WAL) before the updated (“dirty”) page is written back to the data files.
When a backend requests a page from disk, that page is first loaded into the Shared Buffers before being returned to the backend. If the backend modifies the page, the change is first recorded in the Write-Ahead Log (WAL), not directly in the data files. This information is written to a WAL segment, which is why the mechanism is called Write-Ahead Logging, or simply WAL.