Engineering Lessons from Replicating Amazon RDS Postgres

Replicating a managed database service like Amazon RDS for PostgreSQL is not as simple as pointing pg_dump at it and calling it a day. The managed environment, designed for stability and security, imposes a unique set of constraints that require non-trivial engineering solutions. For the past three months, since SerenDB’s founding in September 2025, we’ve been working 12-hour days to build a high-performance, open-source database replication tool for SerenAI agentic backend services. In this post, we’ll distill our experience into five key technical takeaways for engineers working with AWS RDS Managed Postgres database service.

Lesson 1: State Dumps Require a Compatibility Layer

Standard and trusty tooling like pg_dumpall generates a perfect snapshot of a self-hosted cluster, …

Lesson 1: State Dumps Require a Compatibility Layer

Standard and trusty tooling like pg_dumpall generates a perfect snapshot of a self-hosted cluster, but that snapshot is incompatible with Amazon’s RDS. Attempting to restore it verbatim fails because AWSRDS restricts superuser-only commands (ALTER ROLE ... SUPERUSER), privileged operations (GRANT pg_checkpoint), and modifications to certain GUCs (ALTER ROLE ... SET log_statement).

The engineering challenge is to transform this incompatible state dump into a portable format. We solved this by building a multi-pass sanitization pipeline that parses the SQL dump and comments out non-portable commands. It’s not just a simple filter; it’s a state-aware parser that has to understand context.

This approach is essentially a compatibility layer for database state, allowing us to treat AWSRDS as just another PostgreSQL instance, despite its underlying limitations.

Lesson 2: Statically-Linked TLS is a Deployment Superpower

AWSRDS enforces SSL/TLS, making the choice of a TLS library critical. We started with native-tls, which dynamically links against the host system’s OpenSSL libraries. This created immediate CI/CD and deployment headaches. Builds would fail on runners that didn’t have the exact right version of OpenSSL installed, and it complicated creating portable, statically-linked binaries for distribution.

We migrated to rustls for two primary reasons:

Build Portability: rustls is a pure Rust implementation, which allowed us to compile a single, dependency-free binary that runs consistently across different Linux distributions and container environments (e.g., Alpine, Debian).
Security and Control: rustls provides memory safety guarantees and offers a more explicit, verifiable API for certificate handling, which is crucial when dealing with customer data.

Lesson 3: The Network is Unreliable (Especially in the AWS Cloud)

Long-running database replications often involve periods where the connection is idle from the perspective of the client. On AWS, network infrastructure like Elastic Load Balancers (and even some NAT gateways) have idle connection timeouts (e.g., 350 seconds for an NLB). These services will silently drop idle TCP connections, causing the replication to fail with a cryptic “connection reset by peer” error hours into the process.

The solution is to operate under the assumption that the network is unreliable and proactively keep the connection alive. We implemented TCP keepalives directly in our connection logic. This is a network-layer fix for a problem that manifests as a database-layer failure. Keep-alives save lives!

Lesson 4: Abstract Away Cloud-Specific Error Noise

Error handling is about creating useful abstractions. A raw tokio-postgres error might tell you Connection refused, but it won’t tell you why. In an AWSRDS context, the “why” is often specific to the cloud environment: a misconfigured security group, a bad AWS IAM policy, or connecting to the wrong endpoint.

We built a diagnostic layer on top of the raw database driver errors. This layer inspects the error message and provides actionable, RDS-specific advice. This transforms a generic network error into a specific, solvable problem for the user.

Lesson 5: Reverse-Engineering Managed Service Internals of AWSRDS

As we dug deeper, we realized that simply filtering a fixed list of commands wasn’t enough. Every time we patched, we’d be hit with another exception. RDS has its own internal metadata and objects that are not part of standard PostgreSQL. These need to be identified and handled with surgical precision.

This is essentially a process of reverse-engineering. For example, we discovered that RDS uses its own internal tablespaces like rds_temp_tablespace, which don’t exist on other PostgreSQL instances. We also found that pg_dumpall would try to include the internal rdsadmin database. Our sanitization logic had to be extended to pattern-match and exclude these RDS-specific constructs.

This is an ongoing effort. With every new RDS version, there’s a risk of new internal objects or commands that will require us to update our compatibility layer. We stand at the ready.

Lessons Hard Learned

Working with a managed service like RDS is a constant exercise in navigating abstractions. The key takeaway is that you cannot treat it as a black box. You have to build tools that are aware of the managed environment’s specific constraints and behaviors. For us, this meant building a portable, dependency-free binary with robust networking, creating a compatibility layer for database state, and wrapping it all in an abstraction that provides actionable, context-aware diagnostics.

Explore and fork the replication code:

Lesson 1: State Dumps Require a Compatibility Layer

Lesson 1: State Dumps Require a Compatibility Layer

Lesson 2: Statically-Linked TLS is a Deployment Superpower

Lesson 3: The Network is Unreliable (Especially in the AWS Cloud)

Lesson 4: Abstract Away Cloud-Specific Error Noise

Lesson 5: Reverse-Engineering Managed Service Internals of AWSRDS

Lessons Hard Learned

Similar Posts