PostgreSQL High-Availability Architectures

PostgreSQL is highly suitable for powering critical applications in all industries. However, to run critical applications, there are key requirements which are absolutely needed: High-Availability and automatic failover.

Table of Contents

Scenario 1 Automatic failover with Patroni
Advantages of this approach
Transparent connection handling
Multihost connect strings in PostgreSQL
vip-manager IP management
Why can we not just send queries anywhere?
Scenario 2 Multi data center clusters
[Scen…

Table of Contents

Scenario 1 Automatic failover with Patroni
Advantages of this approach
Transparent connection handling
Multihost connect strings in PostgreSQL
vip-manager IP management
Why can we not just send queries anywhere?
Scenario 2 Multi data center clusters
Scenario 3 Geo-redundancy and multimaster
Managing schema changes reliably
Mitigating potential problems
Finally..

This document explains which options are available and which problems one can solve with PostgreSQL. We have listed the most common setups and some of the most common recommendations.

Scenario 1: Automatic failover with Patroni

The first scenario has been designed to provide a solution to the following requirements and scenario:

Run servers on premise or on VMs
Automatically handle server errors
Make the cluster transparent to the application
Handle recovery automatically

This is a typical scenario which is commonly seen in real world deployments. An application has to rely on a highly available database and in case of error we simply want to switch to a backup system.

The solution in this case is a Patroni cluster. It uses distributed consensus based on “etcd” and fully automated failover as well as database recovery. The following image shows how this works:

The image shows a 3-node cluster. The primary server is able to handle read as well as write transactions. The remaining two servers will accept reads.

Advantages of this approach

This approach has a couple of advantages that are important to understand:

The primary server is a single source of truth
Replicas allow read scalability
A “Shared nothing infrastructure” is highly resilient
Easy to scale as needed
Replicas can exist in various locations

In general, achieving high availability is possible if you follow a simple design pattern: Keep it simple. The more complexity you add to the system, the more likely it is that something goes wrong or something impacts operations. Simplicity is a key advantage of this approach.

Transparent connection handling

When dealing with clusters, there is an important question which has to be answered: How does the application know where the desired server is? In general, there are various solutions to the problem:

Multi-cost connect strings
vip-manager IP management
haproxy routing

What does this actually mean? Let us walk through this step by step.

Multihost connect strings in PostgreSQL

Multihost connect strings mean that your application is made aware of ALL servers in the cluster. Simply pass all connect strings to the application and decide on the driver level what this means. Most drivers support this. Here are some examples:

PgSQL


1	postgresql://host1:123,host2:456/somedb?target_session_attrs=any&application_name=myapp

As you can see, “host1” and “host2” are both part of the connect string. By defining the target session attributes, you can tell the library whether to use any host, force a primary and so on. This type of connect string works for all languages that are based on the PostgreSQL C library (Python, Perl, C++ and so on). If you happen to use Java, the PostgreSQL JDBC driver offers a similar mechanism. Here is how it works:

PgSQL


123	jdbc:postgresql://node1,node2,node3/accounting?targetServerType=primaryjdbc:postgresql://node1,node2,node3/accounting?targetServerType=preferSecondary&loadBalanceHosts=true

The trick here is to set the targetServerType - it defines what type of connection you are looking for (reading, writing).

vip-manager IP management

The more elegant variation is the use of vip-manager, which is ideal if you got the chance to manage your own IPs. Let us again imagine our 3-node cluster: Each node has its own IP address BUT we add one more IP address, which is associated with the current primary.

In a real world setup it looks like this:

Primary server: Normal IP + 1 floating IP
Replicas: Normal IP

The application will only talk to the floating IP, making the cluster totally transparent to the application. vip-manager takes care of all the rest for you.

haproxy connection routing

Finally, we can use haproxy to handle connectivity. How does it work? haproxy runs on every node. You can connect to any node and haproxy will automatically route you to the desired primary. This makes it easy for the application to find its peer.

Keep in mind that haproxy adds one more network hop to your setup, which increases latency - if you are sending thousands of very small statements, this can be an issue.

Why can we not just send queries anywhere?

There is a question which customers keep asking. “Why can we not just send queries anywhere and let the database sort it out if we are dealing with a read or a write?” While the question is totally valid, it is a lot harder than people might think.

Consider the following query:

PgSQL


1	SELECT * FROM whatever;

Most people would say: Sure, this is a read …

Well, are you sure?

PgSQL


1	CREATE VIEW whatever AS SELECT writing_function();

In other words: By just looking at the SQL statement and by parsing it, we don’t know if we are dealing with reads or writes - this information is only available at runtime and therefore no middleware in the world can ever reliably make this decision.

Scenario 2: Multi data center clusters

Once we have fully understood the first scenario, we can move to the next scenario: What if we want to spread our cluster across multiple data centers?

Here is what we want to solve:

Have servers in multiple data centers
Switch to a new data center in case of failure
Make sure single node failures produce a new primary in the same data center
Ensure that simplicity is maintained

The solution to this requirement looks as follows:

Once we have fully understood the first scenario, we can move to the next scenario: What if we want to spread our cluster across multiple data centers?

Here is what we want to solve:

Have servers in multiple data centers
Switch to a new data center in case of failure
Make sure single node failures produce a new primary in the same data center
Ensure that simplicity is maintained

The solution to this requirement looks as follows:

Patroni allows us to configure a “standby cluster”. What does this mean? A standby cluster is essentially an entire cluster following an entire cluster. If a node fails in DC1, a new primary will be found in DC1 - however, in case all of DC1 is failing, DC2 can be promoted (manually) to be the new primary data center promoting the standby cluster to the productive setup.

Some people might ask: “Why manual promotion? Didn’t you just say all is cool and automatic?” This is true for each cluster but remember, we only have two data centers and we always need odd numbers to make clear majority decisions. Therefore, the ideal solution needs a third location.

We have various options to do that: One option is to just use a single VM outside those two data centers to enable a majority decision in case one data center is gone:

Alternatively, we can of course also use a full blown data center taking care of an even more redundant setup:

Of course this needs a lot of hardware, but sometimes applications are simply so critical that the amount of hardware does not matter in the grand scheme of things. Just imagine something like air traffic control, emergency services, nuclear infrastructure and so on - all those things need extreme reliability and so hardware is more of a secondary concern.

Scenario 3: Geo-redundancy and multimaster

The next scenario is all about geo-redundancy. Consider the following requirements:

Running offices in various countries
Each office is allowed to write data
Writes should happen locally
Copies of all data should be everywhere

Here is what such a setup might look like:

While this sounds sane at first glance, it definitely comes with a couple of issues, which make this type of setup not that desirable in most scenarios. It makes sense to spend some time to understand why this is actually the case.

Handling replication conflicts

The most important scenario every active-active system is suffering from is related to replication conflicts. Why is that the case and what does it mean? Here is a picture outlining the general problem:

The scenario starts with a simple assumption: accountid = 123 has a balance of 100 Euros. Both servers agree that this is indeed correct. Now, two people want to withdraw 70 Euros in cash. Node 1 and Node 2 start to run the update and a couple of bank notes are issued from both ATMs. Both sides commit and data is sent to the other side after the commit has happened. This means that both sides have issued bank notes worth 70 euros each (= 140 in total). But: Due to the nature of the operation, only half the amount has been deducted from the balance.

When the change made by Node 1 finally hits Node 2 we figure out that there is not much money around (and the same happens when the change made on Node 2 is received by Node 1).

This is a huge problem - we have given out more money than we have actually accounted for and we cannot get the money back from the client.

The natural question arising is: Well, why does this happen? It does not happen on a single server either? This is true - but there is a major difference. If we are running concurrent transactions on the same machine, we can simply lock rows and synchronize our operations to end up with a correct answer (ACID compliant). However, this is totally impossible in case the system is spread all over the globe. The reason is simple: Performance. To handle locking over such distances, some hard requirements have to be met which would lead to, say, 0.1 transactions per second or something along those lines. In short: Our overhead would be so big that useful interaction would be literally impossible.

All long distance solutions we are aware of rely on asynchronous replication, which naturally leads to conflicts and inconsistencies.

Documented divergence conflicts

One might throw in an argument and say: But there are conflict resolution algorithms and they will take care of it. Well, this might be true in some cases, but not in all of them. Consider the following scenario:


Node 1	Node 2	Comment

BEGIN;	BEGIN;
UPDATE customer SET id = 19 WHERE id = 18;	UPDATE customer SET id = 17 WHERE id = 18;	both nodes make the change without knowing each other’s operation
COMMIT;	COMMIT;	both COMMIT operations succeed
…	…	… replication happens …
“row 18 is not found”	“row 18 is not found”	… divergence happens …

Congratulations, we have just successfully created geo-redundant inconsistency. Often handbooks say that “the administrator has to fix that”. In our judgement this is pure fantasy. When dealing with billions of rows and thousands and thousands of changes per second, uncovering such inconsistencies is, well, incredibly tricky or even impossible.

It is also an illusion to assume that active-active will make things faster - it does not, because every node still has to write all the data. Writes can only be scaled by applying sharding, but never by writing everything to all database servers - all you can scale without sharding is reading.

Managing schema changes reliably

The next tricky problem related to active-active is related to changes in the data structure. Consider a simple example:

PgSQL


1	ALTER TABLE x DROP COLUMN y;

Let us reflect on this a bit: The purpose of our cluster is to increase redundancy, right? In case a data link between two locations is not available, we want to keep working. Now: We cannot just run with inconsistent table definitions. One location might add a column, some other place might drop a column - what does this mean when data finally arrives after the connection failure? It is not hard to imagine that what you will be facing is just a big disaster.

The solution is that changes made to data structures can only happen under some carefully designed circumstances which are usually (and often not limited to):

All nodes MUST be available
No other writes are accepted while changes to table definitions are made

Sounds easy? In reality this can mean several minutes of downtime which is exactly what we want to prevent.

Don’t get us wrong - similar issues can of course happen in active-passive setups BUT the requirements and implications are by far less punishing as in an active-active setup, which comes with a ton of additional complexity.

Mitigating potential problems

However, active-active is not only a problem - there are scenarios where this can make sense. One scenario coming to mind is: Imagine a work without conflicts. Here is a scenario:

Interest rates are centrally set by the London office
Everybody else has no permissions
Everybody else is just a receiver
Employees in Germany can only edit German customers
Employees in Stockholm can only edit Swedish customers

While we are still replicating data back and forth, the chances of a conflict have been reduced to 0 by design. Sure, an employee in Stockholm might see slightly outdated data but in most cases this is fine.

Note that this is one of the rare scenarios where active-active can make sense. In 95% of all cases, a single primary is definitely the recommended scenario.

Finally..

If you are not sure which setup is ideal for you - just reach out to us and we can explain in detail what makes sense in your specific scenario. Keep in mind that low complexity is your friend and it makes sense to think about the specifics of your setup in detail before introducing components that can hurt you in the long run.

Scenario 1: Automatic failover with Patroni

Advantages of this approach

Transparent connection handling

Multihost connect strings in PostgreSQL

vip-manager IP management

haproxy connection routing

Why can we not just send queries anywhere?

Scenario 2: Multi data center clusters

Scenario 3: Geo-redundancy and multimaster

Handling replication conflicts

Documented divergence conflicts

Managing schema changes reliably

Mitigating potential problems

Finally..

Similar Posts