I have a question regarding how Patroni and etcd handle the below scenarios.
Assume there are two sites in this deployment, Site1 and Site2 and it’s a 5 node cluster spread across these two sites. This is a primary/replica HA deployment.
- Site1 has
S1node1(Patroni + etcd leader + PostgreSQL master),
S1node2(Patroni + etcd follower + PostgreSQL replica) and
S1node3(Patroni + etcd follower + PostgreSQL replica)
- Site2 has
S2node1(Patroni + etcd follower + PostgreSQL replica) and
S2node2(Patroni + etcd follower + PostgreSQL replica).
My understanding is that since the etcd leader is in Site1 and quorum is (5/2)+1 = 3, in case of a network split between the two sites Site1 will continue to accept updates since it has majority and meets quorum. Site2 will stop accepting updates since quorum is not met. Once the network glitch is resolved then the Site2 nodes just join back into the cluster. No action required.
If for some reason the entire Site1 went down (disaster scenario), then in Site2 we just have two nodes which don’t meet the quorum. There will be no etcd leader which means that Patroni cannot promote the PostgreSQL replica to master at Site2.
Now the question is how to ensure that the Site2 nodes can get promoted to master? Does Patroni take care of this scenario?
Let me just copy the answer and following remarks from the main Patroni developer and maintainer:
You got nearly everything correctly. If the Site1 is down, Site2 will not promote due to the lack of quorum. There is no way to automatically recover from such a situation.
The only solution would be to run one etcd node somewhere else, say Site3. You don’t have to run Postgres there. Keep the nu
Please note that the Postgres master doesn’t necessarily have to run on the same node where the etcd leader does. They are independent from each other.
In any case, with only two data centers it is sometimes better to have two independent etcd clusters on Site1 and Site2, and run a Patroni standby cluster on Site2. If the first site is down, you can manually promote the standby cluster.
Automatic promotion in this case is not possible, because Site2 will never be able to figure out the state of Site1.