I am looking for some guidance on architecting our SQL Server 2016 SP1 Always On Availability Groups HADR solution on Windows Server 2012 R2. We have a primary site A with a primary replica and a secondary replica and a disaster recovery (DR) site B with a secondary replica and a file share witness. Our goal is that if the primary replica server 1 at site A is down, the Always On Availability Group (AG) fails over to the secondary replica server 2 at site A, and if both servers at site A are down, the AG fails over to site B.
We are attempting to do a Node and File Share Majority configuration per https://technet.microsoft.com/en-us/library/cc731739(v=ws.11).aspx and this diagram:
This diagram shows that when one node and the “disk” / file share witness are in communication the cluster runs, but in our testing of this situation, the cluster fails due to a loss of quorum of the WSFC. The automatic AG failover works if we test a failure of one server at a time by disabling the NIC in vmWare since SQL Server 2016 supports two automatic failover target replicas. However, it does not work if we fail both servers at site A at the same time to simulate a point-to-point network failure or a site power failure.
The following approach with manual intervention to force the quorum will work but it is not automatic, which is what we want in the ideal situation:
$node = "SQLServerC" Stop-ClusterNode –Name $node Start-ClusterNode –Name $node –FixQuorum ALTER AVAILABILITY GROUP SQLServerAO FORCE_FAILOVER_ALLOW_DATA_LOSS; $node = "SQLServerC" Stop-ClusterNode –Name $node Start-ClusterNode –Name $node
I appreciate any suggestions you can offer and thank you in advance!
Our goal is that if the primary replica server 1 at site A is down, the Always On Availability Group (AG) fails over to the secondary replica server 2 at site A, and if both servers at site A are down, the AG fails over to site B.
Except for a very specific scenario, this will not work. In all other scenarios you’ll lose quorum before you’d ever be able to automatically fail over (which requires… quorum).
The best answer would be to have the DR side be manual failover (it could still be synchronous) with the proper documentation on how to force quorum and bring the AG online.
You could also invest more into VMWare and use their technologies but that assumes the infrastructure, licensing, and ability to implement those products for a specific service such as this.
You should consider windows server 2016. blogs.msdn – introducing-cloud-witness
We have a primary site A with a primary replica and a secondary
replica and a disaster recovery (DR) site B with a secondary replica
and a file share witness.
Okay, you have 3 node cluster, 2 nodes are hosted in Site A and one hosted on Site B.
Site A has a primary replica (node1) and a secondary replica (node2) Site B has a secondary replica (node3)
And your quorum model is:
Node and File Share Majority.
By default, each node will have 1 vote and file share witness will have 1 vote.
The total quorum votes will be
node1 + node2 + node3 + file share witness = 1 + 1 + 1 + 1 = 4 and the current
quorum majority will be 4 out of 4, i.e. 100% survival.
Let’s assume there is network connectivity issue between subset (2,2) of the cluster nodes, now the quorum majority will be 2 out 4, i.e. 50 % chance for survival. The problem here is, each subset will think, it has majority and it will try to own the
resource groups to keep the cluster up and running, but
**Windows Server Failover Cluster (WSFC)** supports only one active nodes at a time, two subset of nodes trying to access to same resource end up in conflicting issue, this is called
even number of nodes in cluster is not the right approach.
SQL Server 2016 SP1 Always On Availability Groups HADR solution on
Windows Server 2012 R2
To avoid the above mentioned issue,
Windows Server 2012 R2 introduced a new feature called Dynamic Witness.
In your case, (starting from Windows Server 2012 R2) the default voting will be as follows:
node1 + node2 + node3 = 1 + 1 + 1 = 3 & File Share Witness = 0
Here, the total quorum votes will be 3 and majority is 3 out of 3. There is no voting for File Share Witness to keep the total voting as odd number.
If there is a failure of node, the cluster will assign a vote to the File Share Witness dynamically.
The process of assigning or revoking quorum vote to witness dynamically called as Dynamic Witness.
And the new quorum will be calculated dynamically post failure of node. For example, if a node is failed, and witness has a vote, the quorum will be 3 out of 3, i.e. node2 + node3 + File Share Witness.
The process of adopting quorum after subsequent failure is called Dynamic Quorum.
Our goal is that if the primary replica server 1 at site A is down,
the Always On Availability Group (AG) fails over to the secondary
replica server 2 at site A, and if both servers at site A are down,
the AG fails over to site B.
So let see the failure scenario for your case:
node1 node2 node3 File Share Witness Quorum Majority 1 1 1 0 3 out of 3 100 %
1 node failure:
node1 node2 node3 File Share Witness Quorum Majority 0 1 1 1 3 out of 3 100 % --> **`Dynamic Quorum and Dynamic Witness`**
In this case, the cluster become,
2 node cluster with witness.
Here, the default quorum configuration will be
node2 + node3 + File Share Witness = 1 + 1 + 1 = 3 out of 3 (because dynamically quorum updated as 3)
another node fails (node2)
node2 node3 File Share Witness Quorum Majority 0 1 1 2 out of 3 66.66 % --> It has majority - more than 50% - the cluster will survive
Now the new cluster will be:
node3 File Share Witness Quorum Majority 1 1 2 out of 2 --> Dynamically quorum updated as 2
In case of another node failure, the cluster will be down. I hope this is clear. Any node may fail in any order, but the quorum majority is calculated same as above.
To setting up DR, I suggest a simple solution:
Site A set up Always On Availability Group (AG) with
2 node cluster and configure quorum model as Node and File Share Majority.
Site B set up SQL Server Standalone instance and configure – automated
backup and restoration / log shipping from Site A.
There will be different options, this was the one, I have used earlier. In this case, failover should be performed manually. Consider your business requirement and design the DR strategy.
Few points to be noted while designing
- Recovery Time Objective (RTO)
- Recovery Point Objective (RPO)
- Retention Policy,
- Failover, etc…
I recommend you to check my page for better understanding. I hope this answer helps, thank you.