Question :
I am looking for some guidance on architecting our SQL Server 2016 SP1 Always On Availability Groups HADR solution on Windows Server 2012 R2. We have a primary site A with a primary replica and a secondary replica and a disaster recovery (DR) site B with a secondary replica and a file share witness. Our goal is that if the primary replica server 1 at site A is down, the Always On Availability Group (AG) fails over to the secondary replica server 2 at site A, and if both servers at site A are down, the AG fails over to site B.
We are attempting to do a Node and File Share Majority configuration per https://technet.microsoft.com/en-us/library/cc731739(v=ws.11).aspx and this diagram:
This diagram shows that when one node and the “disk” / file share witness are in communication the cluster runs, but in our testing of this situation, the cluster fails due to a loss of quorum of the WSFC. The automatic AG failover works if we test a failure of one server at a time by disabling the NIC in vmWare since SQL Server 2016 supports two automatic failover target replicas. However, it does not work if we fail both servers at site A at the same time to simulate a point-to-point network failure or a site power failure.
The following approach with manual intervention to force the quorum will work but it is not automatic, which is what we want in the ideal situation:
$node = "SQLServerC"
Stop-ClusterNode –Name $node
Start-ClusterNode –Name $node –FixQuorum
ALTER AVAILABILITY GROUP SQLServerAO FORCE_FAILOVER_ALLOW_DATA_LOSS;
$node = "SQLServerC"
Stop-ClusterNode –Name $node
Start-ClusterNode –Name $node
I appreciate any suggestions you can offer and thank you in advance!
Answer :
Our goal is that if the primary replica server 1 at site A is down, the Always On Availability Group (AG) fails over to the secondary replica server 2 at site A, and if both servers at site A are down, the AG fails over to site B.
Except for a very specific scenario, this will not work. In all other scenarios you’ll lose quorum before you’d ever be able to automatically fail over (which requires… quorum).
The best answer would be to have the DR side be manual failover (it could still be synchronous) with the proper documentation on how to force quorum and bring the AG online.
You could also invest more into VMWare and use their technologies but that assumes the infrastructure, licensing, and ability to implement those products for a specific service such as this.
You should consider windows server 2016. blogs.msdn – introducing-cloud-witness
We have a primary site A with a primary replica and a secondary
replica and a disaster recovery (DR) site B with a secondary replica
and a file share witness.
Okay, you have 3 node cluster, 2 nodes are hosted in Site A and one hosted on Site B.
Site A has a primary replica (node1) and a secondary replica (node2)
Site B has a secondary replica (node3)
And your quorum model is: Node and File Share Majority.
By default, each node will have 1 vote and file share witness will have 1 vote.
The total quorum votes will be node1 + node2 + node3 + file share witness = 1 + 1 + 1 + 1 = 4
and the current quorum majority will be 4 out of 4
, i.e. 100% survival.
Let’s assume there is network connectivity issue between subset (2,2) of the cluster nodes, now the quorum majority will be 2 out 4, i.e. 50 % chance for survival. The problem here is, each subset will think, it has majority and it will try to own the resource groups
to keep the cluster up and running, but **Windows Server Failover Cluster (WSFC)**
supports only one active nodes at a time, two subset of nodes trying to access to same resource end up in conflicting issue, this is called split-brain scenario
.
So having even number of nodes in cluster
is not the right approach.
SQL Server 2016 SP1 Always On Availability Groups HADR solution on
Windows Server 2012 R2
To avoid the above mentioned issue, Windows Server 2012 R2 introduced a new feature called Dynamic Witness.
In your case, (starting from Windows Server 2012 R2) the default voting will be as follows:
node1 + node2 + node3 = 1 + 1 + 1 = 3 & File Share Witness = 0
Here, the total quorum votes will be 3 and majority is 3 out of 3. There is no voting for File Share Witness to keep the total voting as odd number.
If there is a failure of node, the cluster will assign a vote to the File Share Witness dynamically. The process of assigning or revoking quorum vote to witness dynamically called as Dynamic Witness.
And the new quorum will be calculated dynamically post failure of node. For example, if a node is failed, and witness has a vote, the quorum will be 3 out of 3, i.e. node2 + node3 + File Share Witness. The process of adopting quorum after subsequent failure is called Dynamic Quorum.
Our goal is that if the primary replica server 1 at site A is down,
the Always On Availability Group (AG) fails over to the secondary
replica server 2 at site A, and if both servers at site A are down,
the AG fails over to site B.
So let see the failure scenario for your case:
Default:
node1 node2 node3 File Share Witness Quorum Majority
1 1 1 0 3 out of 3 100 %
1 node failure:
node1 node2 node3 File Share Witness Quorum Majority
0 1 1 1 3 out of 3 100 %
--> **`Dynamic Quorum and Dynamic Witness`**
In this case, the cluster become, 2 node cluster with witness
.
Here, the default quorum configuration will be node2 + node3 + File Share Witness = 1 + 1 + 1 = 3 out of 3 (because dynamically quorum updated as 3)
Let’s assume another node fails (node2)
node2 node3 File Share Witness Quorum Majority
0 1 1 2 out of 3 66.66 %
--> It has majority - more than 50% - the cluster will survive
Now the new cluster will be:
node3 File Share Witness Quorum Majority
1 1 2 out of 2 --> Dynamically quorum updated as 2
In case of another node failure, the cluster will be down. I hope this is clear. Any node may fail in any order, but the quorum majority is calculated same as above.
Having cloud witness will be better option, but it will not solve your problem, still it is just an option. Please check my page for deeper understanding of Windows Server Failover Cluster Quorum
To setting up DR, I suggest a simple solution:
On Site A set up Always On Availability Group (AG)
with 2 node cluster
and configure quorum model as Node and File Share Majority.
On Site B set up SQL Server Standalone instance
and configure – automated backup and restoration / log shipping from Site A.
There will be different options, this was the one, I have used earlier. In this case, failover should be performed manually. Consider your business requirement and design the DR strategy.
Few points to be noted while designing HADR solutions
:
- Recovery Time Objective (RTO)
- Recovery Point Objective (RPO)
- Scalability
- Retention Policy,
- Failover, etc…
I recommend you to check my page for better understanding. I hope this answer helps, thank you.