Question :
New to AG – have inherited my first existing system. Have spent quite some time reading MS docs and watching a few training videos but a) concepts seem blurred (terminology similar between different technologies) and b) all this reading raises more questions than answers.
My question is essentially how to proceed with taking over and managing this AG, and how to prepare for, and carry out, some failovers in order to upgrade the OS on the servers?
- There are 3 replicas – SQL01, SQL02, SQl04. SQL01 is currently primary.
Firstly – is there a good script to give me an overview of AG configuration and health? I understand the console in SSMS gives me an overview, but is there no script that can automate the task and be used to detect or report on servers lagging behind, etc?
Secondly, how can I check whether any secondary has been set up for read-only routing, and/or backups? I believe backups are run only against SQL01, so what happens if/when there is a failover making SQL02 the primary?
Thirdly and probably related – if the AG exists to route queries to a given node, I can understand that all databases are kept in sync so that applications can continue querying and any failover between nodes should be transparent to the application, but if backups (or mailing of query/report results) are run by scheduled Agent jobs, how are these handled after failover? Is this something manually kept in sync by the DBA?
Fourth, along similar lines – is it conceivable that there are differences between databases across two nodes. ie. I’m not talking about differences that will synchronise automatically; I mean might someone have made a change – such as adding a trigger to a table – on a replica that won’t have replicated across all nodes?
Fifth – to start thinking about manual failovers, what things should be checked ahead of failover? I saw a question here talking about checking how quorum is formed – do I need some kind of report on Quorum setup before I fail over? If so, what am I looking for?
Sixth, another question here mentions checking the is_failover_ready column on sys.dm_hadr_database_cluster_states – but when I query that table, I’m told “invalid object name”. What gives?
Seventh To the specifics of OS upgrade, what is the basic approach – Should I, say, take SQL04 off the AG (or pause syncing), then upgrade its OS, restart the machine, then re-add to the AG / continue syncing? If so, how? And how do I know at the end of everything that all is ok? What would be the “rollback plan” if something doesn’t work out as hoped?
Eighth Continuing with the OS update, do I do the same thing for SQL02, then failover from SQL01 to SQL02, upgrade SQL01 OS then fail back? Again, how to determine each step is “successful”?
Ninth – what happens if I initiate a manual failover but SQL01 is in the middle of backing up 20 or so databases? Is there something to check for me ahead of time, on how database backups might be affected, or something to check after failover, to ensure database backups are still functioning as required, post failover? We have full nightly backups on all databases plus partial and incremental backups – but only on some databases.
Tenth – similarly, what if automated jobs are running at the time of failover? These might be Agent jobs updating data or running queries for mailout, or they might be a third-party BI system pulling data for ETL prior to import into the BI end. Is it safe to presume the BI systems will continue receiving their data post-failover, or should I be checking how they’re making their connections to the DB first? (Is it possible they are connecting to SQL01 specifically (or any other node really), in spite of the AG, so that when SQL01 goes down for the OS upgrade, the BI extract fails?
Like I said, have put a fair amount of reading into AGs overall, but not easy to find these big picture answers :/
Answer :
Firstly – is there a good script to give me an overview of AG
configuration and health? I understand the console in SSMS gives me an
overview, but is there no script that can automate the task and be
used to detect or report on servers lagging behind, etc?
Keep in mind that there are many things that are important regarding Always On Availability Groups that cannot be answered in one or multiple answers alone, below are some considerations.
The dashboard in ssms that you mention is a view into the AG, but you there are many columns unchecked. A few of these:
This dashboard is running queries behind the scenes, you could in theory trace your login with the profiler
/ extended events
and get that you are running with SSMS
.
An example of important monitoring is the secondary not being able to follow in redoing the log records, resulting in longer failover times.
This can be checked with the REDO QUEUE
size metric.
More on that here
Querying the redo queue size is not that hard:
SELECT redo_queue_size
FROM master.sys.dm_hadr_database_replica_states;
Some of the important dmv’s you should look through:
master.sys.availability_groups
master.sys.dm_hadr_database_replica_states
master.sys.availability_replicas
master.sys.dm_hadr_database_replica_states
Secondly, how can I check whether any secondary has been set up for
read-only routing, and/or backups? I believe backups are run only
against SQL01, so what happens if/when there is a failover making
SQL02 the primary?
This depends on your backup preferences in the AG properties:
In your case, either Primary
, Prefer Secondary
or Any replica
will be set.
If Primary
is set then you should change this to one of these two other options.
Even better is that you opt for a solution that is AG
aware, one free solution is Ola Hallengren’s Maintenance Solution Backups
You could check if read only routing is set up by querying the read_only_routing_url
column.
SELECT replica_server_name
, read_only_routing_url
, secondary_role_allow_connections_desc
FROM sys.availability_replicas
Thirdly and probably related – if the AG exists to route queries to a
given node, I can understand that all databases are kept in sync so
that applications can continue querying and any failover between nodes
should be transparent to the application, but if backups (or mailing
of query/report results) are run by scheduled Agent jobs, how are
these handled after failover? Is this something manually kept in sync
by the DBA?
Backups running on the primary will be rolled back when a failover occurs.
Not sure about when a secondary transfers to primary
You could add retry attempts to your jobstep / other process used to take backups.
Fourth, along similar lines – is it conceivable that there are
differences between databases across two nodes. ie. I’m not talking
about differences that will synchronise automatically; I mean might
someone have made a change – such as adding a trigger to a table – on
a replica that won’t have replicated across all nodes?
Operations on your primary Always On AG databases are transferred over to other nodes.
The trigger will be replicated in normal circumstances.
What you do need to manage yourself are Jobs and Logins.
You can find more of that in this post: Management of Logins and Jobs for the Databases of an Availability Group (SQL Server)
An idea for managing these is by using the commands Copy-DbaLogin and Copy-DbaAgentJob from dbatools to copy these jobs over.
Running these on a set schedule via a job is a possibility.
Fifth – to start thinking about manual failovers, what things should
be checked ahead of failover? I saw a question here talking about
checking how quorum is formed – do I need some kind of report on
Quorum setup before I fail over? If so, what am I looking for?
The main database health (Synchronized for Sync replica’s, Synchronizing for Async replica’s), redo_queue size, log_send_queue_size, if any big queries are running on the primary that will take a while to rollback.
Regarding quorum, starting from Windows Server 2012
you have something called dynamic quorum, meaning that votes are recalculated when the quorum changes. This is done by changing the vote state. For example when you have two nodes without a fileshare witness in your WSFC
, one of the secondaries will have a vote state of 0
instead of 1
.
Read up on dynamic quorum here
Sixth, another question here mentions checking the is_failover_ready
column on sys.dm_hadr_database_cluster_states – but when I query that
table, I’m told “invalid object name”. What gives?
I remember that question and asked him about it, that person meant sys.dm_hadr_availability_replica_cluster_nodes.
Seventh To the specifics of OS upgrade, what is the basic approach –
Should I, say, take SQL04 off the AG (or pause syncing), then upgrade
its OS, restart the machine, then re-add to the AG / continue syncing?
If so, how? And how do I know at the end of everything that all is ok?
What would be the “rollback plan” if something doesn’t work out as
hoped?
A main walktrough of patching AG
‘s can be found here
Do not remove these databases from the AG
, as synchronization might not work when log backups where taken. You would have to restore the log backups before being able to add them again.
Regarding the steps you should execute, there are some differences between what AG
setup you have.
When you have only two nodes to remember to change the failover mode to manual because you don’t want unexpected failovers while you are patching this secondary.
If you have only two possible secondaries you should do the same when patching these nodes. In short, always set the failover mode of secondaries that are being patched to manual.
The other considerations before turning of the secondary / failing over align with these for a manual failover process discussed earlier.
After the node comes back online, validate the synchronization & database states.
The rollback plan would be that you still have the other nodes, re-add the databases to the node that failed.
Eighth Continuing with the OS update, do I do the same thing for
SQL02, then failover from SQL01 to SQL02, upgrade SQL01 OS then fail
back? Again, how to determine each step is “successful”?
Answered in the answer to question #8.
Ninth – what happens if I initiate a manual failover but SQL01 is in
the middle of backing up 20 or so databases? Is there something to
check for me ahead of time, on how database backups might be affected,
or something to check after failover, to ensure database backups are
still functioning as required, post failover? We have full nightly
backups on all databases plus partial and incremental backups – but
only on some databases.
Backups on the primary will rollback on failover, you would have to add retry logic.
Not sure about when a secondary transfers to primary
Tenth – similarly, what if automated jobs are running at the time of
failover? These might be Agent jobs updating data or running queries
for mailout, or they might be a third-party BI system pulling data for
ETL prior to import into the BI end. Is it safe to presume the BI
systems will continue receiving their data post-failover, or should I
be checking how they’re making their connections to the DB first? (Is
it possible they are connecting to SQL01 specifically (or any other
node really), in spite of the AG, so that when SQL01 goes down for the
OS upgrade, the BI extract fails?
These uncommitted transactions will be rolled back and the job will fail, you would have to add retry logic.
If it is a big operation, consider splitting it up in smaller parts.