Recommendations to prevent failover (scheduled)

Posted on

Question :

I have a two node SQL Server failover cluster with availability group. This is used for SharePoint.
SharePoint rebuilds the indexes where necessary every night. This routinely causes failover.
I can’t get this resolved with the infrastructure team, so I want to mitigate.
Since the index rebuild is scheduled, I am considering:

  1. Suspend secondary node for duration (1 hour)
  2. Change failover from automatic to manual for duration (1hour)

Anything else? Which is best? I know it’s not ideal and the root cause should be fixed but I am out of options.

Thanks
Andrew

Answer :

The AG usually fails over when the secondary and witness can’t talk to the primary for a length of time. (I’m simplifying that a little, but for more technical details, check out MS’s documentation on cluster quorum.)

To fix this, I’m going to suggest a few different things. You can do one, two, or all of ’em – just start with the things that are easiest for you, and add more as necessary.

First, leave some breathing room between index rebuilds. If you’re using the built-in maintenance plans to rebuild indexes, consider switching to Ola Hallengren’s maintenance scripts instead. (They do backups, checkdb, and index rebuilds, but for now, let’s only focus on your index rebuilds.) They operate via a stored procedure, and one of the parameters for the IndexOptimize stored proc is Delay. By default, there’s no delay between index rebuilds, but you would want to add a 5-10 second delay. This will give your IO subsystem and network some time to clear the air.

That by itself may not be enough, so keep reading.

Next, consider tweaking the default AG timeouts. In sections 7-8 of this Microsoft article, they explain how the default timeouts may be too quick for low-quality network connections. In the comments, you also noted that you’re hitting lease timeouts, so check out these additional resources:

  • Microsoft support post: How It Works: SQL Server Always On Lease Timeout – in which they write, “If the lease timeout is expiring you likely have a system wide, impacting event taking place that needs to be resolved. If you are experiencing lease timeouts troubleshoot the entire computer you are likely encountering a broader issue than just a SQL Server lease timeout.” My money’s on this – since you’re seeing it during index rebuilds, I’m guessing you have an issue that makes SQL Server unresponsive during that time. For community help with that, follow the instructions in Getting Help with a Slow SQL Server, especially the last section around including sp_Blitz’s results in Markdown format in your question.
  • Improvements in 2012/2014 patches for lease timeouts – KB #3112363 adds better diagnostic data for lease timeouts in 2012 SP3 and 2014 SP2. This is only going to give you more troubleshooting data – it’s not going to fix the timeouts. (The link I posted above is more about fixing the timeouts long term by fixing the underlying performance problem.)

Then, consider adding network capacity. I’m guessing your servers have just one 1Gb Ethernet port each, and it’s getting saturated during your index rebuilds. You could consider upgrading to teamed cards with load balancing, or to 10Gb cards, or to separate heartbeat network cards. I’m not usually a big fan of separate heartbeat networks because they increase complexity, but that might be the cheapest/easiest solution here. (Start with teaming first though.)

The above steps will let you keep your automatic failover AND your index rebuilds. However, I’d be a bad database guy if I didn’t suggest that maybe rebuilding every index, every night, is a little bit overboard. Not only is it causing your AG problems, but it’s also inflating your tranaction logs, causing those backups to take longer too. For more on that, watch my GroupBy session, Why Defragmenting Your Indexes Isn’t Helping.

Couple of things to add on Brent’s suggestions :

  • Switch to Async when starting Index maintenance routine and monitor the log_send_queue_size. If it exceeds certain threshold (you have to test it as per your environment) then do not proceed with rebuild instead wait for it to go down and then resume index maintenance. You can also put some time limit and carry on the operation. Make sure to switch back to SYNC – your original setting.
  • Use MAXDOP =1 when rebuilding indexes and ONLINE=ON options. Also in newer versions, you can leverage WAIT_AT_LOW_PRIORITY, MAX_DURATION,ABORT_AFTER_WAIT options.
  • Use high performance power plan to help for the single threaded redo process on replicas.

Make sure to check max memory as I have encountered that due to memory pressure, the AG redo thread dies leaving AGs not syncing causing all sort of issues.

Doing Index rebuilds, should not cause AG failover unless you are running out of worker threads or hitting an unknown bug in sql server.

I agree with Brent that doing Index rebuilds daily tells you to look into adjusting fill factor and investigate if it is really necessary to do it ?

Ref: Recommendations for Index Maintenance with AlwaysOn Availability Groups

Leave a Reply

Your email address will not be published. Required fields are marked *