I have a quite busy (1000+ batch requests per second, 350+ workers on average) system built on sql server 2016. It’s an availability group with 3 synchronous replicas.
I’d like to redirect read requests on secondaries, but due to the redo lag (and the way the app works), I’m concerned about the fetched data being stale.
I know I can get the last hardened lsn and last redone lsn from sys.dm_hadr_database_replica_states – but is there any way to wait for the redo to catch up avoiding a while loop?
It sounds like you have a few different questions in here:
Could a query be modified to check to see if redo is current on the underlying data is done, and if not, stall execution until redo catches up? This seems like a really dangerous idea because when replication gets behind, you’ll have more and more queries using up worker threads, waiting for replication to catch up. I can easily see worker thread exhaustion hitting here and causing threadpool waits because it’s essentially the same scenario that blocking causes.
Could queries be redirected to the most current secondary? Yes, by using a load balancer between your app and the SQL Servers. You can build logic into load balancers to run periodic health checks, and when a server fails the load check, it’s taken out of the available connection list until it’s able to pass health checks again. This is a pretty common strategy for web servers, but fairly rare for database servers (since it’s a lot of work to build out.)
Can redo be done faster? Yes, test to see whether serial or parallel redo is a better fit for your workloads, and make sure you’re getting the one you want. Microsoft has blogged about the gotchas with parallel redo, including gems like only the first databases in your AG (by database ID) get parallel redo.