I have a third party incident management tool, that creates tickets from errors in the SQL logs.
Occasionally replication stops replicating without creating errors in the SQL logs.
I want to create a job that runs every 10 minutes or so to check on replication and then uses RAISERROR to start our ticketing (and alerting) process.
There are a number of stored procedures to Programmatically Monitor Replication that will show issues, and there is also a pretty good solution at Monitoring Transactional Replication in SQL Server(by Francis Hanlon 11 April 2013) that is good start on what I want. But would need to be tweaked a bit to meet my needs. Before I start reworking Francis’s solution I am wondering if there are any other solutions I might leverage on.
I have searched around here and google, and Francis’s solution is the only one I found that gets close to inhouse monitoring, without new third party tools.
Are there any solutions to monitor SQL replication real time with T-SQL?
- SQL 2008 to SQL 2019
- Mostly transactional replication
- Same AND cross server replication
TL:DR, there is a good reason, you are not finding any good solutions for what you want. It does not seem possible, within the constraints you have.
At first glance getting details from the existing stored procedures in Programmatically Monitor Replication putting them together in a table (temp or permanent) to use for tracking and reporting seems like a great idea. But when you start googling around you don’t find anything as simple as you imagined it would be.
The two top solutions google finds involve
OPENROWSET, but they have issues and ultimately most googling leads to How to Share Data between Stored Procedures, by Erland Sommarskog
It’s a method that is seemingly very appealing, because it’s very simple to use and understand… It Can’t Nest
Msg 8164, Level 16, State 1, Procedure BigSalesByStore, Line 8
An INSERT EXEC statement cannot be nested.
OPENQUERY and its cousin OPENROWSET
Moreover, it is not aimed at improving performance. It may save you from rewriting your stored procedure, but most likely you will have to put in more work overall – and in the end you get a poorer solution. While I’m not enthusiastic over INSERT-EXEC, it is still a far better choice than OPENQUERY.
The first thing I tried that should have been a really good start.
IF OBJECT_ID('tempdb..#SP_RMHP') IS NOT NULL DROP TABLE #SP_RMHP GO Create Table #SP_RMHP( publisher SYSNAME , distribution_db SYSNAME , status INT , warning INT , publicationcount INT , returnstamp CHAR (16) ) INSERT #SP_RMHP EXEC distribution.dbo.sp_replmonitorhelppublisher Select * from #SP_RMHP drop table #SP_RMHP
The problem is that, sometimes you get the error
An INSERT EXEC statement cannot be nested. and sometimes you don’t. In SQL 2017 I almost always get it. In SQL 2014 sometimes I don’t, sometimes I do. Very frustrating and not a path to solution.
So now what?
Focus on pulling data directly from the distribution database tables.
Data supporting the replication process is kept in tables on 4 different database. MSN details are here
- msdb Database
- Distribution Database
- Publication Database
- Subscription Database
In one of my replication instances I do not have access to the Subscription servers, so I focused on just the first 3.
The table msdb.sysreplicationalerts has some errors, but everyone I saw was also displayed in the SQL logs, I am looking for evidence of broken replication when there are no errors reported in the SQL logs. So not what I need, but might be helpful to others.
select top (100) * from msdb..sysreplicationalerts
I looked through all of the tables, examined the column descriptions and looked at the data durring uptime. I found a few potential source, so I set the subscription database offline, in theory this would duplicate an issue with the subscriber that did not directly impact replication or the publication.
With the subscriber database offline (for hours during this testing)
The replication monitor showed green, no Latency, and the Last Synchronization time was updated every couple of minutes to be current.
- Clearly Replication is not “working” as the subscriber is offline 🙁
A possible candidate is ‘MSreplication_monitordata’ (distribution database), but essentially shows the same thing as the replication monitor, I did not dig deeper, but it is probably the source for Last Synchronization time in the replication monitor
Select last_distsync , DATEDIFF ( MINUTE , last_distsync , GETDATE() ) as 'MinDifflast_distsync' , agentstoptime , time_stamp , DATEDIFF ( MINUTE , time_stamp, GETDATE() ) as 'MinDifftime_stamp' , GETDATE() as 'GETDATE' , agent_id --may want this for reporting , agent_name --may want this for reporting --, * from MSreplication_monitordata where agent_type = 3 --3=Distribution Agent
- The only other thing that looked promising is the table MSsubscriptions on the publication database. There are some sequence numbers there that differ when the subscriber is offline, unfortunately they still differ when you bring the subscriber database back online. Looking at one of live publication replications, there are 100’s that differ at any given time.
select subscription_seqno , publisher_seqno , * from MSsubscriptions where subscription_seqno <> publisher_seqno
Now I am going to look into Canary tables as discussed by Kendra Little in this article This is problematic, because it requires modification of databases (add the table) with can be problematic if you are responsible for monitoring, but prohibited from altering the user database.
Apex SQL has some information on monitoring the agents: https://solutioncenter.apexsql.com/how-to-monitor-sql-server-replication-agents-status/
I think this is more relevant for you though and even talks about a setting up a SQL agent job to continually monitor: https://www.mssqltips.com/sqlservertip/4892/monitor-sql-replication-log-reader-agent-latency/
From there you can RAISE ERROR in the job based on the results of the query.