This problem has been haunting me for a while but I actually fixed the queries that caused the problems (after I finally found them) but i’m curious if it’s an SQL Process that’s the underlying problem our that our raidcontrollers are shit or what the problem truely is. So here’s the scenario that’s been bothering us.
@ 11:10~ and 01:10~ the server would give these IO errors, even through Remote desktop the servers would fail, management studio would give timeouts, everything would point towards imminent failure of the cluster.
(The servers are identical, we run Windows Server 2012 standard, clustered in an AlwaysOn running SQL Server 2012 Enterprise edition).
Around 5-15 minutes later, the servers would come back to live (without anything in the eventlogs except for the Event ID 833 which is the “well known”: SQL Server has encountered 16870 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:XXXXXXXX.mdf] in database [XXXXX] (10). The OS file handle is 0x0000000000001284. The offset of the latest long I/O is: 0x0000041fb3c000).
Now I was running profilers to find out which queries were running, doing anything I could to figure out what would cause the servers to just freeze up and refuse to service properly. The event log aims towards a specific Database to I searched in that direction and yesterday I found a table that looked odd to me.
The specific table was 11 gigs large, had indexes worth 23 gigs (haven’t looked into those yet but that seemed wrong to me) and had 80 million records. Considering the table should hold around 2 million and be growing relatively slowly this sparked my interested and I found three stored procedures which fill this table.
All three storedprocedures had flaws which made them insert not a few hundred records but over 500.000 each time they ran. (These are now fixed so the issue doesn’t present itself but I feel there is an underlying problem here).
These stored procedures would run @ 09:00 and 23:00 (excatly 2 hours before the problems occurred). The procedures themselves would take around 5 minutes to run to completion.
Now I’m wondering why it would take the servers two hours to start dying on me after the actual storedprocedures in question would already have been done. I don’t run any index maintenance at that point, or anything else for that matter (it’s the first thing I checked for). For all I know nothing runs @ the specific times. Could it be SQL Server does some ‘self maintenance’ and notices that some indexes need to be defragmentated or something even if I haven’t asked for any such thing?
Could it be our raidcontrollers (or even the SSD’s that are connected to them) that are flawed and after A LOT of activity decide to show signs of degration after 2 hours?
I simply can not explain this behavior and I’m worried our production servers might be unreliable (which would suck since they’re around 4 months old). Now i’ve fixed this issue last evening, and for 24 hours we haven’t had this problem (I wouldn’t call it ‘resolved’ unless it’s been stable for multiple days) but I’m still very curious what the hell my servers are doing.
This problem can be caused system performance issues, hardware errors, firmware errors, device driver problems, or filter driver intervention in the IO process.
What it means is that
- SQL server is doing more I/O’s that what the disk subsystem can handle.
- There could be some rogue process running on the system that is saturating the disks with all the I/O requests.
Below is the approach I would follow:
Troubleshoot using Wait Statistics – DMV’s
WITH [Waits] AS (SELECT [wait_type], [wait_time_ms] / 1000.0 AS [WaitS], ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS], [signal_wait_time_ms] / 1000.0 AS [SignalS], [waiting_tasks_count] AS [WaitCount], 100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage], ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum] FROM sys.dm_os_wait_stats WHERE [wait_type] NOT IN ( N'CLR_SEMAPHORE', N'LAZYWRITER_SLEEP', N'RESOURCE_QUEUE', N'SQLTRACE_BUFFER_FLUSH', N'SLEEP_TASK', N'SLEEP_SYSTEMTASK', N'WAITFOR', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', N'CHECKPOINT_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH', N'XE_TIMER_EVENT', N'XE_DISPATCHER_JOIN', N'LOGMGR_QUEUE', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'BROKER_TASK_STOP', N'CLR_MANUAL_EVENT', N'CLR_AUTO_EVENT', N'DISPATCHER_QUEUE_SEMAPHORE', N'TRACEWRITE', N'XE_DISPATCHER_WAIT', N'BROKER_TO_FLUSH', N'BROKER_EVENTHANDLER', N'FT_IFTSHC_MUTEX', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'DIRTY_PAGE_POLL', N'SP_SERVER_DIAGNOSTICS_SLEEP') ) SELECT [W1].[wait_type] AS [WaitType], CAST ([W1].[WaitS] AS DECIMAL(14, 2)) AS [Wait_S], CAST ([W1].[ResourceS] AS DECIMAL(14, 2)) AS [Resource_S], CAST ([W1].[SignalS] AS DECIMAL(14, 2)) AS [Signal_S], [W1].[WaitCount] AS [WaitCount], CAST ([W1].[Percentage] AS DECIMAL(4, 2)) AS [Percentage], CAST (([W1].[WaitS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgWait_S], CAST (([W1].[ResourceS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgRes_S], CAST (([W1].[SignalS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgSig_S] FROM [Waits] AS [W1] INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum] GROUP BY [W1].[RowNum], [W1].[wait_type], [W1].[WaitS], [W1].[ResourceS], [W1].[SignalS], [W1].[WaitCount], [W1].[Percentage] HAVING SUM ([W2].[Percentage]) - [W1].[Percentage] < 95; -- percentage threshold GO
Look at the pending I/O’s
SELECT database_id ,file_id ,io_stall ,io_pending_ms_ticks ,scheduler_address FROM sys.dm_io_virtual_file_stats(NULL, NULL) t1 ,sys.dm_io_pending_io_requests AS t2 WHERE t1.file_handle = t2.io_handle
Below are PERFMON counters that will help you:
- Avg. Disk sec/Transfer => Time taken to perform the I/O operation
- Disk Bytes /sec => Total reads and writes to disk per second in bytes.
- Process:IO Data Bytes/Sec => Total reads and writes to disk per second in bytes by each process.
- Buffer Manager: Page Read/sec + Page Writes/sec =>Total reads and writes to disk per second in bytes by SQL Server process.
Also, sometimes such errors can also occur when a different Power Management scheme is selected for the server. So better check your Windows Power Plan setting, and make sure you are using the High Performance Power Plan.
All three stored procedures had flaws which made them insert not a few hundred records but over 500.000 each time they ran. (These are now fixed so the issue doesn’t present itself but I feel there is an underlying problem here).
If you are inserting a huge amount of data, then you need to refer to The Data Loading Performance Guide to cater your server for such operations e.g. enabling trace flag 610.
Note: Others have mentioned that Auto Update Stats might be causing it, which is also worth looking at, but I would primarily concentrate on finding out rouge process that is causing lots of Disk I/Os and checking the storage subsystem to see if all the divers, etc are up-to-date and is configured correctly.
Good reading links :
I might be teaching my grandmother to suck eggs here. If so, apologies.
DMVs (http://technet.microsoft.com/en-us/library/ms188754.aspx) could help you rule out a lot of things inside the RDMS. At the very least you’ll be able to see what is running when and where… assuming the DBAs let you have access.
If the RDMS turns out not to be the culprit, you could look at enabling some PerfMon perf counters or event collectors to see if you can identify any driver or app that may be hitting your OS (http://technet.microsoft.com/en-us/library/cc749249.aspx).
If you are using auto-update statistics, the update wouldn’t occur until a query attempts to use them. So you can have this large insert that would exceed a threshold, but then occur 2 hours later when a query that would use them is run. You could then experience long I/O requests or timeouts while they wait for the statistics to be updated. In addition, after statistics are updated the queries will recompile which also adds overhead.
You have options to change this behavior but any that you choose will have consequences that must be considered as well.
You mention that the stored procedures should be done. My guess is that without any transaction marshaling the data is accumulated and then finally committed about 2 hours later.
You mention you ran profiler, but I would look at Adam Machanic’s sp_whoisactive at http://whoisactive.com/
It will give you all sorts of information.