What could be the cause for my high disk IO latency (from sys.dm_io_virtual_file_stats DMV)?

Posted on

Question :

These are the results from querying the sys.dm_io_virtual_file_stats DMV. (I add some calculated fields for clarity).

DM IO Virtual File Stats

I’m finding certain basic operations in Database 7 are bottlenecking core operations I have running in Database 6.

For example, I have a table in Database 7 that isn’t used anywhere, and has a lot of rows (about 300 million). In Database 6, a bulk insert of hundreds of thousands of records is happening every hour throughout the entire day.
If I run a simple SELECT COUNT(1) FROM Database7.dbo.UnusedTableInDatabase it takes about 10 minutes to return, and it appears to eat up a lot of the server resources, IO in particular. My continuously running bulk insert job on Database 6 crawls to almost a halt and a backlog starts to build up.

One thing I know is Database 7 has it’s data and log files on the same logical drive.
Database 6 is appropriately setup so that it’s data and log files are on separate drives.

Where should I start digging next to try to resolve these issues?

UPDATE: My server is hosted on AWS with the Provisioned IOPS SSD (io1) drive. It sounds like I’m supposed to be seeing up to 64,000 IOPS per volume, but I’m seeing about 2,000 IOPS right now, whereas another server I have on AWS with the General Purpose SSD (gp2) drive is giving me about 6,000 IOPS.

Answer :

My problem with this DMV is that it misses the critical context of time, so it’s difficult to pinpoint when the spikes in latency occur. If you’re running a VM rather than an RDS instance, I’d recommend setting up a Perfmon Trace to collect disk latency metrics so you can narrow down the times. If that’s not an option, push the results of your query to a user table every 5 minutes or so and then trend out the deltas so you can see what periods of time experience spikes.

Once you figure out when spikes are occurring, you’ll have an easier time identifying the cause of the latency via Extended Events (e.g. capturing long-running queries), Job History, etc. Who knows, maybe you’ve run into a situation where you’ve got nothing locally directly affecting the I/O during the spikes at which point you need to escalate to Amazon with this information in hand. Noisy neighbor scenarios happen, so don’t rule anything out.

On a side note, if you want to further break down the DMV stats to their database files, check out Paul Randal’s post on this DMV explaining how to do that.

Leave a Reply

Your email address will not be published. Required fields are marked *