After recently migrating MySQL 5.1 to a new, faster server, I’m seeing occasional and seemingly random periods where some queries are running slowly. Despite a lot of investigation, I can’t find the source of the problem.
I’ve listed a few concrete questions at the end of this writeup, but otherwise I’m not sure where to go at this point. If you can suggest any possible solutions or tools I can use, I would be very grateful.
Symptoms and Background
Every day or two, a small group of queries will appear at random in my slow query log, normally within a period of two or three minutes. These queries, which typically execute within 30-60 milliseconds, are taking vastly longer — in the range of 6 to 20 seconds. The slow query log doesn’t indicate any lock time for these queries.
The problems became obvious after migrating to a new database server. Both servers run MySQL 5.1 under CentOS 6, and both use SSDs for database storage. The new server, unlike the old one, runs two SSDs using Linux software RAID-1. Although the hardware has changed (different datacenter, different network), the software hasn’t. The CentOS and application versions were the same at the time of the migration. The application code has also remained the same.
I have no reason to believe the queries themselves are problematic or that the database is experiencing performance problems; these aren’t complex or unoptimized queries. They normally execute in <100ms, and the database server has a typical load average less than 1.0.
I don’t believe that all queries are affected during these periods of slowness. The server sees, on average, about 8000 queries per hour. But my slow query log is showing only a handful (maybe 5-10) during the slow periods.
The queries appear to have nothing in common. They are mostly select statements against several different databases involving tables that use both InnoDB and MyISAM engines. Sometimes update queries will also appear.
I tried reducing the slow query time to 2 seconds (from 5 seconds) to find out if there were more periods of slowness leading up to the very long-running queries. I don’t see that happening. The slow periods seem to start and stop suddenly and inexplicably.
Suspecting that IO was a problem, I setup the sar utility to capture system stats every minute. I can find no correlation between the periods of slowness and the system or IO load reported by sar. The server appears to be nearly idle when the slow queries are logged.
I examined all cronjobs as well as all logfiles in /var/log, but can’t find anything interesting happening at the same time as the slowness.
Just to be safe, I disabled r1soft (backup software) and disabled other backup-related tasks. I also reduced the IO priority of Sphinx, a search engine that runs on the same server. Neither had any effect.
I tried to disable the query cache, as I’ve read that it can cause strange lock situations. No effect.
I upgraded MySQL to version 5.5, just to see what would happen. This also had no effect on the slow query issue.
- Is there any way I can find out if some strange, intermittent hardware issue with the new server is in play here? As I wrote above, I haven’t noticed anything while carefully checking the system logs.
- Is there any chance that network latency or packet loss has any role in this? I assume not, as the query time recorded in the slow query log should exclude any network delay, right?
- I see that MySQL 5.5 offers a Performance Schema feature that is new to me. But it’s not obvious to me how this can be used. Is there some way I can use this to troubleshoot the problem?
Thanks very much for any help with this frustrating problem!
Short solution: The problem was IO-related. It was solved by moving to a new, more modern system with faster SSDs.
I invested a lot of time over several months trying to get to the bottom of the performance issue described above. As a stopgap measure, I did everything possible to prevent concurrent queries that caused a lot of IO (such as backups and search engine indexing). However, this never really solved the problem or answered the questions I posted — it only covered up the underlying issue with one IO-intensive process blocking database activity.
My “answer,” if you can call it that, was finally realized by moving to more modern hardware. Unlike the old system, which used consumer-grade Micron SSDs, the new server uses IBM datacenter SSDs and a faster processor. Otherwise, the software configuration is identical as in the system tested above (same OS, same use case for MySQL, same mdraid RAID-1 config, etc.).
I can only assume that the new system’s better SSDs and/or more modern motherboard and chipset are better at handling concurrent operations. I no longer experience slow queries of any sort, no matter what’s going on with the database server. Database operations do not seem to be affected by disk activity nearly like before.
So, if you’re experiencing strange slowdowns with database queries and can’t find any application-level reasons for this, don’t forget to look at the underlying hardware as a possible source. Try renting a new dedicated server or VPS for a month from a different hosting provider, replicate your software configuration, and give it a try. Worked for me!