Question :
For the first time, I am running up against a horrifically long execution time of a MySQL query (~5 minutes).
The data in the database is highly (and not arbitrarily) normalized. It very efficient at organizing and shuffling data to be displayed in a lot of very helpful ways for different purposes, except this one particular query is throwing a wrench into it.
I can’t understand the reason for it. However, some background information that may shed some light onto otherwise arbitrarily convoluted queries.
The company has divided the world up into many many teams (macroregions). Everybody belongs to one or two teams, based on their expertise.
-
For example, there are many diverse teams. A few examples are
Spanish
,Sahara
,Iberia
,Portuguese
,Jungle
teams. Each of the teams has considerable overlap with other teams, but is in some senses independent. -
The
Arabic
team works quite closely with theSahara
team, by virtue of the fact that the database tells them they have to work together on certain assignments, because of overlapping geographic locations. TheSpanish
and thePortuguese
team also work closely together, they both work with theAmericas
andEurope
teams and thePortuguese
team also works with theAfrica
team, as does theArabic
team. -
Each team has a given set of regions, which are also not unique to that particular team. For example, the
Mediterranean
region belongs to around 12 teams, and when an event occurs there, they all work on it together. -
Each country belongs to one or more regions.
Turkey
belongs toCentral Asia
,Europe
and evenMediterranean
, and a few others.
Given all that, it is necessary to show each individual what other people on their teams are doing, as well as people who are not on their team, but have overlapping regions.
Query 1 accomplishes this perfectly, and very quickly less than .09 seconds.
SELECT report_name
FROM reports
WHERE region IN (
SELECT distinct region
FROM macroregions
WHERE macroregion IN (
SELECT distinct macroregion
FROM users
WHERE callsign = '$thisuser'
)
)
It is also important people can see every occurrence of the report_name
, when it Query 1 deems that the individual logged in should know about it. Reports_names
are automatically generated and reused for each geographic location
-
If I am on the
East Asia
team and someone in my area is working with theSpanish
team, there would be two entries in the database:-
20120210JOMX01
Japan
Okinawa
-
20120210JOMX01
Mexico
Nuevo Leon
-
Query 2 takes Query 1 and wraps one layer of WHEN-IN
s around it. It allows the East Asia
team members know that someone on their team is working in Mexico
. But the performance delay is unacceptable and totally unusable; it takes nearly 5 minutes to complete a single query!
SELECT *
FROM reports
WHERE report_name IN (
SELECT report_name
FROM reports
WHERE region IN (
SELECT distinct region
FROM macroregions
WHERE macroregion IN (
SELECT distinct macroregion
FROM users
WHERE callsign = '$thisuser'
)
)
)
Despite the fact that the query works, it takes too long to be effective. Again, it is startling how long this particular query takes. Whereas nothing else in the interface has given me any sort of performance issue (i.e. they all take less than one second).
What steps can I take to fix this?
Answer :
Let’s start with your original query
SELECT *
FROM reports
WHERE report_name IN (
SELECT report_name
FROM reports
WHERE region IN (
SELECT distinct region
FROM macroregions
WHERE macroregion IN (
SELECT distinct macroregion
FROM users
WHERE callsign = '$thisuser'
)
)
)
You can gather keys only (in stages), then join the keys with the reports
table. Here is my new proposed query
SELECT reports.* FROM
(
SELECT rpts.report_name FROM
(
SELECT DISTINCT regions.region FROM
(SELECT DISTINCT macroregion
FROM users WHERE callsign = '$thisuser') users
INNER JOIN
(SELECT DISTINCT macroregion,region FROM regions) regions
USING (macroregion)
) regionkeys
INNER JOIN
(SELECT region,report_name FROM reports) rpts
USING (region)
) reportnamekeys
INNER JOIN reports
USING (report_name);
If you are not comfortable with the USING clause here is my new propsed query without using the USING clause:
SELECT reports.* FROM
(
SELECT rpts.report_name FROM
(
SELECT DISTINCT regions.region FROM
(SELECT DISTINCT macroregion
FROM users WHERE callsign = '$thisuser') users
INNER JOIN
(SELECT DISTINCT macroregion,region FROM regions) regions
ON users.macroregion = regions.macroregion
) regionkeys
INNER JOIN
(SELECT region,report_name FROM reports) rpts
ON regionkeys.region = rpts.region
) reportnamekeys
INNER JOIN reports
ON reportkeys.report_name = reports.report_name;
You will need some indexes to support the subqueries
ALTER TABLE users ADD INDEX callsign_macroregion_ndx (callsign,macroregion);
ALTER TABLE regions ADD INDEX macroregion_region_ndx (macroregion,region);
ALTER TABLE reports ADD INDEX region_report_name_ndx (region,report_name);
ALTER TABLE reports ADD INDEX report_name_ndx (report_name);
The first 3 indexes are called covering indexes. They are called such because the subqueries calls for only those exact columns. Therefore, no need to read from the table. The data is only fetched from the index.
- http://peter-zaitsev.livejournal.com/6949.html
- http://ronaldbradford.com/blog/tag/covering-index/
- http://www.mysqlperformanceblog.com/2006/11/23/covering-index-and-prefix-indexes/
You should also study how to refactor queries to gather keys, perform WHERE clauses early, and perform JOINs last. Here is great YouTube Video on how to do this: http://youtu.be/ZVisY-fEoMw (Based on this book : Refactoring SQL Applications)
Give it a Try !!!
First step: get rid of subqueries. Convert them to proper joins, or co-related subqueries.