Question :
I have two tables – comments
and votes
:
comments(id, text, user_id, page_id)
votes(id, value)
There are 2,000,000 rows in the votes
table.
I created the following query:
SELECT SUM(votes.value),
comments.text,
comments.comment_id
FROM comments, votes
WHERE comments.comment_id = votes.comment_id AND comments.page_id = $page_id
GROUP BY comment_id;
The query runs for about 12 seconds, is this normal?
SHOW PROCESSLIST
says that its copying to a temp table. This seems slow, is it necessary to use a temp table?
Answer :
In the world of RDBMS’s, temp tables are a fact of life. It only rears its ugly head in JOINs.
Even the worst case of a JOIN is the degenerate JOIN, a query of one table.
Since temp tables are always making its way into our queries (into our lives), the best thing you can do is starve temp tables. Making them as compact as possible. What do I mean ???
Here is your query :
SELECT SUM(votes.value),
comments.text,
comments.comment_id
FROM comments, votes
WHERE comments.comment_id = votes.comment_id AND comments.page_id = $page_id
GROUP BY comment_id;
You query will actually create a table resulting from a join of comments and votes whose rowcount is the number of comments times 2,000,000. Since temp table have no indexes, there are always cartesians joins occuring with the temp tables. The WHERE clause is applied along the way, then the GROUP BY summation. Don’t forget, the temp table will also contain the text column. That’s a lot of text data to drag around in the JOIN phase.
Let’s refactor your query
You can head off the WHERE clause at the pass, so to speak. Here is how:
SELECT comment_id FROM comments WHERE page_id = $page_id;
This query only has the necessary keys from the comments table.
Next, collect comment_ids from votes
SELECT comment_id,SUM(value) sumofvalues FROM votes;
This is actually the worst part. With 2 million rows and 4 bytes per comment_id and 4 bytes for the sum, that’s an 16MB table in the absolute worst case.
Next, combine the keys of comments with the matching keys in votes.
SELECT BB.* FROM
(SELECT comment_id FROM comments WHERE page_id = $page_id) AA
INNER JOIN
(SELECT comment_id,SUM(value) sumofvalues FROM votes GROUP BY comment_id) BB
USING (comment_id);
Now that keys from comments and the sum of the values from votes are retrieved, the final part is to connect the comment-ids back to the original comments table and get the text fields.
SELECT
B.sumofvalues,A.text,A.comment_id
FROM
comments A INNER JOIN
(
SELECT BB.*
FROM
(SELECT comment_id FROM comments WHERE page_id = $page_id) AA
INNER JOIN
(SELECT comment_id,SUM(value) sumofvalues
FROM votes GROUP BY comment_id) BB
USING (comment_id)
) B
USING (comment_id);
Before this refactored query can work as fast as possible you will need proper indexing.
Here are the indexes you need:
ALTER TABLE comments ADD INDEX pageid_commentid_ndx (page_id,comment_id);
ALTER TABLE votes ADD INDEX commentid_value_ndx (comment_id,value);
You want the first index because it will group rows by page_id. You want the second index because It will group rows by comment_id. In fact, both these indexes are called covering indexes. Why is that important ??? It is important because the subqueries will retrieve needed data from the index only, never touching the main table. The comments table is only accessed once when all needed keys are compiled together in the subqueries.
Give it a Try !!!
If any syntax is not working, please comment on the question and let me know !!!
Assuming this is MySQL, you can use EXPLAIN
syntax to help figure out how the query is being run:
EXPLAIN SELECT SUM(votes.value),comments.text,comments.comment_id FROM comments,votes
WHERE comments.comment_id = votes.comment_id AND comments.page_id = $page_id
GROUP BY comment_id;
First guess is you are going to need indexes on the columns that you’re joining on the votes table.