Question :
After looking at my mysql slow log file I see that almost every entry there is related to this query:
SELECT
n.nid AS nid, AVG(v.value) AS rating
FROM
node n
LEFT OUTER JOIN votingapi_vote v
ON v.entity_id = n.nid
AND v.timestamp < UNIX_TIMESTAMP()-304800
AND v.timestamp > UNIX_TIMESTAMP()-1524000
WHERE
( (n.type = 'gallery') AND (n.status = '1') )
GROUP BY
n.nid
ORDER BY
rating DESC, n.created DESC
LIMIT 20 OFFSET 0;
Slow log shows this stats: Query_time: 1 Lock_time: 0 Rows_sent: 20 Rows_examined: 1981034
As you may see it joins 2 tables – node (45000 rows) and votingapi_vote (only 705 rows for now). votingapi_vote will grow in future and will have thousands rows.
Node table has a content rows like n.nid, n.title, n.created, n.type, n.status.
Votingapi_vote table has a voting rows like v.entity_id, v.value, v.timestamp
So I need to get list of 20 nodes with type gallery who has the best average rating for selected period of time.
Can you pls help me to optimize this query?
Thanks!
UPDATE: This is EXPLAIN results:
+----+-------------+-------+------+----------------------------+------------------+---------+-------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------+------------------+---------+-------------+------+----------------------------------------------+
| 1 | SIMPLE | n | ref | node_status_type,node_type | node_status_type | 102 | const,const | 830 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | v | ALL | NULL | NULL | NULL | NULL | 759 | |
+----+-------------+-------+------+----------------------------+------------------+---------+-------------+------+----------------------------------------------+
2 rows in set (0.00 sec)
UPDATE 2: This is EXPLAIN for query where I changed LEFT OUTER JOIN to INNER JOIN:
mysql> EXPLAIN SELECT n.nid AS nid, AVG(v.value) AS rating FROM node n INNER JOIN votingapi_vote v ON v.entity_id = n.nid AND v.timestamp < UNIX_TIMESTAMP()-304800 AND v.timestamp > UNIX_TIMESTAMP()-1524000 WHERE ( (n.type = 'gallery') AND (n.status = '1') ) GROUP BY n.nid ORDER BY rating DESC, n.created DESC LIMIT 20 OFFSET 0;
+----+-------------+-------+--------+------------------------------------+---------+---------+-----------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------------------+---------+---------+-----------------------------+------+----------------------------------------------+
| 1 | SIMPLE | v | ALL | NULL | NULL | NULL | NULL | 653 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | n | eq_ref | PRIMARY,node_status_type,node_type | PRIMARY | 4 | drupal.v.entity_id | 1 | Using where |
+----+-------------+-------+--------+------------------------------------+---------+---------+-----------------------------+------+----------------------------------------------+
2 rows in set (0.01 sec)
Is INNER JOIN better? Btw it returned 0 results (without EXPLAIN). So looks like INNER JOIN wrong in this case.
Answer :
If presented with this situation I would test the performance improvement of this:
- Add two fields to the
node
table:vote_count
,vote_sum
- Add a generated field to the
node
tablevote_avg
which is stored and indexed with a descending sort on the index key. - Add a trigger to the
votingapi_vote
table so that when row is inserted the respectivevote_count
is incremented by one and the value of the vote is added to thevote_sum
. (Likewise forUPDATE
orDELETE
.)
The drawbacks to this approach are that now your writes take longer so that you can do faster reads, and this might not be suitable if the filter that you use in your query on the timestamp
column isn’t predictable.
It’s tough to optimise Aggregation (AVG), Group By and Order by. These operations even if well indexed will generally cost internal temp tables and regularly on-disk temp tables for grouping and sorting. It’s widely accepted to cache the results for re-use rather then rerunning this at any level of concurrency. You may have heard of summary tables to record such values. Worth a read, google is your friend.