Question :
Could you please help me to optimize this query – currently it takes ~4 minutes to run this query. It looks like keys: ip
and timestamp
are non-unique so shall I add some indexes?
mysql> EXPLAIN
SELECT max(url) as url,
max(title) as title,
keyword as keyword,
COALESCE(max(domain), 'example.com') as domain
FROM `test_url` as tu
JOIN `test_log` AS tl
ON tl.shorturl = tu.keyword
AND tl.click_id >
(SELECT option_value
from `test_options`
WHERE option_name = 'click_id' )
WHERE 1 = 1
GROUP BY keyword
HAVING count( keyword ) > 50
ORDER BY count( keyword ) DESC
LIMIT 10;
+----+-------------+--------------+--------+------------------+-------------+---------+-----------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+------------------+-------------+---------+-----------------------+---------+----------------------------------------------+
| 1 | PRIMARY | tl | range | PRIMARY,shorturl | PRIMARY | 4 | NULL | 5748586 | Using where; Using temporary; Using filesort |
| 1 | PRIMARY | tu | eq_ref | PRIMARY | PRIMARY | 602 | test_urls.tl.shorturl | 1 | |
| 2 | SUBQUERY | test_options | ref | option_name | option_name | 194 | | 1 | Using where |
+----+-------------+--------------+--------+------------------+-------------+---------+-----------------------+---------+----------------------------------------------+
3 rows in set (0.00 sec)
–
mysql> SHOW CREATE TABLE test_urlG
*************************** 1. row ***************************
Table: test_url
Create Table: CREATE TABLE `test_url` (
`keyword` varchar(200) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`url` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`title` text,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`ip` varchar(41) NOT NULL,
`clicks` int(10) unsigned NOT NULL,
`username` varchar(255) NOT NULL DEFAULT '',
`company_id` int(11) NOT NULL DEFAULT '0',
`domain` varchar(31) NOT NULL DEFAULT '',
PRIMARY KEY (`keyword`),
KEY `timestamp` (`timestamp`),
KEY `ip` (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
–
mysql> SHOW INDEXES FROM test_url;
+----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_url | 0 | PRIMARY | 1 | keyword | A | 176798 | NULL | NULL | | BTREE | | |
| test_url | 1 | timestamp | 1 | timestamp | A | 176798 | NULL | NULL | | BTREE | | |
| test_url | 1 | ip | 1 | ip | A | 11 | NULL | NULL | | BTREE | | |
+----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
3 rows in set (0.08 sec)
Edit: Tue Jun 17 15:17:36 BST 2014
mysql> SHOW CREATE TABLE test_logG
*************************** 1. row ***************************
Table: test_log
Create Table: CREATE TABLE `test_log` (
`click_id` int(11) NOT NULL AUTO_INCREMENT,
`click_time` datetime NOT NULL,
`shorturl` varchar(200) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`referrer` varchar(200) NOT NULL,
`user_agent` varchar(255) NOT NULL,
`ip_address` varchar(41) NOT NULL,
`country_code` char(2) NOT NULL,
PRIMARY KEY (`click_id`),
KEY `shorturl` (`shorturl`)
) ENGINE=InnoDB AUTO_INCREMENT=51636348 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
–
mysql> SHOW CREATE TABLE test_optionsG
*************************** 1. row ***************************
Table: test_options
Create Table: CREATE TABLE `test_options` (
`option_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`option_name` varchar(64) NOT NULL DEFAULT '',
`option_value` longtext NOT NULL,
PRIMARY KEY (`option_id`,`option_name`),
KEY `option_name` (`option_name`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
Edit: Fri Jun 20 10:40:35 BST 2014
Database size is 4975MB (3965MB data size + 1010MB index size)
MySQL veriosn: 5.5.35
OS: Ubuntu 12.04
This is basically this application: http://yourls.org/
The main problem is that this query is causing a high load on the server using temporary tables created on disk (State: Copying to tmp table)
Answer :
I bear bad news. I’ve studied your tables and your query and looked at many variations on them, ran them against a synthetic data set approximately the same size as yours. And I’m sorry to report that I’m 98% sure that your query is nearly optimal and that you already have the right indexes in place. I welcome anyone who can disprove this and come up with additional indexes + rewritten sql that can significantly improve performance.
Aggregation != Searching.
You’re doing OLAP. Your query seems to be doing aggregation over about 5 million rows. Normally if I see a query doing a full table scan or a wide index range scan, I would suspect that indexes might not be optimal unless (big unless here) you’re doing an aggregation (such as COUNT
, SUM
, AVG
, etc), then there’s no way around accessing all those rows since you need to compute on them.
I don’t believe you are going to get significantly quicker results simply by adding indexes and rewriting the SQL alone. You’ll need to incorporate some other approaches:
-
Pre-aggregating data. You could have some background job that runs and pre-aggregates your data by time intervals. This could be by 1-day intervals, 1-hour, 5-minutes, 1-minute … whatever makes sense for your application in terms of how much latency you can tolerate.
-
Partitioning your data could help. I was thinking you might get some improvement by hash partitioning
test_log
overHASH(shorturl)
in 32 hash partitions. But you’d need to redefine your PK and then maintaining the lifecycle of the data could become more difficult (not sure whether you’re purging data already, but you probably should, or at least archive it).
Bigger picture, it looks like you’re using a RDBMS (MySQL database) to do real-time keyword/click analysis. However, I don’t believe an RDBMS is ideally suited for this task. Big data and stream processing are really becoming hot on this exact problem. The following search might get you in the right direction: http://www.google.com/search?q=stream+computing+real-time+click+analysis
Since the short_url is a keyword, maybe refactoring the query could be a game changer
Start with getting the top 10 keywords whose count > 50
Then, join the 10 words to the other tables
SELECT B.url,B.title,B.keyword,B.domain
FROM
(
SELECT keyword FROM
(
SELECT COUNT(1) keyword_count,shorturl keyword
FROM test_log AA INNER JOIN test_options BB
ON AA.click_id = BB.option_value
WHERE BB.option_name <> 'click_id'
GROUP BY shorturl HAVING COUNT(1) > 50
) KW
ORDER BY keyword_count,keyword DESC LIMIT 10
) A
LEFT JOIN test_url B USING (keyword);
The worse part is the KW
subquery. If KW
performs fast, the rest of the query will perform fast
Give it a Try !!! (Hope it works)
I’m not very optimistic, but I’d try these two rewrite variations:
SELECT
tu.url,
tu.title,
tu.keyword,
tu.domain
FROM
( SELECT tl.shorturl, COUNT(*) AS cnt
FROM
( SELECT CAST(option_value AS UNSIGNED) AS option_value
FROM test_options
WHERE option_name = 'click_id'
) AS op
JOIN
test_log AS tl
ON tl.click_id > op.option_value
GROUP BY tl.shorturl
HAVING COUNT(*) > 50
ORDER BY cnt DESC
LIMIT 10
) AS lim
JOIN
test_url as tu
ON lim.shorturl = tu.keyword
ORDER BY
lim.cnt DESC ;
Tested at SQL-Fiddle (with much less data of course).
Variation 2 may return slightly different results but it will use the index on test_log (shorturl)
instead of a full table scan so it might be more efficient:
SELECT
tu.url,
tu.title,
tu.keyword,
tu.domain
FROM
( SELECT tl.shorturl,
COUNT(CASE WHEN tl.click_id > op.option_value THEN 1 END) AS cnt
FROM
( SELECT CAST(option_value AS UNSIGNED) AS option_value
FROM test_options
WHERE option_name = 'click_id'
) AS op
CROSS JOIN
test_log AS tl
GROUP BY tl.shorturl
HAVING cnt >= 0
ORDER BY cnt DESC
LIMIT 10
) AS lim
JOIN
test_url as tu
ON lim.shorturl = tu.keyword
ORDER BY
lim.cnt DESC ;
test_options need not be a subquery in the WHERE clause since it will always return the same single value. The table can be in the FROM clause.
For example:
SELECT
MAX(tu.url) url,
MAX(tu.title) title,
tu.keyword,
COALESCE
(
MAX(tu.domain),
'example.com'
) domain
FROM
`test_url` tu,
`test_log` t1,
`test_options` to
WHERE
tl.shorturl = tu.keyword
AND to.option_value <= tl.click_id
GROUP BY
tu.keyword
HAVING
COUNT(tu.keyword) > 50
ORDER BY
COUNT(tu.keyword) DESC
LIMIT
10;