Aggregate query optimization

Posted on

Question :

Could you please help me to optimize this query – currently it takes ~4 minutes to run this query. It looks like keys: ip and timestamp are non-unique so shall I add some indexes?

mysql> EXPLAIN
    SELECT max(url) as url, 
           max(title) as title, 
           keyword as keyword,
           COALESCE(max(domain), 'example.com') as domain 
    FROM `test_url` as tu 
      JOIN `test_log` AS tl 
        ON  tl.shorturl = tu.keyword 
        AND tl.click_id > 
            (SELECT option_value 
             from `test_options` 
             WHERE option_name = 'click_id' ) 
    WHERE 1 = 1 
    GROUP BY keyword 
    HAVING count( keyword ) > 50 
    ORDER BY count( keyword ) DESC 
    LIMIT 10;
+----+-------------+--------------+--------+------------------+-------------+---------+-----------------------+---------+----------------------------------------------+
| id | select_type | table        | type   | possible_keys    | key         | key_len | ref                   | rows    | Extra                                        |
+----+-------------+--------------+--------+------------------+-------------+---------+-----------------------+---------+----------------------------------------------+
|  1 | PRIMARY     | tl           | range  | PRIMARY,shorturl | PRIMARY     | 4       | NULL                  | 5748586 | Using where; Using temporary; Using filesort |
|  1 | PRIMARY     | tu           | eq_ref | PRIMARY          | PRIMARY     | 602     | test_urls.tl.shorturl |       1 |                                              |
|  2 | SUBQUERY    | test_options | ref    | option_name      | option_name | 194     |                       |       1 | Using where                                  |
+----+-------------+--------------+--------+------------------+-------------+---------+-----------------------+---------+----------------------------------------------+
3 rows in set (0.00 sec)

mysql> SHOW CREATE TABLE test_urlG
*************************** 1. row ***************************
       Table: test_url
Create Table: CREATE TABLE `test_url` (
  `keyword` varchar(200) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `url` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `title` text,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `ip` varchar(41) NOT NULL,
  `clicks` int(10) unsigned NOT NULL,
  `username` varchar(255) NOT NULL DEFAULT '',
  `company_id` int(11) NOT NULL DEFAULT '0',
  `domain` varchar(31) NOT NULL DEFAULT '',
  PRIMARY KEY (`keyword`),
  KEY `timestamp` (`timestamp`),
  KEY `ip` (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql> SHOW INDEXES FROM test_url;
+----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table    | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_url |          0 | PRIMARY   |            1 | keyword     | A         |      176798 |     NULL | NULL   |      | BTREE      |         |               |
| test_url |          1 | timestamp |            1 | timestamp   | A         |      176798 |     NULL | NULL   |      | BTREE      |         |               |
| test_url |          1 | ip        |            1 | ip          | A         |          11 |     NULL | NULL   |      | BTREE      |         |               |
+----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
3 rows in set (0.08 sec)

Edit: Tue Jun 17 15:17:36 BST 2014

mysql> SHOW CREATE TABLE test_logG
*************************** 1. row ***************************
       Table: test_log
Create Table: CREATE TABLE `test_log` (
  `click_id` int(11) NOT NULL AUTO_INCREMENT,
  `click_time` datetime NOT NULL,
  `shorturl` varchar(200) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `referrer` varchar(200) NOT NULL,
  `user_agent` varchar(255) NOT NULL,
  `ip_address` varchar(41) NOT NULL,
  `country_code` char(2) NOT NULL,
  PRIMARY KEY (`click_id`),
  KEY `shorturl` (`shorturl`)
) ENGINE=InnoDB AUTO_INCREMENT=51636348 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql> SHOW CREATE TABLE test_optionsG
*************************** 1. row ***************************
       Table: test_options
Create Table: CREATE TABLE `test_options` (
  `option_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `option_name` varchar(64) NOT NULL DEFAULT '',
  `option_value` longtext NOT NULL,
  PRIMARY KEY (`option_id`,`option_name`),
  KEY `option_name` (`option_name`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

Edit: Fri Jun 20 10:40:35 BST 2014

Database size is 4975MB (3965MB data size + 1010MB index size)

MySQL veriosn: 5.5.35
OS: Ubuntu 12.04

This is basically this application: http://yourls.org/

The main problem is that this query is causing a high load on the server using temporary tables created on disk (State: Copying to tmp table)

Answer :

I bear bad news. I’ve studied your tables and your query and looked at many variations on them, ran them against a synthetic data set approximately the same size as yours. And I’m sorry to report that I’m 98% sure that your query is nearly optimal and that you already have the right indexes in place. I welcome anyone who can disprove this and come up with additional indexes + rewritten sql that can significantly improve performance.

Aggregation != Searching.
You’re doing OLAP. Your query seems to be doing aggregation over about 5 million rows. Normally if I see a query doing a full table scan or a wide index range scan, I would suspect that indexes might not be optimal unless (big unless here) you’re doing an aggregation (such as COUNT, SUM, AVG, etc), then there’s no way around accessing all those rows since you need to compute on them.

I don’t believe you are going to get significantly quicker results simply by adding indexes and rewriting the SQL alone. You’ll need to incorporate some other approaches:

  • Pre-aggregating data. You could have some background job that runs and pre-aggregates your data by time intervals. This could be by 1-day intervals, 1-hour, 5-minutes, 1-minute … whatever makes sense for your application in terms of how much latency you can tolerate.

  • Partitioning your data could help. I was thinking you might get some improvement by hash partitioning test_log over HASH(shorturl) in 32 hash partitions. But you’d need to redefine your PK and then maintaining the lifecycle of the data could become more difficult (not sure whether you’re purging data already, but you probably should, or at least archive it).

Bigger picture, it looks like you’re using a RDBMS (MySQL database) to do real-time keyword/click analysis. However, I don’t believe an RDBMS is ideally suited for this task. Big data and stream processing are really becoming hot on this exact problem. The following search might get you in the right direction: http://www.google.com/search?q=stream+computing+real-time+click+analysis

Since the short_url is a keyword, maybe refactoring the query could be a game changer

Start with getting the top 10 keywords whose count > 50

Then, join the 10 words to the other tables

SELECT B.url,B.title,B.keyword,B.domain
FROM
(
    SELECT keyword FROM
    (
         SELECT COUNT(1) keyword_count,shorturl keyword
         FROM test_log AA INNER JOIN test_options BB
         ON AA.click_id = BB.option_value
         WHERE BB.option_name <> 'click_id'
         GROUP BY shorturl HAVING COUNT(1) > 50
    ) KW
    ORDER BY keyword_count,keyword DESC LIMIT 10
) A
LEFT JOIN test_url B USING (keyword);

The worse part is the KW subquery. If KW performs fast, the rest of the query will perform fast

Give it a Try !!! (Hope it works)

I’m not very optimistic, but I’d try these two rewrite variations:

SELECT 
    tu.url, 
    tu.title, 
    tu.keyword,
    tu.domain
FROM
    ( SELECT tl.shorturl, COUNT(*) AS cnt
      FROM
          ( SELECT CAST(option_value AS UNSIGNED) AS option_value 
            FROM test_options 
            WHERE option_name = 'click_id'
          ) AS op 
        JOIN
          test_log AS tl
            ON  tl.click_id > op.option_value 
      GROUP BY tl.shorturl 
      HAVING COUNT(*) > 50
      ORDER BY cnt DESC 
      LIMIT 10
    ) AS lim
  JOIN 
    test_url as tu 
      ON  lim.shorturl = tu.keyword 
ORDER BY 
    lim.cnt DESC ;

Tested at SQL-Fiddle (with much less data of course).

Variation 2 may return slightly different results but it will use the index on test_log (shorturl) instead of a full table scan so it might be more efficient:

SELECT 
    tu.url, 
    tu.title, 
    tu.keyword,
    tu.domain
FROM
    ( SELECT tl.shorturl, 
             COUNT(CASE WHEN tl.click_id > op.option_value THEN 1 END) AS cnt
      FROM
          ( SELECT CAST(option_value AS UNSIGNED) AS option_value 
            FROM test_options 
            WHERE option_name = 'click_id'
          ) AS op 
        CROSS JOIN
          test_log AS tl
      GROUP BY tl.shorturl 
      HAVING cnt >= 0
      ORDER BY cnt DESC 
      LIMIT 10
    ) AS lim
  JOIN 
    test_url as tu 
      ON  lim.shorturl = tu.keyword 
ORDER BY 
    lim.cnt DESC ;

test_options need not be a subquery in the WHERE clause since it will always return the same single value. The table can be in the FROM clause.

For example:

SELECT
    MAX(tu.url)     url,
    MAX(tu.title)       title,
    tu.keyword,
    COALESCE
    (
     MAX(tu.domain),
     'example.com'
    )           domain
FROM
    `test_url`  tu,
    `test_log`  t1,
    `test_options`  to
WHERE
    tl.shorturl = tu.keyword
  AND   to.option_value <= tl.click_id
GROUP BY
    tu.keyword 
HAVING
    COUNT(tu.keyword) > 50 
ORDER BY
    COUNT(tu.keyword) DESC
LIMIT
    10;

Leave a Reply

Your email address will not be published.