slow mysql query need help

Posted on

Question :

I want to retrieve count of unread message for each phone number from the table user_message. Below mentioned query takes 2996 ms to give result which is super slow.

Extra Info of table:
table contains 6.6 million rows

I have Indexed Column phone_num and m_status

    SELECT  COUNT ( * ) AS `__count`
        FROM  `user_message`
        WHERE  ( `user_message` . `phone_num` = "12312312312"
          AND  `user_message` . `m_status` = "unread" 
               ) 
Excexution Plan

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "293495.86"
    },
    "table": {
      "table_name": "user_message",
      "access_type": "index_merge",
      "possible_keys": [
        "user_m_statu_997f64_idx",
        "phone_num_7f5e39_idx"
      ],
      "key": "intersect(user_m_statu_997f64_idx,phone_num_7f5e39_idx)",
      "key_length": "402,2002",
      "rows_examined_per_scan": 922210,
      "rows_produced_per_join": 922210,
      "filtered": "100.00",
      "using_index": true,
      "cost_info": {
        "read_cost": "201274.86",
        "eval_cost": "92221.00",
        "prefix_cost": "293495.86",
        "data_read_per_join": "10G"
      },
      "used_columns": [
        "m_status",
        "phone_num"
      ],
      "attached_condition": "((`db`.`user_message`.`m_status` = 'unread') and (`db`.`user_message`.`phone_num` = '123213123'))"
    }
  }
}

Answer :

Indexing the columns individually means that you will need to find out all the rows that match the phone_num = "12312312312" predicate and then find out all the rows that match the m_status = "unread" predicate, and then merge the results sets together. Alternatively, it could just use one of the indexes and read the rows by index lookups and check the other column.

If neither filter is suitably selective on it’s own, you will want to use a composite index:

create index user_message_number_status
       on user_message (phone_num, m_status);

For this particular query, the order of columns in the index won’t matter much. But if you are looping this query for several different phone_num values, you might be better off with m_status being the lead column (although it probably won’t make a huge difference).

“index merge” “intersect” is the clue that it would be better to add a composite INDEX(m_status, phone_num). The columns can be in either order, but Andrew’s answer gives an argument for this order.

And, while you are at it, Drop the index containing just the first column. (That is, Drop the INDEX(m_status) if you pick this order.)

To understand the advantage of this composite index, read about “covering index”.

You have to consider not just this query, but how you will access messages.

Suppose you offer a few different display options:

Unread on top, then read, both sorted by date:

WHERE phone_num=? ORDER BY m_read, m_date DESC LIMIT n

Ordered by date, with read and unread in a different style, for example bold for unread like in email clients:

WHERE phone_num=? ORDER BY m_date DESC LIMIT n

Unread only, sorted by date:

WHERE phone_num=? AND m_read=0 ORDER BY m_date DESC LIMIT n

From one sender only, sorted by date:

WHERE phone_num=? AND sender_num=? ORDER BY m_date DESC LIMIT n

The messages table will grow quite large, so you want all of these to be indexed. There is a tradeoff with indices, they use space and updating them also uses resources, so every index should be useful. If an index is not useful, it is a waste of resources. So it pays off to design the whole system to use a minimum number of indices. If an index can optimize several queries, it’s even better.

In addition, people never delete their messages, so the table will grow and most of it will never be accessed. This is why the LIMIT is important, the app will only ever display the last N messages, unless a search is done, and this means there should never be any sorts done without indices.

Using an index for a sort+limit is very fast. The database simply picks rows in index order and stops once the limit is reached. It’s like picking the dictionary and retrieving words in alphabetic order. They’re already sorted, so you just have to read the dictionary in order. On the other hand, if the database does an actual sort, it will have to retrieve ALL messages for this phone number, even the very old ones that no-one will ever read again, then do a big sort, and throw away most of the data to display only one page of results. It won’t just make a slow sort, it will also pull all the useless old rows into your cache, filling it, and pushing out the new rows that you really want to keep in cache. This will completely destroy your performance. You absolutely want the most recent messages to be cached, so they are accessed as quickly as possible, which means the rows with ancient messages must never be read.

So, considering the above example queries, you’d need:

A more convenient status column. It should be an INTEGER, not a string, unless you used an ENUM, then that’s an integer, so that’s okay.

Then, you could make the following indices:

(phone_num, m_read, m_date DESC)

Note the “m_date DESC” means the rows are put in the index in reverse order, because you’re always gonna ORDER BY date DESC anyway. I used a bool m_read column in this example, so the query does “ORDER BY m_read, m_date DESC”, so m_read=0, which is unread, comes first. If you use a different encoding, you may have to put DESC in the index definition for the read/status column also. I don’t remember if MySQL can use an index for sorting if the column sort orders in the index are not a good match for the requested ORDER BY, so you’ll have to test and make sure it uses the index for sorting. This shows up in EXPLAIN.

Now you’ll also need an index with

(phone_num, m_date DESC)

when you’re not filtering on unread messages.

And you’ll need this one to display a conversation:

(phone_num, sender_num, m_date DESC)

Note to display a conversation, you need two selects:

WHERE phone_num=me AND sender_num=you ORDER BY m_date DESC LIMIT n
WHERE phone_num=you AND sender_num=me ORDER BY m_date DESC LIMIT n

Then merge the results with a UNION ALL, and sort again.

It is possible to optimize the display of a conversation by giving each conversation a unique identifier. A good value for this identifier is the concatenation of the minimum then maximum of both phone numbers. So, if my phone number is “12” and yours is “34”, the identifier for our conversation is “12-34”, and it is the same whether you or I are using the app to view it. If more than two people can join a conversation, you’ll need to generate real identifiers though.

Anyway. These indices should do the trick. Check EXPLAIN to make sure there are no sorts, and always use LIMIT!

There’s one little thing that bothers me though:

(phone_num, m_read, m_date DESC)

This index will be huge, and quite useless for rows with read messages. Since the majority of the messages will be read, it would be better to index only the unread messages. Unfortunately MySQL does not support conditional indices, but you can always use a hack.

Leave a Reply

Your email address will not be published.