How to speed up update join query between 2 mysql tables?

Posted on

Question :

alright long story short.,
i have powerful dedicated server.

Intel  I7-6700K -
64GB DDR4 2400 MHz
1x480GB   SSD

running mysql server along with nginx,php

innodb-ft-min-token-size = 1
innodb-ft-enable-stopword = 0

innodb_buffer_pool_size = 40G
max_connections = 2000

[deploy@ns540545 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:            62G         45G         11G        107M        6.4G         16G
Swap:          2.0G        1.4G        640M

it was expensive so i got another dedicated server for cost cutting lets call it
not-so powerful dedicated server

Intel  i3-2130  
8GB DDR3 1333 MHz   
2TB 

running mysql server along with nginx,php

innodb-ft-min-token-size = 1
innodb-ft-enable-stopword = 0

innodb_buffer_pool_size = 4G
max_connections = 2000

[root@privateserver deploy]# free -h
              total        used        free      shared  buff/cache   available
Mem:           7.7G        7.5G         73M         24M        150M         79M
Swap:           39G        7.8G         32G

i moved database from powerful server to not-so powerful server.

I can feel slight performance degradation while running simple queries which is fine, but this one query which used to take 2 minutes on powerful server now it takes around 26.6525 hours and counting on not-so powerful server.

UPDATE content a JOIN peers_data b ON a.hash = b.hash SET a.seeders = b.seeders, a.leechers = b.leechers, a.is_updated = b.is_updated

More info about tables which are exactly same on both the dedicated server

CREATE TABLE `peers_data` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `hash` char(40) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `seeders` int(11) NOT NULL DEFAULT '0',
  `leechers` int(11) NOT NULL DEFAULT '0',
  `is_updated` int(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`hash`),
  UNIQUE KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

CREATE TABLE `content` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `content_id` int(11) unsigned NOT NULL DEFAULT '0',
  `hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
  `title` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `tags` varchar(1000) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `category` smallint(3) unsigned NOT NULL DEFAULT '0',
  `category_name` varchar(50) CHARACTER SET ascii COLLATE ascii_bin DEFAULT '',
  `sub_category` smallint(3) unsigned NOT NULL DEFAULT '0',
  `sub_category_name` varchar(50) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
  `size` bigint(20) unsigned NOT NULL DEFAULT '0',
  `seeders` int(11) unsigned NOT NULL DEFAULT '0',
  `leechers` int(11) unsigned NOT NULL DEFAULT '0',
  `upload_date` datetime DEFAULT NULL,
  `uploader` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '0',
  `uploader_level` varchar(10) CHARACTER SET ascii COLLATE ascii_bin NOT NULL DEFAULT '',
  `comments_count` int(11) unsigned NOT NULL DEFAULT '0',
  `is_updated` tinyint(1) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`content_id`),
  UNIQUE KEY `unique` (`id`) USING BTREE,
  KEY `hash` (`hash`),
  KEY `uploader` (`uploader`),
  KEY `sub_category` (`sub_category`),
  KEY `category` (`category`),
  KEY `title_index` (`title`),
  KEY `category_sub_category` (`category`,`sub_category`),
  KEY `seeders` (`seeders`),
  KEY `uploader_sub_category` (`uploader`,`sub_category`),
  KEY `upload_date` (`upload_date`),
  KEY `uploader_upload_date` (`uploader`,`upload_date`),
  KEY `leechers` (`leechers`),
  KEY `size` (`size`),
  KEY `uploader_seeders` (`uploader`,`seeders`),
  KEY `uploader_size` (`uploader`,`size`),
  FULLTEXT KEY `title` (`title`),
  FULLTEXT KEY `tags` (`tags`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;


mysql> explain UPDATE content a JOIN peers_data b ON a.hash = b.hash SET a.seeders = b.seeders, a.leechers = b.leechers, a.is_updated = b.is_updated ;
+----+-------------+-------+------------+--------+---------------+---------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type   | possible_keys | key     | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-------+------------+--------+---------------+---------+---------+------+---------+----------+-------------+
|  1 | UPDATE      | a     | NULL       | ALL    | NULL          | NULL    | NULL    | NULL | 4236260 |   100.00 | NULL        |
|  1 | SIMPLE      | b     | NULL       | eq_ref | PRIMARY       | PRIMARY | 160     | func |       1 |   100.00 | Using where |
+----+-------------+-------+------------+--------+---------------+---------+---------+------+---------+----------+-------------+
2 rows in set (0.00 sec)

records in peers_data 6,367,417
records in content 4,236,268

How can i speed up the above update join query ?
i was expecting about 1 hour on not-so powerful server, but 26 hours+ is too much.

what i am doing wrong ? or missing here ?

i have tried to compensate for RAM on not-so powerful server by setting 32 GB + swap space.
is innodb buffer pool 4 gb too much ?


so far i have tried.

  1. fixing both charset type same
  2. setting innodb buffer pool size to default 128M

and now the task completed in 19.93 Hours.


as per RolandoMySQLDBA suggestion,

i am trying SUGGESTION #2 as those Indexes are required for site to function properly.

i have set innodb_buffer_pool_size = 5G

and trying

SET GLOBAL innodb_change_buffer_max_size = 50; UPDATE content a JOIN
peers_data b ON a.hash = b.hash SET a.seeders = b.seeders, a.leechers
= b.leechers, a.is_updated = b.is_updated; SET GLOBAL innodb_change_buffer_max_size = 25;

It’s been 25+ hours and the query is still running.
will update when the query finishes.

one more thing is data is not ultra important, its readily available data and i take daily backups., so later ill try suggestion 3 also. and will use whichever is faster.


update 2 :

suggestion 2 took 38.36667 hours to complete.

now trying suggestion 3.

Answer :

YOUR QUERY

Your query should be taking a long time. Why ??? Look at your query:

UPDATE content a JOIN peers_data b ON a.hash = b.hash
SET a.seeders = b.seeders, a.leechers = b.leechers, a.is_updated = b.is_updated;

Please note the columns begin updated by the query in the content table:

  • seeders
  • leechers
  • is_updated

Which of these columns are indexed ?

  • seeders
  • leechers

What indexes do you have in the content table involving those columns ?

KEY `seeders` (`seeders`), <<<<-------------------------------- THIS ONE !!!
KEY `uploader_sub_category` (`uploader`,`sub_category`),
KEY `upload_date` (`upload_date`),
KEY `uploader_upload_date` (`uploader`,`upload_date`),
KEY `leechers` (`leechers`), <<<<------------------------------ THIS ONE !!!
KEY `size` (`size`),
KEY `uploader_seeders` (`uploader`,`seeders`), <<<<------------ THIS ONE !!!

What is going on under the hood ???

If you are changing the values for seeders and leeches, these three(3) indexes are having their leaf nodes reshuffled.

Even if the the majority of the values are not being changed, rows are being locked and copies of your data are being stockpiled in your undo logs (for the sake of MVCC). This results is additional disk I/O (ibdata1 should be growing)

INNODB

The InnoDB Buffer Pool will be going through the “Perfect Storm”. Why ???

Please note the InnoDB Architecture (Picture from Percona CTO Vadim Tkachenko)

InnoDB Plumbing

Please note the Insert Buffer. Here is what the MySQL Documentation says about it:

The change buffer is a special data structure that caches changes to
secondary index pages when affected pages are not in the buffer pool.
The buffered changes, which may result from INSERT, UPDATE, or DELETE
operations (DML), are merged later when the pages are loaded into the
buffer pool by other read operations.

Unlike clustered indexes, secondary indexes are usually nonunique, and
inserts into secondary indexes happen in a relatively random order.
Similarly, deletes and updates may affect secondary index pages that
are not adjacently located in an index tree. Merging cached changes at
a later time, when affected pages are read into the buffer pool by
other operations, avoids substantial random access I/O that would be
required to read-in secondary index pages from disk.

Periodically, the purge operation that runs when the system is mostly
idle, or during a slow shutdown, writes the updated index pages to
disk. The purge operation can write disk blocks for a series of index
values more efficiently than if each value were written to disk
immediately.

Change buffer merging may take several hours when there are numerous
secondary indexes to update and many affected rows. During this time,
disk I/O is increased, which can cause a significant slowdown for
disk-bound queries. Change buffer merging may also continue to occur
after a transaction is committed. In fact, change buffer merging may
continue to occur after a server shutdown and restart (see Section
14.21.2, “Forcing InnoDB Recovery” for more information).

In memory, the change buffer occupies part of the InnoDB buffer pool.
On disk, the change buffer is part of the system tablespace, so that
index changes remain buffered across database restarts.

Here is where the “Perfect Storm” comes in: By default, InnoDB storage engine reserves up to 25% of the Buffer Pool for Change Buffering. All changes to your three(3) secondary indexes has to pile up in the InnoDB Bufffer Pool’s Insert Buffer. When the actual index pages land in the Buffer Pool, the merge process will subsequently be pushed to disk (Note the Insert Buffer inside ibdata1), producing more disk I/O.

SUGGESTIONS

SUGGESTION #1

Get rid of those indexes. Why ??? This will eliminate the change buffering needed for managing those indexes during your mass UPDATE.

Run the following query:

SELECT
    COUNT(1) rowcount,
    COUNT(DISTINCT seeders) seeders_count,
    COUNT(DISTINCT leechers) leechers_count
FROM content;

If seeders_count is less than 5% of rowcount, then the cardinality of seeders_count can disqualify the seeders from being of any use. Smae going with leechers.

For that index uploader_seeders, run this query:

SELECT COUNT(1) uploader_seeders_count FROM
(SELECT DISTINCT uploader,seeders FROM content) A;

If uploader_seeders_count is less than 5% of rowcount (from previous query), then get rid of uploader_seeders index.

To get rid of those indexes, run this:

ALTER TABLE content DROP INDEX uploader_seeders,DROP INDEX seeders,DROP INDEX leechers;

SUGGESTION #2

Change the Insert Buffer size to the max value, the run the query:

SET GLOBAL innodb_change_buffer_max_size = 50;
UPDATE content a JOIN peers_data b ON a.hash = b.hash
SET a.seeders = b.seeders, a.leechers = b.leechers, a.is_updated = b.is_updated;
SET GLOBAL innodb_change_buffer_max_size = 25;

Also, set the Buffer Pool Size (innodb_buffer_pool_size) to 20G. Change that value in my.cnf.

If you have MySQL 5.7, simply run

mysql> SET GLOBAL innodb_buffer_pool_size = 1024 * 1024 * 1024 * 20;

If you have MySQL 5.6 and prior, you must restart mysqld.

SUGGESTION #3 (RISKY)

If you do not want to drop the indexes, you could disable change buffering during the mass UPDATE, then enable it afterwards:

Run the following:

SET GLOBAL innodb_change_buffering = 'none';
UPDATE content a JOIN peers_data b ON a.hash = b.hash
SET a.seeders = b.seeders, a.leechers = b.leechers, a.is_updated = b.is_updated;
FLUSH TABLES;
SET GLOBAL innodb_change_buffering = 'all';

This is risky because this speeds up changes to indexes in favor of not having buffering to recover in the event of a crash or reboot.

SUGGESTION #4 (RISKY)

Another cavalier approach would be to disable the Double Write Buffer. Since a restart is required, do this:

  • service mysql restart --skip-innodb_doublewrite
  • Do your mass UPDATE
  • service mysql restart

This is not recommended for Production. Dev and Staging only please !!!

hash char(40) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT ''

The hash is hex, correct? Don’t make it utf8mb4; use ascii.

Pack the hash into BINARY(20) using UNHEX(). Now it is only 20 bytes.

Hashes of any kind are very “random” What is probably happening — the hash cannot be cached, so you are hitting the disk for many, maybe most, rows. This means disk I/O instead of simply CPU.

Even your faster machine will eventually slow down drastically when the table gets too big (relative to the buffer_pool_size).

Hashes, UUIDs, etc have this drawback. Try to avoid them.

If seeders is just a flag, use TINYINT (1 bytes), instead of INT (4 bytes).

Why do you have both of these?

  PRIMARY KEY (`hash`),
  UNIQUE KEY `id` (`id`)

Do other things to shrink the disk footprint of the tables. Smaller -> more cacheable -> less I/O -> faster.

You mentioned that the updated above use to take couple of minutes,

  • how many rows were updated in that time?
  • have the number of records in the tables been increased?
  • was the cache completely full ?

Also that query seems very agresive, I would suggest to create a file with all the updates in it and later using the percona pt-fifo-split to exexute the file something like:

mysql -e “select concat(‘UPDATE content a SET a.seeders =
‘,b.seeders,’, a.leechers = ‘, b.leechers,’, a.is_updated = ‘,
b.is_updated ,’ where a.hash=’, b.hash) from peers_data b ” >
/tmp/updating_seeders.sql

(Please double check the syntax).

This process wont take 2 minutes, more likely couple of hours adding a 1 sec sleep every 500 updates to reduce hight load/lag, plus the time to create the above file:

(4236268(total records)/500(records per second))/60(minutes)/60 (hours)

Server 1

              total        used        free      shared  buff/cache   available
Mem:            62G         45G         11G        107M        6.4G         16G
Swap:          2.0G        1.4G <--    640M

Server 2

              total        used        free      shared  buff/cache   available
Mem:           7.7G        7.5G         73M         24M        150M         79M
Swap:           39G        7.8G <--     32G

i have tried to compensate for RAM on not-so powerful server by setting 32 GB + swap space

Swap should not be a compensate for not-enough RAM. Yes, it’s used when there is not enough RAM available, but only to prevent program crash (respect to cloud servers). You could see that the OS aggressively swaps in Server 2.

In development machine, it’s ok to compensate with swap. But for dedicated servers, swapping causes all programs in OS to slow down (causing web application slowdown) even though it’s triggered by one application. I could assume the main bottleneck here is IO (aggressive swap + mysql IO). Try increasing the server RAM to 16GB with swappiness set to 1.


so far i have tried.

  1. fixing both charset type same

  2. setting innodb buffer pool size to default 128M

Even after reducing buffer pool from 4G to 128M, your system still needs 4G (7.8G – 3.8G) RAM memory for swap to no-swap. Always keep in mind that application servers should not swap at all except to prevent program crash. Also provide whether your 2TB disk is SSD or HDD.

is_updated column definitions are different. They need to be identical.

Repost your create tables by ADDING it with a date so we can see where you are.

Please post complete results of
A) SHOW INDEX FROM seeders;
B) SHOW INDEX FROM leechers;
when you have an opportunity.

Leave a Reply

Your email address will not be published. Required fields are marked *