MySQL partitioning relatively large DB Table

Posted on

Question :

I have table structure similar to the following –

CREATE TABLE `ProductCatalog` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `SerialNumber` varchar(20) DEFAULT NULL,
  `BasePrice` decimal(10,2) DEFAULT NULL,
  `BatchCode` tinyint(3) unsigned DEFAULT NULL,
  `Type` varchar(5) DEFAULT NULL,
  `ItemCode` varchar(5) DEFAULT NULL,
  `ArrivalDate` datetime DEFAULT NULL,
  `InsertTimestamp` int(10) unsigned NOT NULL,
  `BrandID` tinyint(3) unsigned DEFAULT NULL,
  `Model` varchar(10) NOT NULL DEFAULT 'RX209',
  `Description` text,
  PRIMARY KEY (`id`),
  KEY 'idx_ic_sn_ad' (`ItemCode`, `SerialNumber`, `ArrivalDate`),
  KEY 'idx_sn_ad' (`SerialNumber`, `ArrivalDate`, `ItemCode`, `BasePrice`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8

There are about 600 Million rows in this table and is growing fastly. Everyday about 0.5 million records are inserted sometimes more sometimes less. So there is massive write activity going on during certain period of the day (6 am to 8 pm)

Following are the queries I run against this table

SELECT * 
FROM ProductCatalog  
WHERE SerialNumber='1234567890' 
  AND ItemCode!="ABCD" 
ORDER BY id DESC LIMIT 1;
    
SELECT BasePrice 
FROM ProductCatalog 
WHERE SerialNumber='123456789' 
  AND ItemCode!="ABCD" 
  and ItemCode!="PQRS" 
  AND ItemCode!="MNOP" 
ORDER BY ID Desc LIMIT 1
    
SELECT * 
FROM ProductCatalog 
WHERE SerialNumber='123456789' AND ItemCode='ABCD'
  AND (ArrivalDate>='2019-01-01 00:00:00' AND ArrivalDate<='2020-12-31 23:59:59')  
ORDER BY ArrivalDate ASC

SELECT BatchCode
FROM ProductCatalog 
WHERE SerialNumber='123456789' 
  AND ItemCode!="ABCD" 
  and ItemCode!="PQRS" 
  AND ItemCode!="MNOP" 
ORDER BY ID Desc LIMIT 1

Above queries and table indexes are optimized as suggested in my previous question here
MySQL table proper indexes for performance optimization

PROBLEM

Since ProductCatalog Table is getting bigger, and third query mentioned above is frequently used, I am thinking of partitioning this table. For this I am thinking of range partitioning using ArrivalDate column. Following is the query I came up with.

ALTER TABLE ProductCatalog PARTITION BY RANGE (TO_DAYS(ArrivalDate)) (
PARTITION p11 VALUES LESS THAN (TO_DAYS('2011-01-01')),
PARTITION p12 VALUES LESS THAN (TO_DAYS('2012-01-01')),
PARTITION p13 VALUES LESS THAN (TO_DAYS('2013-01-01')),
PARTITION p14 VALUES LESS THAN (TO_DAYS('2014-01-01')),
PARTITION p15 VALUES LESS THAN (TO_DAYS('2015-01-01')),
PARTITION p16 VALUES LESS THAN (TO_DAYS('2016-01-01')),
PARTITION p17 VALUES LESS THAN (TO_DAYS('2017-01-01')),
PARTITION p18 VALUES LESS THAN (TO_DAYS('2018-01-01')),
PARTITION p19 VALUES LESS THAN (TO_DAYS('2019-01-01')),
PARTITION p20 VALUES LESS THAN (TO_DAYS('2020-01-01')),
PARTITION p21 VALUES LESS THAN (TO_DAYS('2021-01-01')),
PARTITION p22 VALUES LESS THAN MAXVALUE);

Later every year I will re-organize the partition like this

ALTER TABLE ProductCatalog 
REORGANIZE PARTITION p22 INTO (
    PARTITION p22 VALUES LESS THAN (TO_DAYS('2022-01-01')),
    PARTITION p23 VALUES LESS THAN (TO_DAYS('2023-01-01')),
    PARTITION p24 VALUES LESS THAN MAXVALUE
);

QUESTIONS

Every year about 200 Million rows in average are inserted. So taking this into consideration –

  1. Do this partitioning do any good in terms of performance with my existing table structures and indexes.
  2. Do this partitioning do any good for selects statements I have specified above?
  3. Do this partitioning do any good for lots of inserts going every moment?
  4. I know we can delete particular partition, but is it easily possible to archive particular partition to another archive database before deleting?
  5. Is there any other better idea, to get maximum possible read and write performance in the database?
  6. Do I need to worry about column id int(10) when data is increasing rapidly?
  7. What this partitioning will impact on other three queries where no ArrivalDate is used in WHERE condition?

Answer :

So going down your list of questions:

Do this partitioning do any good in terms of performance with my
existing table structures and indexes.

No! If you always query on date, maybe.

Do this partitioning do any good for selects statements I have
specified above?

Only your third query, and that’s a maybe.

What this partitioning will impact on other three queries where no
ArrivalDate is used in WHERE condition?

It depends on if the secondary indexes are also partitioned. But generally there can be more overhead as you have to query each partition that possibly contains rows. Partitioning also breaks a lot of features in MySQL so it shouldn’t be undertaken lightly.

Do this partitioning do any good for lots of inserts going every
moment?

No. You’re still appending according to the meaningless row identifier. Everything will be stuck on the last page.

I know we can delete particular partition, but is it easily possible
to archive particular partition to another archive database before
deleting?

I’m not sure about MySQL – moving/deleting partitions is relatively simple in other databases.

Do I need to worry about column id int(10) when data is increasing
rapidly?

Yes! You either fix the problem now or have to rebuild the table with a larger, equally meaningless row identifier and further your woes.

Is there any other better idea, to get maximum possible read and write
performance in the database?

Yes. It’s called define the primary key, and cluster on that.

Right now you have a big unorganized mess of rows. I’d call it a heap, but people reserve that for a certain type of unorganized mess of rows that hides the row pointer from the user instead of declaring it a primary key.

Your rows are located in the table according to their approximate insert order. To locate these rows more efficiently you slap an index on top so you can find where they’re hidden, but even with the index you could be reading one page per row, or the entire table, depending on how the data is distributed. See my answer here for a very basic illustration: Save performance with large update on Index with Included Column

So it’s not entirely obvious from your question if (SerialNumber, ArrivalDate, ItemCode) is unique, but let’s assume it is (we can adjust later if an additional column is necessary to define uniqueness). If we define your table like so:

CREATE TABLE `ProductCatalog` (
  `SerialNumber`    varchar(20)  DEFAULT NULL,
  `ItemCode`        varchar(5) NOT NULL,
  `ArrivalDate`     datetime NOT NULL,
  `BasePrice`       decimal(10,2) DEFAULT NULL,
  `BatchCode`       tinyint(3) unsigned DEFAULT NULL,
  `Type`            varchar(5) DEFAULT NULL,
  `InsertTimestamp` int(10) unsigned NOT NULL,
  `BrandID`         tinyint(3) unsigned DEFAULT NULL,
  `Model`           varchar(10) NOT NULL DEFAULT 'RX209',
  `Description`     text,
  CONSTRAINT PK_ProductCatalog PRIMARY KEY (SerialNumber, ItemCode, ArrivalDate)
) 

Three things happen:

  1. We save space because we’re not slapping an unnecessary row pointer on every row
  2. We are guaranteeing uniqueness
  3. Rows are now grouped (for the most part) according to (SerialNumber, ItemCode, ArrivalDate).

The consequences of #3 is that all of the queries you submitted will traverse one b-tree and at usually no more than 1-2 pages to satisfy the query.

Yes, this can cause page splits (fragmentation), but the impact of page splits is generally less than the cost of additional indexes (which also fragment) and more read I/O. If it gets really bad (which you’ll monitor just like anything else), you can rebuild the table (clustered index).

Other general observations:

  1. Don’t store timestamps as anything other than DATETIME/TIMESTAMP. The data type exists for a reason, if you have to convert on insert, that’s better than having to convert on read.
  2. If one or more of Brand, Model, or Description is dependent on the ItemCode, then that data should be stored in its own table. Makes your main table even more compact.
  3. If SerialNumber and ItemCode are fixed length, you’re probably better off using CHAR(<the max length>) versus VARCHAR.

You say that that no combination of columns is “unique”?

bbaird rightly suggests PRIMARY KEY (SerialNumber, ItemCode, ArrivalDate) to help with performance, but it won’t work because of lack of uniqueness?

In that case, do this:

 PRIMARY KEY (SerialNumber, ItemCode, ArrivalDate, ID)
 INDEX(ID)

This gives you

  • The clustering benefit all your queries since you are always filtering on SerialNumber

  • Having ID anywhere in the PK assures that it is unique.

  • Keeps AUTO_INCREMENT happy since ID is at the start of some index.

  • Takes almost exactly the same disk space as

      PRIMARY KEY(ID),
      INDEX(SerialNumber, ItemCode, ArrivalDate)
    
  • A slight downside is that any further secondary keys will have a copy of the PK in them, thereby being bigger when the PK is bigger.

  • PK(ID) orders the data chronologically, which has some benefit for some queries. All new rows go at the “end” of the data. PK(something else) arranges the data differently, thereby causing inserts to go in at multiple places. However, the benefit for SELECTs is likely to oughtweigh the downside.

PARTITIONing is not indicated for those queries. If you will be purging “old” data based on ArrivalDate, then your partitioning (and bbaird’s comments) do apply. More: http://mysql.rjweb.org/doc.php/partitionmaint

SUBPARTITIONing is useless (in my experience).

Leave a Reply

Your email address will not be published.