Database design for a lot of rows with data archive

Posted on

Question :

i’m new to database design on larger scale. I have some knowledge and experience with MySQL but now i have some performance issues.

Currently i’m using MySQL with 3 databases:

  • History
  • Snapshot
  • Static

In history database i’m inserting data. For one user there are approx 5mio rows on month (users are slowly growing every week). As my select queries have gotten slow probbably due to large amount of data i have triggers that copy inserts into snapshot database and delete older entires. Select statements are very quick on snapshot database. But as i add more triggers, mysql seems to fail to whitstand the load. The static database is just users table and some metadata.

So my question is, how can i achieve push/pop with my tables using MySQL or any other database technology?

I want to insert data into snapshot database and all data that has timestamp older than 12 hours should be moved to history (for later data analysis – once or twice a week) and deleted from snapshot so it keeps minimal amount of data to work with. Is this achievable with MySQL or something else? What would be recommended hardware setup for such database design?

As of writing and reading from database i do most of writing from python with pymysql, also some reading and most of reading with NodeJS mysql plugin.

I hope there is some not too complicated solution for my issue. But i will be happy with any advice or guide direction on that to use.

Answer :

A sliding time scale is very efficient by using PARTITIONs. Implement 14 hourly partitions using PARTITION BY RANGE(..). Use REORGANIZE PARTITION to create a new partition every hour. Use “transportable tablespaces” (assuming you are using 5.6, or preferably 5.7, and InnoDB) to move the ‘old’ partition away from the main table to the archive table.

More on timed partitions (including code).

More thoughts:

  • Although TRIGGERs may be functional, I worry that they are costly in your situation. Can you elaborate. Perhaps partitioning can serve a purpose there, too.
  • I hope the Query cache is completely turned off: query_cache_type = 0 and query_cache_size = 0; otherwise it could be causing some of the slowdown.
  • How do you do the millions(?) of INSERTs per day? One row at a time? Or batched (much better)?
  • For the ‘data analysis’, do you build and maintain “Summary tables”? With such, you may not need the triggers, archive, etc. Reporting from a summary table is easily 10x faster, and does not touch the main table.
  • Have you looked into Master-Slave as a way of (1) archiving the data, and (2) providing a Reporting platform separate from the main table. This, too, could eliminate the need for triggers/archive/etc.

Could you provide more numbers — how many users? inserts/sec?

Leave a Reply

Your email address will not be published. Required fields are marked *