Database/Storage engine suggestion for a project involving heavy inserts?

Posted on

Question :

I am looking into a project which involves large number of inserts daily.I will have a list of user’s (for example a user set of 500k ) , for which I need to monitor daily certain activities associated with them.

For example, let there be a set of 100 users say U1,U2,…,U100

I need to insert their daily scores into my database.

Consider the total score obtained for a user U1 for period June 30 – July 6, is as follows

June 30 - 99
July 1 - 100
July 2 - 102
July 3 - 102
July 4 - 105
July 5 - 105
July 6 - 107

The database should keep daily scores of each users ,like

For user U1,

July 1- 1pt (100-99)
July 2- 2pt (102-100) 
July 3- 0pt (102-102) 
July 4- 3pt (105-102) 
July 5- 0pt (105-105) 
July 6- 2pt (107-105) 

Similarly the database should hold daily details of the full set of user’s.

And on a later phase ,
I envision to take aggregate reports out of these data like total points scored on each day,week,month,etc; and to compare it with older data.

I need to start things from the scratch.I am experienced with PHP as a server side script and MYSQL. I am confused on the database side ?
Since I need to process about a million insertion daily,what all things should be taken care of ?

Does MySQL fits my requirement,If so what storage engine should be used ? Initially I envision to create a user table with a foreign key user id and monthly score tables with dates as fields.And later on I got suggestion to write things into a csv/excel first and then to load them to table after a particular period.

Does file insertion makes things more favorable in this regard.

Or should I try for some other databases,NoSQL methods ?


I am summarizing my requirements,I need to have a database of a million user’s whose points are to be updated every day as separate entries.
This will be done regularly such that there should a field per day for each using showing daily points,which could be aggregated in weekly/monthly/yearly basis.
I am confused on the database design as well issue which could happen ever after deployment.
A million or more DB operations are done every day.How the server and other things are to be considered in this case.

Any help will be highly appreciated.Thanks in advance.

Answer :

Let’s break this question up into a few parts.

Q: I need to insert 1mm rows a day. Is that a lot?

Not really. 1mm divided by 24 hours divided by 60 minutes divided by 60 seconds gives you about 12 inserts per second. For a rough frame of perspective, it’s not unusual to see 1,000 inserts per second in typical commodity servers with no tuning.

Granted, your load won’t be perfectly averaged out like that – you’ll have bursts of load – but I wouldn’t make database platform decisions based on less than 10k-20k inserts per second. Any platform out there will work fairly well.

Q: How should I structure the data?

Zoom out – don’t think table, think databases. If you’re going to be keeping this data permanently, and it’s truly insert-only with no updates, then you probably want to start a new database for time lengths. Your inserts may only go into one table in one database, but every year, create a new database (MyApp_2015) and seal the old 2014 data as read-only. You can stop backing it up (as long as you’ve still got a good backup once), stop doing index maintenance, statistics updates, etc.

The PHP will only ever have to know about the current database for inserts, making your design a lot easier. The archival process becomes a DBA task much later down the road as long as you go in knowing that there will be more than one database involved.

If you were doing more than 1,000 inserts per second sustained, and you wanted easier performance management, then I’d also suggest building sharding into the initial design regardless of the database platform. Don’t get me wrong, any modern database can handle over 1,000 inserts per second, but designing sharding in now just gives you more flexibility later on. At 12 inserts per second, it’s just not worth the design/testing hassle.

Q: How should I do reporting?

In an ideal world, reports would not be done against the live server. Run the reports against a restored or replicated copy of the database. This does two things: it reduces load on the live server, and it validates your backups, guaranteeing that you’ve got your valuable data elsewhere.

If it’s a data warehouse system you’re looking at, you may wish to consider InfiniDB. It’s an Open Source columnar storage engine designed for DW type loads – it falls into the NewSQL paradigm as defined by Michael Stonebraker. There’s also InfoBright, which is a similar offering. Note about InfiniDB – it’s now under the aegis of – not sure where this project will go in the future.

Of course, you may not require any special software at all – vanilla MySQL on a good server may be right up your alley. As has been pointed out, 11 operations/second is not a great deal on modern (even relatively modest) servers.

Obviously, this depends on your particular queries/application, load &c. and of course, budget :-). I would urge you to evaluate solutions and test before committing to any given approach.


I was thinking that I only considered MySQL related solutions. Take a look here (feature comparisons) about what’s available in freebie servers from the big vendors. There’s also PostgreSQL which is, in the eyes of many (and especially me, a more capable DB server than MySQL.

[EDIT – in response to comments]

There is not so much of a learning curve here – one of the key ideas behind NewSQL is that it explicitly retains much of OldSQL such as the SQL language and ACID transactions. The fundamental point is that NewSQL splits data processing into two types of task and takes a different approach for each.

OLTP (Online Transaction Processing – banking, shopping &c.) is done on a shared nothing sharded in-memory architecture and OLAP (Online Analytical Processing – your issue – data warehousing/DW – i.e. reporting/aggregating over long periods) is done with compressed columnar stores of data.

As regards the speed issue – MySQL may be faster (and not by a whole lot) for simple read-heavy applications with little data-processing, but once you start writing complex DW type queries, PostgreSQL begins to shine!

Take a look here and here. If you are looking at comparisons, pay no attention to any sites that mention MyISAM table types – the default for MySQL now is InnoDB which does enforce ACID transactions and referential integrity.

[EDIT – in response to OP’s further comment about having no knowledge of PostgreSQL]

The fundamentals of SQL are essentially the same for MySQL and PostgreSQL. I had never worked with PostgreSQL before last year (I’m a mature student) and had to install a PostgreSQL instance on College machines. It worked a treat – straight out of the box – compiled from source. At heart, the fundamentals of database servers are reasonably similiar (without being identical!).

However, it is worth noting that PostgreSQL is a) much more standards compliant than MySQL and b) it’s SQL language is significantly richer. Check constraints work in PostgreSQL -they don’t in MySQL (I still find this a shocking lack!). What’s even worse is that if you include them in your CREATE TABLE statements, the server will accept them without throwing an error!

PostgreSQL has windowing functions and common table expressions (CTEs) – MySQL doesn’t (at least not mainstream – yet – InfiniDB does). If you want a fast, read-heavy OLTP database, I’d go with MySQL. If you want a serious DW database and you have a choice from the get-go, I’d go with PostgreSQL.

IMHO, you’ll save yourself a lot of work in the long run by not having to implement CTEs and windowing functions in your own code. Take a look at what Baron Schwartz wrote about this here and here. Schwawtz wrote the book on MySQL high performance.

My read is that your main challenge will not be storing and updating the data – as others have suggested, I would expect most DBMS can handle that on commodity hardware.

I expect your main challenge will be aggregations at scale. For that I would look into cube/OLAP technology e.g. SQL Server Analysis Services.

MySQL will fits your scenario and InnoDB engines will suits. Also pl have a look on the below link of storage engines.

Leave a Reply

Your email address will not be published. Required fields are marked *