Question :
We developed a system with very high volume of data but very simplified data structure. We had only cellX
, cellY
, timeStamp
and value
columns.
Operations are:
- Insert 455+ thousand rows per day
- Query using range filters on
cellX
,cellY
andtimeStamp
- Query is not required to return instantly, we can notify the user when requested data is ready.
Because the data was too large and we required index for query, we used this scheme:
- Use SQL Server.
- Clustered index on
cellX
,cellY
andtimeStamp
. - Separate tables for every year, therefore total number of rows in a table remains within limit (~166 million).
- Use a custom format for
timeStamp
. Skip year, keep only month, date and hour. We were able to keep it within 16 bit int. - Use partition on each years table.
- Insert one day data at a time. Use partition switching to keep the database live while inserting data.
This has been working good so far. Although we notify the user after data is ready, the delay is never more than a few seconds as long as the query is reasonable.
But recently we got the opportunity to get more precised data and the data volume has been increased by 68 times!. Therefore, now we have:
- insert 30+ million rows per day.
- store 11 billion rows in a table for a year. This can be reduced by making quarterly (2.7 billion) or monthly (1 billion) tables.
This is possible that we might be able to receive more precised data in a year or two. Therefore it is possible that data volume will increase again by a significant factor.
The question is, will this scheme we are using sustain? Or we should migrate to another scheme, may be another database system leaving SQL Server?
Edit
The three dimension columns cellX
, cellY
and timeStamp
are very regular in nature. You can define all of them by
f(x) = mx + c
, for some integer x
ranging (0, 1, 2, ..., X
).
Answer :
I’ve worked with a 30+ billion row monthly partitioned table with page compression and 10 years of history. The table schema was fairly simple with a datetime2(2) clustered index and 3 non-clustered indexes on varchar columns and a couple of non-indexed columns. Storage was about 2TB and it performed reasonably well. SqlBulkCopy was used to insert about 15M rows continuously throughout the day as data was needed in near real time.
Based on this anecdote, I’m confident SQL Server could handle your expected volume with adequately sized hardware. That being said, I completely agree with @DamianoVerzulli that your application is an excellent candidate for a less costly NoSQL solution due to your tolerance for delay.
Are you not getting rapid fragmentation of that clustered index? That would negatively impact insert and select.
Consider a different index
You are loading data daily – I assume for the day or prior day
PK cellX, cellY, timeStamp
That is maximum fragmentation
Consider
PK timeStamp, cellX, cellY
And load the data sorted by that order
Even if you have to load it into a staging table to sort
That is minimum fragmentation
If you really need a cellX, cellY index for query performance then put that in another index on another partition with a fill factor < 1. And perform index maintenance. If this is done off hours it might be faster to disable the index, insert, and then rebuild the index (in this case can use a fill factor or 1).