best design for my data storage

Posted on

Question :

I am looking for advice on storing rows of data.

What I have a bunch of users (id), and data about their activities. Each user can perform a type of activity (type_id). And then about 20 columns of integer data related to the type of activity.

This next bit is slightly wrong. Originally I said: I currently fetch the data and store it once per hour, so I have a unique index over 3 columns (id, type_id, timestamp) + the 20 integers.

But actually I currently fetch and store the data every 15 mins, but only keep the last record for each hour. I have been doing this with “on duplicate key update”

The data for type_id might not change for hours, days, weeks or even never again. Some users have 3 or 4 types, others 20 or more types of activities. In a typical hour I am adding 1000’s of rows of data, most of which are duplicates (eg when sleeping none of the users data changes but I still store a row for each type_id for each hour, each row contains the 20 identical integers)

I need to look at this data and generate graphs and other reports so I am storing all the data even though it is mostly duplicates but I think I am looking for a way to do the reverse of “insert or on duplicate key update” where I only insert a new row if the integer data is different for each type_id for each user id.

If I could do that, can I then use SQL to fill in the blanks when I select data for a time period? Buy that I mean if users A type B has not changed since January I don’t want to store rows for them in February and March, but still be able to select and get data for those months. In other words select the last row before a given timestamp?

Are there better ways to store and retrieve this data?

Answer :

You could expand your activities table so that instead of a single timestamp column, it includes two timestamps, ts_from and ts_to, that indicate the time period for which the row’s data is applicable. As long as no data changes, you do not insert any new rows into your table. When data does change, you insert the new data into a new row and update the old row’s ts_to column.

Then you can create a calendar table that contains all timestamps of interest to you (in your case that table would contain a row for every day and every hour in the period of time that is of interest to you).

Finally, you get the result by cross-joining the two tables and keeping only those records whose date (from calendar table) falls between ts_from and ts_to.

For example (my example uses only dates and not datetimes, but you can change that easily):

create table calendar (
    d date
);

create table time_data (
    d_from date,
    d_to date,
    val1 int(10),
    val2 int(10)
);

insert into time_data values('2015-01-01', '2015-01-15', 10, 15);
insert into time_data values('2015-01-15', '2015-02-21', 7, 18);
insert into time_data values('2015-02-21', '2015-03-04', 1, 55);
insert into time_data values('2015-03-04', '2015-04-21', 3, 22);
insert into time_data values('2015-04-21', null, 9, 99);

I filled the calendar table with all dates from 2015-01-01 to now and did the simple query

select d, val1, val2 
from calendar, time_data  
where d >= d_from and d < ifnull(d_to, curdate()) 
order by d

The resultset contains an entry for every date in the range (2015-01-01, 2015-05-01) and val1, val2 from last entry from time_data before that date.

For a nice procedure to fill the calendar table, see this article.

However, since a cross join is an expensive operation, and you have large amounts of data, thoroughly test this (or any other) solution before applying it to your production system.

Leave a Reply

Your email address will not be published. Required fields are marked *