I have a web application where users can view data of performance. There will be around 200 to 250 users. Each user will have like
100,000 rows of data per day.
It is only table and here is structure of my table
CREATE TABLE `performance` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `user_id` int(11) NOT NULL, `OfferId` varchar(128) COLLATE utf8_unicode_ci NOT NULL, `title` varchar(256) COLLATE utf8_unicode_ci NOT NULL, `CampaignName` varchar(255) COLLATE utf8_unicode_ci NOT NULL, `Impressions` int(11) NOT NULL, `Clicks` int(11) NOT NULL, `Ctr` double(10,2) NOT NULL, `AverageCpc` int(11) NOT NULL, `Cost` int(11) NOT NULL, `ConvertedClicks` int(11) NOT NULL, `Conversions` int(11) NOT NULL, `CostPerConvertedClick` int(11) NOT NULL, `ClickConversionRate` double(10,2) NOT NULL, `ConversionValue` int(11) NOT NULL, `Roi` double(10,2) NOT NULL, `AdGroupName` varchar(255) COLLATE utf8_unicode_ci NOT NULL, `Brand` varchar(32) COLLATE utf8_unicode_ci NOT NULL, `Price` double(10,2) NOT NULL, `L1` varchar(32) COLLATE utf8_unicode_ci NOT NULL, `L2` varchar(32) COLLATE utf8_unicode_ci NOT NULL, `L3` varchar(32) COLLATE utf8_unicode_ci NOT NULL, `L4` varchar(32) COLLATE utf8_unicode_ci NOT NULL, `L5` varchar(32) COLLATE utf8_unicode_ci NOT NULL, `Attribute0` varchar(32) COLLATE utf8_unicode_ci NOT NULL, `Attribute1` varchar(32) COLLATE utf8_unicode_ci NOT NULL, `Attribute2` varchar(32) COLLATE utf8_unicode_ci NOT NULL, `Attribute3` varchar(32) COLLATE utf8_unicode_ci NOT NULL, `Attribute4` varchar(32) COLLATE utf8_unicode_ci NOT NULL, `Date` date NOT NULL, `DataStatus` tinyint(1) NOT NULL DEFAULT '0', `valueChanged` varchar(255) COLLATE utf8_unicode_ci NOT NULL, `attribute_changed` int(11) DEFAULT NULL , `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`,`user_id`), KEY `performance_offerid_index` (`OfferId`), ) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci /*!50100 PARTITION BY HASH (user_id) PARTITIONS 256 */;
The volume of data in this table will be as high as billions of rows and keep increasing.
GROUP BY clause can on the following columns and it will be random. It will be controlled from front-end. For example if a user wants to see the Brand column on the front-end then the query will include the Brand columns otherwise it won’t.
OfferId will always part of group by clause
Group by columns list
OfferId CampaignName AdGroupName Brand Attribute0 Attribute1 Attribute2 Attribute3 Attribute4 L1 L2 L3 L4 L5 Title
The aggregate function will apply on these columns
Impressions Clicks Ctr AverageCpc Cost ConvertedClicks Conversions CostPerConvertedClick ClickConversionRate ConversionValue Roi
In every query the user_id clause and date clause is a must. For the rest there can be multiple conditions
ANDed together as the front-end interface allow user to perform filter on data.
The queries is taking too much time. I load 2.5 millions records for user_id 1. and when i run a query with simple group by 3 columns. it take around 250 sec. but the max acceptable time is 5 sec. How can i optimize it ?
The query is
SELECT `OfferId` AS `OfferId`,`Attribute0` AS `Attribute0`, SUM(`Impressions`) AS `Impressions`,SUM(`Ctr`) AS `Ctr`, ROUND(AVG(`AverageCpc`),2) AS `AverageCpc`,SUM(`Cost`) AS `Cost`, SUM(`ConvertedClicks`) AS `ConvertedClicks`,SUM(`Conversions`) AS `Conversions`, ROUND(AVG(`CostPerConvertedClick`),2) AS `CostPerConvertedClick`, ROUND(AVG(`ClickConversionRate`),2) AS `ClickConversionRate`, ROUND(AVG(`ConversionValue`),2) AS `ConversionValue`, ROUND(AVG(`Roi`),2) AS `Roi`,`id` AS `id` FROM `performance` WHERE (`user_id` = 13) AND (`Date` BETWEEN '2016-01-08 00:00:00' AND '2016-02-08 23:59:59') GROUP BY `OfferId`,`Attribute0` having Cost > 100 order by Cost LIMIT 0,10;
i can add a covering index but the problem is the query is generated randomly like order some times some columns will be included in
GROUP BY and sometimes not.
For example lets say I add a covering index ‘attribute_index’ on
Attribute0 Attribute1 Attribute2 Attribute3 Attribute4
these columns with this order. But there is no guarantee that those 5 columns will always be used in the query. So if only
is used in query the attribute_index will not be used.
I am using RDS Aurora . Any solution to my problem ?
Do not use
double(10,2) it involves two roundings. Either do
DECIMAL(10,2) or plain
DOUBLE. For monetary values (
DECIMAL(...) so that there will be no rounding.
DOUBLE is 8 bytes;
DECIMAL(10,2) is 5 bytes.
Can you have “negative” Clicks? Suggest
It is usually a bad idea to splay arrays across columns (L1, … and Attribute1, …) Instead have another table for them. (OK, I don’t know what impact that will have on queries that
GROUP BY Attributes.)
PARTITION BY HASH as no known case of improving performance.
attribute_changed int(11) DEFAULT NULL — If that is just a flag, make it
TINYINT UNSIGNED NOT NULL DEFAULT '0'; that will save 3 bytes, plus space for
Will you have a billion different Campaigns?
CampaignName varchar(255) should be normalized and replaced by, say, a
MEDIUMINT UNSIGNED (3 bytes) to save a lot of space. Ditto for any other varchars that repeat a lot.
If you expecting billions of rows, squeezing out a few bytes per row can add up. That, in turn, will decrease the I/O, thereby speeding up queries.
Your covering index on the 5 attributes could consume a huge amount of space. Furthermore, with a billion rows, it may slow down
INSERTs to one row per disk hit! On traditional drives, that is only 100/sec. You need more than 300/sec.
Since you say the users must include a Date clause, then it may be practical to use
PARTITION BY RANGE(TO_DAYS(Date)). It is unclear, but it sounds like
user_id is also a requirement in the queries? At that point, I would suggest
INDEX(user_id, Date) without partitioning. That composite index is much better than “and index on Date”. Adding more columns to that index will not help.
Your example has a Date range of one month plus one day; is that reasonable, or just a goof?
One table per client does not help.
Do not partition by user — 250 partitions has its own performance problems. About 50 is the ‘practical’ limit for the number of partitions.
Summary tables (plural) is the only way you will make this multi-billion row table perform adequately. The
PRIMARY KEY of each summary table would include user_id and date (probably truncated to the day), plus a couple of other “dimensions”. Then several aggregations (mostly
SUMs) would be the other columns. These tables would be significantly smaller than your ‘Fact’ table, and they can afford to have multiple indexes. Beware of
AVG because the average of averages is not mathematically correct. Instead store the
COUNT, then calculate the average as
More on Data Warehousing and Summary Tables.
(Yes, some of my statements do disagree with previous Comments and Answers.)
You say that “In every query the user_id clause and date clause is a must”, so an index on
(user_id, Date) should help everything. Since the other conditions are variable, this is probably the best you can do for indexing for this query. You can add other columns to the end of the index if they are commonly used. With the current structure, it looks like the only thing the optimizer can due is partition pruning, which will help, but probably not as much as my suggested index. I’d also compare performance with and without partitioning since with this new index, partitioning may not help much.
After some discussion, I have made the following changed suggestions. If partitioning remains intact with one user per partition, the best index should be on
(Date). However, without partitioning, the best index should be on
(user_id, Date) as originally suggested.