Question :
I have a web application where users can view data of performance. There will be around 200 to 250 users. Each user will have like
100,000 rows of data per day.
It is only table and here is structure of my table
CREATE TABLE `performance` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`OfferId` varchar(128) COLLATE utf8_unicode_ci NOT NULL,
`title` varchar(256) COLLATE utf8_unicode_ci NOT NULL,
`CampaignName` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`Impressions` int(11) NOT NULL,
`Clicks` int(11) NOT NULL,
`Ctr` double(10,2) NOT NULL,
`AverageCpc` int(11) NOT NULL,
`Cost` int(11) NOT NULL,
`ConvertedClicks` int(11) NOT NULL,
`Conversions` int(11) NOT NULL,
`CostPerConvertedClick` int(11) NOT NULL,
`ClickConversionRate` double(10,2) NOT NULL,
`ConversionValue` int(11) NOT NULL,
`Roi` double(10,2) NOT NULL,
`AdGroupName` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`Brand` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
`Price` double(10,2) NOT NULL,
`L1` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
`L2` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
`L3` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
`L4` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
`L5` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
`Attribute0` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
`Attribute1` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
`Attribute2` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
`Attribute3` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
`Attribute4` varchar(32) COLLATE utf8_unicode_ci NOT NULL,
`Date` date NOT NULL,
`DataStatus` tinyint(1) NOT NULL DEFAULT '0',
`valueChanged` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`attribute_changed` int(11) DEFAULT NULL ,
`updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`,`user_id`),
KEY `performance_offerid_index` (`OfferId`),
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
/*!50100 PARTITION BY HASH (user_id)
PARTITIONS 256 */;
The volume of data in this table will be as high as billions of rows and keep increasing.
The GROUP BY
clause can on the following columns and it will be random. It will be controlled from front-end. For example if a user wants to see the Brand column on the front-end then the query will include the Brand columns otherwise it won’t.
OfferId will always part of group by clause
Group by columns list
OfferId
CampaignName
AdGroupName
Brand
Attribute0
Attribute1
Attribute2
Attribute3
Attribute4
L1
L2
L3
L4
L5
Title
The aggregate function will apply on these columns
Impressions
Clicks
Ctr
AverageCpc
Cost
ConvertedClicks
Conversions
CostPerConvertedClick
ClickConversionRate
ConversionValue
Roi
In every query the user_id clause and date clause is a must. For the rest there can be multiple conditions AND
ed together as the front-end interface allow user to perform filter on data.
The queries is taking too much time. I load 2.5 millions records for user_id 1. and when i run a query with simple group by 3 columns. it take around 250 sec. but the max acceptable time is 5 sec. How can i optimize it ?
The query is
SELECT `OfferId` AS `OfferId`,`Attribute0` AS `Attribute0`,
SUM(`Impressions`) AS `Impressions`,SUM(`Ctr`) AS `Ctr`,
ROUND(AVG(`AverageCpc`),2) AS `AverageCpc`,SUM(`Cost`) AS `Cost`,
SUM(`ConvertedClicks`) AS `ConvertedClicks`,SUM(`Conversions`) AS `Conversions`,
ROUND(AVG(`CostPerConvertedClick`),2) AS `CostPerConvertedClick`,
ROUND(AVG(`ClickConversionRate`),2) AS `ClickConversionRate`,
ROUND(AVG(`ConversionValue`),2) AS `ConversionValue`,
ROUND(AVG(`Roi`),2) AS `Roi`,`id` AS `id` FROM `performance`
WHERE (`user_id` = 13)
AND (`Date` BETWEEN '2016-01-08 00:00:00'
AND '2016-02-08 23:59:59')
GROUP BY `OfferId`,`Attribute0`
having Cost > 100
order by Cost
LIMIT 0,10;
i can add a covering index but the problem is the query is generated randomly like order some times some columns will be included in GROUP BY
and sometimes not.
For example lets say I add a covering index ‘attribute_index’ on
Attribute0
Attribute1
Attribute2
Attribute3
Attribute4
these columns with this order. But there is no guarantee that those 5 columns will always be used in the query. So if only
Attribute0
Attribute4
is used in query the attribute_index will not be used.
I am using RDS Aurora . Any solution to my problem ?
Answer :
Do not use double(10,2)
it involves two roundings. Either do DECIMAL(10,2)
or plain DOUBLE
. For monetary values (price
) use DECIMAL(...)
so that there will be no rounding. DOUBLE
is 8 bytes; DECIMAL(10,2)
is 5 bytes.
Can you have “negative” Clicks? Suggest INT UNSIGNED
.
It is usually a bad idea to splay arrays across columns (L1, … and Attribute1, …) Instead have another table for them. (OK, I don’t know what impact that will have on queries that GROUP BY
Attributes.)
PARTITION BY HASH
as no known case of improving performance.
attribute_changed int(11) DEFAULT NULL
— If that is just a flag, make it TINYINT UNSIGNED NOT NULL DEFAULT '0'
; that will save 3 bytes, plus space for NULL
.
Will you have a billion different Campaigns? CampaignName varchar(255)
should be normalized and replaced by, say, a MEDIUMINT UNSIGNED
(3 bytes) to save a lot of space. Ditto for any other varchars that repeat a lot.
If you expecting billions of rows, squeezing out a few bytes per row can add up. That, in turn, will decrease the I/O, thereby speeding up queries.
Your covering index on the 5 attributes could consume a huge amount of space. Furthermore, with a billion rows, it may slow down INSERTs
to one row per disk hit! On traditional drives, that is only 100/sec. You need more than 300/sec.
Since you say the users must include a Date clause, then it may be practical to use PARTITION BY RANGE(TO_DAYS(Date))
. It is unclear, but it sounds like user_id
is also a requirement in the queries? At that point, I would suggest INDEX(user_id, Date)
without partitioning. That composite index is much better than “and index on Date”. Adding more columns to that index will not help.
Your example has a Date range of one month plus one day; is that reasonable, or just a goof?
One table per client does not help.
Do not partition by user — 250 partitions has its own performance problems. About 50 is the ‘practical’ limit for the number of partitions.
Summary tables (plural) is the only way you will make this multi-billion row table perform adequately. The PRIMARY KEY
of each summary table would include user_id and date (probably truncated to the day), plus a couple of other “dimensions”. Then several aggregations (mostly COUNTs
and SUMs
) would be the other columns. These tables would be significantly smaller than your ‘Fact’ table, and they can afford to have multiple indexes. Beware of AVG
because the average of averages is not mathematically correct. Instead store the SUM
and COUNT
, then calculate the average as SUM(sums)/SUM(counts)
.
More on Data Warehousing and Summary Tables.
(Yes, some of my statements do disagree with previous Comments and Answers.)
You say that “In every query the user_id clause and date clause is a must”, so an index on (user_id, Date)
should help everything. Since the other conditions are variable, this is probably the best you can do for indexing for this query. You can add other columns to the end of the index if they are commonly used. With the current structure, it looks like the only thing the optimizer can due is partition pruning, which will help, but probably not as much as my suggested index. I’d also compare performance with and without partitioning since with this new index, partitioning may not help much.
After some discussion, I have made the following changed suggestions. If partitioning remains intact with one user per partition, the best index should be on (Date)
. However, without partitioning, the best index should be on (user_id, Date)
as originally suggested.