Question :
I’ve currently got a pretty hefty database (for me anyway) with multiple tables exceeding 15 million rows. I’m trying to improve performance on the whole table lookup, I implemented memcached to cache the already parsed result, but the initial lookups are very slow. 20seconds + on a big query.
SELECT `something`, ROUND(SUM( ( `amount` / `something_of_value` ) * `column` ) WHERE ... (the where clause is huge based on many conditionals)
Current setup:
A managed vps server
InnoDB storage for big tables
10.0.36-MariaDB
18gb ram
8 x 2.40 GHz CPU.
we don’t use query_cache because 25% of queries are write queries, and after some research if a write/update query is done – the cache is deleted. Is this a correct approach?
Whats the best way to improve the initial lookup without upgrading the hardware?
I probably left out a lot of important details, so if you need more info just ask.
If you can steer me in a good general direction that would be appreciated also. I know about query optimization, but we are trying to improve the time of lookup and generally mysql performance.
EDIT:
full query:
SELECT `metric`, SUM( `amount` )
FROM big_table
WHERE 1
AND (`metric` = '1' )
AND (`source` = 'some_unique_source'
OR `source` = 'some_other_unique_source'
OR `source` = 'yet_another_unique_source'
OR `source` = 'some_most_unique_source'
OR `source` = 'almost_last_unique_source'
OR `source` = 'the_last_unique_source'
)
AND (`platform` = '2'
OR `platform` = '1'
OR `platform` = '3'
)
AND (`account` = '1'
OR `account` = '2'
OR `account` = '3'
OR `account` = '4'
OR `account` = '5'
OR `account` = '6'
OR `account` = '7'
OR `account` = '8'
OR `account` = '9'
OR `account` = '10'
OR `account` = '11'
OR `account` = '12'
)
AND (`something_id` = 'some.unique.id' )
AND `target_date` >= '2018-08-27'
AND `target_date` <= '2018-08-27'
GROUP BY `metric`;
Create query:
CREATE TABLE `big_table` (
`stat_id` int(8) NOT NULL AUTO_INCREMENT,
`metric` tinyint(1) NOT NULL DEFAULT '0',
`source` varchar(16) DEFAULT NULL ,
`platform` tinyint(1) NOT NULL DEFAULT '0' ,
`account` int(2) DEFAULT NULL ,
`something_id` varchar(128) DEFAULT 'EMPTY SOMETHING_ID',
`target_date` date DEFAULT NULL,
`country` varchar(2) DEFAULT NULL ,
`version` varchar(16) DEFAULT NULL ,
`amount` decimal(16,6) NOT NULL DEFAULT '0.000000' ,
`tax` decimal(12,6) NOT NULL DEFAULT '0.000000' ,
`currency` varchar(3) DEFAULT NULL,
`currency_rate` decimal(12,6) DEFAULT '500.000000',
`rate_updated` int(11) NOT NULL DEFAULT '0',
`multiplier` decimal(16,6) NOT NULL DEFAULT '1.000000',
`unique_key` varchar(180) DEFAULT NULL ,
`caption` varchar(128) DEFAULT NULL,
`transaction_timestamp` timestamp NULL DEFAULT NULL ,
`finalised` tinyint(1) NOT NULL DEFAULT '0',
`created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`stat_id`),
UNIQUE KEY `unique_key` (`unique_key`) USING BTREE,
KEY `target_date` (`target_date`,`currency`),
KEY `index_country` (`country`),
KEY `currency_rate` (`currency_rate`,`multiplier`)
) ENGINE=InnoDB AUTO_INCREMENT=43453089 DEFAULT CHARSET=utf8
Edit:
The date is always 1 day – the script runs a foreach date based on user inputted date range. The returned mysql result is parsed into a multidimensional array and then parsed into a json file after the datarange finishes.
Now that I think about it,a better approach may be to make a more intelligent query where the result would be grouped by date, though I don’t know how much of an improvement that would be regarding speed. There are 5 main tabs, each selecting a different [main] thing – either grouping and selecting accounts, metrics, sources, platforms, countries and something_id, then the where clauses are also constructed from user input and may be different. This is a custom analytics dashboard if that helps understanding what we are using this for.
A lot of different selects can be chosen by the user and a custom query is constructed based on the user input. I’ve reduced the select size by excluding the countries because on default it loaded all > 250 countries as where clauses making the the query length pretty ridiculous and embarrassing. For clarification – countries are all marked as selected on default. Pasting it into the answer made me realise that it could be wastly improved. Removing countries if all are selected reduced the load time with parsing from 21 secs~ to 8-10secs (30day foreach loop with basically the select you see on a 14million database rows). Though if the user would exclude at least one country, the sql would be constructed with 250~ where countries. I’m thinking about making an algorithm where if there are only a few of unselected countries to make a where country <> ‘unselected’ instead of loading all the selected ones ultimately making the query less in size.
Answer :
You don’t have 1 day, you have 2. If you really wanted only one day and the column is a DATE
, then simply say target_date = '2018-08-27'
. This will help in designing the optimal index.
Start with the =
columns (in any order):
INDEX(something_id, metric, target_date,
xxxx) -- then add one other column in the WHERE
If there are only 3 “platforms”, then leave out the test for it. I assume you are dynamically building the query. So put a little more effort into the UI to optimize things as you go.
Unfortunately, you may need multiple indexes to handle multiple use cases. It is not reasonable to have more than, say, 10 indexes. So design indexes that are likely to be used and start with the =
columns. Don’t bother with indexes longer than, say, 4 column.
Other issues:
int(2)
— INT always takes 4 bytes; seeTINYINT
(etc) for space-savings.DEFAULT NULL
— useNOT NULL
when appropriate.- Change the
OR
s toIN
s. This won’t improve performance, but it will be cleaner. - After moving to
IN
, you can saycountry NOT IN (short-list)
. - A Comment mentions
JOIN
vsLEFT JOIN
, but I don’t see either??
If the data is never modified after it is inserted, then building and maintaining Summary Tables is the way to go. Carefully designed, about 5 such tables could handle maybe 50 different queries. And the queries against these would run perhaps 10 times as fast. For more discussion, provide a list of the likely queries.
(I added some Comments about summary tables onto danblack’s Answers.)
Variants:
foo LIKE '%something'
— can’t usefoo
part of index due to leading wildcardfoo <> 'blah'
orfoo NOT IN ('blah')
— probably treated like a “range”, therefore,foo
may be harmful (to performance) if too early in index; may be helpful if last.country NOT IN ('Foostan')
versuscountry IN (very-long-list)
— the very-long-list takes some time to parse, etc; soNOT IN
will be slightly better.- Elimination of “all options checked” — Do it. The Optimizer can’t tell the difference between that and “some checked”. So, yes, it is a potential optimization.
Once you have done things like those, the next challenge will be to pick a limited set of ‘composite’ INDEXes
— perhaps 5 indexes with 3 columns each. Keep track of the queries that users hit you with. The slowlog is one way to do this.
As you are going to be grabbing 30 days you may as well use a GROUP BY target_date
and put the target_date
in the result fields, will save some query parsing overhead.
As the query is always going to iterate over the range of the target_dates
, and this will likely be the best column for the start of the index. To save the query from going back to the table space to fetch each row to further filter the result, or as a result (i.e. the amount
). An index should start with date
, and include all fields of the where clause and the amount
. As this ends up with a largish index recommend creating it separately to the existing target_date
index.
To make this query is faster when the user say selects one country, assuming this is a common case, a index country
, target_date
, {other fields} will aid this form of query. Likewise if a single metric
, something_id
or other field is common as a single value selection (i.e x=5
, but not x=5 or x=9
).
Option 2, based on the target_date data being generally static after creation you could create a summary table based on date.
CREATE TABLE `summary_of_big_table` (
`target_date` date NOT NULL,
`metric` tinyint(1) NOT NULL DEFAULT '0',
`source` varchar(16) DEFAULT NULL ,
`platform` tinyint(1) NOT NULL DEFAULT '0' ,
`account` int(2) DEFAULT NULL ,
`something_id` varchar(128) DEFAULT 'EMPTY SOMETHING_ID',
`country` varchar(2) DEFAULT NULL ,
`amount` decimal(16,6) NOT NULL DEFAULT '0.000000' ,
PRIMARY KEY (`target_date`),
KEY `index_country` (`country`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Index with fields there’s likely to be few single values of.
Populated will data group by all the parameters and amount totalled:
INSERT INTO summary_of_big_table AS
SELECT target_date, metric, source,
platform, account, something_id, country,
SUM(amount) as amount
FROM big_table
GROUP BY target_date, metric, source,
platform, account, something_id, country