Question :
Let’s say I have a very large lookup table that looks like this:
CREATE TABLE `MyLookup` (
`FKToTableA` bigint(20) NOT NULL,
`FKToTableB` bigint(20) NOT NULL,
`Count` bigint(20) NOT NULL,
`Bytes` bigint(20) NOT NULL,
`Packets` bigint (20) NOT NULL,
) ENGINE=InnoDB;
The first 2 attributes are FKs to Table A and B respectively. But doesn’t need to be. the FKs can be dropped if it’s more optimal not to have it. FKToTableA and FKToTableB is the candidate key for this table. Thus FKToTableA and FKToTableB could be a composite primary key if needed.
My question is what’s the most optimal way of indexing this table? Storage size and insert times are not a concern.
Most of the time my queries will be joining on to MyLookup on both FK columns and SUM the Count, Bytes and Packets columns.
select
a.something, b.something, SUM(c.Count), SUM(c.Bytes), SUM(b.Packets)
from
A a
inner join MyLookup c on a.Id = c.FKToTableA
inner join B b on b.Id = c.FKToTableB
where
a.something = 1 and
a.Time >= 'blah' and
a.Time <= 'blah'
group by
a.something, b.something
I see three options.
1) Put a HASH index on FKToTableA and FKToTableB.
2) Put a composite BTREE index on all columns.
3) Put a Primary Key on FKToTableA and FKToTableB.
I’m kind of leaning towards 1. From what I gather HASH indexes are great at equality comparisons and that’s all joins are is a big equality comparison is it not? Still has to go hit the table again to sum the other columns though which may be a bottle neck. Not sure.
Or maybe it’s possible to have a hash index with Count, Bytes, Packets? Not sure how composite hash indexes works though.
Has anyone dealt with this type of thing before and can shed some knowledge and suggestions?
EDIT:
Here’s my schema for Table A and B to make it a bit more clear
CREATE TABLE `B` (
`Id` bigint(20) NOT NULL AUTO_INCREMENT,
`CustomerIdent` int(11) NOT NULL,
`MetaData1` varchar(256) NULL,
`MetaData2` varchar(256) NULL,
`MetaData3` varchar(256) NULL,
`MetaData4` varchar(256) NULL,
`MetaData5` varchar(256) NULL,
`Time` datetime NOT NULL
CONSTRAINT `PK_A` PRIMARY KEY (`Id` ASC)
) ENGINE=InnoDB;
CREATE TABLE `B` (
`Id` bigint(20) NOT NULL AUTO_INCREMENT,
`SourceIp` bigint(20) NULL,
`DestinationIp` bigint(20) NULL,
`SourcePort` int(11) NULL,
`DestinationPort` int(11) NULL,
CONSTRAINT `PK_B` PRIMARY KEY (`Id` ASC)
) ENGINE=InnoDB;
Basically the lookup table is in charge of telling me how many times an row in B happened for one row of A.
Answer :
a
: Satisfy theWHERE
withINDEX(something, time)
MyLookup
— If the pair(FKToTableA, FKToTableB)
is unique, then make that thePRIMARY KEY
, and put the columns in that order so that yourSELECT
can quickly get intoMyLookup
.- Don’t use
BIGINT
(8 bytes) unless you expect to exceed 4 billion, the limit forINT UNSIGNED
, which takes only 4 bytes. - IP addresses — for the old IPv4, there are convenient routines for converting to
INT UNSIGNED
. For the new IPv6, it won’t fit intoBIGINT
. See 5.6.3. - Do those Metadata columns need utf8? Can they be combined into a TEXT column? And other questions.
- Build and maintain a “Summary table” rather than scanning large chunks of the “Fact” table to get this “report”.
- Do the
SUMs
beforeJOINing
tob
. - Do you really need a many-to-many mapping table? Seems like this is 1:many.
- More tips on many-to-many.
This design looks like that you stored log data into MySQL, and you want to do something analytic on it.
I would suggest you to look at Cassandra (or ScyllaDB, a C++ re-written compatible solution to Cassandra) + Presto, they are all open source softwares, and they can process your SQL query parallelly and effectively.
Especially because Presto’s SQL language is very similiar to MySQL (because it’s developed by Facebook in the beginning), it should be very easy to learn.
Anyway, if you must use InnoDB, there are serveral optimizations you should do for this SQL query:
- Remove
a.something
inGROUP BY
part, and changea.something
inSELECT
part to1
, because it will always be1
due to yourWHERE
codition. - table
A
should create index on(something, Time)
or(something, Time, ...)
because yourWHERE
condition. If using MyISAM, choose BTREE due to its range search condition. - However, I would suggest you to change from MyISAM to InnoDB, becase InnoDB is a crash-safe engine and Oracle is still putting a lot of effort to improve it.
Also, If you can modify your application, then you can do more optimizations:
- Select all possible
b.something
in the given time, then run all SQL queries in the same time. It will use your CPU resource much efficiently.