MySQL – Optimal indexing for a lookup table. HASH index, BTREE index or composite PK?

Posted on

Question :

Let’s say I have a very large lookup table that looks like this:

   CREATE TABLE `MyLookup` (
      `FKToTableA` bigint(20) NOT NULL,
      `FKToTableB` bigint(20) NOT NULL,
      `Count` bigint(20) NOT NULL,
      `Bytes` bigint(20) NOT NULL,
      `Packets` bigint (20) NOT NULL,
   ) ENGINE=InnoDB;

The first 2 attributes are FKs to Table A and B respectively. But doesn’t need to be. the FKs can be dropped if it’s more optimal not to have it. FKToTableA and FKToTableB is the candidate key for this table. Thus FKToTableA and FKToTableB could be a composite primary key if needed.

My question is what’s the most optimal way of indexing this table? Storage size and insert times are not a concern.
Most of the time my queries will be joining on to MyLookup on both FK columns and SUM the Count, Bytes and Packets columns.

select
    a.something, b.something, SUM(c.Count), SUM(c.Bytes), SUM(b.Packets)
from
    A a 
    inner join MyLookup c on a.Id = c.FKToTableA
    inner join B b on b.Id = c.FKToTableB
where
    a.something = 1 and
    a.Time >= 'blah' and
    a.Time <= 'blah'
group by 
    a.something, b.something

I see three options.

1) Put a HASH index on FKToTableA and FKToTableB.

2) Put a composite BTREE index on all columns.

3) Put a Primary Key on FKToTableA and FKToTableB.

I’m kind of leaning towards 1. From what I gather HASH indexes are great at equality comparisons and that’s all joins are is a big equality comparison is it not? Still has to go hit the table again to sum the other columns though which may be a bottle neck. Not sure.

Or maybe it’s possible to have a hash index with Count, Bytes, Packets? Not sure how composite hash indexes works though.

Has anyone dealt with this type of thing before and can shed some knowledge and suggestions?

EDIT:
Here’s my schema for Table A and B to make it a bit more clear

CREATE TABLE `B` (
  `Id` bigint(20) NOT NULL AUTO_INCREMENT,
  `CustomerIdent` int(11) NOT NULL,
  `MetaData1` varchar(256) NULL,
  `MetaData2` varchar(256) NULL,
  `MetaData3` varchar(256) NULL,
  `MetaData4` varchar(256) NULL,
  `MetaData5` varchar(256) NULL,
  `Time` datetime NOT NULL
  CONSTRAINT `PK_A` PRIMARY KEY (`Id` ASC)
) ENGINE=InnoDB;

CREATE TABLE `B` (
  `Id` bigint(20) NOT NULL AUTO_INCREMENT,
  `SourceIp` bigint(20) NULL,
  `DestinationIp` bigint(20) NULL,
  `SourcePort` int(11) NULL,
  `DestinationPort` int(11) NULL,
  CONSTRAINT `PK_B` PRIMARY KEY (`Id` ASC)
) ENGINE=InnoDB;

Basically the lookup table is in charge of telling me how many times an row in B happened for one row of A.

Answer :

  • a: Satisfy the WHERE with INDEX(something, time)
  • MyLookup — If the pair (FKToTableA, FKToTableB) is unique, then make that the PRIMARY KEY, and put the columns in that order so that your SELECT can quickly get into MyLookup.
  • Don’t use BIGINT (8 bytes) unless you expect to exceed 4 billion, the limit for INT UNSIGNED, which takes only 4 bytes.
  • IP addresses — for the old IPv4, there are convenient routines for converting to INT UNSIGNED. For the new IPv6, it won’t fit into BIGINT. See 5.6.3.
  • Do those Metadata columns need utf8? Can they be combined into a TEXT column? And other questions.
  • Build and maintain a “Summary table” rather than scanning large chunks of the “Fact” table to get this “report”.
  • Do the SUMs before JOINing to b.
  • Do you really need a many-to-many mapping table? Seems like this is 1:many.
  • More tips on many-to-many.

This design looks like that you stored log data into MySQL, and you want to do something analytic on it.

I would suggest you to look at Cassandra (or ScyllaDB, a C++ re-written compatible solution to Cassandra) + Presto, they are all open source softwares, and they can process your SQL query parallelly and effectively.

Especially because Presto’s SQL language is very similiar to MySQL (because it’s developed by Facebook in the beginning), it should be very easy to learn.

Anyway, if you must use InnoDB, there are serveral optimizations you should do for this SQL query:

  • Remove a.something in GROUP BY part, and change a.something in SELECT part to 1, because it will always be 1 due to your WHERE codition.
  • table A should create index on (something, Time) or (something, Time, ...) because your WHERE condition. If using MyISAM, choose BTREE due to its range search condition.
  • However, I would suggest you to change from MyISAM to InnoDB, becase InnoDB is a crash-safe engine and Oracle is still putting a lot of effort to improve it.

Also, If you can modify your application, then you can do more optimizations:

  • Select all possible b.something in the given time, then run all SQL queries in the same time. It will use your CPU resource much efficiently.

Leave a Reply

Your email address will not be published. Required fields are marked *