I need to create a log table to stores the connections to out network (last 3 years, after that the log will go into backup). The hardware/software used is proprietary, and for accounting it just call our custom script with some arguments like this:
- when a user connects (
our_script START user mac ip);
- when a user disconnects (
our_script STOP user mac ip in_bytes out_bytes more)
Sometimes we do not receive the disconnect message. So we need to adapt to this.
So far I came up with this structure for the accounting table:
CREATE TABLE `accounting` ( `user` varchar(50) NOT NULL DEFAULT '', `mac` varchar(20) NOT NULL DEFAULT '', `ip` varchar(15) NOT NULL DEFAULT '', `ipv6` varchar(39) DEFAULT NULL, `start_datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, `stop_datetime` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00', `in_bytes` bigint(32) unsigned DEFAULT '0', `out_bytes` bigint(32) unsigned DEFAULT '0', `more_columns` varchar(255) default NULL, PRIMARY KEY (`user`,`mac`,`ip`,`start_datetime`,`stop_datetime`), KEY `prim_ipv6` (`user`,`mac`,`ipv6`,`start_datetime`,`stop_datetime`), KEY `user` (`user`) USING HASH, KEY `mac` (`mac`) USING HASH, KEY `ip` (`ip`) USING HASH, KEY `ipv6` (`ipv6`) USING HASH, KEY `start_datetime` (`start_datetime`), KEY `stop_datetime` (`stop_datetime`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1
The query used for writing are (these needs to be real fast):
- when a user connects a simple insert
- when a user disconnects
update accounting set stop_datetime=now(), in_bytes=$in_bytes, out_bytes=$out_bytes, more_columns=$more where user="$user" and ip="$ip" and stop_datetime="0000-00-00 00:00:00" order by start_datetime DESC limit 1;, this is a workaround in the case we have more then one row for start, we just update the last start.
A better workaround, is to update the stop_datetime for rows without it, at every start, and use another column for storing that this is not a normal stop, I think I will go with the second workaround.
We have 2 ways to select from this table (this could be slower max 1 – 2 sec):
- select the last 50 connection for a user:
select ... where user="$user" order by start_datetime DESC limit 50;, for this the user HASH key is handy, beacause AFAIK HASH is better than BTREE for equality.
- select witch user was connected with ip=$ip on date=$date
select ... where ip="$ip" and $date between start_datetime and stop_datetime;
This table will hold between 0.5 to 1 bilion rows, that’s why I thought to partition it.
The best option is to partition it by month, but I have 2 relevant dates, and at the begining/end of the month start will be in one partition, stop will be in another.
Witch is the best way to do partitions, by start_datetime, by stop_datetime, by start_datetime with subpartitions for stop_datetime ?
Do I realy need to partitionate ?
Do you have any other suggestion on how to improve this ?
Since stop_datetime will not be known at insert time the partition must be done by start_datetime.
I think partitioning by day (
date_format(start_datetime, '%Y-%m-%d')) would be adequate since 1 billion rows divided by (3 years * 365 days) ≃ 900,000 rows per partition
Updating indexes is expensive and it happens at inserts, updates and deletes. Keep the indexes to a minimum. If the main queries will be those selects then I would keep the primary key like the following and trash all the other keys:
PRIMARY KEY (`user`,`start_datetime`,`stop_datetime`,`mac`,`ip`)
PRIMARY KEY (
The second one is totally redundant, DROP it. (A PRIMARY KEY is a KEY!)
MyISAM does not have HASH indexing; you were probably given BTree, which is nearly as good.
where user=”$user” and ip=”$ip” and stop_datetime=”0000-00-00 00:00:00″ order by start_datetime DESC limit 1 —
That begs for this index:
INDEX(user, ip, stop_datetime, start_datetime)
where user=”$user” order by start_datetime DESC limit 50 —
begs for INDEX(user, start_datetime)
This is very hard to optimize:
where ip=”$ip” and $date between start_datetime and stop_datetime
INDEX(ip, start_datetime), INDEX(stop_datetime)
PARTITIONing will not help unless you can take advantage of “partition pruning”.
Since most of these queries include ip=”$ip”, and then get into trouble with timestamps, it might be best to partition on ip. I would not do more than 64 partitions; there are inefficiencies otherwise.
Do not (usually) put the “partition key” first in the PRIMARY KEY.
Because of your “inside-out” use of start/end times, I don’t think the timestamps would be useful for partitioning.
ipvarchar(15) NOT NULL DEFAULT ”,
ipv6varchar(39) DEFAULT NULL,
How are you dealing with that? I think the standard says that this is how to encode ipv4 in IPv6 format:
Perhaps you should switch to that, and abandon the