Delete from GROUP BY with multiple GROUP columns

Posted on

Question :

I have a simple table consisting of id, channel_id, value and timestamp:

`channel_id` int(11) DEFAULT NULL,
`timestamp` bigint(20) NOT NULL,
`value` float NOT NULL,
KEY `IDX_ADF3F36372F5A1AA` (`channel_id`),

The combination of channel_id,timestamp MUST be unique but the table has gotten corrupted. My ultimate goal is to add a unique index which fails due to the duplicates.

The duplicates are:

select channel_id,timestamp,count(*) cnt 
from data 
group by channel_id,timestamp having cnt>1

How can I delete the duplicates? I’m stuck since obviously I can’t get the row id from the duplicates sub select?

UPDATE I could probably do

delete d1.* from data d1 join data d2 on d1.channel_id=d2.channel_id and d1.timestamp=d2.timestamp where !=;

but since the table has almost 1 billion records a cross-product does not seem like a good idea?

Answer :

FROM data t1
JOIN data t2 USING (channel_id,timestamp)

Corruption! AUTO_INCREMENT=2126679551 INT has a limit of 2G! You could change it to BIGINT. But, read on; I have an even better solution (namely to get rid of id):

I recommend replacing your two indexes with these two:

PRIMARY KEY(channel_id, timestamp),

That will help in many of your activities, plus Akina’s suggestion, and it will replace VĂ©race’s suggestion.

This is based on id being there only to provide a unique PK, and that most queries (read or write) are based on channel_id, often in a time range.

If this is sensor data, and you can be sure that there are not two readings in the same ‘second’, then get rid of id completely. This will shrink the table and speed up all queries. (Akina’s > would become t1.timestamp > t2.timestamp, etc.)

Another tip. I assume you have only a small number of “channels”? INT takes 4 bytes; TINYINT UNSIGNED takes 1 byte and allows values up to 255. SMALLINT UNSIGNED: 2 bytes, max 64K. Etc.

Leave a Reply

Your email address will not be published. Required fields are marked *