Question :
So it looks like for some entries to the table the data is being written multiple times from our system, so all the rows entered for a specific data item are written 3 times, with exactly the same data apart from the auto-generated ID is different (incremented by 1 each time).
How would I retrieve all the essentially unique rows from the table? A join on itself doesn’t seem to be working as the IDs are different.
Answer :
Omit the ID
column from your SELECT ...
statement and and a DISTINCT
. As in:
SELECT DISTINCT COLUMN2, COLUMN3, .... FROM TABLE1
If you want to find how many rows exist for the same values then do a COUNT()
combined with a GROUP BY
. As in:
SELECT COUNT(ID), COLUMN2, COLUMN3, COLUMN4, ....
FROM TABLE1
GROUP BY COLUMN2, COLUMN3, COLUMN4, ....
Reference Material (MySQL Dev Site)
To get an idea of your dupes, run this:
SELECT value_1, value_2, COUNT(d_id) AS cnt
FROM dupe
GROUP BY value_1, value_2
HAVING COUNT(d_id) > 1
ORDER BY value_1, value_2;
Result:
value_1 value_2 cnt
1 3 3
1 4 2
2 11 4
2 12 2
So, for example, the values (2, 11) occurs 4 times in the table.
What I would do is this:
CREATE TABLE dupe
(
d_id SERIAL,
value_1 INTEGER NOT NULL,
value_2 INTEGER NOT NULL,
CONSTRAINT dupe_pk PRIMARY KEY (d_id)
);
The value_x
columns will need to be specified by you – as many as you need.
Populate the table:
INSERT INTO dupe (value_1, value_2)
VALUES
(1, 1), (1, 2), (1, 3), (1, 3), (1, 3), (1, 4), (1, 4), (1, 5),
(2, 10), (2, 11), (2, 11), (2, 11), (2, 11), (2, 12), (2, 12), (2, 13);
Then run:
SELECT
d_id,
value_1,
value_2,
ROW_NUMBER() OVER (PARTITION BY value_1, value_2 ORDER BY d_id) AS rn
FROM dupe
Result:
d_id value_1 value_2 rn
1 1 1 1
2 1 2 1
3 1 3 1 <<== dupe in d_id order (i.e. v_1 = 1,v_2 = 3)
4 1 3 2 <<==
5 1 3 3
6 1 4 1 <<==
7 1 4 2 <<==
8 1 5 1
9 2 10 1
10 2 11 1 <<==
11 2 11 2 <<==
12 2 11 3 <<==
13 2 11 4 <<==
14 2 12 1 <<==
15 2 12 2 <<==
16 2 13 1
Now, we can take two possible views here (depends on your circumstances) – to DELETE
records with the lowest d_id or the highest.
So, we run the following query:
-- DELETE all duplicate records except those for d_id is the lowest
DELETE FROM dupe
WHERE d_id IN
(
SELECT d_id
FROM
(
SELECT
d_id,
value_1,
value_2,
ROW_NUMBER() OVER (PARTITION BY value_1, value_2 ORDER BY d_id ASC) AS rn
FROM dupe
) AS t1
WHERE rn > 1
);
Note that all fields in the inner loop with rn = 1 are the start of a sequence of dupes. Then we run:
SELECT * FROM dupe
ORDER BY d_id;
Result:
d_id value_1 value_2
1 1 1
2 1 2
3 1 3
6 1 4
8 1 5
9 2 10
10 2 11
14 2 12
16 2 13
Et voilĂ – no dupes!
The beauty of this approach is that to DELETE
the duplicates with the highest
d_id, all that is required is to reverse the sort order of the ROW_NUMBER()
function (shown in the fiddle).
ROW_NUMBER() OVER (PARTITION BY value_1, value_2 ORDER BY d_id DESC) AS rn
Note the ORDER BY d_id DESC
– the important bit is the DESC
– DESCending.