How do retrieve unique rows where all all columns are the same except id

Posted on

Question :

So it looks like for some entries to the table the data is being written multiple times from our system, so all the rows entered for a specific data item are written 3 times, with exactly the same data apart from the auto-generated ID is different (incremented by 1 each time).

How would I retrieve all the essentially unique rows from the table? A join on itself doesn’t seem to be working as the IDs are different.

Answer :

Omit the ID column from your SELECT ... statement and and a DISTINCT. As in:

SELECT DISTINCT COLUMN2, COLUMN3, .... FROM TABLE1

If you want to find how many rows exist for the same values then do a COUNT() combined with a GROUP BY. As in:

SELECT COUNT(ID), COLUMN2, COLUMN3, COLUMN4, .... 
    FROM TABLE1 
    GROUP BY COLUMN2, COLUMN3, COLUMN4, ....

Reference Material (MySQL Dev Site)

To get an idea of your dupes, run this:

SELECT value_1, value_2, COUNT(d_id) AS cnt
FROM dupe
GROUP BY value_1, value_2
HAVING COUNT(d_id) > 1
ORDER BY value_1, value_2;

Result:

value_1 value_2 cnt
      1       3   3
      1       4   2
      2      11   4
      2      12   2

So, for example, the values (2, 11) occurs 4 times in the table.

What I would do is this:

CREATE TABLE dupe
(
  d_id SERIAL,
  value_1 INTEGER NOT NULL,
  value_2 INTEGER NOT NULL,
  CONSTRAINT dupe_pk PRIMARY KEY (d_id)
);

The value_x columns will need to be specified by you – as many as you need.

Populate the table:

INSERT INTO dupe (value_1, value_2)
VALUES
  (1, 1), (1, 2), (1, 3), (1, 3), (1, 3), (1, 4), (1, 4), (1, 5), 
  (2, 10), (2, 11), (2, 11), (2, 11), (2, 11), (2, 12), (2, 12), (2, 13);

Then run:

SELECT 
  d_id,
  value_1,
  value_2,
  ROW_NUMBER() OVER (PARTITION BY value_1, value_2 ORDER BY d_id) AS rn
FROM dupe

Result:

d_id    value_1 value_2 rn
1       1   1   1
2       1   2   1
3       1   3   1  <<== dupe in d_id order (i.e. v_1 = 1,v_2 = 3)
4       1   3   2  <<==
5       1   3   3
6       1   4   1  <<==
7       1   4   2  <<==
8       1   5   1
9       2   10  1
10      2   11  1  <<==
11      2   11  2  <<==
12      2   11  3  <<==
13      2   11  4  <<==
14      2   12  1  <<==
15      2   12  2  <<==
16      2   13  1

Now, we can take two possible views here (depends on your circumstances) – to DELETE records with the lowest d_id or the highest.

So, we run the following query:

-- DELETE all duplicate records except those for d_id is the lowest
DELETE FROM dupe
WHERE d_id IN 
(
  SELECT d_id
  FROM
  (
    SELECT 
      d_id,
      value_1,
      value_2,
      ROW_NUMBER() OVER (PARTITION BY value_1, value_2 ORDER BY d_id ASC) AS rn
    FROM dupe
  ) AS t1
  WHERE rn > 1
);

Note that all fields in the inner loop with rn = 1 are the start of a sequence of dupes. Then we run:

SELECT * FROM dupe
ORDER BY d_id;

Result:

d_id    value_1 value_2
1       1       1
2       1       2
3       1       3
6       1       4
8       1       5
9       2      10
10      2      11
14      2      12
16      2      13

Et voilĂ  – no dupes!

The beauty of this approach is that to DELETE the duplicates with the highest d_id, all that is required is to reverse the sort order of the ROW_NUMBER() function (shown in the fiddle).

  ROW_NUMBER() OVER (PARTITION BY value_1, value_2 ORDER BY d_id DESC) AS rn

Note the ORDER BY d_id DESC – the important bit is the DESC – DESCending.

Leave a Reply

Your email address will not be published. Required fields are marked *