Best way to delete large set of rows knowing field to NOT delete

Posted on

Question :

Coming from a Rails background, I have a large production database that I have a copy of where I only need records from 4 of the 2000+ companies. I’m trying to delete all the rows except the ones belonging to the 4, and I know the way I have it is not optimal.

DELETE FROM appointments 
WHERE (appointments.company_id NOT IN (6, 753, 785, 1611))

another example is when I have to delete records on a table where the company_id is on an associated table:

DELETE FROM mappings 
WHERE mappings.id IN (SELECT mappings.id 
                      FROM code_mappings 
                      INNER JOIN codes ON codes.remote_id = mappings.code_remote_id 
                      WHERE (codes.company_id NOT IN (6, 753, 785, 1611)))

Answer :

In relation to the first table, appointments, make sure that you have an index on company_id column.

In relation to the mappings table, using EXISTS rather than IN may yield better performance. You can re-write your query as following:

DELETE FROM mappings AS m
WHERE EXISTS (  SELECT 1
                FROM code_mappings AS cm
                  INNER JOIN codes AS c
                    ON c.remote_id = cm.code_remote_id
                WHERE 
                (
                c.company_id NOT IN (6, 753, 785, 1611)
                AND cm.id = m.id
                )
)

In the above query, you will also benefit from indexes on the mappings and code tables.

Documentation for creating indexes is @ https://www.postgresql.org/docs/current/static/sql-createindex.html. In your case, you can create indexes on the relevant tables as following:

CREATE INDEX company_id_idx ON appointments (company_id);

CREATE INDEX remote_id_company_id_idx ON codes (remote_id, company_id);

CREATE INDEX code_remote_id_id_idx ON code_mappings (code_remote_id, id);

-- If you don't already have a primary key OR index on `id` column in the `mappings` table, then create one:

ALTER TABLE mappings ADD PRIMARY KEY (id);
-- Choose primary key, or index: CREATE INDEX id_idx ON mappings (id);

I think you’re fine, honestly. You could always dig in and analyze the query plans for various options suggested here.

The performance benefit of the EXISTS predicate (or NOT EXISTS), for instance, comes from letting the query optimizer choose how to use indexes and it works best when the subquery is simplest:

EXISTS (select * from blah where outertable.key = key)

In this instance, you are dealing with very low selectivity. You’re essentially saying delete everything! Well, except these four, leave them alone.

The query optimizer isn’t going to choose to utilize your indexes, because there isn’t really any point. It’s going to do a table scan. If it did choose an index (it won’t), it would actually perform worse (probably).

Leave a Reply

Your email address will not be published. Required fields are marked *