PostgreSQL: Non-continious replication of new data with COPY

Posted on

Question :

I have a database tables with millions of measurements. New data is coming in every day. For analysis, I want to replicate the data to my laptop (local postgres db). No need for automatic replication. It’s OK to start a script for this.

I feel like the standard replication solutions are inadequate and
overengineered for my use because

  • I need asynchron replication,
  • some solutions are inefficient because they replicate row-wise
    (yet data on the server is bulk-inserted),
  • and they need too much configuration.

I would like to use COPY TO STDOUT | COPY FROM STDIN but here I
have to select only the new data. How can I do that?

The table has this form:

   Column   |           Type
------------+--------------------------
 devicename | character varying(30)
 id         | integer
 timestamp  | timestamp with time zone
 value      | numeric
 variable   | text

PK would be (devicename, id). Note that id alone is not unique because
the data is coming from multiple devices.

How can I select only new data for COPY?
Any other approaches for this replication requirements?

Answer :

I see the following possible solutions:

Change the source table

Change the source table to include a column that you can use to (uniquely) identify the data you have already synchronized. A column based on a sequence (serial or identity in Postgres 10) or column defined as timestamp that is populated through a trigger.

If that column is indexed, this is probably the most efficient solution.

Keep track of copied data

Create a table in the source database that contains the already synchronized rows.

create table synchronized 
(
  devicename  character varying(30),
  id          integer,
  primary key (id, devicename)
);

The following statement would return all rows that are not yet synchronized and “marks” them as transferred in a single query. It should be possible to use that as the source for the copy statement (but I haven’t tested it).

with todo as (
  select *
  from data 
  where (id,devicename) not in (select id,devicename from synchronized)
), done as (
  insert into synchronized (id, devicename)
  select id, devicename 
  from todo
)
select *
from todo;

Create a foreign table and use that for synchronization.

Using a foreign data wrapper you create a foreign table in your local Postgres server that connects to the remote server. Then whenever you want to synchronize the data, you can do that with a single statement:

insert into local_table
select *
from remote_fdw_table
on conflict do update 
    set "timestamp" = excluded.timestamp, 
        value = excluded.value,
        variable = excluded.variable;

That is probably the least efficient solution as it will always transfer the complete table to your local computer and overwrite/insert as needed. But this requires no changes on the source database and other than the I/O needed for the SELECT, has not impact on the source.

The connection that is used by the FDW does not need to be valid all the time. It only needs to work, when you run the synchronization. As long as you don’t access the foreign table, it doesn’t matter if your computer can connect to the source or not.

Copy everything into a staging table, then use “upsert”

A variation of the solution with the foreign table is to simple export the whole source table, import that into a staging table with the same structure, then use the above INSERT ON CONFLICT approach to insert/update the real table.

That requires one more step for synchronization but might be easier to handle depending on your environment.

Leave a Reply

Your email address will not be published. Required fields are marked *