Question :
I have a database tables with millions of measurements. New data is coming in every day. For analysis, I want to replicate the data to my laptop (local postgres db). No need for automatic replication. It’s OK to start a script for this.
I feel like the standard replication solutions are inadequate and
overengineered for my use because
- I need asynchron replication,
- some solutions are inefficient because they replicate row-wise
(yet data on the server is bulk-inserted), - and they need too much configuration.
I would like to use COPY TO STDOUT | COPY FROM STDIN
but here I
have to select only the new data. How can I do that?
The table has this form:
Column | Type
------------+--------------------------
devicename | character varying(30)
id | integer
timestamp | timestamp with time zone
value | numeric
variable | text
PK would be (devicename, id)
. Note that id
alone is not unique because
the data is coming from multiple devices.
How can I select only new data for COPY
?
Any other approaches for this replication requirements?
Answer :
I see the following possible solutions:
Change the source table
Change the source table to include a column that you can use to (uniquely) identify the data you have already synchronized. A column based on a sequence (serial
or identity
in Postgres 10) or column defined as timestamp
that is populated through a trigger.
If that column is indexed, this is probably the most efficient solution.
Keep track of copied data
Create a table in the source database that contains the already synchronized rows.
create table synchronized
(
devicename character varying(30),
id integer,
primary key (id, devicename)
);
The following statement would return all rows that are not yet synchronized and “marks” them as transferred in a single query. It should be possible to use that as the source for the copy
statement (but I haven’t tested it).
with todo as (
select *
from data
where (id,devicename) not in (select id,devicename from synchronized)
), done as (
insert into synchronized (id, devicename)
select id, devicename
from todo
)
select *
from todo;
Create a foreign table and use that for synchronization.
Using a foreign data wrapper you create a foreign table in your local Postgres server that connects to the remote server. Then whenever you want to synchronize the data, you can do that with a single statement:
insert into local_table
select *
from remote_fdw_table
on conflict do update
set "timestamp" = excluded.timestamp,
value = excluded.value,
variable = excluded.variable;
That is probably the least efficient solution as it will always transfer the complete table to your local computer and overwrite/insert as needed. But this requires no changes on the source database and other than the I/O needed for the SELECT, has not impact on the source.
The connection that is used by the FDW does not need to be valid all the time. It only needs to work, when you run the synchronization. As long as you don’t access the foreign table, it doesn’t matter if your computer can connect to the source or not.
Copy everything into a staging table, then use “upsert”
A variation of the solution with the foreign table is to simple export the whole source table, import that into a staging table with the same structure, then use the above INSERT ON CONFLICT
approach to insert/update the real table.
That requires one more step for synchronization but might be easier to handle depending on your environment.