Right now I’m working on developing a DB schema for an application which requires versioning several different sets of numerical data which must be joined together for use in views. The application requires that we’re able to ‘look back’ and find the state of what these multiple variables were like at an arbitrary point in time.
I’ve built up a few view definitions using subqueries to fetch the various inputs along arbitrary join criteria as well as checking the version against a
time_created field. This isn’t a strict SCD2 implementation, but it’s close. (link to Slow Changing Dimensions page in Wikipedia)
You can see a similar ( albeit greatly simplified ) implementation to our schema here: DB Fiddle link. The real schema has 6 subqueries/joins.
This works very well for my purposes, except that when we start scaling it up into the thousands of ‘purchase’ records we incur a significant query cost. A query against 2000+ rows is taking ~1s, which is unsuitable for our application.
The first solution that springs to mind is creating a table which joins a ‘purchase’ to it’s various versioned input tables, to be created by the application when a purchase is created. This should be much more efficient, but requires more schema complexity.
Can anyone offer me advice on how to best plan my schema/queries so as to be most optimal for this kind of setup?
Admittedly this is a cross-post from SO, but it received zero interest there over 24 hours, so I figured it may be better suited for the DBA stackExchange. Apologies in advance if this causes any offence.
The original can be found here.
I found a somewhat satisfactory solution to this issue. I feel a bit foolish for overlooking this, but you could put it down to a case of ‘premature optimization’ on my behalf. I had omitted the
end_date columns, as per the academic example in Wikipedia, in my SCD tables. I figured this was not particularly necessary since all you need to do to ‘rewind’ your table and find the correct SCD row is find the first row with a
start_date less than or equal to the date you’re querying against when the table is sorted. But this requires a table sort to function correctly.
The major bottleneck in my query was caused by sorting a moderately large ( 2500+ row ) table by a timestamp. I managed to resolve this issue by adding an ‘end_date’ column and changing the query to use ( simplified example ):
LEFT JOIN LATERAL ( SELECT * FROM tax t WHERE p.time_created >= t.time_created AND p.time_created <= COALESCE(t.time_ended, '9999-12-30'::date)) current_tax ON true ....
This cuts the query down from ~1sec to <100ms. I tried adding indexes against fields in the 2500+ row table, but this didn’t seem to have an enormous effect on the query time, but perhaps I had made some errors there. This seems more optimized at any rate.
An updated DBFiddle can be seen here.
I would love some advice on how to better represent this, or how to better account for
null values in the