Schema for analytics table in Postgres

Posted on

Question :

We use Postgres for analytics (star schema).
Every few seconds we get reports on ~500 metrics types.
The simplest schema would be:

timestamp      metric_type     value
78930890       FOO              80.9
78930890       ZOO              20

Our DBA has came up with a suggestion to flatten all reports of the same 5 seconds to:

timestamp   metric1     metric2     ...  metric500
78930890    90.9        20          ...  

Some developers push back on this saying this adds a huge complexity on development (batching data so it is written in one shot) and to maintainability (just looking at the table or adding fields is more complex).

Is the DBA model the standard practice in such systems or only a last resort once the original model is clearly not scalable enough?

EDIT: the end goal is to draw a line chart for the users. So queries will mostly be selecting a few metrics, folding them by hours/minutes, and selecting min/max/avg per hour (or any other time period).

EDIT: The DBA main argument is that reducing the number of rows x500 times will allow more efficient indexes and memory (the table will contain hundreds of millions of rows before this optimization). Then when selecting multiple metrics the suggested schema will allow one pass over the data instead of separate index search for each metric.

EDIT: 500 metrics is an “upper bound” but in practice most of the time only ~40 metrics are reported per 5 seconds (not the same 40 though)

Answer :

As commented, arguments from both sides are valid. Let’s call them “star” (the flattened schema of your DBA) and “EAV” (entity-attribute-value). The latter can serve as a hint. Details in this related answer:
Is there a name for this database structure?

Well, if your 500 metrics are of well known type and you don’t invent new ones / drop old ones all the time, it’s not the worst case EAV scenario, but similar.

There is no “standard” way to do it. The more flexible approach would be the “EAV” schema. Then you can easily add and delete metric-types: add a row to the referenced metric_type table, or delete one (cascading to metrics table). That would need a schema-change and be more expensive in the “star” model.

You can do quite a bit for either model with smart covering / partial / multicolumn indexes.

Some decision guidance

Aggregates (min/max/avg/…) on a single metric-type? -> “star”
Aggregates considering all or many different metrics? -> “EAV”

Do these attributes describe a common entity? -> “star”
Or is it just a bunch of numbers that may be grouped / split up any other way? -> “EAV”

Your data is written once and then never changed? -> “star”
Or do you run UPDATES on selected metrics? -> “EAV”

Is your set of 500 metrics complete? The set hardly ever or never changes? -> “star”
New metrics are added, existing metrics are dropped all the time? -> “EAV”

Storage size

Concerning your comment:

Storage is less important now for optimization, we are focusing on query times.

Storage size is a major factor for query times. The number of data pages that have to be read to satisfy a query is probably the most important single factor for performance.

Let’s start with your casual remark: Data type is int or double.

int occupies 4 bytes.
double occupies 8 bytes.

Assuming all columns are NOT NULL, 500 integer columns, plus 1 timestamp plus row overhead (no padding) would occupy 2036 bytes in the “star” schema. No compression possible. Here is how you calculate that:
Configuring PostgreSQL for read performance

If you mix int with double, be sure not to waste space for padding. For instance, group integer and double metrics.

In the “EAV” model, you’d need at least 44 or 52 bytes per row. 22000 or 26000 bytes for one timestamp. 11 – 13 times as much. That matters. For one hour’s worth of data, you need to fetch 2000 data pages or more (default page size 8k) vs. around 180 pages for the “star” schema.

Here are some tools to measure size:
Measure the size of a PostgreSQL table row

I think storage size can be the key to performance here. If you are focusing on query times, and I had to make a wild guess, the “star” schema is probably the better choice.

But as I said, it depends on a lot of details.

Either way, you may be interested in cross tabulation (“pivot table”). The tablefunc module provides the crosstab() function:
Have aggregate function group results in one row

Leave a Reply

Your email address will not be published. Required fields are marked *