Estimated hardware requirement for a planned “mostly-write” database

Posted on

Question :

I was asked to create a server application that needs to receive and store relatively simply messages (about 7 fields, uints, strings and a timestamp), but a lot of them and at a high frequency. The messages fly in at an average rate of about 100 messages per second, with peaks going up to about 400 per second. The database will mainly consist of one large archive table and some reference tables with station names etc.

As I do not have experience with Postgres, I’m having a hard time defining the required hardware for the use case, is there some rule of thumb or some lookup table that you know of? Of course, bigger and faster is always “better”, but I would like to get a feeling for what will do the job with a good performance and reserve without going directly to the top of the shelf.

Here are some more details:

  • The data need to be accessible for at least 5 years before it is acceptable to export and remove them from the DB
  • the data need to be filtered by time ranges and some additional fields
  • Search and filtering results should be reasonably fast (from a GUI user’s point of view who knows nothing about the amount of data), so maybe some milliseconds or seconds
  • From what I already know, new messages do not need to be written to the DB instantly, it is ok to collect them in intervals of 1 to 2 seconds and bulk put them into the database
  • 99% of the time, nobody will look at the data, so there’s usually just the server application pumping messages into the archive, from time to time a user starts a client application to look at some values, so there are very few concurrent connections to the DB

Here’s a message definition from my current server application written in Go:

type Message struct {
    Sender     string   `json:"sender"`
    MessageID  int      `json:"message_id"`
    Text       string   `json:"text"`
    Priority   int      `json:"priority"`
    Parameters []string `json:"parameters"`
}

MessageID references a Messages table, for the Parameters string array I would like to use a Postgres string array field.

What do you think is a good hardware platform in terms of CPU & cores, disk space, memory? Thanks in advance for your estimation.

Answer :

How many connections are there going to be pumping in the data? Is every insert naturally its own transaction or are they naturally bundled up somehow?

From what I already know, new messages do not need to be written to the DB instantly, it is ok to collect them in intervals of 1 to 2 seconds and bulk put them into the database

Where are they going to be stored for 1 to 2 seconds? What happens if the system crashes/restarts and those 1 to 2 seconds get forgotten about, is that acceptable? If a small amount of data loss is acceptable on rare occasions, then simply setting synchronous_commit to off is probably a better solution then coming up with some complicated way to re-arrange small transactions so they are bundled into larger transactions.

All the above is mostly to answer the question of how fast does the fdatasync on your storage need to be.

the data need to be filtered by time ranges and some additional fields

How wide will the queries time ranges be? If they are narrow, then you can have an index just on the timestamp. Then read every row in the time range and apply the additional field filters to each row. This would make index maintenance pretty efficient, as the rows should be inserted roughly in timestamp order, meaning the index leaf pages that need to be updated will have just recently been used and so still in memory. If you need more indexes than this, then index maintenance will likely become your bottleneck, especially once the total size of your indexes becomes a substantial fraction of your RAM.

Also, will the time range in those queries be mostly for recent times, or for any time in the last 5 years?

The data need to be accessible for at least 5 years before it is acceptable to export and remove them from the DB

Do you envision this working by removing data from 5 years + 1 hour ago, every hour? Or just archive and drop an entire year at a time once a year? If the latter, then partitioning by year would be a good idea. But at 3 billion records per year, you might want to partition by month instead.

I think it is hard to do much capacity planning without knowing which indexes will need to exist.

I don’t think a large number of CPUs will help you. Spend the money on fast storage, or maybe RAM, instead.

Leave a Reply

Your email address will not be published. Required fields are marked *