To partition or not to partition?

Posted on

Question :

Having already read several questions on SO, external blog posts and manual

I still find myself wondering if I should go with partitioning considering my case or not.

The case – simplified

Storing customer data. All names of tables mentioned below are made up for clarity.

  1. Having objects that are identifiable by customer and are non-physical beings, also their physical objects in which they are actually stored in case of needing to send some objects back to the customer on demand, or process it in some other ways. They are mapped in a many-to-many relationship. objects_nonphysical, objects_physical, objects_mapping_table.

  2. Second many-to-many relationship is between those non-physical objects and their metrics. There are objects which are bound with some metrics. metrics, metrics_objects_nonphysical

  3. Both non-physical and physical objects have their hierarchy tables which are child-parent relations. objects_nonphysical_hierarchy, objects_physical_hierarchy

Depending on each customers’ needs and requirements the data about physical objects can be supplied or might need to be created from scratch. Basically, what I need to do is:

  • Maintain internal system for fast INSERT and SELECT statements, because here is where the mapping is going to take place.

  • Maintain the system for external customer to view and operate on their non-physical objects – fast retrieval of data. Strong need of efficiency for SELECT statements – this data is available for many customers to search whenever they want.

My consideration

There can be a customer, who may access the data, view and operate on it, but that doesn’t need to be a contractor that we got the data from / are processing the data for.

This has lead me to introduce table partitioning into my system, considering that I always know which partition data should fall into (partitioning for contractors), and then to mainain system for external customer where I need partitioning for customers (this would be done with some delay using automation tools and set of rules to rewrite the data in customers manner, so that for each customer we’d only scan one partition for each table.

Data volume

My data is going to grow constantly, especially when importing new customers’ objects and metrics. The pace of new data arriving into the system is unpredictable at the moment for the long run. There is really no way to measure it not knowing who is going to be the next customer. Right now there are just 2 customers with more or less 1M rows for every customer in every table. But in future I predict new customers to come with a volume of 10M rows or so as well.

Questions

These questions are all related to each other.

  1. Should partitioning really be considered here, or is that an overkill? I consider it to be of use since I’m always scanning exactly one partition.
  2. If partitioning is the way to go, how do I enforce FK constraint the most effectively considering my needs? Should I go for constraint triggers, or just keep it in the application layer for internal system, or maybe some other method?
  3. If partitioning isn’t the way to go, what should I dive into?

If there is not enough data supplied, please let me know in the comments below.

Answer :

There are many open ends in your question, but partitioning by customer could to be the way to go – especially if:

  • you expect many customers,
  • each of them could have tons of data (“tons” means much more than RAM cache size),
  • most of their datasets will be mutually exclusive (each customer sees different subset of data).

RULEs or triggers are a performance overhead, and can be avoided.

Consider something along these lines:

BEGIN;

CREATE USER tenant1;
CREATE USER tenant2;

CREATE SCHEMA app;
CREATE SCHEMA tenant1;
CREATE SCHEMA tenant2;

CREATE TABLE app.objects_nonphysical(id int);
CREATE TABLE app.objects_physical(id int);
CREATE TABLE app.objects_mapping(id int);    
CREATE TABLE tenant1.objects_nonphysical() INHERITS(app.objects_nonphysical);
CREATE TABLE tenant1.objects_physical() INHERITS(app.objects_physical);
CREATE TABLE tenant1.objects_mapping() INHERITS(app.objects_mapping);
CREATE TABLE tenant2.objects_nonphysical() INHERITS(app.objects_nonphysical);
CREATE TABLE tenant2.objects_physical() INHERITS(app.objects_physical);
CREATE TABLE tenant2.objects_mapping() INHERITS(app.objects_mapping);

GRANT USAGE ON SCHEMA tenant1 TO tenant1;
GRANT USAGE ON SCHEMA tenant2 TO tenant2;
GRANT SELECT,INSERT,UPDATE,DELETE ON ALL TABLES IN SCHEMA tenant1 TO tenant1;
GRANT SELECT,INSERT,UPDATE,DELETE ON ALL TABLES IN SCHEMA tenant2 TO tenant2;

/* TEST: simulate login as customer */
SET SESSION AUTHORIZATION tenant2;
/* No schema needed - default search_path works */
SELECT count(*) FROM objects_nonphysical; 

ROLLBACK;

You do not need any triggers / rules to maintain it.

There are open ends here – that’s just a draft… Some issues:

  • PK, FK and indexes are not “inherited”.
  • even if you create them, the PK is not enforced on master table
  • you can overcome this by using same sequence for all tenants
  • obviously, application must be adjusted for this model

It won’t hurt if you implement partitioning now, but use a single partition until your system really requires a new one. Performance-wise, there will be only a tiny overhead, to deal with primary keys and such.

I recommend using rules for redirecting inserts, and an external table for the primary keys (e.g. CREATE TABLE objects_physical_ids (id bigserial NOT NULL PRIMARY KEY), along with one function trigger that inserts a row in the ids table and copies it to NEW.id (e.g. INSERT INTO objects_physical_ids DEFAULT VALUES RETURNING id INTO NEW.id;), and other triggers that deal with deletion and updates, and a trigger executing those function triggers for each inherited table (do not forget to do this when you create a new inherited table!). Then all related tables can have a FOREIGN KEY to the relevant ids table (including any foreign key actions like ON UPDATE or ON DELETE).

Leave a Reply

Your email address will not be published. Required fields are marked *