I need to calculate reach, grouped by different criterions, and would prefer to keep everything in my database. I’m looking for either a pre-built bitset implementation for PostgreSQL, or pointers on how I may achieve my goal.
What I’m currently doing is storing serialized Java instances of EWAHCompressedBitSet in bytea columns, but this means I have to round-trip to a Java process to continue my calculations.
Ideally, I’d like to be able to do the following:
-- Unique people reached during a time window SELECT service_name, last_updated_at, bitset_or(followers_bs) FROM followers GROUP BY service_name, last_updated_at
I have used a commercial solution (Truviso) that had such a feature, and it was based on PostgreSQL as well.
Note that I do have a table that has the data in an uncompressed format, but because the data is so huge (10M rows and counting), JOINing on it simply takes too much time than I want to devote to it.
Can you use the
"bit" type instead of
"bytea", and use your own user-defined aggregate?
postgres=> CREATE AGGREGATE bitset_or ("bit") postgres-> ( sfunc = bitor, postgres-> stype = "bit", postgres-> initcond = '0' );
bitset_or should work with your query as written (again, assuming bytea -> bit).