Biological Sequences of UniProt in PostgreSQL

Posted on

Question :

What is the best way to store UniProt biological sequences in PostreSQL?

Data Details

  • We pull in 12 million sequences from UniProt – this number is likely to double every 3-10 month.
  • The length of a sequence can vary from 10 to 50 billion characters
  • Less than 1% of the sequences are longer than 10 thousand characters
    • Would it improve performance to store the longer sequences separately?
  • A sequence can be of either Protein or DNA alphabet
    • The DNA alphabet has 5 characters (A, T, C, G, or -).
    • The Protein alphabet will have around 30 characters.
    • We don’t mind storing the sequences of the two different alphabets in different columns or even different tables. Would that help?

Data Access Details

To answer Jeremiah Peschka’s comment:

  • Protein and DNA sequences would be accessed at different times
  • Would not need to search within the sequence (that’s done outside of db)
  • Would ether access single rows at a time or pull out sets of rows by IDs. We would not need to scan rows. All sequences are referenced by other tables – several biologically and chronologically meaningful hierarchies exist in the database.

Backwards Compatibility

It would be nice to be able to continue to be able to apply the following hashing function (SEGUID – SEquence Globally Unique IDentifier) to the sequences.

CREATE OR REPLACE FUNCTION gfam.get_seguid(p_sequence character varying)
  RETURNS character varying AS
  result varchar := null;
  x integer;

  select encode(gfam.digest(p_sequence, 'sha1'), 'base64')
  into   result;

  x := length(result);
  if substring(result from x for 1) = '=' then

     result := substring( result from 1 for x-1 );

  end if;

  return result;

  COST 100;

Answer :

Exploring the functions at PostBio it looks like they have a couple of ways of encoding. However, given that those extensions are optimized for searching, they make multiple references to simply using the text data type.

According to the documentation:

Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values. In any case, the longest possible character string that can be stored is about 1 GB.

Therefore, by putting the table into its own very large tablespace on dedicated hardware should be sufficient for your performance goals. If 1 GB is too small for your data, the int_interval from ProtBio should provide excellent performance:

A sequence feature corresponds to a triplet (id, orient, ii) where id is a sequence identifier (possibly the primary key for a sequence table), orient is a boolean indicating if the feature is in the same or contrary orientation of the sequence, and ii is the int_interval representing the feature as a subsequence.

Encoding the sequence in sha1 looks to be a very painful way of making a GUID, considering the potential lengths of the sequence.

If the different sequences are unrelated, store them on different tablespaces on different disks for maximum performance.

I think 50 billion characters will likely push the limits of what you can do with PostgreSQL without splitting your records in some way. I suspect you will have to find some way to break things apart in some way. I don’t know what sort of encoding postbio allows but….

Quick calculations here: 5 characters requries 3 bits to encode, but 4 bits will make searching easier since two characters can be encoded per byte. On the other hand 3 may be sufficient if you are searching for groups of 10 or more letters since you could do 10 characters per 4 bytes. So optimized for short string searches, 50 billion characters takes approx 25gb of storage, well beyond what you can do in a single column. Compression may help, but that’s a huge compression scale required beyond the minimal uncompressed binary representation in order to get down to 1GB. Optimized for longer searches, we get only 20GB. so I think even if you had genetic information types, you’d have break things up. Proteins at that complexity will be even more of a challenge since the best you can hope for is 5 bit notation which means you have 6 per 32, meaning your best case for storage is 30GB per column. So unless you can get Compression may again help but that’s a large compression rate required. I have seen good compression rates, but keep in mind you may be pushing it.

So my recommendation is be aware of this problem, and do some testing with real data. Be parepared to decompose your readings in some cases.

Leave a Reply

Your email address will not be published. Required fields are marked *