I need to store binary data files in a PostgreSQL database which runs on an Ubuntu server. Initially there will be a few dozen files of approximately 250kb in size each. However, the number of files will increase over time. I may at times need to extract data from the files for other downstream analyses.
I’ve done some research regarding the age-old question of storing binary data as BLOBs or references. Both obviously have their pros and cons. Are there any specific issues relating to PostgreSQL that I should be aware of? Is one method or the other preferable if I want to extract data from the files, either via a PostgreSQL function or via an external Python program?
If I were to store the data files directly in the database, would it be better to store them in a separate table with a foreign key referencing the “main” table, rather than in the table containing all the other fields?
I have read the question and answers here; a comment there suggests that storing binary files by reference (in the filesystem) on Linux is better. My questions here relate to PostgreSQL specifically and to extracting data from the files for various analyses.
Update: Similar question.
I think you should store the data in the database as a normal
bytea column. That way you get all the advantages of a database, and you can process the data using database functions (and even PL/Python, if you want that). Larger data items will automatically be stored out of line, so there would be no reason for you to introduce another reference indirection.
The main reasons for storing large binary objects outside the database would be if they are too large to be able to store and retrieve them in a satisfactory time, if they bloat the database beyond practicality, or if you need to access the files as files from a separate application. None of that applies there, as far as I can tell.