Question :
Currently we (an active opensource project) have a large PostgreSQL production database which contains all the site records. Since we can’t just allow every developer or contributor access to the database (containing emails, passwords, phone numbers, etc.) but we need to give the developers a somewhat up-to-date SQLite database. We currently do the following:
- dump the database
- analyze and change it with custom python scripts including truncating the table to make exporting to SQLite more bearable, remove sensitive data like passwords
- export this to SQLite
This is very slow and error prone.
Is there a recommended/best practice approach way to do this, both the whitewashing and the table truncation, without breaking object relational mappings between tables?
As I analyze the problem domain I see the main problem with the object related mapping. I can’t just pull the first thousand records from all the tables (with LIMIT
) since an object in line 900, for example, in one table may map a foreign key to the 1001 line in another table. I guess changing and sanitizing data can be done with views, replacing certain columns for a calculated one (based on the original table column). Then a cron job could just dump the views.
I’ll be glad for help/references.
Answer :
What you’re referring to is usually called Test Data Management. There are several commercial solutions, e.g. IBM Optim Test Data Management (disclosure: I work for IBM, but don’t speak for them). I’m not aware of any free products that can perform that function, so I’m afraid you’ll have to stick to your Python scripts unless you have deep pockets.
On a side note, I would suggest using the same database platform, i.e. PostgreSQL, for both production and development — it will probably save you some troubleshooting time later.
Generally speaking the best-practise approach is to have a set of test data that isn’t based directly on live data, particularly if you work in environments where you might be holding sensitive data such as personal or financial information.
The bulk of such a data-set can be created in an automated manner by putting together a set of scripts that generate records following set patterns. If you make the process customisable then you can use it for both simple testing (a small DB for new dev work) and performance testing (a large DB to make sure your code scales in that direction). For instance in our training and competence recording system you’d want to ask it to “give me a DB with 10 teams spread over 4 branches of around 8 people/team in random statuses (trainee, full comp adviser, …) and suitable records going back 6 months” or “give me a DB with 1000 teams spread over 600 branches of around 8 people/team in random statuses (trainee, full comp adviser, …) and suitable records going back 24 months”. This way you have nothing to whitewash and each developer can create a new test set to work with easily & quickly whenever they need to. You can also make sure that your sample data includes all the edge cases you have ever come across (or expect to come across) so it can become an important part of any automated regression testing you perform.
Of course maintaining the scripts to generate this sample data, fixing bugs and adding support for new features as your application grows, becomes extra effort that you need to factor in, and you may still occasionally need to test against “more real” data, so it isn’t all “win” unfortunately.