Question :
I’m looking for (ideally free, open-source) data masking tools. Do any such exist?
Note: this related question deals with tools for generating test data, but in this question I’m more interested in starting with real data, and masking it for use in test without losing any special relationships that make it interesting for test purposes. Generated data is fine for some test purposes, but real-world data will bring up issues you never thought of.
Tool to generate large datasets of test data
Answer :
I would be very surprised if there was a generic tool for this – how would it “know” what is sensitive data and what wasn’t? For example it would need to examine all your data and recognize all possible formats of credit card number, phone number, postcode, email address, and whatever other data is considered sensitive. It would also need to be smart about your schema – e.g. should it rewrite all customer email addresses to “nobody@company.com” – or does any part of your database, applications, other tools assume that a customer’s email address (or SSN or whatever) is unique? Or do you have some part of the application that checksums credit card numbers, that would break if you reset them all to 0000 0000 0000 0000? Or does your telephony system assume that a customer’s dialing code corresponds to the country in their address?
Basically, configuring any tool to do it will be as much or more work than just writing your own script, using your knowledge of the application. At my site, we simply made it policy that anyone who adds a column with such data in updates the script to anonymize it at the same time, after an initial audit to find all those columns and write version 1.
If your database is tiny, has a simple data model and is well understood by current DBA’s – scripting “might” be the answer. However, the effort (and cost) to manually analyze and mask typical databases can get out of hand pretty quickly as requirements change, functionality is added and developers/DBA’s come and go.
While I’m not aware of any open source data masking products, there are commercial offerings available that are reasonably comprehensive, relatively easy to use and may be surprisingly reasonable cost-wise. Many of them include out-of-the-box discovery capability to identify and classify sensitive data (SSN, credit cards, phone numbers) as well as functionality to maintain the checksums, email address formatting, data grouping, etc. so that masked data looks and feels real.
But you don’t have to take my (admittedly biased) word for it. Ask the industry analysts such as Gartner or Forrester who have a number of unbiased reports available on masking that may help.
Hopefully these comments will encourage you to consider exploring commercial products as well as internal script development. At the end of the day, the most important thing is to protect the sensitive data that many of us see day-in and day-out that we really don’t need to see to do our jobs – putting us and the people whose personal data we hold at risk.
Kevin Hillier,
Senior Integration Specialist,
Camouflage Software Inc.
Never seen such an item, but having worked with a few sensitive data sets in my time, the main thing that needs to be scrambled is people’s identities or personally identifying information. This should only make an appearance in a few places in the database.
Your masking operation should retain the statistical properties and relationships of the data, and probably needs to retain actual reference codes (or at least some sort of controlled translation mechanism) so you can reconcile it to the actual data.
This sort of thing can be achieved by getting a distinct list of the names in the fields and replacing it with something like FirstNameXXXX (where XXXX is a sequence number, one for each distinct value). Credit card numbers and similar information that could be used for identity theft are quite likely to be a no-no in a development environment, but you only need real ones if you’re testing payment processing systems – typically the vendor will give you special codes for dummy accounts.
It’s not particularly difficult to write anonymising procedures of this sort, but you will need to agree exactly what needs to be anonymised with the business. If necessary, go through the database field by field. Asking yes/no will give you false positives that you don’t want. Ask the business rep to explain why, or the consequences or regulatory implications of not anonymising particular data.
I had the same task few weeks ago. we evaluated some software systems, but most of them are only for exactly one type of data base, e.g. oracle and they are often very complicated to use… so uhm no the nicest thing to evaluate this. It took us weeks.
We decided to buy the data masking suite professional version as it was the most easy to ue one.
It has also cool possibilities to mask data, e.g. you can change email addresses to real looking ones e.g. …@siemens.com to mike.miller@seimsen.com.
You can try it free for about 500(?) records as far as I remember.
Here is the link http://www.data-masking-tool.com/
My way of doing this:
- Make a new database with only view and select rights for the users
- Make views to tables that should be viewable in other databases
- Mask columns that need masking by: repeat(‘*’,char_length(
column to be masked
))
I first stated down this path several years ago and have since built up a consultancy based around this practice.
I’m assuming that the purpose is to build up test data for use in test environments where those personel accessing the data do not have rights for viewing the production information.
The first thing to establish is exactly what data elements you need to mask and to do that it is best to start out with a data discovery tool such as Schema Spy (Open source) and you will need the relevant jdbc driver for this task but it is a very useful step in the process.
Talend Open Studio is one of the best tools I have used in recent years to perform some of the ETL functions and you will also be able to do some basic masking practice by replacing values with a random or. Lookup/replace – to maintain consistency – using the map component.
But if your looking for a real data masking tool I have not found a suitable open source tool. If you have a very moderate budget for tools I would suggest Data Masker but you will need to do some import and export through MS SQL or Oracle as it only connects via those protocols.
Check out http://www.datakitchen.com.au/2012-08-14-15-04-20/data-masking/data-masker-toolset for info about data masking, data masking methodology, data discovery and test data management. There is also a useful blog at http://www.dataobfuscation.com.au