I have a requirement for a database table to hold sensitive patient information about their lifestyle (alcohol/ drug use / injecting etc).
On screen the information is sectioned into tabs – “alcohol”, “drugs”, “injecting” and so on. Some of the data is co-dependent (ie: data in one tab may depend on what is input in another).
I was considering storing all of the data captured in the tabs (circa 25 fields – maybe more, but not much more) in a single table, because of the aforementioned co-dependency. All of the fields in the table would be NULLable (edit: for two reasons: (1) because people might have alcohol problems but not drug problems and vice versa and (2) the data can be in DRAFT or PRODUCTION state – in draft incomplete data is acceptable, but in production relevant data must be supplied)
To improve data integrity I would employ CHECK constraints to ensure that data was supplied for sets of fields in the table IF certain BIT flags were set (e.g. we are in PRODUCTION mode and a flag named HasUsedAlcohol is set, this mandates all alcohol related fields be populated; HasUsedDrugs would mandate all drug related fields be populated)
Additionally this table may be used to produce reports.
My question is – is this a sensible, maintainable solution? And if not, why not?
Database target is SQL Server 2012.
Thanks in advance.
I wouldn’t be too concerned about 25 columns. That’s not an extremely high number. It doesn’t even approach violating Swart’s Ten Percent Rule (102 columns). Of course, the “rule” there is discussing a limit, not a goal, so keep that in mind
In my mind, ideally, each of these separate “forms” (tabs) would be backed by its own table (AlcoholAbuse, DrugAbuse, Injections, etc), which would then be linked back to the patient by a foreign key constraint.
However, you mentioned that there are data validation concerns that cross forms. Maybe if you answer “yes” to certain questions on the drugs form, then certain questions on the injections form are required.
CHECK constraints are really the best way to enforce what you’re describing, that would be mean they all need to be in the same table.
Your other options, if things were in separate tables, would be to move validation into the application layer, or use triggers to ensure the data in the different form tables is valid on each insert / update.
Since the number of columns isn’t outrageous, and you probably don’t want to incur the complexity of dealing with triggers, I think one table with conditional
CHECK constraints is a valid solution.
This solution seems maintainable to me, and fairly easy to reason about, since all of the “rules” are in one place (one table, and the constraints applied to that one table).
Here I may be placing an emphasis on your words which you did not intend but I think it is a point worth making. You say “I would employ CHECK constraints to ensure that data was supplied for sets of fields.” You cannot. No setting within the DBMS can affect what comes in from an up-stream system. In-coming data is what it is; complete or not, correct or not. The DBMS can choose to accept the data or reject it. That is all.
I understand that you would use DRAFT / PRODUCTION to allow for this. However that works both ways. PRODUCTION would prevent incomplete data from entering the system. Conversely incomplete data would prevent the row from being set to PRODUCTION. How desirable would that be?
Is there no value to your users in part-complete field sets? If, say, nine out of ten columns have values but the last is NULL is this row really worthless junk? Could you imagine a scenario where a deadline approaches, data is incomplete, but the senior manager still says “It’s good enough – let’s proceed with what we’ve got?” Is it useful in any circumstance for a clinician to know the patient has alcohol issues even if all the details aren’t present?
You can’t choose what comes in but you can choose what to consume and to pass on. It is possible to put in place a view that returns only rows that meet the strict data quality requirements. Indeed a set of views could be constructed – one for complete alcohol data only, one for complete drug data and so on. It is possible to make only these views accessible to consumers of the data and hide the underlying tables and incomplete rows.
My suggestion is to make the table permissive – accept any data that is provided and allow for missing data by making columns nullable. The number of columns you are considering is not excessive. There is no advantage to partitioning this table vertically.
Do not put constraints on the table. Do not have a state that changes from DRAFT to PRODUCTION. Instead define a number of views. One view will be called PRODUCTION (or whatever your naming convention dictates). In this view implement the predicates for what is “good” data. It will look something like
create view dbo.PRODUCTION as select * from dbo.patients where (HasUsedAlcohol = 0 or (HasUsedAlcohol = 1 and alcohol_column_a is not null and alcohol_column_b is not null .. and alcohol_column_h is not null ) ) and (HasUsedDrugs = 0 or (HasUsedDrugs = 1 and drug_column_a is not null .. and drug_column_f is not null ) )
With your proposed solution a row would have to move from draft to production. Perhaps a person would do this as data is typed. More likely a scheduled job would just look at all rows and flip those that meet the criteria, or a trigger would examine each update to the table. The job would have to contain the same logic as the constraint in order to find rows that can be flipped. Duplicate logic leads to maintenance errors. To remove a value the user would first have to reset the status to DRAFT. All this is more complexity with little benefit. Using view(s) will remove this overhead.
In terms of performance, constraints as you propose will have practically no impact on reads or writes. My view may have overhead depending on query patterns. Reads which are single-row access by patient key will be efficient (assuming an index on patient key). Scans which read all rows will be affected by the fraction of rows which meet the “production” criteria. With most meeting there will be a table scan in either design. With few meeting the view does not have a status column to filter rows and so must read all rows before rejecting the draft ones. This can be addressed by indexing the view. Range scans would likely benefit from the indexing, too.
One further caveat: if the queries are very complicated having the view at the base may be the last straw for the compiler so you get compilation time-outs and poor plans. This is unlikely, but worth mentioning.
Having slept on it I think I have a better approach. Most of the analysis above still holds – no constraints, all columns in one table, accept everything but choose what to use. However, I would retain the status column. The difference is I would implement it as a computed column. This can be indexed more simply than the view I proposed previously. If the status column is also PERSISTED it’s run-time performance will be identical to your own suggestion. The benefits will be
- all data can be accepted
- no additional or duplicated programming will be required to flip the status value
- status is set automatically as data is set or unset