How to optimally structure a table with sparse data

Posted on

Question :

I am working on a project to create a series of tables that are easier to use and more intuitive for fairly naive SQL users. Because of the structure of our data, we are running into a trade-off between information density and discoverability. By discoverability, I mean things like ease of looking at a column and knowing immediately what it contains or the intuitiveness of columnar autocomplete.

Below is a simplified example of the tables we are working with. The item_table in the actual database has over 30 attribute columns that contain codes, but no item has codes in every attribute column. The most attributes any one item has is 19. The code_table has about ten thousand codes. To add to the complexity, each of the fields in the code_table isn’t necessarily useful information to the end-user. In the example below, sometimes only the description element in the code_table is relevant, sometimes only the value is relevant.

item_table
---------------------------------------------
|     item | attr1 | attr2 | attr3 |  attr4 |
|-------------------------------------------|
|    house | code3 | code7 |  NULL | code11 |
|-------------------------------------------|
|      car | code2 | code6 | code9 |   NULL |
|-------------------------------------------|
| skeleton | code1 | code5 |  NULL | code12 |
|-------------------------------------------|
|   pencil | code4 | code8 |  NULL |   NULL |
---------------------------------------------
code_table
------------------------------------------------------
|   code |                name | value |        desc |
|----------------------------------------------------|
|  code1 |            material |   5.0 |        bone |
|----------------------------------------------------|
|  code2 |            material |   9.0 |       steel |
|----------------------------------------------------|
|  code3 |            material |   1.0 |       brick |
|----------------------------------------------------|
|  code4 |            material |   3.0 |    graphite |
|----------------------------------------------------|
|  code5 |               color |   1.0 |       white |
|----------------------------------------------------|
|  code6 |               color |   2.0 |        blue |
|----------------------------------------------------|
|  code7 |               color |   3.0 |         red |
|----------------------------------------------------|
|  code8 |               color |   4.0 |      yellow |
|----------------------------------------------------|
|  code9 | engine_displacement |   3.4 |        3.4L |
|----------------------------------------------------|
| code10 |      heating_system |   1.0 |     exhaust |
|----------------------------------------------------|
| code11 |      heating_system |   2.0 | natural gas |
|----------------------------------------------------|
| code12 |    intact phalanges |  42.0 |   forty two |
------------------------------------------------------

At the moment, I have thought of two approaches.

The first approach would be to join each attr column with the code_table, shift every non-NULL value to the left, and then trim off any columns which always contain NULLs. In this example, there are no items that have 4 attributes, so we only need to have 3 sets of columns to capture the details of each attribute.

Approach the first
----------------------------------------------------------------------------------------------------------------------------------------------
|     item | attr_name1 | attr_value1 | attr_desc1 | attr_name2 | attr_value2 | attr_desc2 |          attr_name3 | attr_value3 |  attr_desc3 |
|--------------------------------------------------------------------------------------------------------------------------------------------|
|    house |   material |           1 |      brick |      color |           3 |        red |      heating_system |           2 | natural gas |
|--------------------------------------------------------------------------------------------------------------------------------------------|
|      car |   material |           9 |      steel |      color |           2 |       blue | engine_displacement |         3.4 |        3.4L |
|--------------------------------------------------------------------------------------------------------------------------------------------|
| skeleton |   material |           5 |       bone |      color |           1 |      white |    intact_phalanges |          42 |   forty-two |
|--------------------------------------------------------------------------------------------------------------------------------------------|
|   pencil |   material |           3 |   graphite |      color |           4 |     yellow |                NULL |        NULL |        NULL |
----------------------------------------------------------------------------------------------------------------------------------------------

The second option is to go ahead and do the heavy lifting of figuring out what fields are relevant to the end-user and structure the table to have a column for each unique description. This example works out very simply, but applying this solution to the real-world situation increases our NULLs considerably. The more complexity there is in the item_table and code_table, the more sparse this table will be. Having the attributes we are working with as column names will increase discoverability by allowing users to use autocomplete.

Approach the second
------------------------------------------------------------------------------------------
|     item | material |  color | engine_displacement | heating_system | intact_phalanges |
|----------------------------------------------------------------------------------------|
|    house |    brick |    red |                NULL |    natural gas |             NULL |
|----------------------------------------------------------------------------------------|
|      car |    steel |   blue |                3.4L |           NULL |             NULL |
|----------------------------------------------------------------------------------------|
| skeleton |     bone |  white |                NULL |           NULL |        forty-two |
|----------------------------------------------------------------------------------------|
|   pencil | graphite | yellow |                NULL |           NULL |             NULL |
------------------------------------------------------------------------------------------

I have been able to create both tables to see what each solution looks like in real life. Some high-level analysis of the three different tables are below:

Summary
-------------------------------------------------------------------------
|           |    rows | columns |  elements |     nulls | perc_of_nulls |
|-----------------------------------------------------------------------|
| approach1 |  578351 |      23 |  13302073 |   8412026 |        63.23% |
|-----------------------------------------------------------------------|
| approach2 |  578351 |      46 |  26604146 |  21714099 |        81.61% |
-------------------------------------------------------------------------

What are the strengths and weaknesses of each approach that I may not have considered? When working with the resulting tables, are vast numbers of NULLs problematic? Does shifting elements to the left significantly help the discoverability of the data?

Answer :

Both approaches seem wrong. You shouldn’t have the name of an attribute stored in the table. Rather, that should be the name of the column.

Different kinds of items should be stored in different tables if they have different objects and properties. If they are quite similar, it might be better to use a single table – a couple of NULL values won’t harm.

There is a third, hybrid approach: have one table that stores attributes common to all items, and have additional tables with the specific extra attributes, one table per item type.

The readability of the data model for inexperienced users is not an important concern. It should be readable to people who understand relational databases.

One guideline is: If you are tempted to design a model with lots of tables or with lots of columns per table, you are probably doing something wrong.

Leave a Reply

Your email address will not be published.