Question :
We have items of three types to be stored and queried according to some business logic.
create table a_table
(
item_a varchar2(30),
item_b varchar2(16),
item_c varchar2(2),
-- other columns
);
And the indexes
create unique index idx_1 on a_table (item_a);
create unique index idx_2 on a_table (item_b);
create unique index idx_3 on a_table (item_c);
Data for a specific item of a specific type would be read simply as:
-- reading item_a
select ... from a_table where item_a = '...';
-- reading item_b
select ... from a_table where item_b = '...';
-- ...
If an item is of type a, it can’t be of type b, so each row must have only one of item_a
, item_b
or item_c
with a value, the others must be null. (This can be enforced in any ways, it does not matter if it impacts insert
ions.)
In theory the number of type of items could grow (maybe a fourth item type could be added somewhere in the future).
This fact pushed towards another solution which also avoid the ugly mutually exclusive (unchecked in the given example) columns:
create table a_table
(
item varchar2(30) not null,
item_type varchar2(10) not null,
-- other columns
);
create unique index idx_1 on a_table (item_type, item);
-- reading item_X
select ... from a_table where item='...' and item_type='item_X';
Disregarding the obvious disadvantage of the first solution when the number of item types increases and other possible design flaws, is there any important performance difference between accessing (select of a row using indexed columns) a table as done in the first case and accessing a table as done in the second one?
The cardinality of item_a
and item_b
can be millions, the cardinality of item_c
is surely under 200.
Notes
- Assume that there will be an equal number of selects for
item_a
,item_b
anditem_c
- In the first solution, consider a composite index
(item_c, item_a, item_b)
iff it would be better (performance-wise) than three separate indexes, but do not forget that the comparison must be done with the second solution where both indexed columns will be used in thewhere
condition: assume the use of the best “correct†index/indexes for both cases.
Answer :
I think, from performance perspective, it will be very close.
But some thoughts:
-
One column indexes do not store NULL values. So individual indexes will be smaller. Composite index (item_type, item ) size will be bigger than sum of individual indexes. In extreme cases even the height of indexes may differ.
-
Two of three individual indexes will be unique, so it takes one less consistent read compared to nonunique index.
-
For item_c column oracle can build precise histogram.
-
In the case of (item_type, item ), oracle will correctly calculate cardinalities for queries with item types A and B, but for “item_type=’C’ and item=:X” it can underestimate. This is may be not a problem, but in complex queries with joins can give ineffective plan. So you need to check.
-
About to compare index (item_a,item_b,item_c) with (item_type, item ) – like others said it will work almost only with queries like “item_a=:X”, and will not work when “item_b=:Y”. So using index (item_a,item_b,item_c) for first case is just a bad idea when we have equal number of queries on item_b and item_c columns.
create table hr.tt nologging
as
select case when mod(rownum,3) = 0 then rownum end AS item_a,
case when mod(rownum,3) = 1 then rownum end AS item_b,
case when mod(rownum,3) = 2 then round(dbms_random.value(1,200)) end AS item_c,
o.*
from all_objects o,
(select * from dual connect by level < 100) d
where rownum <= 5e6;
create unique index hr.item_a_idx on hr.tt(item_a) nologging;
create unique index hr.item_b_idx on hr.tt(item_b) nologging;
create index hr.item_c_idx on hr.tt(item_c) nologging;
create table hr.tt2 nologging
as
select case when mod(rownum,3) in (0,1) then rownum
else round(dbms_random.value(1,200))
end AS item,
case when mod(rownum,3) = 0 then 'A'
when mod(rownum,3) = 1 then 'B'
when mod(rownum,3) = 2 then 'C' end AS item_type,
o.*
from all_objects o,
(select * from dual connect by level < 100) d
where rownum <= 5e6;
create index hr.item_item_type_idx on hr.tt2(item_type, item) nologging;
some queries
SYS@mydb> select * from hr.tt where item_a = 300;
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 197 | 3 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TT | 1 | 197 | 3 (0)| 00:00:01 |
|* 2 | INDEX UNIQUE SCAN | ITEM_A_IDX | 1 | | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
...
4 consistent gets
SYS@mydb> select * from hr.tt2 where item_type = 'A' and item = 300;
1 row selected.
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 115 | 4 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TT2 | 1 | 115 | 4 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | ITEM_ITEM_TYPE_IDX | 1 | | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
5 consistent gets
Below is an example where optimizer calculations are wrong (oracle 11.2).
Optimizer thinks there will be only one row in the “INDEX RANGE SCAN”.
Actually we see 8438 rows. Extended optimizer statistics on these two columns may be helpful, may be not.
SYS@mydb> select * from hr.tt2 where item_type = 'C' and item = 150;
8438 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 4063424880
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 115 | 4 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TT2 | 1 | 115 | 4 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | ITEM_ITEM_TYPE_IDX | 1 | | 3 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ITEM_TYPE"='C' AND "ITEM"=150)
Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
8618 consistent gets
. . .
8438 rows processed
And below is the same case for individual item_c column.
Here optimizer calculations are correct – 8133 rows estimated, 8290 rows selected.
SYS@mydb> select * from hr.tt where item_c = 150;
8290 rows selected.
Execution Plan
----------------------------------------------------------
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8133 | 826K| 7737 (1)| 00:01:33 |
| 1 | TABLE ACCESS BY INDEX ROWID| TT | 8133 | 826K| 7737 (1)| 00:01:33 |
|* 2 | INDEX RANGE SCAN | ITEM_C_IDX | 8133 | | 19 (0)| 00:00:01 |
------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("ITEM_C"=150)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
8499 consistent gets
. . .
8290 rows processed
I don’t speak Oracle, but I can’t see any advantages, logical or performance-wise – with your first scenario. As @brian-leach suggests in his comment, you should enforce the mutual-exclusivity, but that will slow down inserts.
Using this scenario, not only would I be worried about adding item_d to the collection, what happens when item_x gets an additional attribute? Model two is easily extendable in a decent way, Model one will end up in a hell hole of nullable attributes
You are obviously convinced that the second option is better (I strongly agree), to convince others you should set up a test where you compare them. Generating 1000000 item_a, 1000000 item_b and 1000 item_c in either model should be an easy task.
As ever, it all depends on how you’re going to access the data.
If you only ever query the table by item_1, then indexing the other two would be a waste of time.
If you most commonly query the table by item_1 and sometimes by item_2 as well then a composite index on item_1 and item_2 might be a good idea. Your database can use the composite index for item_1-only queries as well, but you do have to “start” that index with the field that you most often query by.
Best way to find out? Test it.
Spin up a test database, load some representative data, create some indexes and get execution plans for your typical queries.
Your item/item_type table is using a concept called “Entity/Attribute Value” and that, generally, scales very badly. Proper indexing should give you better performance.
The composite index would be better than three separate indexes because it would be able to use it for all three columns but you will want to pay attention to the order of the columns because it DOES matter. Could you not test and compare in dev/test environment?
See quote below.
RedGate:
“For composite indexes, take into consideration the order of the columns in the index definition. Columns that will be used in comparison expressions in the WHERE clause (such as WHERE FirstName = ‘Charlie’) should be listed first. Subsequent columns should be listed based on the uniqueness of their values, with the most unique listed first.”