Question :
Consider 5 tables with more than 1 billion records each.
A query needs to join them. And I know that less than 10% of their records will be require. Say they all have a Date dimension, and only data from current month is needed.
What should be faster:
1) Use a simple SELECT: join all tables, then filter (WHERE) each table’s dimension for current month.
2) create 5 temp tables, filtering each source table for current month records, here we can also take the opportunity to select only required columns, then join these temp tables.
Extra possibility:
3) Maintain secondary tables, having only current month/year worth of data. These tables are maintained by same ETL that feeds main ones.
Answer :
Build up the query and see.
The query optimizer is often smart enough to filter early.
SQL is logical – in the where does not mean it will process last.
Clearly you want indexes on join and filter.
When you get to 5 or more joins the optimizer will often get defensive and go into a loop join.
Do I have a citation – no. It is an observation.
When it gets to 5 or more pulling the conditions into the join can help the optimizer.
Do I have a citation – no. It is an observation.
select *
from tableA
join tableB
on tableA.fkB = tableB.id
and tableA.date1 >= @date1start
and tableA.date1 < @date1end
join tableC
on tableA.fkC = tableC.id
and tableC.filter1 = 'do me early'
Build it up one join at a time and see when it goes stupid.
Optimize one join at at time.
If you are going to materialize then put a pk on the #temp.
Start with where you think you are going to get the most bang.
OR condition in where / join are the most trouble and often lead to a loop join.
Those should be the first to materialize.
You can force hash join but that is a slippery slope that can go bad.
It seems kind of odd all tables have a date dimension, you would typically have some lookup type tables with more static type data.
If you don’t need output from the table then a where exists may preform better.