Filter first then join or join first then filter?

Posted on

Question :

Consider 5 tables with more than 1 billion records each.

A query needs to join them. And I know that less than 10% of their records will be require. Say they all have a Date dimension, and only data from current month is needed.

What should be faster:

1) Use a simple SELECT: join all tables, then filter (WHERE) each table’s dimension for current month.

2) create 5 temp tables, filtering each source table for current month records, here we can also take the opportunity to select only required columns, then join these temp tables.

Extra possibility:

3) Maintain secondary tables, having only current month/year worth of data. These tables are maintained by same ETL that feeds main ones.

Answer :

Build up the query and see.
The query optimizer is often smart enough to filter early.
SQL is logical – in the where does not mean it will process last.

Clearly you want indexes on join and filter.

When you get to 5 or more joins the optimizer will often get defensive and go into a loop join.
Do I have a citation – no. It is an observation.

When it gets to 5 or more pulling the conditions into the join can help the optimizer.
Do I have a citation – no. It is an observation.

select * 
from tableA 
join tableB  
      on tableA.fkB = tableB.id 
     and tableA.date1 >= @date1start   
     and tableA.date1 <  @date1end 
join tableC  
      on tableA.fkC = tableC.id 
     and tableC.filter1 = 'do me early'  

Build it up one join at a time and see when it goes stupid.
Optimize one join at at time.

If you are going to materialize then put a pk on the #temp.
Start with where you think you are going to get the most bang.

OR condition in where / join are the most trouble and often lead to a loop join.
Those should be the first to materialize.

You can force hash join but that is a slippery slope that can go bad.

It seems kind of odd all tables have a date dimension, you would typically have some lookup type tables with more static type data.

If you don’t need output from the table then a where exists may preform better.

Leave a Reply

Your email address will not be published. Required fields are marked *