Is there an execution difference between a JOIN condition and a WHERE condition?

Posted on

Question :

Is there a performance difference between these two example queries?

Query 1:

select count(*)
from   table1 a
join   table2 b
on     b.key_col=a.key_col
where  b.tag = 'Y'

Query 2;

select count(*)
from   table1 a
join   table2 b
on     b.key_col=a.key_col
   and b.tag = 'Y'

Notice the only difference is the placement of the supplemental condition; the first uses a WHERE clause and the second adds the condition to the ON clause.

When I run these queries on my Teradata system, the explain plans are identical and the JOIN step shows the additional condition in each case. However, on this SO question regarding MySQL, one of the answers suggested that the second style is preferred because WHERE processing occurs after the joins are made.

Is there a general rule to follow when coding queries like this? I’m guessing it must be platform dependent since it obviously makes no difference on my database, but perhaps that is just a feature of Teradata. And if it is platform dependent, I’d like very much to get a few documentation references; I really don’t know what to look for.

Answer :

According to Chapter 9 (Parser and Optimizer), Page 172 of the Book Understanding MySQL Internals by Sasha Pachev

Understanding MySQL Internals

here is the breakdown the evaluation of a query as the following tasks:

  • Determine which keys can be used to retrieve the records from tables, and choose the best one for each table.
  • For each table, decide whether a table scan is better that reading on a key. If there are a lot of records that match the key value, the advantages of the key are reduced and the table scan becomes faster.
  • Determine the order in which tables should be joined when more than one table is present in the query.
  • Rewrite the WHERE clauses to eliminate dead code, reducing the unnecessary computations and changing the constraints wherever possible to the open the way for using keys.
  • Eliminate unused tables from the join.
  • Determine whether keys can be used for ORDER BY and GROUP BY.
  • Attempt to simplify subqueries, as well as determine to what extent their results can be cached.
  • Merge views (expand the view reference as a macro)

On that same page, it says the following:

In MySQL optimizer terminology, every query is a set of joins. The term join is used here more broadly than in SQL commands. A query on only one table is a degenerate join. While we normally do not think of reading records from one table as a join, the same structures and algorithms used with conventional joins work perfectly to resolve the query with only one table.

EPILOGUE

Because of the keys present, the amount of data, and the expression of the query, MySQL Joins may sometimes do things for our own good (or to get back at us) and come up with results we did not expect and cannot quickly explain.

I wrote about this quirkiness before

because the MySQL Query Optimizer could make dismiss certain keys during the query’s evaluation.

@Phil’s comment help me see how to post this answer (+1 for @Phil’s comment)

@ypercube’s comment (+1 for this one too) is a compact version of my post because MySQL’s Query Optimizer is primitive. Unfortunately, it has to be since it deals with outside storage engines.

CONCLUSION

As for your actual question, the MySQL Query Optimizer would determine the performance metrics of each query when it is done

  • counting rows
  • selecting keys
  • massaging intermittent results sets
  • Oh yeah, doing the actual JOIN

You would probably have to coerce the order of execution by rewriting (refactoring) the query

Here is the first Query you gave

select count(*)
from   table1 a
join   table2 b
on     b.key_col=a.key_col
where  b.tag = 'Y';

Try rewriting it to evaluate the WHERE first

select count(*)
from   table1 a
join   (select key_col from table2 where tag='Y') b
on     b.key_col=a.key_col;

That would definitely alter the EXPLAIN plan. It could produce better or worse results.

I once answered a question in StackOverflow where I applied this technique. The EXPLAIN was horrendous but the performance was dynamite. It only worked because of having the correct indexes present and the use of LIMIT in a subquery.

As with stock prices, when it comes to Queries and trying to express them, restrictions apply, results may vary, and past performance is not indicative of future results.

For Oracle, since mySQL had a lengthy description, we’ve got 2 high level ways of leveraging the optimizer.

First is Rule Based Optimization (or RBO). Oracle has 15 set-in-stone rules that each query it parses attempts to follow in a set order. If it cannot generate an optimized query from rule 1, it will move forward to rule 2 and onward until it hits rule 15.

for more info:
https://docs.oracle.com/cd/B10500_01/server.920/a96533/rbo.htm

These affect Oracle RDBMS kernels from 11.1 and below that have not been converted to Cost Based Optimizer (aka CBO). Oracle 11.2 and up require the CBO optimizer, but can force specific Sql IDs to optimize in the old RBO method if the user wants to.

The CBO for Oracle 11.1+ instead makes several execution plans for the same SQL ID and executes the one with the least overall anticipated cost. It leverages a lot of the logic from RBO, but analyzes table statistics to create dynamic execution plan costs for each operation the DB has to do to provide the end-user their data. Executing full table scans on very large tables is really costly; executing full table scans on a table with 10 rows is cheap. In RBO these were considered equal operations.

for more info:
https://oracle-base.com/articles/misc/cost-based-optimizer-and-database-statistics

For your specific query example:
Oracle would likely parse the information to make different execution plans and thus one will be technically better than the other. However, this can be a minimal difference. Eyeballing it, both Oracle RBO and CBO would like query 1 more because it is executing on a join on less conditions and then filtering out a specific column from the temporary table it made from the join.

If you have two queries and you think that they are equivalent then the following can happen:

  1. Both queries have the same execution plan. That is fine and that is what we expect. Let’s hope that it is the optimal execution plan for the query.
  2. there are different execution plans. We have two subcases here.

    2.1 The queries have different execution plans but both plans perform equally good. That is fine, too. There is no need that for equivalent queries the same plan must be generated. But the performance should be equal. And again we hope that it is the best possible.

    2.2 The queries have different execution plans and one plan is better than the other. Again we have subcases:

    2.2.1 The plans are different because the queries are not equivalent. So carefully check if they are really equivalent. In your case they really are equivalent.

    2.2.2 The plans are different but the queries are equivalent. This means the optimizer is not matured enough. In a perfect world with perfect optimizers this should not happen. So yes, it is platform dependent and you have to study platform specific documents to find out why this happens.

    2.2.3 The plans are different,the queries are equivalent, the database software has a bug.

Leave a Reply

Your email address will not be published. Required fields are marked *