Prune unused joins

Posted on

Question :

I have expected that selecting from a view would not join the tables from which I don’t query any values, but it doesn’t seem to be the case.

I have a bunch of tables with the following structure:

CREATE TABLE ind1 (year integer, id integer, ind1 float);
ALTER TABLE ind1 ADD PRIMARY KEY (year, id);
INSERT INTO ind1 VALUES (2000, 1, 0.0);
INSERT INTO ind1 VALUES (2000, 2, 0.3);
INSERT INTO ind1 VALUES (2000, 3, 1.1);
INSERT INTO ind1 VALUES (2001, 1, 0.0);
INSERT INTO ind1 VALUES (2001, 2, 0.3);
INSERT INTO ind1 VALUES (2001, 3, 1.1);
INSERT INTO ind1 VALUES (2002, 1, 0.0);
INSERT INTO ind1 VALUES (2002, 2, 0.3);
INSERT INTO ind1 VALUES (2002, 3, 1.1);
VACUUM ANALYZE ind1;

There is also ind2, ind3, … . The set of possible year and id values is identical in all tables, and also given in an extra table:

CREATE TABLE id (id integer PRIMARY KEY);
INSERT INTO id VALUES (1);
INSERT INTO id VALUES (2);
INSERT INTO id VALUES (3);
VACUUM ANALYZE id;

Now I create a view to show all entries for a given year:

CREATE VIEW ind_2000 AS SELECT id, ind1, ind2, ind3
  FROM (SELECT id, 2000 AS year FROM id) T
  LEFT JOIN ind1 USING (year, id)
  LEFT JOIN ind2 USING (year, id)
  LEFT JOIN ind3 USING (year, id);

However, when selecting only the columns id, ind1 and ind2, still all tables are joined, although this is not necessary to perform the query correctly:

EXPLAIN ANALYZE SELECT id, ind1, ind2 FROM ind_2000;

                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=3.45..4.63 rows=3 width=20) (actual time=0.150..0.157 rows=3 loops=1)
   Hash Cond: (id.id = ind1.id)
   ->  Hash Left Join  (cost=2.30..3.43 rows=3 width=12) (actual time=0.118..0.124 rows=3 loops=1)
         Hash Cond: (id.id = ind2.id)
         ->  Hash Left Join  (cost=1.15..2.23 rows=3 width=4) (actual time=0.087..0.089 rows=3 loops=1)
               Hash Cond: (id.id = ind3.id)
               ->  Seq Scan on id  (cost=0.00..1.03 rows=3 width=4) (actual time=0.007..0.007 rows=3 loops=1)
               ->  Hash  (cost=1.11..1.11 rows=3 width=4) (actual time=0.028..0.028 rows=3 loops=1)
                     ->  Seq Scan on ind3  (cost=0.00..1.11 rows=3 width=4) (actual time=0.019..0.024 rows=3 loops=1)
                           Filter: (2000 = year)
         ->  Hash  (cost=1.11..1.11 rows=3 width=12) (actual time=0.012..0.012 rows=3 loops=1)
               ->  Seq Scan on ind2  (cost=0.00..1.11 rows=3 width=12) (actual time=0.007..0.011 rows=3 loops=1)
                     Filter: (2000 = year)
   ->  Hash  (cost=1.11..1.11 rows=3 width=12) (actual time=0.011..0.011 rows=3 loops=1)
         ->  Seq Scan on ind1  (cost=0.00..1.11 rows=3 width=12) (actual time=0.007..0.009 rows=3 loops=1)
               Filter: (2000 = year)
 Total runtime: 0.304 ms
(17 rows)

The entire code is in this gist.

How do I tell the optimizer to avoid the useless joins?

(Using PostgreSQL 8.4)

EDIT: The error persists even when dropping the time dimension, see option1.sql in the gist.

Answer :

Outer join removal is implemented in PostgreSQL 9.0. I suggest you try your example with that version (or an even more recent one).

You’re only considering that there could be zero or one joined rows in ind3 — if there were seven then it would affect the number of rows returned from the query, so it’s not correct to say that the join is irrelevant.

Even if that were not the case, it’s a matter of whether a potential optimising operation is implemented in the query optimiser. Recent Oracle versions will avoid an unnecessary join in querying a view or inline view if constraints are in place to enforce the logic that omitting the join will make no difference to the query but I’m not sure that they would deal with this sort of situation.

You would almost certainly still see the potential join in the execution plan, so that is not really the way to judge whether any work has been expended in the join.

below function replaces all select fields with count(*) and 2nd part removes unnecessary joins. This function works only with tables that has aliases and should be tested for very complex queries and wont work if there is inner queries in join condition.

function sql_query_count($sql) {
        //replace select fields with count(*)
        $a = true;
        $b = 0;
        $first_select = stripos($sql, 'select ');
        $last_from = 0;
        $i = 0;
        while($a){
            $i++;
            $b = stripos($sql, ' from ',$last_from);
            $c = strripos(substr($sql, $last_from, $b), 'select ');
            if ($c == $first_select || $c === false || $i>100) $a = false;
            $last_from = $b+6;
        }        
        if (stripos($sql, 'order by') !== false)
            $sql = substr($sql, 0, stripos($sql, 'order by'));
        $sql1 = 'select count(*) as c ' . substr($sql, $b);

        //remove unnecessary joins
        $joins = preg_split("/ join /i", $sql1);
        $join_count = count($joins);
        $join_type = '';
        if (count($joins)>1){
            for ($index = 0; $index < $join_count+2; $index++) {
                $sql_new = '';
                $where = '';
                $i = 0;
                foreach ($joins as $key => $value) { $i++;
                    $parts = preg_split("/ where /i", trim($value));
                    $value = $parts[0];
                    unset($parts[0]);
                    $where = implode(' where ', $parts);
                    $occurence_count = 0;
                    if ($i > 1) {
                        $a = explode(' on ', $value);
                        $c = preg_replace('!s+!', ' ', trim($a[0]));
                        $c = explode(' ', $c);
                        $occurence_count = substr_count($sql1, ' '.$c[1].'.')+substr_count($sql1, '='.$c[1].'.');
                    }
                    $t = explode(' ', $value);
                    $j = '';
                    if (trim(strtolower($t[count($t) - 1])) == 'inner'){
                        $j = 'inner';
                        unset($t[count($t) - 1]);
                    } else if (trim(strtolower($t[count($t) - 2])).' '.trim(strtolower($t[count($t) - 1])) == 'left outer'){
                        $j = 'left outer';
                        unset($t[count($t) - 1]);
                        unset($t[count($t) - 1]);
                    }
                    if ($occurence_count == 0 || $occurence_count > 1) $sql_new.= ' '.$join_type.(($join_type!='')?' join ':'').implode(' ', $t);                    
                    $join_type = $j;
                }
                $sql_new .= ' where '.$where;
                $sql1 = $sql_new;
                $joins = preg_split("/ join /i", $sql1);
            }
        }
        return $sql1;
    }

Leave a Reply

Your email address will not be published. Required fields are marked *