WL#13377: Add support for hash outer, anti and semi join

Affects: Server-8.0 — Status: Complete

WL#2241 added support for hash inner join as a replacement for block nested-
loop. This worklog aims to implement the remaining types of joins supported in 
mysql; outer-, anti- and semijoin. As with WL#2241, this worklog will simply 
replace block nested-loop with hash join. The optimizer will still generate 
execution plans thinking it will execute a block nested-loop.

In WL#2241, we chose not to replace block nested-loop with hash join if the only 
join condition was a non equi-join. We will lift this restriction in this 
worklog. This means that all queries using BNL will execute using hash join 
given that the query does not contain anything that the iterator executor does 
not yet support (i.e. BKA outer join).

Example plans

Inner non equi-join

EXPLAIN FORMAT=tree SELECT * FROM t1 JOIN t2 ON t1.col1 < t2.col1;
-> Filter: (t1.col1 < t2.col1)  (cost=0.70 rows=1)
    -> Inner hash join  (cost=0.70 rows=1)
        -> Table scan on t2  (cost=0.35 rows=1)
        -> Hash
            -> Table scan on t1  (cost=0.35 rows=1)

Semijoin

EXPLAIN FORMAT=tree SELECT * FROM t1 WHERE (t1.i) IN (SELECT t2.i FROM t2);
-> Hash semijoin (t2.i = t1.i)  (cost=0.83 rows=2)
    -> Table scan on t1  (cost=0.45 rows=2)
    -> Hash
        -> Table scan on t2  (cost=0.18 rows=2)

Antijoin

EXPLAIN FORMAT=tree
  SELECT * FROM t2 WHERE NOT EXISTS (SELECT * FROM t1 WHERE t1.col1 = t2.col1);
-> Hash anti-join (t1.col1 = t2.col1)  (cost=1.15 rows=6)
    -> Table scan on t2  (cost=0.55 rows=3)
    -> Hash
        -> Table scan on t1  (cost=0.30 rows=2)

Left outer join

EXPLAIN FORMAT=tree SELECT * FROM t1 LEFT JOIN t2 ON t1.col1 = t2.col1;
-> Left hash join (t2.col1 = t1.col1)  (cost=0.63 rows=2)
    -> Table scan on t1  (cost=0.45 rows=2)
    -> Hash
        -> Table scan on t2  (cost=0.18 rows=1)

Right outer join
Note that MySQL rewrites all right outer joins to left outer joins.

EXPLAIN FORMAT=tree SELECT * FROM t1 LEFT JOIN t2 ON t1.col1 = t2.col1;
 -> Left hash join (t1.col1 = t2.col1)  (cost=0.70 rows=1)
    -> Table scan on t2  (cost=0.35 rows=1)
    -> Hash
        -> Table scan on t1  (cost=0.35 rows=1)

F-1: All queries using block nested-loop should execute using hash join. This 
does not apply if the query contains something else that the iterator executor 
does not yet support (i.e. outer join with BKA).

NF-1: A query executed using hash join should be equal to or faster than 
executing the same query with block nested loop.

This worklog continues the work from "WL#2241 - Hash join". WL#2241 only 
implemented hash inner join, where we simply replaced block nested-loop (BNL) 
with   hash join. The optimizer were still makin execution plans thinking it was 
going to execute BNL, and there were some (BNL) inner joins that we did not 
convert to hash join. More specifically, we chose not to convert BNL inner joins 
where the only join condition(s) were non equi-join conditions (i.e. SELECT * 
FROM t1 JOIN t2 ON t1.col1 < t2.col1;). This worklog aims to make all BNL inner 
joins execute using hash join, as well as implementing the remaining join types 
(semi-, anti- and outer joins).

Inner hash join with non equi-join condition
Hash semi-join
Hash anti-join
Hash left outer join
- Method 1
- Method 2
Hash right outer join
Hash full outer join
User-visible changes
Planning hash join

Inner hash join with non equi-join condition

SELECT * FROM t1 JOIN t2 ON t1.col1 < t2.col1;

In order to execute this query using hash join, we will do a cross-join between t1 and t2, and attach the join condition as a filter after the join. The execution plan will look like this:

-> Filter: (t1.col1 < t2.col1)  (cost=0.70 rows=1)
    -> Inner hash join  (cost=0.70 rows=1)
        -> Table scan on t2  (cost=0.35 rows=1)
        -> Hash
            -> Table scan on t1  (cost=0.35 rows=1)

Hash semi-join

SELECT * FROM t1 WHERE t1.col1 IN (SELECT t2.col1 FROM t2);

In the above example, a semi-join returns every row from t1 where there is at least one matching row in t2. Note that nothing from t2 is returned to the user.

To execute this using hash join, we will assign the IN side of the join (t2) as the build input, using the join attribute (t2.col1) as the hash table key. We then scan the other table (t1) and do a lookup in the hash table using the join attribute (t1.col1) as the lookup key. If at least one match is found, we output the row from t1.

Hash anti-join

SELECT * FROM t1 WHERE t1.col1 NOT IN (SELECT t2.col1 FROM t2);

An anti-join is quite similar to a semi-join. In the above example, an anti-join returns every row from t1 where there is no matching row in t2. Again, note that nothing from t2 is returned to the user.

To execute this using hash join, we will assign the NOT IN side of the join (t2) as the build input, using the join attribute (t2.col1) as the hash table key. We then scan the other table (t1) and do a lookup in the hash table using the join attribute (t1.col1) as the lookup key. If no match is found, we output the row from t1.

Hash left outer join

SELECT * FROM t1 LEFT JOIN t2 ON t1.col1 = t2.col1;

There are in general two ways of executing a left join using hash join.

Method 1

Assign the right table (t2) as the build input, using the join attribute (t2.col1) as the hash table key. We then scan the left table (t1) and do a lookup in the hash table using the join attribute (t1.col1). For each matching row found in the hash table, output the joined row. If not, output a null-complemented row.

Method 2

Assign the left side of the join (t1) as the build input, using the join attribute (t2.col1) as the hash table key. Scan the right table (t2) and do a lookup in the hash table using the join attribute (t2.col1). For each matching row found in the hash table, output the joined row and set a match flag for the matching rows in the hash table.. After all the rows from t2 has been read, go through all the rows in the hash table where the match flag is not set and output a null-complemented row.

We have seem both methods mentioned in the literature, and we choose method 1 as it is simpler to implement.

Hash right outer join

SELECT * FROM t1 RIGHT JOIN t2 ON t1.col1 = t2.col1;

The optimizer rewrites all right outer join to left outer join, so there is no need for a right outer join implementation.

Hash full outer join

SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.col1 = t2.col1;

MySQL does not support full outer joins, but it is worth mentioning that it is possible to implement it using hash join by doing a combination of both methods mentioned for left outer join:

Assign one of the tables (t1) as the build input, using the join attribute (t1.col1) as the hash table key. We then scan the other table (t2) and do a lookup in the hash table using the join attribute (t2.col1). For each matching row found in the hash table, mark the hash table row as "matched" and output the joined row. If not, output a null-complemented row. When we are done reading the probe input, do a pass through the hash table and ouput a null-complemented row for each row where the "matched" flag is not set.

User-visible changes

The only user-visible changes is that EXPLAIN FORMAT=tree will print out execution plans for hash semi-, anti- and outer joins:

EXPLAIN FORMAT=tree SELECT * FROM t1 WHERE (t1.i) IN (SELECT t2.i FROM t2);
-> Hash semijoin (t2.i = t1.i)  (cost=0.83 rows=2)
    -> Table scan on t1  (cost=0.45 rows=2)
    -> Hash
        -> Table scan on t2  (cost=0.18 rows=2)

EXPLAIN FORMAT=tree
SELECT * FROM t2 WHERE NOT EXISTS (SELECT * FROM t1 WHERE t1.col1 = t2.col1);
-> Hash anti-join (t1.col1 = t2.col1)  (cost=1.15 rows=6)
    -> Table scan on t2  (cost=0.55 rows=3)
    -> Hash
        -> Table scan on t1  (cost=0.30 rows=2)


EXPLAIN FORMAT=tree SELECT * FROM t1 LEFT JOIN t2 ON t1.col1 = t2.col1;
EXPLAIN
-> Left hash join (t2.col1 = t1.col1)  (cost=0.63 rows=2)
    -> Table scan on t1  (cost=0.45 rows=2)
    -> Hash
        -> Table scan on t2  (cost=0.18 rows=1)

It is also worth mentioning that hash join does not have a deterministic output order. Due to this, we expect that a lot of test will change their output order in this worklog.

Other than this, there are no additional user-visible changes beyond those mentioned in WL#2241.

Planning hash join

The optimizer will not be changed in this worklog. This means that the optimizer will generate plans thinking it will execute BNL. This is obviously not ideal, but we leave this effort to a later worklog.

Most of the needed functionality was implemented in WL#2241, and this worklog mostly extends the usage of this functionality. However, there are a few notable changes that will be done.

We will add the method HashJoinBuffer::find(const Key &key). This is useful for anti-join and semi-join, as we are only interested in the first matching row. This may give a speedup as find() has better average complexity compared to equal_range (constant vs. linear in the number of matching elements).
The HashJoinIterator will now support multiple tables for the probe input. For inner joins, there will be at most one table in the probe input. For semi-, anti- and outer join, this is not true anymore.
The HashJoinIterator will get a list of optional "extra" conditions. These conditions are non equi-join conditions, but needs to be evaluated _before_ any null-complemented row is generated. Consider the following query:
```
SELECT * FROM t1 LEFT JOIN t2 ON t1.col1 < t2.col2
```
t1.col1 < t2.col2 is a condition that must be evaluated before any null-complemented row is generated. Attaching this condition as a filter after the join would yield wrong results.
As an optimization for semi- and antijoin, we will add a new argument to HashJoinBuffer::StoreRow(); bool reject_duplicate_keys. Since semi-/anitjoin only needs the first match from the hash table, there is no need in storing multiple rows with the same key. Note that since we can have these "extra" conditions mentioned above, this optimization can only be applied if there are no extra conditions.