MySQL :: Antijoin in MySQL 8

In MySQL 8.0.17, we made an observation in the well-known TPC-H benchmark for one particular query. The query was executing 20% faster than in MySQL 8.0.16. This improvement is because of the “antijoin” optimization which I implemented. Here is its short mention in the release notes:

“The optimizer now transforms a WHERE condition having NOT IN (subquery), NOT EXISTS (subquery), IN (subquery) IS NOT TRUE, or EXISTS (subquery) IS NOT TRUE internally into an antijoin, thus removing the subquery.”

Today I will explain what this optimization does, and present performance numbers.

The use-case for this optimization is questions like

“Show me the objects in this set which are not in that other set”
“Show me the customers for whom there is no purchase order this quarter”
“Show me the students who passed no exam this year”
“Show me the doctor’s patients who didn’t have a medical check-up in the last three years”.

In SQL, this usually translates to queries of the form:

SELECT * FROM patients
WHERE
NOT EXISTS
  (
   # subquery lists recent check-ups of this patient
   SELECT * FROM exams
   WHERE exams.type="check-up"
         AND exams.date>=DATE_SUB(NOW(), INTERVAL 3 YEAR)
         AND exams.patient_id=patients.patient_id
  )

SELECT * FROM patients

WHERE

NOT EXISTS

(

# subquery lists recent check-ups of this patient

SELECT * FROM exams

WHERE exams.type="check-up"

AND exams.date>=DATE_SUB(NOW(), INTERVAL 3 YEAR)

AND exams.patient_id=patients.patient_id

)

If the query remains in this form, there is not much potential for optimization. We have to read every record in the table of patients, and for each such record, evaluate the subquery and check for containment. Yes, we need that many evaluations of the subquery, as its WHERE clause depends on patients.patient_id, which changes for every record of patients (we call that “a correlated subquery”).

The first step towards optimizing this query, is to break the boundaries between the top query and the subquery, effectively merging the latter into the former, yielding:

SELECT * FROM patients ANTIJOIN exams
              ON exams.type="check-up"
                 AND exams.date>=DATE_SUB(NOW(), INTERVAL 3 YEAR)
                 AND patients.patient_id=exams.patient_id;

SELECT * FROM patients ANTIJOIN exams

ON exams.type="check-up"

AND exams.date>=DATE_SUB(NOW(), INTERVAL 3 YEAR)

AND patients.patient_id=exams.patient_id;

This new query uses the antijoin operator ; it is like the join operator except that instead of looking for matches it looks for non-matches ; precisely, it selects records from the left side for which the right side has no record matching the ON condition.

From there, MySQL can choose between two strategies, to evaluate the antijoin.

First, the “First Match” strategy: read a record from patients, find matches in exams, if there are none then emit the record of patients ; we recognize that it is equivalent to if we had kept the subquery.

Alternatively, the “Materialization” strategy: observe that in the ON clause, there are three sub-conditions, of which only one depends on patients; so, MySQL can automatically build a temporary table tmp, made of records of exams which match the two first sub-conditions (exam’s type and date); a bit like if it did:

CREATE TABLE tmp SELECT patient_id FROM exams
WHERE exams.type="check-up"
      AND exams.date>=DATE_SUB(NOW(), INTERVAL 3 YEAR)

CREATE TABLE tmp SELECT patient_id FROM exams

WHERE exams.type="check-up"

AND exams.date>=DATE_SUB(NOW(), INTERVAL 3 YEAR)

Then MySQL automatically adds an index on tmp.patient_id and does: read a record from patients, use the index to find matches in tmp, if there are none then emit the record of patients.

Compared to “First Match”, this strategy can be beneficial, as:

it reads exams only once (to build tmp)
tmp likely has less records than exams, so it is faster to probe in tmp than in exams
the probing is done through a purposely-built index over tmp, while exams might not have an index itself.

However, building tmp may have a significant up-front cost: MySQL needs to allocate memory for storing its records (or perhaps even allocate disk space, if there are many records), it also needs time to write records to tmp. Thus, which one of the two strategies is best, depends on the situation. Fortunately MySQL has a cost-based optimizer which will consider the two different strategies, calculate their cost based on number of records in tables, selectivities of conditions, usability of indexes, and pick the cheapest strategy.

So far, we have already understood that the antijoin optimization can speed a query up by allowing a cost-based choice between two execution strategies instead of one.

But, there is more to it, if we use more than two tables. So let’s jump to this TPC-H query which I mentioned previously. I will be using the DBT-3 implementation of TPC-H, and the query is number 21, which can be directly read here.

In this query, we have four tables, and additionally we have two subqueries in the WHERE clause. The first one is of type EXISTS, and MySQL handles it as a semijoin (an optimization introduced in MySQL 5.6). The second subquery is of type NOT EXISTS, so can be handled as an antijoin (the new thing). Subqueries are thus merged into the top query, of which the FROM clause now looks like:

(
  (supplier, lineitem l1, orders, nation)
  SEMIJOIN
  lineitem l2 ON ...
)
ANTIJOIN
lineitem l3 ON condition-on-l1-and-l3

(

(supplier, lineitem l1, orders, nation)

SEMIJOIN

lineitem l2 ON ...

)

ANTIJOIN

lineitem l3 ON condition-on-l1-and-l3

And this is where we can start to understand another key benefit of the antijoin transformation: because its ON condition depends only on l1 and l3, the antijoin operator can be moved left or right at any place in the FROM clause, as long as it remains after l1. Which place is the right one, depends at least on the number of records in l3 (the bigger this number, the more it costs to evaluate the antijoin), and on the selectivity of the antijoin condition. A costly operator should run late, when lots of records have already been eliminated by previous operators; on the other hand, if that operator is very selective, it should run early, to eliminate many records as soon as possible. So, there is no simple answer, cost calculations and comparisons are necessary. MySQL’s cost-based optimizer will consider the different orderings of accessing tables, and pick the cheapest.

Very good.

There is an obvious objection though. One could say: “there is no need to use an antijoin operator, MySQL could just keep the subquery, not merge it, and evaluate it at the best place (right after reading l1, or orders, or nation…), doing the cost-based choice of the place as you explained”.

True. But, remember, MySQL optimizes the top query, and after that, it optimizes subqueries. So when it is optimizing the top query, and wonders to which table it should attach the NOT EXISTS(subquery) condition, it knows neither the cost of evaluating the subquery, nor the selectivity of NOT EXISTS. So it just assumes that NOT EXISTS(subquery) is “transparent “: that it costs nothing and has a selectivity of 100%. I know, I know, it’s not good. But note that, if instead MySQL optimized subqueries before optimizing the top query, this problem would be solved, but another one would appear, as sometimes the best strategy to execute a not-merge-able subquery depends on how many times it will be evaluated, which is known only if we have already optimized the top query. So sometimes the dependency is from top to bottom, sometimes it’s from bottom to top, most often it’s … both 🙁

By merging the subquery into an antijoin, we circumvent this problem: we bring all tables together into a single planning phase, therefore such planning can make an informed choice.

Let’s illustrate this by playing with TPC-H.

After creating tables, I run query number 21. It emits 100 records, but we care more about its execution time:

...
| Supplier#000098830 | 17 |
+--------------------+---------+
100 rows in set (1 min 3.75 sec)

...

| Supplier#000098830 | 17 |

+--------------------+---------+

100 rows in set (1 min 3.75 sec)

Now, I run this query again, but I use a hint to disable the antijoin optimization and thus keep NOT EXISTS as a subquery, to show things as they happened before MySQL 8.0.17. The hint is the same as the one which disables semijoins (NO_SEMIJOIN), I write:
NOT EXISTS(SELECT /*+ NO_SEMIJOIN() */ …).

Now the execution time is:

| Supplier#000098830 | 17 |
+--------------------+---------+
100 rows in set (1 min 19.05 sec)

| Supplier#000098830 | 17 |

+--------------------+---------+

100 rows in set (1 min 19.05 sec)

We can see the antijoin optimization saves 15 seconds, which is a 19% gain 🙂

Here is the good execution plan with antijoin, as shown by EXPLAIN FORMAT=TREE (antijoin is on line 5):

-> Limit: 100 row(s)
    -> Sort: <temporary>.numwait DESC, <temporary>.s_name, limit input to 100 row(s) per chunk
        -> Table scan on <temporary>
            -> Aggregate using temporary table
                -> Nested loop anti-join
                    -> Nested loop semijoin
                        -> Nested loop inner join 
                            -> Inner hash join
                                -> Filter: (nation.n_name = 'IRAN')
                                    -> Table scan on nation
                                -> Hash
                                    -> Nested loop inner join
                                        -> Filter: (orders.o_orderstatus = 'F')
                                            -> Table scan on orders
                                        -> Filter: ((l1.l_receiptDATE > l1.l_commitDATE) and (l1.l_suppkey is not null))
                                            -> Index lookup on l1 using PRIMARY (l_orderkey=orders.o_orderkey)
                            -> Filter: (supplier.s_nationkey = nation.n_nationkey)
                                -> Single-row index lookup on supplier using PRIMARY (s_suppkey=l1.l_suppkey)
                        -> Filter: (l2.l_suppkey <> l1.l_suppkey)
                            -> Index lookup on l2 using PRIMARY (l_orderkey=orders.o_orderkey)
                    -> Filter: ((l3.l_suppkey <> l1.l_suppkey) and (l3.l_receiptDATE > l3.l_commitDATE))
                        -> Index lookup on l3 using PRIMARY (l_orderkey=orders.o_orderkey)

-> Limit: 100 row(s)

-> Sort: <temporary>.numwait DESC, <temporary>.s_name, limit input to 100 row(s) per chunk

-> Table scan on <temporary>

-> Aggregate using temporary table

-> Nested loop anti-join

-> Nested loop semijoin

-> Nested loop inner join

-> Inner hash join

-> Filter: (nation.n_name = 'IRAN')

-> Table scan on nation

-> Hash

-> Nested loop inner join

-> Filter: (orders.o_orderstatus = 'F')

-> Table scan on orders

-> Filter: ((l1.l_receiptDATE > l1.l_commitDATE) and (l1.l_suppkey is not null))

-> Index lookup on l1 using PRIMARY (l_orderkey=orders.o_orderkey)

-> Filter: (supplier.s_nationkey = nation.n_nationkey)

-> Single-row index lookup on supplier using PRIMARY (s_suppkey=l1.l_suppkey)

-> Filter: (l2.l_suppkey <> l1.l_suppkey)

-> Index lookup on l2 using PRIMARY (l_orderkey=orders.o_orderkey)

-> Filter: ((l3.l_suppkey <> l1.l_suppkey) and (l3.l_receiptDATE > l3.l_commitDATE))

-> Index lookup on l3 using PRIMARY (l_orderkey=orders.o_orderkey)

and here is the bad one without antijoin (instead, it still has a subquery on line 16):

-> Limit: 100 row(s)
    -> Sort: <temporary>.numwait DESC, <temporary>.s_name, limit input to 100 row(s) per chunk
        -> Table scan on <temporary>
            -> Aggregate using temporary table
                -> Nested loop semijoin
                    -> Nested loop inner join
                        -> Inner hash join
                            -> Filter: (nation.n_name = 'IRAN')
                                -> Table scan on nation
                            -> Hash
                                -> Nested loop inner join
                                    -> Filter: (orders.o_orderstatus = 'F')
                                        -> Table scan on orders
                                    -> Filter: ((l1.l_receiptDATE > l1.l_commitDATE) and exists(select #3) is false and (l1.l_suppkey is not null))
                                        -> Index lookup on l1 using PRIMARY (l_orderkey=orders.o_orderkey)
                                        -> Select #3 (subquery in condition; dependent)
                                            -> Limit: 1 row(s)
                                                -> Filter: ((l3.l_suppkey <> l1.l_suppkey) and (l3.l_receiptDATE > l3.l_commitDATE))
                                                    -> Index lookup on l3 using PRIMARY (l_orderkey=l1.l_orderkey)
                        -> Filter: (supplier.s_nationkey = nation.n_nationkey)
                            -> Single-row index lookup on supplier using PRIMARY (s_suppkey=l1.l_suppkey)
                    -> Filter: (l2.l_suppkey <> l1.l_suppkey)
                        -> Index lookup on l2 using PRIMARY (l_orderkey=orders.o_orderkey)

-> Limit: 100 row(s)

-> Sort: <temporary>.numwait DESC, <temporary>.s_name, limit input to 100 row(s) per chunk

-> Table scan on <temporary>

-> Aggregate using temporary table

-> Nested loop semijoin

-> Nested loop inner join

-> Inner hash join

-> Filter: (nation.n_name = 'IRAN')

-> Table scan on nation

-> Hash

-> Nested loop inner join

-> Filter: (orders.o_orderstatus = 'F')

-> Table scan on orders

-> Filter: ((l1.l_receiptDATE > l1.l_commitDATE) and exists(select #3) is false and (l1.l_suppkey is not null))

-> Index lookup on l1 using PRIMARY (l_orderkey=orders.o_orderkey)

-> Select #3 (subquery in condition; dependent)

-> Limit: 1 row(s)

-> Filter: ((l3.l_suppkey <> l1.l_suppkey) and (l3.l_receiptDATE > l3.l_commitDATE))

-> Index lookup on l3 using PRIMARY (l_orderkey=l1.l_orderkey)

-> Filter: (supplier.s_nationkey = nation.n_nationkey)

-> Single-row index lookup on supplier using PRIMARY (s_suppkey=l1.l_suppkey)

-> Filter: (l2.l_suppkey <> l1.l_suppkey)

-> Index lookup on l2 using PRIMARY (l_orderkey=orders.o_orderkey)

In the bad one, we can see that NOT EXISTS is evaluated right after reading l1 (i.e. as soon as possible). Whereas in the good one, we can see that it is evaluated at the very end (l3 is the last table), which is apparently a smarter choice as it runs faster.

By the way, we can see that the antijoin has been handled with “First Match”, as in the “Nested loop anti-join” node there is no mention of an internal temporary table.

We’re coming to the end of this blog post. It’s time for a recap. We learnt that the antijoin optimization:

applies to NOT EXISTS, NOT IN(subquery)
allows MySQL’s planner to have a choice of strategy (First Match, or Materialization)
allows MySQL’s planner to have more choices of orderings of tables.

More usage details are on the dedicated manual page.

As always, thank you for using MySQL!

Featured image credit: Pixabay.