WL#4245: Subquery optimization: Transform NOT EXISTS and NOT IN to anti-join

Affects: Server-8.0 — Status: Complete

Description
Requirements
Dependent Tasks
High Level Architecture
Low Level Design

MOTIVATION
==========

Provide better cost planning: by bringing subquery tables
into the top query's plan, and also by merging semi-joins and anti-joins
together, we will gain more freedom to re-order tables in the execution plan,
and thus sometimes find better (faster) plans than before this WL.

ABSTRACT
========

An anti-join (also called anti-semi-join):
https://en.wikipedia.org/wiki/Relational_algebra#Antijoin_(%E2%96%B7)
A ANTISEMIJOIN B ON COND
returns all rows of for which there is no row in B matching COND.

Subqueries in form 

  SELECT ... FROM ot WHERE oe NOT IN (SELECT ie FROM it1, it2 ...)

can be rewritten to

  SELECT ... FROM ot ANTIJOIN (it1, it2 ...) ON oe=ie;

and then can be executed with an inverted FirstMatch strategy or inverted
Materialization strategy:

Assuming an outer-to-inner join order  

  ot it1 it2

For FirstMatch, whenever we reach it1, we first enumerate all record
combinations from 'it1' and 'it2'. If we find a first matching combination, we
short-cut back to 'ot'. If we find no matching combinations, we proceed with
joining with the rest of the plan.
For Materialization, we materialize the inner SELECT into a temporary table
having a unique index on 'ie', set up plan order

  ot tmp-tbl

and then do as for FirstMatch.

This approach is similar to, and reuses, the already existing IS NULL
optimization for outer joins (called "Not exists" in
https://dev.mysql.com/doc/refman/8.0/en/explain-output.html).

F1 - there should be a new strategy in the Optimizer, called anti-join; user
should see in EXPLAIN if it's used (in the Note showing the rewritten query: the
rewritten query should show "antijoin")

F2 - it should be used for "expr NOT IN (subquery)" and "NOT EXISTS (subquery)",
if these conditions apply: the said predicate should be the WHERE or some JOIN
ON condition; or should be be connected to WHERE or to some JOIN ON condition
via a chain of AND; and, for NOT IN, neither 'expr' nor the SELECT list of
'subquery' should be nullable.

F3 - the regular greedy search cost optimizer should be used to choose between
the supported strategies for anti-join; in EXPLAIN this should be visible by the
absence of "SUBQUERY" lines in the output: tables of the subquery should be
listed under the same "id" as tables of the top query.

NF4 - no general performance degradation is expected.

NF5 - Queries that contain NOT EXISTS and NOT IN clauses should generally run
without performance degradation and sometimes with better performance than
before this WL.

NF6 - antijoin should apply to the following DBT3's queries: Q16, Q21 and Q22.

WL#4389: Subquery optimizations: Make IN optimizations also handle EXISTS

Antijoin (also known as antisemijoin, depending on sources) can be
treated like semijoin, but not quite. Below are notable differences.

DIFFERENCE 1: Possible semijoin strategies
==========================================

Taking this sample data:
- outer table 'ot' has an INT column, contains rows 1,2,3.
- inner table 'it' has an INT column, contains row 3,4.

SELECT * FROM ot WHERE ot.a NOT IN (SELECT it.a FROM it);

Consider a plan which is reading first 'it' then 'ot'.
If we read value 3 of 'it', then find it in 'ot', we can know that 3
of 'ot' must not be emitted; then we read 4 of 'it', don't find it in
'ot', so emit nothing either; we are now done with 'it'; how do we
remember to emit '1' and '2' of 'ot'? The only way would be to keep
"marks" on records of 'ot' ("this record matched something from 'it'")
and do a final pass to emit all records of 'ot' which don't have the
mark. This is like the "match flag" in the join buffering code.
But then such plan is effectively like plan 'ot-it' with join buffering
enabled on 'it'.

So 'ot-it' is the only possible plan which we will consider.

Loosescan and subquery-materialization-scan are impossible as they are
'it-ot'.

Firstmatch is ok: when you find a matching row in 'it', don't emit the row of
'ot', and
jump back to next row of 'ot'; if you find no matching row in 'it', emit the row
of 'ot'. This is similar to NULL-complementing in outer joins.
Subquery-materialization-lookup: works in the same way.
Duplicates-weedout is ok too: similar to firstmatch without the jumping
back.

DIFFERENCE 2: handling of NULLs
===============================

a) For NOT IN
-------------

consider
WHERE ot.a NOT IN (SELECT it.a ...) i.e.
WHERE NOT(ot.a IN (SELECT it.a...)):
- the antijoin (AJ) algorithm would check the equality of ot.a and it.a
(with a '=' operator; so NULLs are not equal), and it would emit the
row of 'ot' if no row of 'it' matches both the equality and the
conditions inside the subquery's SELECT.
- let's examine all possible cases, to see if AJ correctly emulates
the original query:
- if subquery's SELECT is empty, SQL std says: IN is FALSE, NOT IN is
TRUE, so the row of 'ot' is emitted, AJ does the same -> ok
- if ot.a is not NULL and subquery's SELECT has a matching row, IN is
TRUE, NOT IN is FALSE, row of 'ot' is not emitted, AJ does the same -> ok
- if ot.a is not NULL and subquery's SELECT has no matching row and
has rows where 'it.a' is NULL, IN is UNKNOWN, NOT IN is UNKNOWN, row
of 'ot' is not emitted, AJ emits it -> problem
- if ot.a is not NULL and subquery's SELECT has no matching row and
has no rows where 'it.a' is NULL, IN is FALSE, NOT IN is TRUE, row of
'ot' is emitted, AJ does the same -> ok
- if ot.a is NULL, IN is UNKNOWN, NOT IN is UNKNOWN, row of 'ot' is not emitted,
AJ emits it -> problem

Conclusion: for NOT IN, antijoin cannot apply if ot.a or it.a are nullable.
A test for this case is added to Item_exists_subselect::semijoin_checks_ok().

An enhancement:
in the presence of NULLs, we can still apply antijoin to NOT IN if it's
(NOT IN) IS NOT FALSE,
or, equivalently,
(IN) IS NOT TRUE.
They are constructs where, in the problematic scenarios above, the returned
value is TRUE (so, in accordance with AJ's response; somehow it means that a
NULL is invisible to the construct: only non-NULL values on left and right side
count to make a TRUE value for IN). This is implemented in the tree.

For NOT EXISTS
--------------

For WHERE NOT EXISTS (SELECT it.a FROM it ...): it's equivalent to
1 IN (SELECT 1 FROM it...), and as 1 is not NULL, there is no problem,
AJ will work.


DIFFERENCE 3: cost-based choice of strategy
===========================================

Let's compare the cost of the (anti)semijoin op with the semijoin op.

For possible strategies (firstmatch, subq-mat-lookup, dups weedout):
- the number of rows which the operation has to read is identical: for
firstmatch, we read until we find a match and then we decide what to do of the
outer row (send or reject); for subq-mat-lookup, we do one single index lookup
and decide; for dups-weedout, we do a join.
- the number of rows output by the operation is different. For firstmatch and
subq-mat-lookup it's impossible to predict if AJ/SJ produces more or less rows.
For dups-weedout, we know that, before applying the "uniqueness" filter, SJ produces
card(outer-table)*card(inner-table)*0.1 (COND_FILTER_EQUALITY==0.1); whereas AJ
produces at most card(outer-table) rows. But we expect that dups-weedout will
rarely be chosen (indeed, only ot-it plans are possible, where
Firstmatch is legible, and should cost less than Dups-weedout as it
doesn't use any tmp table and doesn't have "many duplicate rows" in a
pre-uniqueness filter step).

For simplicity we will use, for AJ, the existing cost calculations of SJ. So we
don't need to make the planner AJ-aware.

IMPLEMENTATION DETAILS
======================

TABLE_LIST::is_sj_nest() and is_aj_nest() say if this is a SJ or AJ nest.
is_sj_nest() replaces the old call sj_cond().

TABLE_LIST::on_expr_dep_tables is renamed to join_cond_dep_tables.

Before the AJ nest is created, Item_exists_subselect needs to know if
it has a NOT above (to know that it should do antijoin instead of
semijoin). For that, it gets a neg_transformer(), which absorbs the
parent NOT and sets a new member inside the class ("value_transform").
The item's val_bool() has to take care of applying NOT to its return value.
Item_in_subselect's val_bool() does not take care, this is left to its parent
Item_in_optimizer which "knows better" (handles caching of value,
etc) (Item_in_optimizer::val_bool() gets the value of Item_in_subselect,
consults Item_in_subselect::value_transform and applies negation).
As Item_in_subselect's val_bool() returns the value of the "naked" item (without
NOT), it's renamed to val_bool_naked().

<>ALL also uses the same neg_transformer(), as it's equivalent to NOT IN.

resolve_subquery() identifies a candidate for AJ.

convert_subquery_to_semijoin() converts the candidate to a AJ nest;
for example:
SELECT * FROM ot WHERE ot.a NOT IN (SELECT it.a FROM it);
becomes
SELECT * FROM ot AJ it ON ot.a=it.a
where the AJ nest contains 'it'.
The AJ nest's sj_cond() is the ON.
The AJ nest is also made to be a LEFT JOIN nest (join_cond() is set
to ON).

Let's explain why we introduce a LEFT JOIN.
The result of AJ is equivalent to that of
SELECT * FROM ot LEFT JOIN it ON ot.a=it.a WHERE SPECIAL_COND
where SPECIAL_COND is:
"<if>(is_not_null_compl(it), <if>(found_match(it), 0, true), true)"
which is made of two triggered conditions. It is good to remember here
some old principles:
- ON and WHERE end up AND-ed together into QEP_TAB::condition()
(where QEP_TAB is 'it').
- the job of triggered conditions is to disable certain AND-ed bits in
certain phases
- e.g. the job of found_match is to disable the evaluation of WHERE, so
it's false when we want to evaluate ON, and true when we want to
evaluate WHERE.
So, SPECIAL_COND has this effect, for a row of 'ot':
- before evaluating ON, found_match is set to false
- QEP_TAB::condition() is evaluated, for ON (SPECIAL_COND returns true)
- if ON is true (there is a match in 'it'): found_match is set to
true, is_not_null_compl is set to true, QEP_TAB::condition() is
evaluated, for WHERE, SPECIAL_COND returns 0, row of 'ot' is rejected
- if there is no match in 'it', NULL-complementing is done,
is_not_null_compl is set to false, found_match is set to true;
QEP_TAB::condition() is evaluated, for WHERE, SPECIAL_COND returns
"true", row of 'ot' is emitted. 
- All in all, the row of 'ot' is emitted only if it has no match in
'it' for 'ON ot.a=it.a'. This is the same result as the original query
with NOT IN.

Note that we can end up with an ON condition having a reference to external 
nests: indeed,
SELECT * FROM ot1 LEFT JOIN ot2 ON ot1.a NOT IN (SELECT it.a FROM it);
becomes
SELECT * FROM ot1 LEFT JOIN (ot2 AJ it ON ot1.a=it.a);
and then
SELECT * FROM ot1 LEFT JOIN (ot2 LEFT JOIN it ON ot1.a=it.a) ON SPECIAL_COND;
So we have the ON condition of "ot2 LJ it" referencing ot1 i.e. referencing 
outside of "ot2 LJ it"; this is not allowed with user-specified LEFT JOIN but 
doesn't seem to cause any implementation problem.

An AJ nest is characterized by sj_cond()!=nullptr &&
join_cond()!=nullptr;
a SJ nest is characterized by sj_cond()!=nullptr &&
join_cond()==nullptr;
a LEFT JOIN nest is characterized by sj_cond()==nullptr &&
join_cond()!=nullptr.

AJ/SJ-nest is now pushed to front of the join list, and not the back
anymore; this way it's last in join order; this gives more logical
printouts in EXPLAIN's Note (closer to the order specified by the
user).
And it makes some "swapping" code unnecessary in print_join().
Implied a minor fix in hint code (update_semijoin_strategies()).

The AJ nest is added to SELECT_LEX::sj_nests.

Thanks to the LEFT JOIN, dependencies of tables are automatically put
in place (so 'it' won't go before 'ot' in plan).
And normal NULL-complementing code will automatically be used in
execution (both with or without join buffering).
And AJ conditions properly remain non-merged with other conditions
like WHERE (unlike conditions of SJ nests which are merged with WHERE).

During planning, this AJ nest is optimized by advance_sj_state().

advance_sj_state() may pick Firstmatch, Materialization or Duplicates
Weedout (other strategies are disabled by
SELECT_LEX::update_semijoin_strategies).

There was an old limitation (in semijoin_types_allow_materialization()):
SJ-Mat was prevented when on the right side of LEFT JOIN, as in:
SELECT FROM t1 LEFT JOIN t2 ON x IN (SELECT FROM t3)
i.e.
t1 LEFT JOIN (t2 SJ t3).
Thus, AJ (which uses LEFT JOIN internally) couldn't use SJ-Mat.
This was a short-term problem, because if resolve_subquery() chose
AJ, and AJ was limited to Firstmatch, then the WL would have this
effect: a query with NOT IN, previously served with Subquery
materialization, would be served with AJ Firstmatch, possibly much
slower. It would be a regression.
There were two solutions:
- make resolve_subquery() not choose AJ if NOT IN was inside the
condition of a LEFT JOIN (=> continue using Subquery Mat), or
- implement SJ-Mat for such case (=> our example query would switch to
SJ-Mat which shouldn't be slower than Subquery Mat).
The 1st possibility was less work and enough for the 3 DBT3 queries
(which don't have LEFT JOIN).
I chose the 2nd possibility. So the limitation has been lifted in this
WL. Independently of AJ it required these changes, when SJ-mat was used:
- in make_outerjoin_info(), in the case of nested outer joins: do not
set first_upper() of a SJ-inner table to point to a table external to the
SJ-nest; indeed when the SJ-inner table gets used in execution, it
must be as if the nest would materialize alone, independently of the
external nests; it's the sj-tmp table which must have first_upper()
set now, as it's the representative of the SJ nest from the POV of
external nests.
- The SJ-tmp table is inserted into the LEFT JOIN nest (which, if AJ,
is also the AJ nest), again because it's the representative of the SJ
nest from the POV of external nests.
- some parts of the ON condition (e.g. equality-based triggered
conditions) need to be attached to the SJ-tmp table and not to the SJ-inner
tables; but some other conditions (SJ-inner ones, from the original
subquery's WHERE) need to be attached to the SJ-inner tables;
before this WL JOIN::attach_join_conditions() split ON among tables
between "first_inner" and "last_inner" of the LEFT JOIN nest;
but SJ-inner tables are not in this range (remember: from the POV of
execution, they're like an independent JOIN); the SJ-mat tmp table is
in this range; so it gets its part of ON, and it sends the other part
to the SJ-inner tables. For that, some code is added to
JOIN::attach_join_conditions() specifically for SJ-Mat.

When it finds an AJ nest, JOIN::attach_join_conditions() builds the
SPECIAL_COND described earlier and attaches it to the first AJ-inner
table (which, if AJ-mat is used, is the AJ-mat tmp table, as that
table has been inserted into the AJ nest; and it is as it should, as
SPECIAL_COND must check null-complementing of the AJ-mat tmp table,
not of SJ-inner tables which don't participate in execution anymore).

Finally, JOIN::attach_join_conditions() turns on the
"not_exists_optimize" flag for the first AJ-inner table (or AJ-mat
tmp table).
So, if SPECIAL_COND returns false (i.e. there is a match) the existing
code in evaluate_join_record() sets a return_tab which ignores all next
rows of 'it' and goes to next record of 'ot'.
If SPECIAL_COND returns true (which means we have exhausted 'it' and
done NULL-complementing), we emit the row of 'ot' and automatically go
to next record of 'ot' (there's no next row in 'it' to skip).
It is worth nothing that if the planner selected Firstmatch,
execution does not use the Firstmatch machinery (do_firstmatch()). It
uses "Not exists" instead.
As it doesn't use the Firstmatch machinery, the SJ_OPT_FIRSTMATCH flag
is turned off from QEP_TABs at end of planning, and before join
buffering is set up (because join buffering has limitations when
together with Firstmatch, and we don't need to have these
limitations for AJ as we don't use Firstmatch; join buffering already
has limitations when together with LEFT JOIN, which we have for AJ).

simplify_joins() used to dissolve a SJ nest into a parent SJ nest (the
inner being changed to a plain JOIN nest as its duplicates are anyway
absorbed by the parent).
Now, with the advent of AJ: SJ can be dissolved into AJ nest. But
AJ nest cannot be dissolved into AJ or SJ nest (would not be
semantically correct).
So we have this new situation of: AJ inside AJ/SJ.
As advance_sj_state() cannot manage nested SJ nests (see the
comment added to it; e.g. there's only one JOIN_TAB::emb_sj_nest
pointer): AJ can't be inside AJ/SJ nest, and so we block it: in
resolve_subquery(), if a subquery contains AJ candidates, it will
stay as a subquery.

Actually this blocking is done in a sub-function of resolve_subquery():
semijoin_checks_ok(). This same function blocks AJ if there is a
problem with NULLs.

Semijoin table pullout isn't applied to AJ, it wouldn't have any
interest (the LEFT JOIN would stay anyway).

Before this WL, SJ didn't apply to
WHERE (x IN (subq)) IS TRUE .
This is fixed; Item_exists_subselect now absorbs the upper IS [NOT]
TRUE/FALSE operator, if there is. Item_exists_subselect gets an enum
members (enum BOOL_IS_TRUE, BOOL_NOT_TRUE, etc) to remember the IS operator it
absorbed. In the enum there is also BOOL_NEGATED to remember it's inside NOT.
Simplifications are done: e.g.
(NOT IN) IS TRUE is changed to IN IS FALSE.
In top_level_item(), we tell the Item_in_subselect that it can behave as if it
were inside IS TRUE.
So WHERE NOT IN becomes WHERE (NOT IN) IS TRUE then WHERE IN IS FALSE.

top_level_item() is renamed to propagate_is_true().
is_top_level_item() is renamed to ignore_unknown().

Item_func_true/false are deleted, their logic goes into parent Item_func_truth.
The enum BOOL_IS_TRUE/etc is used in Item_func_truth instead of booleans.

Since before this WL, when the contextualization phase sees
NOT(expr)
it calls negate_expression(expr), which calls expr->neg_transformer(): expr has
a choice to ignore this, in which case an Item_func_not would be created around
it; or to take action, internally recording the presence of NOT around, in which
case 'expr' would take the place of NOT(expr) (Item_func_not would not be created).
For example, not(a<b) becomes (a>=b).
To build more on this, in this WL a neg_transformer() is added to
Item_func_truth, so NOT (x IS TRUE) becomes x IS NOT TRUE.
Reusing the idea of negate_expression(), the same is done to "push down" IS
[NOT] TRUE/FALSE to the operator's argument. That's how Item_in_subselect gets
the signal that it's, for example, inside IS FALSE (and can thus do antijoin).
This is done by a new function change_truth_value_of_expression(); which
actually also does the job of negate_expression() which is suppressed from code.
It calls truth_transformer(), which, likewise, also does the job of
neg_transformer() which is suppressed.

IS NULL and IS NOT NULL are not changed by this WL.

The truth_transformer() is further extended to simplify, thanks to a matrix
(thanks Roy!), some "nested" cases :
- (expr IS TRUE) IS FALSE becomes: expr IS NOT TRUE
This reduces the depth of the Item tree.

"::print()" functions of subquery items now take care of printing the "NOT" and
"IS etc" that they absorbed.