WL#4389: Subquery optimizations: Make IN optimizations also handle EXISTS

Affects: Server-8.0   —   Status: Complete

The purpose of this worklog is to extend the IN semi-join optimizations to
handle EXISTS subqueries as well. EXISTS subqueries may be handled with the same
semi-join strategies as IN, including first-match, materialization, duplicate
weedout and loose index scan.

In addition, the WHERE condition attached to the subquery is decorrelated into
IN-like expression lists, further expanding the possible semi-join strategies.
Decorrelation will be done for both EXISTS and IN subqueries.
Functional requirements

FR1. Transform EXISTS subqueries into semi-join operations. Criteria for
     transformation are the same as for IN subqueries (see existing code).
     Among other things, this means that aggregated subqueries cannot be

FR2. Decorrelate trivially-correlated equality predicates in an IN or EXISTS
     subquery's WHERE condition, so that they can be treated similarly to
     expressions in IN subqueries.

FR3. Optimize EXISTS subqueries transformed to semi-join operations, considering
     all semi-join strategies, including firstmatch, materialization,
     duplicate weedout and loose scan.

FR4. All hints and optimizer switches applicable to IN subqueries transformed to
     semi-join operations are also applicable to transformed EXISTS subqueries.

Non-functional requirements.

NF1. Performance should be better or equal to current MySQL version.
     Significant performance improvement is expected when optimizer can take
     advantage of semi-join strategies like materialization or loose scan,
     however the improvement will depend heavily on schema and data volumes.
1. Observations about EXISTS subqueries

1.1 EXISTS subquery transformed to a semi-join operation
When an EXISTS subquery is part of the WHERE condition or a join condition, it
can be handled by a semi-join strategy in the same way as an IN subquery. We can
make this trivial conversion



  WHERE ... 1 IN (SELECT 1 FROM ...)

and then proceed to handle the subquery operation as a semi-join.

1.2 Trivially-correlated EXISTS

In a good number of cases, a correlated EXISTS subquery has the form:

  EXISTS (SELECT ... WHERE uncorrelated_cond AND inner_expr=outer_ref)

which allows us to easily change it into an uncorrelated IN subquery, which
can be handled by Materialization and InsideOut strategies.

Notice that this decorrelation can be applied to all kinds of EXISTS subqueries,
not only those that can be transformed into a semi-join. However, in this
worklog we limit the work to semi-join operations.

See LLD section 3.1 for more details.

1.3 EXISTS subquery always ignore NULL values

Notice that the WHERE condition (according to the SQL standard) has an implicit
IS TRUE added, making the subquery's WHERE condition ignore NULL values in the
de-correlated equality predicates. Thus, a de-correlated subquery will always
ignore NULLs, even though the equivalent IN subquery does not.

Let's say we have:

  SELECT ... FROM ot WHERE oe1 IN (SELECT ie1 FROM it)

If oe1 or ie1 are nullable, the result of the IN subquery may be UNKNOWN.
However, since the WHERE has an implicit IS TRUE, UNKNOWN results are converted
to FALSE and are thus effectively filtered out.

However, look at the similar correlated EXISTS query:


Here, the WHERE clause in the subquery filters out NULL values in the equality,
so the query effectively ignores all NULL values.

As long as the EXISTS subquery is placed in the WHERE clause, this has no effect
since UNKNOWN results are filtered out anyway. However, we could take advantage
of this property and implement semi-join for EXISTS queries in the SELECT list.
We could also implement NOT EXISTS as anti-semi-join. 
1.0 Item_subselect class modifications.

The Item_subselect derived classes are organized as follows:


Item_in_subselect is a derived class of Item_exists_subselect, making one
believe that IN is a specialization of EXISTS, which is not the case.
Further, Item_allany_subselect seems to be a derivation of
Item_in_subselect, which is exactly the opposite way.

Later we may refactor the subquery classes. See 5.0 Future work for some notes.

2.0 Transformation of EXISTS queries to semi-join

Imagine that we have the query:
  SELECT ...
  FROM ot1, ot2, ...
               FROM it1, it2, ...
               WHERE search-cond)
    AND outer-search-cond;

SELECT_LEX::resolve_subquery will look for EXISTS queries on the form above and
attempt to transform them into semi-join queries.

The actual transformation takes place in SELECT_LEX::flatten_subqueries() and
SELECT_LEX::convert_subquery_to_semijoin(). A semi-join nest is created and
added to the join nest structure of the outer query. The tables referenced in
the FROM clause of the table subquery of the EXISTS predicate are attached to
this semi-join nest. So far, this is the same that happens with an IN subquery
predicate. The set of outer and inner expressions (sj_outer_exprs and
sj_inner_exprs are initialized to empty sets.

After this, the WHERE clause of the subquery is decorrelated (see next section
for details).

Semi-join optimization does not support semi-join nests with empty semi-join
conditions, thus if the WHERE condition of the subquery is empty, a TRUE
expression is inserted in it to make sure it is not empty. However, semi-join
optimization does support the degenerate case of empty semi-join inner and outer
expression lists.

After these changes have been made, semi-join nests for IN and EXISTS
subquery predicates are equivalent and can be treated similarly in subsequent
optimization. Some consolidation is made: After analysis of IN expressions and
decorrelation, sj_outer_exprs and sj_inner_exprs contain possibly empty sets of
expressions, and the subquery WHERE condition contains the non-trivial part of
the condition after decorrelation. After this, a semi-join condition is
synthesized by creating equality predicates from pair-wise expressions in
sj_outer_exprs and sj_inner_exprs. These are collected in an AND condition,
before the remaining subquery predicate is appended.

Calculations of sj_depends_on (the complete set of outer tables that the
subquery depends on) and sj_corr_tables (the set of outer tables that the
non-trivially correlated WHERE condition depends on) is improved.

3.0 Decorrelation of subquery search conditions

3.1 Definitions of correlated subqueries

Subquery predicates are classified according to the correlation between
the tables in the subquery (the inner tables) and the tables in the main query
(the outer tables) as follows:

- Non-correlated.
  There is no correlation between the inner and the outer tables.
  A query containing an EXISTS subquery with no WHERE clause is non-correlated.
  A query containing an IN subquery where neither the left-hand expression of
  the IN nor the SELECT list of the subquery contain column references is

- Trivially-correlated.
  A "trivially-correlated subquery" is defined as a subquery used in an
  IN/=ANY or EXISTS predicate on the form:
    (SELECT select-list
     FROM from_clause
     WHERE uncorrelated_cond AND
             correlated_cond1 AND ...
   1. from_clause is a list of one or more tables or a join expression.
   2. GROUP BY, HAVING, ORDER BY and aggregated expressions are disallowed.
   3. uncorrelated_cond contains references to tables inside the subquery
      exclusively, and may contain negations and disjunctions.
   4. correlated_cond is an equality predicate where one side is an
      expression that refers tables outside the subquery exclusively and
      the other side is an expression that refers tables inside the
      subquery exclusively. correlated_cond may be written as

    SELECT * FROM ot
    WHERE ot.x IN (SELECT it.y FROM it
                   WHERE ot.z=it.w AND it.v>100);
    SELECT * FROM ot
                  WHERE ot.z=it.w AND it.v>100);

- Non-trivially-correlated.
  Any subquery with a WHERE clause that contains references to columns from
  outer tables, but does not adhere to the definition of trivially-correlated
  subquery above.

    SELECT * FROM ot
    WHERE ot.x IN (SELECT it.y FROM it
                   WHERE ot.z=it.w OR ot.z=it.v);
    SELECT * FROM ot
    WHERE ot.x IN (SELECT sum(it.y) FROM it
                  WHERE ot.z=it.w GROUP BY it.w);

A JOIN clause in a subquery will have no effect on whether the subquery
is considered to be correlated or not, because the JOIN .. ON clause cannot
contain outer references.

Notice that this definition of correlated subqueries is made for internal 
implementation work and is incompatible with the normal definition of 
correlated subqueries in SQL. When used on a syntactical level, a non-
correlated IN subquery is a subquery that does not have outer references in the 
WHERE clause.

Notice also that predicates consisting of only outer references are marked as
non-trivially correlated by this definition. It would be possible to move them
out of the subquery and into the outer query block's WHERE condition, and thus
let the subquery be trivially correlated, but this is not done in this WL.

3.2 Decorrelation

The function SELECT_LEX::decorrelate_where_cond() and the helper function
decorrelate_equality() are used to analyze a WHERE clause and classify the
subquery correlation based on the above criteria.

For each trivially-correlated pair of expressions, the outer expression is
appended to the list sj_outer_exprs, and the inner expression is appended
to the list sj_inner_exprs. If the WHERE clause contains predicates that are
not trivially-correlated, the semijoin nest will be classified as non-trivially-
correlated (sj_corr_tables is non-empty). After decorrelation, the equality is
removed from the WHERE condition.

3.3 Use of correlation information

After the decorrelation has been performed, the lists of correlated expressions
will be used in subsequent optimization and execution to:

- replace the SELECT list of the transformed subquery, meaning that we
  do not need to consider the corresponding query block (SELECT_LEX) after
  semi-join transformation.

- use with LooseScan semi-join strategy to see whether there are outer
  references that classify as LooseScan lookup columns.

- use with Materialization semi-join strategy to define materialization keys
  from the inner expressions.

NOTE: sj_corr_tables is updated when a table is pulled out of a
semijoin (a partial execution strategy for semijoin). This should probably imply
that calculation of correlated expressions and the trivial correlation
classification should be redone. The current behaviour disables some possible 
optimizations of the remaining tables in the semi-join nest, but it will not 
cause execution problems.


WHERE ot.a IN (SELECT it1.pk
               FROM it1, it2
               WHERE it1.x>it2.y);

Table pullout transforms this to:

SELECT * FROM ot, it1 SEMIJOIN it2 ON it1.x>it2.y
WHERE ot.a=it1.pk;

When it1 is pulled out, the semi-join subquery is no longer trivially
correlated. Materialization can be used for the original query, but not after

4.0 Testing effort

- Verify that IN and EXISTS queries have equivalent representations and
  produce the same results after semi-join transformation.

- Test variations of key types (primary, unique, non-unique, none).

- Test support for nullable and non-nullable columns.

- Test behaviour of optimizer-switch settings combined with semi-join

- Test behaviour with empty tables and single-row tables.

- Test IS TRUE, IS FALSE and IS UNKNOWN applied to subquery predicates.

- Test conjunctions and disjunctions of multiple subquery predicates.

- Test nested subquery predicates.

- Validate that decorrelation of IN and EXISTS predicates triggers
  materialization semi-join strategy.

5.0 Future work

- Refactor the Item_subselect class hierarchy into the following classes:

  For EXISTS, IN, ALL and ANY subqueries, the following set of derived classes
  should be considered instead of the existing classes:


- Mark predicates consisting of only outer references as non-trivially
  correlated (or move them from the inner where condition to the outer
  where condition).

- Enhance calculation of sj_corr_tables when performing table pullout of
  functionally dependent tables.