WL#4389: Subquery optimizations: Make IN optimizations also handle EXISTS
Affects: Server-8.0
—
Status: Complete
The purpose of this worklog is to extend the IN semi-join optimizations to handle EXISTS subqueries as well. EXISTS subqueries may be handled with the same semi-join strategies as IN, including first-match, materialization, duplicate weedout and loose index scan. In addition, the WHERE condition attached to the subquery is decorrelated into IN-like expression lists, further expanding the possible semi-join strategies. Decorrelation will be done for both EXISTS and IN subqueries.
Functional requirements
FR1. Transform EXISTS subqueries into semi-join operations. Criteria for
transformation are the same as for IN subqueries (see existing code).
Among other things, this means that aggregated subqueries cannot be
transformed.
FR2. Decorrelate trivially-correlated equality predicates in an IN or EXISTS
subquery's WHERE condition, so that they can be treated similarly to
expressions in IN subqueries.
FR3. Optimize EXISTS subqueries transformed to semi-join operations, considering
all semi-join strategies, including firstmatch, materialization,
duplicate weedout and loose scan.
FR4. All hints and optimizer switches applicable to IN subqueries transformed to
semi-join operations are also applicable to transformed EXISTS subqueries.
Non-functional requirements.
NF1. Performance should be better or equal to current MySQL version.
Significant performance improvement is expected when optimizer can take
advantage of semi-join strategies like materialization or loose scan,
however the improvement will depend heavily on schema and data volumes.
1. Observations about EXISTS subqueries ======================================= 1.1 EXISTS subquery transformed to a semi-join operation -------------------- When an EXISTS subquery is part of the WHERE condition or a join condition, it can be handled by a semi-join strategy in the same way as an IN subquery. We can make this trivial conversion WHERE ... EXISTS (SELECT ... FROM ...) to WHERE ... 1 IN (SELECT 1 FROM ...) and then proceed to handle the subquery operation as a semi-join. 1.2 Trivially-correlated EXISTS ------------------------------- In a good number of cases, a correlated EXISTS subquery has the form: EXISTS (SELECT ... WHERE uncorrelated_cond AND inner_expr=outer_ref) which allows us to easily change it into an uncorrelated IN subquery, which can be handled by Materialization and InsideOut strategies. Notice that this decorrelation can be applied to all kinds of EXISTS subqueries, not only those that can be transformed into a semi-join. However, in this worklog we limit the work to semi-join operations. See LLD section 3.1 for more details. 1.3 EXISTS subquery always ignore NULL values --------------------------------------------- Notice that the WHERE condition (according to the SQL standard) has an implicit IS TRUE added, making the subquery's WHERE condition ignore NULL values in the de-correlated equality predicates. Thus, a de-correlated subquery will always ignore NULLs, even though the equivalent IN subquery does not. Let's say we have: SELECT ... FROM ot WHERE oe1 IN (SELECT ie1 FROM it) If oe1 or ie1 are nullable, the result of the IN subquery may be UNKNOWN. However, since the WHERE has an implicit IS TRUE, UNKNOWN results are converted to FALSE and are thus effectively filtered out. However, look at the similar correlated EXISTS query: SELECT ... FROM ot WHERE EXISTS (SELECT * FROM it WHERE ot.oe=it.ie) Here, the WHERE clause in the subquery filters out NULL values in the equality, so the query effectively ignores all NULL values. As long as the EXISTS subquery is placed in the WHERE clause, this has no effect since UNKNOWN results are filtered out anyway. However, we could take advantage of this property and implement semi-join for EXISTS queries in the SELECT list. We could also implement NOT EXISTS as anti-semi-join.
1.0 Item_subselect class modifications.
The Item_subselect derived classes are organized as follows:
Item_subselect
Item_singlerow_subselect
Item_maxmin_subselect
Item_exists_subselect
Item_in_subselect
Item_allany_subselect
Item_in_subselect is a derived class of Item_exists_subselect, making one
believe that IN is a specialization of EXISTS, which is not the case.
Further, Item_allany_subselect seems to be a derivation of
Item_in_subselect, which is exactly the opposite way.
Later we may refactor the subquery classes. See 5.0 Future work for some notes.
2.0 Transformation of EXISTS queries to semi-join
Imagine that we have the query:
SELECT ...
FROM ot1, ot2, ...
WHERE EXISTS(SELECT *
FROM it1, it2, ...
WHERE search-cond)
AND outer-search-cond;
SELECT_LEX::resolve_subquery will look for EXISTS queries on the form above and
attempt to transform them into semi-join queries.
The actual transformation takes place in SELECT_LEX::flatten_subqueries() and
SELECT_LEX::convert_subquery_to_semijoin(). A semi-join nest is created and
added to the join nest structure of the outer query. The tables referenced in
the FROM clause of the table subquery of the EXISTS predicate are attached to
this semi-join nest. So far, this is the same that happens with an IN subquery
predicate. The set of outer and inner expressions (sj_outer_exprs and
sj_inner_exprs are initialized to empty sets.
After this, the WHERE clause of the subquery is decorrelated (see next section
for details).
Semi-join optimization does not support semi-join nests with empty semi-join
conditions, thus if the WHERE condition of the subquery is empty, a TRUE
expression is inserted in it to make sure it is not empty. However, semi-join
optimization does support the degenerate case of empty semi-join inner and outer
expression lists.
After these changes have been made, semi-join nests for IN and EXISTS
subquery predicates are equivalent and can be treated similarly in subsequent
optimization. Some consolidation is made: After analysis of IN expressions and
decorrelation, sj_outer_exprs and sj_inner_exprs contain possibly empty sets of
expressions, and the subquery WHERE condition contains the non-trivial part of
the condition after decorrelation. After this, a semi-join condition is
synthesized by creating equality predicates from pair-wise expressions in
sj_outer_exprs and sj_inner_exprs. These are collected in an AND condition,
before the remaining subquery predicate is appended.
Calculations of sj_depends_on (the complete set of outer tables that the
subquery depends on) and sj_corr_tables (the set of outer tables that the
non-trivially correlated WHERE condition depends on) is improved.
3.0 Decorrelation of subquery search conditions
3.1 Definitions of correlated subqueries
Subquery predicates are classified according to the correlation between
the tables in the subquery (the inner tables) and the tables in the main query
(the outer tables) as follows:
- Non-correlated.
There is no correlation between the inner and the outer tables.
A query containing an EXISTS subquery with no WHERE clause is non-correlated.
A query containing an IN subquery where neither the left-hand expression of
the IN nor the SELECT list of the subquery contain column references is
non-correlated.
Examples:
SELECT * FROM ot WHERE EXISTS (SELECT * FROM it);
SELECT * FROM ot WHERE 1 IN (SELECT 1 FROM it);
- Trivially-correlated.
A "trivially-correlated subquery" is defined as a subquery used in an
IN/=ANY or EXISTS predicate on the form:
(SELECT select-list
FROM from_clause
WHERE uncorrelated_cond AND
correlated_cond1 AND ...
correlated_condN)
where:
1. from_clause is a list of one or more tables or a join expression.
2. GROUP BY, HAVING, ORDER BY and aggregated expressions are disallowed.
3. uncorrelated_cond contains references to tables inside the subquery
exclusively, and may contain negations and disjunctions.
4. correlated_cond is an equality predicate where one side is an
expression that refers tables outside the subquery exclusively and
the other side is an expression that refers tables inside the
subquery exclusively. correlated_cond may be written as
inner_expr=outer_expr.
Examples:
SELECT * FROM ot
WHERE ot.x IN (SELECT it.y FROM it
WHERE ot.z=it.w AND it.v>100);
SELECT * FROM ot
WHERE EXISTS (SELECT * FROM it
WHERE ot.z=it.w AND it.v>100);
- Non-trivially-correlated.
Any subquery with a WHERE clause that contains references to columns from
outer tables, but does not adhere to the definition of trivially-correlated
subquery above.
Examples:
SELECT * FROM ot
WHERE ot.x IN (SELECT it.y FROM it
WHERE ot.z=it.w OR ot.z=it.v);
SELECT * FROM ot
WHERE ot.x IN (SELECT sum(it.y) FROM it
WHERE ot.z=it.w GROUP BY it.w);
A JOIN clause in a subquery will have no effect on whether the subquery
is considered to be correlated or not, because the JOIN .. ON clause cannot
contain outer references.
Notice that this definition of correlated subqueries is made for internal
implementation work and is incompatible with the normal definition of
correlated subqueries in SQL. When used on a syntactical level, a non-
correlated IN subquery is a subquery that does not have outer references in the
WHERE clause.
Notice also that predicates consisting of only outer references are marked as
non-trivially correlated by this definition. It would be possible to move them
out of the subquery and into the outer query block's WHERE condition, and thus
let the subquery be trivially correlated, but this is not done in this WL.
3.2 Decorrelation
The function SELECT_LEX::decorrelate_where_cond() and the helper function
decorrelate_equality() are used to analyze a WHERE clause and classify the
subquery correlation based on the above criteria.
For each trivially-correlated pair of expressions, the outer expression is
appended to the list sj_outer_exprs, and the inner expression is appended
to the list sj_inner_exprs. If the WHERE clause contains predicates that are
not trivially-correlated, the semijoin nest will be classified as non-trivially-
correlated (sj_corr_tables is non-empty). After decorrelation, the equality is
removed from the WHERE condition.
3.3 Use of correlation information
After the decorrelation has been performed, the lists of correlated expressions
will be used in subsequent optimization and execution to:
- replace the SELECT list of the transformed subquery, meaning that we
do not need to consider the corresponding query block (SELECT_LEX) after
semi-join transformation.
- use with LooseScan semi-join strategy to see whether there are outer
references that classify as LooseScan lookup columns.
- use with Materialization semi-join strategy to define materialization keys
from the inner expressions.
NOTE: sj_corr_tables is updated when a table is pulled out of a
semijoin (a partial execution strategy for semijoin). This should probably imply
that calculation of correlated expressions and the trivial correlation
classification should be redone. The current behaviour disables some possible
optimizations of the remaining tables in the semi-join nest, but it will not
cause execution problems.
Example:
SELECT * FROM ot
WHERE ot.a IN (SELECT it1.pk
FROM it1, it2
WHERE it1.x>it2.y);
Table pullout transforms this to:
SELECT * FROM ot, it1 SEMIJOIN it2 ON it1.x>it2.y
WHERE ot.a=it1.pk;
When it1 is pulled out, the semi-join subquery is no longer trivially
correlated. Materialization can be used for the original query, but not after
pullout.
4.0 Testing effort
- Verify that IN and EXISTS queries have equivalent representations and
produce the same results after semi-join transformation.
- Test variations of key types (primary, unique, non-unique, none).
- Test support for nullable and non-nullable columns.
- Test behaviour of optimizer-switch settings combined with semi-join
transformation.
- Test behaviour with empty tables and single-row tables.
- Test IS TRUE, IS FALSE and IS UNKNOWN applied to subquery predicates.
- Test conjunctions and disjunctions of multiple subquery predicates.
- Test nested subquery predicates.
- Validate that decorrelation of IN and EXISTS predicates triggers
materialization semi-join strategy.
5.0 Future work
- Refactor the Item_subselect class hierarchy into the following classes:
For EXISTS, IN, ALL and ANY subqueries, the following set of derived classes
should be considered instead of the existing classes:
Item_predicate_subquery
Item_predicate_allany
Item_predicate_in_subquery
Item_predicate_exists
- Mark predicates consisting of only outer references as non-trivially
correlated (or move them from the inner where condition to the outer
where condition).
- Enhance calculation of sj_corr_tables when performing table pullout of
functionally dependent tables.
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.