WL#4389: Subquery optimizations: Make IN optimizations also handle EXISTS
Affects: Server-8.0
—
Status: Complete
The purpose of this worklog is to extend the IN semi-join optimizations to handle EXISTS subqueries as well. EXISTS subqueries may be handled with the same semi-join strategies as IN, including first-match, materialization, duplicate weedout and loose index scan. In addition, the WHERE condition attached to the subquery is decorrelated into IN-like expression lists, further expanding the possible semi-join strategies. Decorrelation will be done for both EXISTS and IN subqueries.
Functional requirements FR1. Transform EXISTS subqueries into semi-join operations. Criteria for transformation are the same as for IN subqueries (see existing code). Among other things, this means that aggregated subqueries cannot be transformed. FR2. Decorrelate trivially-correlated equality predicates in an IN or EXISTS subquery's WHERE condition, so that they can be treated similarly to expressions in IN subqueries. FR3. Optimize EXISTS subqueries transformed to semi-join operations, considering all semi-join strategies, including firstmatch, materialization, duplicate weedout and loose scan. FR4. All hints and optimizer switches applicable to IN subqueries transformed to semi-join operations are also applicable to transformed EXISTS subqueries. Non-functional requirements. NF1. Performance should be better or equal to current MySQL version. Significant performance improvement is expected when optimizer can take advantage of semi-join strategies like materialization or loose scan, however the improvement will depend heavily on schema and data volumes.
1. Observations about EXISTS subqueries ======================================= 1.1 EXISTS subquery transformed to a semi-join operation -------------------- When an EXISTS subquery is part of the WHERE condition or a join condition, it can be handled by a semi-join strategy in the same way as an IN subquery. We can make this trivial conversion WHERE ... EXISTS (SELECT ... FROM ...) to WHERE ... 1 IN (SELECT 1 FROM ...) and then proceed to handle the subquery operation as a semi-join. 1.2 Trivially-correlated EXISTS ------------------------------- In a good number of cases, a correlated EXISTS subquery has the form: EXISTS (SELECT ... WHERE uncorrelated_cond AND inner_expr=outer_ref) which allows us to easily change it into an uncorrelated IN subquery, which can be handled by Materialization and InsideOut strategies. Notice that this decorrelation can be applied to all kinds of EXISTS subqueries, not only those that can be transformed into a semi-join. However, in this worklog we limit the work to semi-join operations. See LLD section 3.1 for more details. 1.3 EXISTS subquery always ignore NULL values --------------------------------------------- Notice that the WHERE condition (according to the SQL standard) has an implicit IS TRUE added, making the subquery's WHERE condition ignore NULL values in the de-correlated equality predicates. Thus, a de-correlated subquery will always ignore NULLs, even though the equivalent IN subquery does not. Let's say we have: SELECT ... FROM ot WHERE oe1 IN (SELECT ie1 FROM it) If oe1 or ie1 are nullable, the result of the IN subquery may be UNKNOWN. However, since the WHERE has an implicit IS TRUE, UNKNOWN results are converted to FALSE and are thus effectively filtered out. However, look at the similar correlated EXISTS query: SELECT ... FROM ot WHERE EXISTS (SELECT * FROM it WHERE ot.oe=it.ie) Here, the WHERE clause in the subquery filters out NULL values in the equality, so the query effectively ignores all NULL values. As long as the EXISTS subquery is placed in the WHERE clause, this has no effect since UNKNOWN results are filtered out anyway. However, we could take advantage of this property and implement semi-join for EXISTS queries in the SELECT list. We could also implement NOT EXISTS as anti-semi-join.
1.0 Item_subselect class modifications. The Item_subselect derived classes are organized as follows: Item_subselect Item_singlerow_subselect Item_maxmin_subselect Item_exists_subselect Item_in_subselect Item_allany_subselect Item_in_subselect is a derived class of Item_exists_subselect, making one believe that IN is a specialization of EXISTS, which is not the case. Further, Item_allany_subselect seems to be a derivation of Item_in_subselect, which is exactly the opposite way. Later we may refactor the subquery classes. See 5.0 Future work for some notes. 2.0 Transformation of EXISTS queries to semi-join Imagine that we have the query: SELECT ... FROM ot1, ot2, ... WHERE EXISTS(SELECT * FROM it1, it2, ... WHERE search-cond) AND outer-search-cond; SELECT_LEX::resolve_subquery will look for EXISTS queries on the form above and attempt to transform them into semi-join queries. The actual transformation takes place in SELECT_LEX::flatten_subqueries() and SELECT_LEX::convert_subquery_to_semijoin(). A semi-join nest is created and added to the join nest structure of the outer query. The tables referenced in the FROM clause of the table subquery of the EXISTS predicate are attached to this semi-join nest. So far, this is the same that happens with an IN subquery predicate. The set of outer and inner expressions (sj_outer_exprs and sj_inner_exprs are initialized to empty sets. After this, the WHERE clause of the subquery is decorrelated (see next section for details). Semi-join optimization does not support semi-join nests with empty semi-join conditions, thus if the WHERE condition of the subquery is empty, a TRUE expression is inserted in it to make sure it is not empty. However, semi-join optimization does support the degenerate case of empty semi-join inner and outer expression lists. After these changes have been made, semi-join nests for IN and EXISTS subquery predicates are equivalent and can be treated similarly in subsequent optimization. Some consolidation is made: After analysis of IN expressions and decorrelation, sj_outer_exprs and sj_inner_exprs contain possibly empty sets of expressions, and the subquery WHERE condition contains the non-trivial part of the condition after decorrelation. After this, a semi-join condition is synthesized by creating equality predicates from pair-wise expressions in sj_outer_exprs and sj_inner_exprs. These are collected in an AND condition, before the remaining subquery predicate is appended. Calculations of sj_depends_on (the complete set of outer tables that the subquery depends on) and sj_corr_tables (the set of outer tables that the non-trivially correlated WHERE condition depends on) is improved. 3.0 Decorrelation of subquery search conditions 3.1 Definitions of correlated subqueries Subquery predicates are classified according to the correlation between the tables in the subquery (the inner tables) and the tables in the main query (the outer tables) as follows: - Non-correlated. There is no correlation between the inner and the outer tables. A query containing an EXISTS subquery with no WHERE clause is non-correlated. A query containing an IN subquery where neither the left-hand expression of the IN nor the SELECT list of the subquery contain column references is non-correlated. Examples: SELECT * FROM ot WHERE EXISTS (SELECT * FROM it); SELECT * FROM ot WHERE 1 IN (SELECT 1 FROM it); - Trivially-correlated. A "trivially-correlated subquery" is defined as a subquery used in an IN/=ANY or EXISTS predicate on the form: (SELECT select-list FROM from_clause WHERE uncorrelated_cond AND correlated_cond1 AND ... correlated_condN) where: 1. from_clause is a list of one or more tables or a join expression. 2. GROUP BY, HAVING, ORDER BY and aggregated expressions are disallowed. 3. uncorrelated_cond contains references to tables inside the subquery exclusively, and may contain negations and disjunctions. 4. correlated_cond is an equality predicate where one side is an expression that refers tables outside the subquery exclusively and the other side is an expression that refers tables inside the subquery exclusively. correlated_cond may be written as inner_expr=outer_expr. Examples: SELECT * FROM ot WHERE ot.x IN (SELECT it.y FROM it WHERE ot.z=it.w AND it.v>100); SELECT * FROM ot WHERE EXISTS (SELECT * FROM it WHERE ot.z=it.w AND it.v>100); - Non-trivially-correlated. Any subquery with a WHERE clause that contains references to columns from outer tables, but does not adhere to the definition of trivially-correlated subquery above. Examples: SELECT * FROM ot WHERE ot.x IN (SELECT it.y FROM it WHERE ot.z=it.w OR ot.z=it.v); SELECT * FROM ot WHERE ot.x IN (SELECT sum(it.y) FROM it WHERE ot.z=it.w GROUP BY it.w); A JOIN clause in a subquery will have no effect on whether the subquery is considered to be correlated or not, because the JOIN .. ON clause cannot contain outer references. Notice that this definition of correlated subqueries is made for internal implementation work and is incompatible with the normal definition of correlated subqueries in SQL. When used on a syntactical level, a non- correlated IN subquery is a subquery that does not have outer references in the WHERE clause. Notice also that predicates consisting of only outer references are marked as non-trivially correlated by this definition. It would be possible to move them out of the subquery and into the outer query block's WHERE condition, and thus let the subquery be trivially correlated, but this is not done in this WL. 3.2 Decorrelation The function SELECT_LEX::decorrelate_where_cond() and the helper function decorrelate_equality() are used to analyze a WHERE clause and classify the subquery correlation based on the above criteria. For each trivially-correlated pair of expressions, the outer expression is appended to the list sj_outer_exprs, and the inner expression is appended to the list sj_inner_exprs. If the WHERE clause contains predicates that are not trivially-correlated, the semijoin nest will be classified as non-trivially- correlated (sj_corr_tables is non-empty). After decorrelation, the equality is removed from the WHERE condition. 3.3 Use of correlation information After the decorrelation has been performed, the lists of correlated expressions will be used in subsequent optimization and execution to: - replace the SELECT list of the transformed subquery, meaning that we do not need to consider the corresponding query block (SELECT_LEX) after semi-join transformation. - use with LooseScan semi-join strategy to see whether there are outer references that classify as LooseScan lookup columns. - use with Materialization semi-join strategy to define materialization keys from the inner expressions. NOTE: sj_corr_tables is updated when a table is pulled out of a semijoin (a partial execution strategy for semijoin). This should probably imply that calculation of correlated expressions and the trivial correlation classification should be redone. The current behaviour disables some possible optimizations of the remaining tables in the semi-join nest, but it will not cause execution problems. Example: SELECT * FROM ot WHERE ot.a IN (SELECT it1.pk FROM it1, it2 WHERE it1.x>it2.y); Table pullout transforms this to: SELECT * FROM ot, it1 SEMIJOIN it2 ON it1.x>it2.y WHERE ot.a=it1.pk; When it1 is pulled out, the semi-join subquery is no longer trivially correlated. Materialization can be used for the original query, but not after pullout. 4.0 Testing effort - Verify that IN and EXISTS queries have equivalent representations and produce the same results after semi-join transformation. - Test variations of key types (primary, unique, non-unique, none). - Test support for nullable and non-nullable columns. - Test behaviour of optimizer-switch settings combined with semi-join transformation. - Test behaviour with empty tables and single-row tables. - Test IS TRUE, IS FALSE and IS UNKNOWN applied to subquery predicates. - Test conjunctions and disjunctions of multiple subquery predicates. - Test nested subquery predicates. - Validate that decorrelation of IN and EXISTS predicates triggers materialization semi-join strategy. 5.0 Future work - Refactor the Item_subselect class hierarchy into the following classes: For EXISTS, IN, ALL and ANY subqueries, the following set of derived classes should be considered instead of the existing classes: Item_predicate_subquery Item_predicate_allany Item_predicate_in_subquery Item_predicate_exists - Mark predicates consisting of only outer references as non-trivially correlated (or move them from the inner where condition to the outer where condition). - Enhance calculation of sj_corr_tables when performing table pullout of functionally dependent tables.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.