WL#3740: Subquery optimization: Semijoin: Pull-out of inner tables
Affects: Server-6.0
—
Status: Complete
This is a subtask of WL#2980 "Subquery optimization: Semijoin". It includes: - Conversion of subquery IN predicates to semi-join nests (i.e. TABLE_LISTs) - Elimination of nested semi-join nests. - A procedure that pulls out tables out of semi-join nests when this is possible.
1. Task scope 2. SJ applicability criteria 2.1 MAX_TABLES limitation 3. Subquery to semijoin conversion 3.1 Conversion steps and their location 3.2 The conversion procedure 3.1 Nested semi-join nests removal 3.2 Treatment of semi-join nests by other optimizations 4. The Pull-out procedure 1. Task scope ============= This WL entry covers: - Conversion of subquery IN predicates to semi-join nests (i.e. TABLE_LISTs) - Elimination of nested semi-join nests. - A procedure that pulls out tables out of semi-join nests when this is possible. After this WL is implemented, the code will be in this state: - Subqueries that can't be executed with SJ will be handled as before; - Subqueries that can be executed by SJ and allow to eliminate all semi-join nests (by pulling out all of the tables) will be handled correctly. - Subqueries that can be executed with SJ but do not allow for all tables to be pulled out will not be executed, a query failure will be reported (without any special error message). That is, this WL will leave the subquery implementation in a state where some subqueries are not handled. The next task, WL#3741, will restore server's ability to handle all kinds of subqueries. 2. SJ applicability criteria ============================ (WL#2980 has a copy of this section contents) A subquery predicate is replaced with semi-join nest if the following conditions are met: SUBQ-TO-SJ-CRITERIA: 1. Subquery predicate is of the form (...) IN (SELECT ...) (...) =ANY (SELECT ...) 2. Subquery is a single SELECT (no UNIONs) 3. Subquery does not have GROUP BY or ORDER BY 4. Subquery does not use aggregate functions or HAVING 5. Subquery predicate is at the AND-top-level of ON/WHERE clause 6. #tables-in-parent-query + #tables-in-subquery < MAX_TABLES Note that the subquery may have DISTINCT or LIMIT clauses. A more precise wording for condition #5 is: The WHERE/ON clause can be converted to "subquery_predicate AND remainder" while the check will be performed as follows: Subquery is at the AND-top-level of the clause if we can reach it by starting from the root Item* and descending into the arguments of Item_cond_and() functions. 2.1 MAX_TABLES limitation ~~~~~~~~~~~~~~~~~~~~~~~~~ The condition #6 is: #tables-in-parent-query + #tables-in-subquery < MAX_TABLES The reason for this limitation is that we must not end up in a situation where the total number of tables in a join is >= MAX_TABLES. If there are several candidate subqueries and the total number of tables is >= MAX_TABLES we'll have to make a choice which of the subqueries to convert to semi-joins. If the "children" subqueries have "grandchildren" subqueries we may also have a choice between pulling grandchild into the child or child into the parent. We are not able to make an educated choice due to lack of information, in particular we don't know subquery's output cardinality. We'll settle for this set of preferences, in the order of their importance: Parents/children 1. Pull the most deeply nested subqueries first; Siblings 2.1. Prefer correlated subqueries over uncorrelated; 2.2. Prefer subqueries that have greater number of outer tables; The preferences place some restrictions on how the subqueries to semi-join nests conversion will need to be organized. - The transformation must proceed as a depth-first traversal. - In order to make the choice we will first need to run JOIN::prepare() for all "sibling" subqueries, then collect and compare all candidates, and then perform the transformation. 3. Subquery to semijoin conversion ================================== 3.1 Conversion steps and their location --------------------------------------- Currently the execution proceeds according to this scenario: parent_join->prepare() { where->fix_fields() { Item_subselect->fix_fields() { engine->prepare() { child_join->prepare() { // Preparation part #1 (fix_fields(), etc) [row|single_value]_select_transformer(child_join) { Inject predicates into the child select, convert IN to EXISTS } [ select_lex->fix_prepare_information(); // GROUP BY and PROCEDURE fix-ups; ](**) } } } } } parent_join->optimize(); parent_join->exec() { ... subq_predicate->val_int() { child_join->optimize(); child_join->exec(); } } The two notable properties are 1. IN->EXISTS conversion is performed "early" 2. Subqery's optimization is performed lazily In order to use the strategy described in section 2.1 we must return out of child_join->prepare() without doing the IN->EXISTS conversion. Since subqueries are processed recursively, this applies to parent_join->prepare() as well. The actions marked (**) seem to have a prerequisite that we either have perfomed IN->EXISTS conversion or have decided not to do it at all, so they will have to be delayed as well. This gives us the following strategy: parent_join->prepare() { where->fix_fields() { Item_subselect->fix_fields() { engine->prepare() { child_join->prepare() { // Preparation part #1 (fix_fields(), etc) if (this is a subquery that meets SUBQ-TO-SJ-CRITERIA #1-#5) { mark it for further processing // see below } else { [row|single_value]_select_transformer(child_join); select_lex->fix_prepare_information(); // GROUP BY and PROCEDURE fix-ups; } } } } } } if (this a top-level-select) { fix_subqueries() { // 1. fix children for each marked subquery S S->fix_subqueries(); // 2. Pick which subqueries to convert, and convert as many as we can // Materialization-check: // 3. Not coded as part of this WL: Pick which subqueries are to be // handled via Materialization. // 4. Finalize join->prepare() for the subqueries that we haven't // converted: for every marked subquery that wasn't processed { select_lex->fix_prepare_information(); // GROUP BY and PROCEDURE fix-ups; } } } parent_join->optimize(); parent_join->exec(); ... Things to note: - The above is not completely in line with WL#1110 "Subquery optimization: Materialization". Timour's intent was to have the "Materialization-check" (see above) at the beginning of JOIN::optimize() and we've moved it out and ahead (but didn't move it over anything significant?) - In the current code subqueries are optimized lazily (if the subquery is never invoked it is never optimized). For subqueries executed by SJ this property will be lost as such subqueries are optimized/executed together with their parent select. - Conversion to semi-join nests will occur recursively, in depth-first fashion. 3.2 The conversion procedure ---------------------------- In order to turn subquery predicate into a semi-join nest we'll need to do the following: * In the parent's WHERE/ON clause, replace Item_subselect with Item_int(1). An alternative approach is to remove the Item_subselect item, i.e: - replace it with NULL if the said item is the complete WHERE/ON clause, or - remove it from the list of arguments of its parent Item_cond_and and remove the Item_cond_and if it becomes "unary" AND. We'll make a choice between the two approaches in LLD or at coding time. * Walk through child's tables and adjust table->map * Adjust the parent_join->tables counter * Create a TABLE_LIST join nest that describes the semi-join (its exact definition will be given in the LLD) * Attach this TABLE_LIST as a child of parent select's global TABLE_LIST or of the appropriate outer join nest. * Move the subquery's WHERE into semi-join's ON condition. - For subquery's WHERE/ON conditions, recalculate all attributes that involve table bitmaps: = call update_used_tables() = do something to update not_null_tables() = adjust other attributes as necessary (to be covered in the LLD) * Insert the IN equalities into semi-join's ON condition, i.e for (oe1 .. oeN) IN (SELECT ie1, ... ieN) create oe1=ie1 AND ... AND oeN=ieN and insert it into the semi-join's ON condition. Note: the question whether semijoin's ON condition should be in TABLE_LIST::on_expr is to be resolved in the LLD. - Pro: this is the "natural place" - Contra: all over the code the (on_expr!=NULL) check is used to check if this is an outer join. * Do nothing for "Item_refs". When a subquery is located in the WHERE clause, references "out of the subquery" are Item_direct_ref objects which are just call redirectors, and consequently can be simply ignored. (The reason for their creation is that we must have Item_direct_ref( Item_field(outer_tbl.col)).used_tables() == OUTER_REF_TABLE_BIT while Item_field(outer_tbl.col).used_tables() = outer_tbl.map ). * (TODO: if any of the execution strategies will need more information to be saved we could cache it here. One candidate is the is_correlated flag) The above list is not inclusive. Details to be covered in the LLD. Once these actions are performed we can discard the child st_select_*/JOIN and operate only on the structures that are in the parent join. 3.1 Nested semi-join nests removal ---------------------------------- We do not need to keep nested semi-join TABLE_LISTs to produce correct query results. This comes from the fact that for some nested semi-join ot_i SJ (it_k SJ iit_j ON sj_cond2) ON sj_cond1 the result is defined as all row combinations of ot_i such that exists a matching row combination it_k such that exists a matching row combination iit_j which is the same as all row combinations of ot_i such that exist a matching row combination {it_k, iit_j}. Consequently, we can make simplify_joins() to remove the nested semi-joins. NOTE: This flattening will destroy e.g. information about whether the grandchild subselect was uncorrelated. We don't need it now but may need it for some further optimizations. 3.2 Treatment of semi-join nests by other optimizations ------------------------------------------------------- This WL code will cause semi-join nests to be seen by the following optimizations: Equality Propagation optimize_cond() can use semi-join nest's ON condition together with parent select's WHERE condition. Partition Pruning For an inner table, its semi-join nest's ON condition must be used. Aggregate Functions optimization, opt_sum_query call and code around it. It seems we don't need to do anything here. Constant table detection Like with outer joins, constant table detection should detect constant inner tables. The pull-out procedure will pull them out of the semijoin. Ref-analysis update_ref_and_keys() should process semi-join's ON expression. It is ok to create candidate ref accesses that mix inner and outer tables (e.g. for outer_tbl.key =inner_tbl.key two KEYUSE elements should be created). 4. The Pull-out procedure ========================= Consider a semi-join (ot1 ... otN) SJ (it1 , ... itK) ON sj_cond Suppose we execute it as inner join. We will violate INNER-UNIQ-REQ when we construct two row combinations (ot1.row1 ... otN.row1, it1.rowX, it2.row1, it3.row1 ...) (ot1.row1 ... otN.row1, it1.rowY, it2.row1, it3.row1, ...) that have the same rows for outer tables but different rows for the inner table. We can guarantee that this won't happen if there is a functional dependency between the inner table and the outer tables: it1.row = func(ot1.row, ... otN.row) We can detect and use those two cases of functional dependency: * the inner table has at most one row * the inner table can be accessed using eq_ref(consts,outer-tables). Presense of functional dependency allows us to pull the table out of the semi-join nest. The pull-out can be performed once we've detected the constant tables and ran update_ref_and_keys(). These two actions are performed inside make_join_statistics(): make_join_statistics() { ... update_ref_and_keys(); ... do { for each non-const table if (can make table constant) const_tables |= table; } while(const_tables was widened); ... // do range analysis } We can use analogous approach to pull tables out of the semi-join nests: for (each semi-join nest) { // Action #1: Mark the constant tables to be pulled out pulled_tables= {}; for (each table T in semi-join nest) { if (T has at most one row) { pulled_tables.insert(T); } } // Action #2: Find which tables we can pull out based on // update_ref_and_keys() data. Note that pulling one table out // can allow us to pull out some other tables too. do { pulled_a_table= FALSE; for each possible ref access to some inner table T { if (ref access is eq_ref && all used tables are outer tables) { pulled_a_table=TRUE; pulled_tables.insert(T) } } } while (pulled_a_table); // Remove pulled_tables from the join nest; if (we've removed all of the tables) destroy the nest; else { //we're not capable of executing semi-join nests yet report query failure; } } NOTE: It is possible that the above may be accomplished by another strategy: create a graph of functional dependencies, topologically sort the graph, and then retrieve all const tables in one pass. This is potentially more efficient/elegant but it is harder to do: - "standard" topological sort requires the graph to be acyclic. Functional dependency graph is not (and there seems to be no trivial way to remove the cycles). - The graph edges are actually "multi-edges", they have one source and several destinations. This seems to make building the graph too complicated and not worth it?
1. SJ applicability check 1.1 Finding out if we at the AND-top-level of ON/WHERE 2. Conversion 3. Interplay with other optimizations 4. Nested semi-joins removal 5. Table pull-out 6. Testcases 7. Schedule 8. Relationship with Prepared Statements 8.1 subquery to semi-join nest conversion 8.2 Nested semi-joins flattening 8.3 Table pullout 8.3.1 Implementation of per-execution table pull-out 1. SJ applicability check ========================= SUBQ-TO-SJ-CRITERIA (copying from the HLS) are: 1. Subquery predicate is of the form (...) IN/=ANY (SELECT ...) 2. Subquery is a single SELECT (no UNIONs) 3. Subquery does not have GROUP BY or ORDER BY 4. Subquery does not use aggregate functions or HAVING 5. Subquery predicate is at the AND-top-level of ON/WHERE clause 6. #tables-in-parent-query + #tables-in-subquery < MAX_TABLES These conditions will be checked as follows: #1: check if the subquery is an Item_in_subselect #2: test(master_unit()->children == 1) #3: single 'if' #4: single 'if' 1.1 Finding out if we at the AND-top-level of ON/WHERE ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The condition #5 is: Subquery predicate is at the AND-top-level of ON/WHERE clause: Currently it is not possible to determine out of item->fix_fields() if we're at the AND-top-level or not. We can check this as follows: * Add THD::marker= 0; * In setup_conds(), for WHERE/ON conditions do: thd->marker= MAGIC_NO; cond->fix_fields(); thd->marker= 0; * In Item_cond_and: - Save thd->marker into local variable and restore after each arg_i->fix_fields() call. * In Item_cond_or::fix_fields() and Item_func::fix_fields(): thd->marker= 0; * In Item_subselect::fix_fields(): if (thd->marker == MAGIC_NO) { we're at AND-top-level of the WHERE or ON } This will also check condition #6. We'll collect a list of Item_subselects that are to be converted to semi-joins. The list will be made of in class JOIN: Item_subselect** sj_subqs; in Item_subselect: Item** next; We'll store pointers-to-pointers as we will need to replace Item_subselect with Item_int(1) later on. Startegy for the satisfying #6 is described in the HLD. 2. Conversion ============= The subquery predicate will be converted to Item_int(1). When wrapping the tables into semi-join nest, we'll need to keep the following invariants: * The semi-join nest will have sj_nest->nested_join = {NESTED_JOIN structure with the list of the children}; sj_nest->join_list= { table list this semi-join is in. This is either embedding->nested_join->list or select_lex->top_level_list }; sj_nest->next_leaf= NULL ; // nest can't be leaf * When we insert a semi-join nest into the subquery we'll need to make sure that all leaves (i.e. base tables) are connected via TABLE_LIST::next_leaf chain. * ::next_local is same as ::next_leaf but views are considered leaves. * ::next_global - it seems we don't need to update this. NOTE Look at information_schema and see what's with the Gluh's in_subquery flag. Re semi-join condition we have those considerations: - semi-join nests need to be distinguishable from outer join nests and from inner join nests. - it seems there is no merit in placing semi-join's ON condition into the sj_nest->on_expr, as on_expr is used to identify outer joins. - Igor's idea is to add semi-joins' ON condition into the WHERE (and use some flag in TABLE_LIST to tell semi-join nests from inner join nests). This makes dealing with other optimizations simpler. On the other hand we might need semi-join's ON condition for some forthcoming optimizations. 3. Interplay with other optimizations ===================================== Make those optimizations treat semi-join nest's ON conditions as if they were AND-parts of the WHERE: * Equality Propagation * Ref-analysis * Constant table detection Verify that those optimizations "properly ignore" semi-join nests: * Aggregate Functions optimization 4. Nested semi-joins removal ============================ We can first do all the outer->inner conversions (and we will not miss anything due to semi-join nest bounds), and then, at "Flatten nested joins" phase we can remove the nested semi-joins. 5. Table pull-out ================= See HLS, nothing to add here. 6. Testcases ============ Create a set of testcases that demonstrate that subqueries that are fully pulled-out are executed correctly. Query EXPLAINSs should show that the tables are moved into the upper SELECT. Create a testcase with a subquery that can't be fully pulled out into the parent and show that it produces an error. 7. Schedule =========== 2007-03-05 StartOfDay: Implement the SJ applicability check 8 hrs Implement the subquery to semi-join nest conversion procedure. 12 hrs 2007-03-09 EndOfDay: MS1: subquery predicates are converted to [possibly nested] semi-join nests. Implement flattening of nested semi-joins. 12 hrs Adjust all "other optimizations" (see LLD text) to handle the semi-join nests. 12 hrs 2007-03-19 EndOfDay: MS2: semi-join nests are flattened, i.e. everything except the pull-out prodcedure is done. Implement PullOut procedure 4 hrs Create a set of testcases (actually some of testcases 12 hrs will probably be created for earlier milestones but the delivery is in this milestone) 2007-03-23 EndOfDay MS3: WL entry complete, the code is in the state as described in "Task scope" section of the HLD. Address some unforeseen problems that will pop up 16 hrs 2007-03-29 EndOfDay MS4: WL entry complete 8. Relationship with Prepared Statements ======================================== 8.1 subquery to semi-join nest conversion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (See WL#3813 for a general contract between PS runtime and the optimizer) The conditions listed in SUBQ-TO-SJ-CRITERIA remain true (or false) for the entire lifetime of the statement, so subquery predicate to semi-join nest conversion should be done once per statement execution. To achieve that, we'll need - Block search/conversion attempts on non-1st execution - Make sure the TABLE_LIST allocations and Item tree changes are performed on statement's arena. 8.2 Nested semi-joins flattening ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To be done on per-statement arena. 8.3 Table pullout ~~~~~~~~~~~~~~~~~ 1. According to the WL#3813 rule, we can pull out a table permanently if it is functionally dependent on outer tables. This is because we can count on corresponding UNIQUE/PK index not to be dropped. 2. We can't permanently pull out a 1-row constant table, because it may have more than one row on a subsequent execution. There are also cases when a table can be pulled with rule #1 once a table with rule #2 was pulled out. It looks like there is little merit in having distinct "permanent" and "per-execution" pull-out phases. A simpler solution of one single phase that is invoked for every execution seems to be more appropriate. 8.3.1 Implementation of per-execution table pull-out ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Don't really pull the tables out, leave the SJ-nest as is. // Join flattening: for each sj-nest { Reset the bitmap of pulled-out tables; Accumulate a bitmap of tables that are pulled out if (!all tables are pulled out) { abort query with an error; } } // Join optimizer, plan refinement: Ignore the SJ-nests when processing outer join nests EOF
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.