WL#1110: Subquery optimization: Materialization
Affects: Server-6.0
—
Status: Complete
Speed up subquery execution by materialising the subquery as a temporary (normally in memory) table. The materialized temporary table must be indexed so that the subquery predicate can be computed efficiently via this index. This task is a complement to WL#2980 semi-join, as it covers most of the cases not covered by semi-join.
======================================================================== HLS for WL#1110 Subquery optimization: Materialization ======================================================================== CONTENTS ======================================================================== 1. Requirements 1.1 External requirements (from management) 1.2 Internal requirements (from the developers) 1.3 Semantics of [NOT] IN 2. Introduction 3. Applicability 4. Query processing 5. Explain 6. Related work 7. Future work 1. Requirements ======================================================================== 1.1 External requirements (from management) -------------------------------------------- - The task should be designed so it can be done short time frame so it can be implemented in 5.2. - Monty required that we add the new optimizations incrementally without first removing the current (often wrong) subquery transformations, and without first simplifying the current code. 1.2 Internal requirements (from the developers) ----------------------------------------------- - Limit the scope to only IN subqueries. We do this because the support of NOT IN in the general case when there are NULL tuple components is rather complex, and is a well-defined separate task. - Limit the scope to only top-level predicates, where NULL <==> FALSE. This requirement results from the same problem as above - handling of the general case with NULLs on either side of [NOT] IN is a separate relatively complex task. 1.3 Semantics of IN ------------------- Let's denote with: - "*_i" - any value at position "i", - "NULL_i" - NULL at position "i", - "complete match"- means that all components match, and there are no NULL components, - "partial match" - means that two tuples match except in those components that contain NULL (that is, NULL acts as wildcard), - "A | T" - means ANY expression in query | TOP-LEVEL expression - "T" - TRUE - "F" - FALSE - "N" - NULL (UNKNOWN) Then the truth table in the general case of [NOT] IN is: Table 1: ================================================================================ Case | Left operand | Right operand | IN | NOT IN ================================================================================ |, | | C1| | | | Exists i, v_i == NULL | Any or Top-level expression ->| A | T | A | T -------------------------------------------------------------------------------- A.1 | | | F | F | T | T -------------------------------------------------------------------------------- A.2 | | exists complete match | impossible -------------------------------------------------------------------------------- A.3 | | (... ...) | N | F | N | F | | exists partial match | | | | -------------------------------------------------------------------------------- A.4 | | | F | F | T | T | | no partial match | | | | ------------------------------------------------------------------------------- | , | | v_i != NULL, i = 1..n | ------------------------------------------------------------------------------- B.1 | | | F | F | T | T | | | | | | ------------------------------------------------------------------------------- B.2 | | (... ...) | T | T | F | F | | exists complete match | | | | ------------------------------------------------------------------------------- B.3 | | (... ...) | N | F | N | F | | exists partial match, but no | | | | | | complete match | | | | ------------------------------------------------------------------------------- B.4 | | , neither | F | F | T | T | | complete, nor partial match | | | | ------------------------------------------------------------------------------- From the requirements in Sect. 1.2 it follows that Table 1 can be reduced to the much simpler Table 2 for top-level IN: Table 2 =============================================================== Left operand | Right operand | IN | --------------------------------------------------------------- , | | | Exists i, v_i == NULL | | | --------------------------------------------------------------- TL.1A | Anything | F | --------------------------------------------------------------- , | v_i != NULL, i = 1..n | --------------------------------------------------------------- TL.2A | (... ...) | T | | exists complete match | | ----------------------------------------- TL.3A | All other cases | F | | | | --------------------------------------------------------------- This WL will implement IN according to Table 2. 2. Introduction ======================================================================== The SQL subquery predicates (IN, ANY, EXISTS, ALL) are variants of the *logical* semi-join operation, where the left operand is a tuple from an outer query (or queries), and the right operand is the result of the subquery under the predicate. There are various *physical* semi-join algorithms that can compute a semi-join, all of them being variants of well known join algorithms. In MySQL currently we consider the semi-join variants of [block-] [index-] nested loops join, and single-pass hash join. Some of these are applicable in all cases, while others require some specific properties of the queries, and/or the schema. Depending on syntactic properties of the query and its subqueires, on data statistics, and index availability, one or another physical semi-join can be the cheapest for a particular query and table instances. The goal of this task is to implement an efficient physical semi-join implementation for the cases when a subquery predicate cannot be "flattened" to an nested-loops semi-join due to operation(s) in the subquery operand that require its materialization. The general idea is to materialize the result of the subquery into a table, and to create a hash index on all materialized fields so that the subquery predicate can be computed via hash lookups. Effectively, this results in a single-pass hash semi-join specialized for subqueries. Since the result of semi-join consists only of the tuples of the left operand that pass the join condition, the right operand of the semi-join should not contain duplicates so that each left tuple is tested exactly once against each unique right tuple. This guarantees that each of the left operands of a subquery predicate is present in the result at most once. Thus we define "subquery materialization" as an operation that materializes all distinct result tuples of a subquery when it is the right operand of a semi-join. When we use the term "subquery materialization" we will assume that it includes duplicate elimination as well. 3. Applicability ======================================================================== For this task, the decision how to execute a subquery predicate will be made in two steps. We will first decide whether to execute a subquery via a logical semi-join operation based on syntactic properties of the query. If semi-join was chosen, then we will further decide whether to execute the semi-join through subquery materialization (that is hash semi-join, the topic of this WL), or through index-nested loops semi-join (described in WL#2980). The decision about logical semi-join will be made during the PREPARE phase of the outer query, just before the IN => EXISTS transformation, so that it is possible to disable the transformation if semi-join is chosen. The decision about the physical semi-join method has to be made in the beginning of the OPTIMIZE phase of the outer query, because we need to be able to analyze the available indexes in all subqueries, but the join method must be chosen before equality propagation, and join simplification, because they depend on the chosen physical semi-join. Below we describe these rules in more detail. Let the subquery Q_i is a direct child of an outer query Q_[i-1]. 3.1 During the PREPARE phase of Q_[i-1] we determine that: ---------------------------------------------------------------------- The subquery predicate SQP_i(outer_ref, Q_i) should be executed via logical semi-join if: 3.1.1 SQP_i is IN (or "= ANY"). 3.1.2 SQP_i is a top-level predicate, that is: - referenced in an expression in the WHERE or ON clauses, - not an argument of an expression - not under NOT so that for the subquery predicate, FALSE <=> NULL. 3.1.3 Q_i is not a UNION, and is non-correlated. TODO: In addition, 3.1.3 can be extented so that Q_i is either of: - correlated to Q_[i-1] but - has none of GROUP BY, ORDER BY [LIMIT], aggregate functions, and - is not under NOT IN (such that we can execute Q_i without materialization), or - correlated to queries that contain Q_[i-1], thus the left IN operand is constant inside Q_[i-1], which allows us to materialize Q_i. 3.2 In the beginnig of the OPTIMIZE phase, we determine that: ---------------------------------------------------------------------- If in step 3.1 above the query preprocessor chose to execute SQP_i via logical semi-join, then the optimizer applies the following rules to choose the physical semi-join algorithm: 3.2.1 SQP_i is executed via subquery materialization (that is, via single-pass hash semi-join) if Q_i contains at least one of: - GROUP BY - ORDER BY ... LIMIT or - there is no index for the SELECT fields of the subquery TODO: Check why do we have the limitation: mysql> select * from t1 where a1 in (select b1 from t2 order by b1 limit 1); ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery' 3.2.2 Otherwise SQP_i is executed via index-nested loops semi-join, or IN=>EXISTS transformation, whichever is applicable as described in WL#2980 "Subquery optimization: Semijoin". The above rules will be applied during the OPTIMIZE phase of the outer query Q_[i-1], before before equality propagation, and simplify_join. QUESTION: - ANY (SOME), ALL are handled by MIN/MAX rewrite. - Are there any cases when it is better to materialize? - What about "= ALL (...)" ? It should be handled as MIN = MAX. Is that so? - Check if it is always the case that ALL is rewritten into MIN/MAX. If that is the case, it should be always cheaper to execute as MIN/MAX without materialization, no matter if there is a suitable index or not. 4. Query processing ======================================================================== Since the subject of this WL is semi-join execution via subquery materialization, this section provides an outline of the processing of IN subqueries with materialization. The implementation details are given in the HLS. For an outer query Q_[i-1], having a top-level IN predicate SQP_i in an ON or WHERE clause, with a subquery Q_i: 4.1 The query preprocessor: --------------------------- - Decides whether to use logical semi-join or not for the predicate, as described in Sect. 3.1 above. - If it chose semi-join, it blocks the transformation of non-correlated IN into correlated EXISTS. - Prepares the subquery Q_i. 4.2 The query optimizer: ------------------------ - Chooses the physical semi-join method for the subquery predicate as described in Sect. 3.2. - If the chosen execution method is index-nested loops semi-join then proceed as described in WL#2980 "Subquery optimization: Semijoin". else /* subquery materialization */ - Set up subquery execution-related structures. - Creates a temporary table for the materialized result of the subquery. - Creates a unique hash index on the SELECT list of the subquery so that we skip duplicate subquery results. - Sets up the execution plan so that when it executes subquery for the first time, its result is stored in the temoprary table. 4.3 The execution engine: ------------------------- - At the first execution of SQP_i (Item_in_subselect::val_int): - Optimize the subquery Q_i. - Execute the subquery through the JOIN object created for the subquery during PREPARE and materialize its result in the temporary table created during the prepare phase. While materializing, skip all duplicates through the unique hash-index on the temporary table. - Use the existing uniquesubquery engine to do index lookup into the temporary table's unique hash index. - Perform a lookup with the left IN operand into the temporary table, and compute the result for the IN predicate. - Cache the left IN operand, and the predicate result. - At each subsequent execution of the IN predicate: - If the left IN operand is the same as the cached one, - return the corresponding cached IN result - else - Perform a lookup with the left IN operand into the temporary table, and compute the result for the IN predicate. - Cache the left IN operand, and the predicate result. 5. Explain ======================================================================== The output of EXPLAIN for a subquery must be changed so that it displays "materialized" for the subquery. TODO: check the approach for WL#2980, and make it similar. 6. Related work ======================================================================== - PostGresSQL uses hash indexes on disk with multi-pass hashing which in general will give them better performance. - We have materialization via hash tables only if in the table with index fits in main memory, while otherwise MySQL swaps the table to disk and indexes it with a Btree index. - Didn't check it, but it is very likely the other DB vendors also use similar technique to PG. => Therefore to be competitive we need multi-pass hashing as described in Sect. 7.5 below. 7. Future work ======================================================================== 7.0 Implement correct NULL semantics according to Table 1 in Sect. 1.3. 7.1 Choose the subquery execution method on a cost-based basis during the cost-based optimization phase. 7.2 Consider correlated (dependent) subqueries for which it is possible to isolate a selective non-correlated fragment (invariant), and materialize this fragment. See the paper (implemented in Sybase): "Reusing invariants: A new strategy for correlated queries" http://portal.acm.org/citation.cfm?doid=276304.276309 7.3 Consider correlated subqueries in the SELECT clause which can't be transformed to joins and for which there is no index that can be used to find the rows in the subquery tables, e.g.: SELECT A IN (SELECT no_key FROM t2 WHERE key=10 AND t1.a=t2.b) FROM t1; 7.4 If there is no aggregate function, GROUP BY <=> DISTINCT, then it should be possible to use semi-join instead of materialization 7.5 The current implementation is limited by the current implementation of memory hash indexes. Once the table and index don't fit into main memory, the table is written to disk, and the index is changed to B-tree. This will result in much worse performance compared to competitive products. To remedy the situation, we need an implementation of multi-pass hashing as described in WL#2106. 7.6 Detect if the subquery produces 1 or 0 rows, and use this information to avoid the overhead of temporary table creation. This can be detected based on various properties: - subquery syntax (e.g. LIMIT 1) - conditon analysis of the subquery (e.g. "WHERE col > 1 and col < 1") - based on index analysis (e.g. "WHERE pk_col = const") - first execution 7.7 If a subquery is executed as a regular query, and it needs a temp table, e.g. to compute GROUP/ORDER BY, then we will create two temp tables - one to compute the operation, and one into which we will materialize the final result. In some cases it is possible to reuse the first table as the materialized table. Figure out a way to reuse the first table as a materialized table to avoid double materialization. 7.8 Improve the performance of lookups for partial matches. The current sequential scan method (LLD, Sect. 2.2.2) to perform partial matches for tuples with NULL components may be quite inefficient for tables with large numbers of NULLs. This approach can be improved in the following ways. 7.8.1 Nested loose scans This is a modification of LLD.2.2.2.1, where we use a B-tree to store tuples with NULLs instead of sequential storage. Then we can use the loose index scan approach as follows. Depending on the lenght of an index keypart prefix, we can look at all keys with this prefix as a sub-grouping of all keys. Then, for each key part from left to right, we: - check if the index contains a NULL for this keypart, - if so, set the current keypart of the lookup key to NULL, move to the next key part, and move to the next key part for a new nested loose scan. - otherwise look for the first index key that has the same value in the current keypart as the lookup key. - if one is found, then set the current keypart of the lookup key to this value, and move to the next key part for a new nested loose scan. - otherwise return "not found" The above algorithm tries to find a partial match for each keypart, and if not found, tries to find a complete match for the keypart, and then to expand it with a partial match. 7.8.2 IO_CACHE with offsets pointing to next groups The idea in 7.8.1 can be applied also to a sequential sorted storage of keys. We can consider such storage as a linear representation of an ordered index. Then for each key that is the beginning of a new group with some common prefix, we can add a tuple that tells us what is the offset of the first tuple in the next group in the same subgroup. This would allow us to "jump" from one group to another inside an outer group, in the same way as we would jump with the nested loose scans approach above. This can be implemented by modifying the key comparison function used by the Unique class, so that it stores its data in reverse order. Then we could fill the IO_CACHE buffer from the end, and set all offsets in one pass. WARNING: According to Guilhem, it is not clear whether IO_CACHE can be used for seek-ing if there is no pre-created file for the IO_CACHE. There is a confusing comment re seeks: "** Should be used when no seeks are done (only reinit_io_buff)". 7.8.3 Role inversion of IN operands (block hash semi-join) If the left IN operand has no NULL components (or has few tuples with NULLs), then we could invert the roles of the left and right operands, and instead perform lookups from the right into the left operand. One way of doing that would be to accumulate the tuples of the left operand into an in-memory hash table, and once this table is filled, test all its tuples by scanning the materialized right operand, and doing lookups into the partially materialized left operand. This in effect would implement a block hash semi-join.
======================================================================== LLD for WL#1110 Subquery optimization: Materialization ======================================================================== CONTENTS ========================================= 1. Query execution 1.1 Subquery engine for hash semi-joins 1.2 Naive IN execution with top-level predicate semantics 1.3 Item_in_subselect shortcut evaluation 2. Query compilation 2.1 Applicability test 2.2 Disabling IN => EXISTS transformations 2.3 Optimization 2.4 Code generation 3. Explain 4. Re-engineering 5. Schedule 5.1 Phases and time estimates 5.2 Plan with dates 1. Query execution =================== We first describe how do we plan to execute IN predicates via hash semi-join. Then in Sect. 2 we describe the changes to the optimizer needed to generate such query plans. Hash semi-join execution of IN predicates follows the same approach as currently - it based on execution via a subclass of subselect_engine. 1.1 Subquery engine for hash semi-joins --------------------------------------- All execution details of subquery materialization, and subsequent index lookups are encapsulated in the class subselect_hash_semi_join_engine. This class behaves as subselect_uniquesubquery_engine in the sense that during query execution it performs index lookups in the same way. However, this new class differs from subselect_uniquesubquery_engine in two ways: - Before the subquery predicate is evaluated for the first time, it has to first materialize the subquery into a temporary table. - When performing lookups it must take care of subquery result tuples with a NULL component. Currenlty if the temporary table contains such tuples, a lookup will return no result, while subquery predicate semantics requres the lookup to return NULL as its result. That is, in general: <... val ...> [NOT] IN <... NULL ...> => NULL, instead of "no match". For top-level queries, NULL is mapped to FALSE, and there is no problem with the IN predicate. Still, there is a problem with top-level NOT IN predicates, because if a NULL result of IN is mapped to FALSE, then the result of NOT IN will be NOT(FALSE) = TRUE, instead of FALSE (since NOT(NULL) = NULL, which is mapped to FALSE). NOTE: This issue will be solved by a separate task: WL#3830: Subquery optimization: Materialization: Partial matching of tuples with NULL components Based on these similarities and differences, we design subselect_hash_semi_join_engine as a subclass of subselect_uniquesubquery_engine. class subselect_hash_sj_engine: public subselect_uniquesubquery_engine { protected: /* TRUE if the subquery was materialized into a temp table. */ bool is_materialized; /* The old engine already chosen at parse time and stored in permanent memory. Through this member we can re-create and re-prepare materialize_join for each execution of a prepared statement. We also resuse the functionality of subselect_single_select_engine::[prepare | cols]. */ subselect_single_select_engine *materialize_engine; /* QEP to execute the subquery and materialize its result into a temporary table. Created during the first call to exec(). */ JOIN *materialize_join; public: subselect_hash_sj_engine(THD *thd, Item_subselect *in_predicate, subselect_single_select_engine *old_engine) :subselect_uniquesubquery_engine(thd, NULL, in_predicate, NULL), is_materialized(FALSE), materialize_engine(old_engine), materialize_join(NULL) {} ~subselect_hash_sj_engine(); bool init_permanent(List- *tmp_columns); bool init_runtime(); void cleanup(); int prepare() { return 0; } int exec(); void print (String *str); uint cols() { return materialize_engine->cols(); } virtual enum_engine_type engine_type() { return HASH_SJ_ENGINE; } }; In this section we will describe only the methods related to execution. 1.2 Naive IN execution with top-level predicate semantics --------------------------------------------------------- The execution of a [NOT] IN predicate via hash semi-join is encapsulated in the exec() method below. The algorithm below implements the IN truth Table 2 from HLS.Sect.1.3, but it doesn't implement correct NULL semantics as described in Table 1 in the HLS. The reason is that lookups with keys that do not contain NULLs into a hash table that contains keys with NULL components will return FALSE (no match) instead of NULL. int subselect_hash_semi_join_engine::exec() { /* Optimize and materialize the subquery during the first execution of the subquery predicate. */ if (!is_materialized) { ASSERT(subquery_plan is not optimized); subquery_plan->optimize(); subquery_plan->exec(); is_materialized = TRUE; if (the subquery returned no rows) { There is no need to perform lookups for empty subqueries, mark IN as false, and mark that subquery result is empty; return FALSE; } } /* Perform a hash-index lookup if the left predicate operand changed. Notice that the exec() method below updates item::value, and item::null_value, thus if we don't set these, the next call to item::val_int() will return whatever result was computed by its previous call. */ if (test_if_l_operand_changed()) return subselect_uniquesubquery_engine::exec(); return FALSE; } /* Check if any of the components of the left operand cache changed, and update the cached values of each one. TODO: this method does the same as test_if_group_changed() - needs refactoring. */ bool subselect_hash_semi_join_engine::test_if_l_operand_changed() { for each l_val in l_operand_cache if l_val->cmp() return TRUE; return FALSE; } 1.3 Item_in_subselect shortcut evaluation ----------------------------------------- TODO: THIS SECTION IS NOT IMPLEMENTED YET!!! From the [NOT] IN truth table in the HLS, Sect. 1.3, it follows that when [NOT] IN is a top-level predicate, we may compute the result of the predicate without performing lookups into the materialized subquery. We distinguish two cases: A) The left operand contains a NULL component, and the predicate is a top-level IN. - In this case the result of IN is always FALSE. - This case is currenlty (incorrectly in the case of NOT IN) handled by Item_in_optimzer, covering the following cases of the IN truth table (Sect. 1.3 in the HLS): A.1, A.3, A.4, column 2. As noted in Sect. 4.3 we will remove Item_in_optimzer and will merge it with Item_optimizer at a later implementation phase. B) The materialized subquery has an empty result. - In this case the result of IN is always FALSE, and for NOT IN is always TRUE, and there is no need to invoke the subquery engine once we find this out. - In the IN truth table in the HLS, these are cases: A.1, B.1. For case (B) we override Item_subselect::exec(), and add a new member: class Item_in_subselect { bool empty_subquery; /* Set to FALSE initially. */ bool exec() { if (empty_subquery) { /* If the subquery engine detected that the result is empty, it set correctly the Item's value as FALSE, so no need to recompute it. */ return FALSE; } else return Item_subselect::exec(); } } 2. Query optimization ===================== 2.1 Applicability test ---------------------- The decision whether to execute an IN predicate via hash semi-join is made during the JOIN::prepare phase of the outer query. This approach uses the fact that for each Item_in_subselect there is a default subselect_engine that contains a JOIN object for the subquery of the IN predicate. Item_in_subselect::fix_fields() invokes the creation of the JOIN object for the corresponding subquery, and subsequently invokes JOIN::prepare for the subquery. This results in a recursive invocation of the prepare phase for all subqueries. The applicability test is positioned after the recursive call, thus we decide whether to use a hash semi-join in a post-order manner. The result of the test is recorded in a new member of Item_in_subselect: Item_in_subselect::use_hash_sj The test is implemented inside JOIN::prepare, as follows: JOIN::prepare { ............ /* recursive call to Item::fix_fields() setup_without_group() ............ if (!thd->lex->view_prepare_mode) { Item_subselect *subselect; /* If we are in a subquery predicate. */ if ((subselect= select_lex->master_unit()->item)) { /* Determine an execution method for the subquery predicate 'subselect'. Let SP_i is the subquery predicate (called 'subselect' in the code, and Q_i is its right operand (subquery). Then P_i is executed via a hash semi-join under the following conditions: */ if ( 1. SP_i is an IN predicate && 2. Q_i is not a UNION && 3. Q_i is not a table-less query && 4. the outermost SQL statement is a SELECT && 5. SP_i is a top-level item && 6. Q_i is not correlated) { /* In this case we record this decision: */ SP_i->use_hash_sj = TRUE; } transform the subquery predicate and its subquery } } ......... } The result of the test above is used to block the current transformation from non-correlated IN => correlated EXISTS, as described in the next section. 2.2 Disabling IN => EXISTS transformations ------------------------------------------ To make it possible to execute an IN predicate via semi-join, we have to disable the current transformation of non-correlated IN predicates into correlated EXISTS, so that we can materialize the unmodified subquery. We modify Item_in_subselect::[single | row]_value_transformer() so that it still wraps Item_in_subselect in an Item_in_optimizer, but doesn't invoke the IN => EXISTS transformation in the case of semi-join as follows: Item_in_subselect::[single | row]_value_transformer() { Wrap 'this' into a new Item_in_optimizer as it is done now. if (use_hash_sj) return OK Transform IN => EXISTS as it is done now. } TODO: The previous version of the WL said that it isn't necessary to change Item_in_subselect::single_value_transformer() because this WL will not change the execution of single-row subqueries. I don't understand why this was decided. Is it known to be generally more efficient to use IN=>EXISTS for IN with one column? 2.3 Subquery optimization ------------------------- Currently no cost-based optimization will be performed for subquery execution. Once the applicability test in 1.1 decides in a rule-based manner to use hash semi-join to execute an IN predicate, the decision is irreversible. The subqueries themselves are optimized in a lazy manner, during the first execution of the IN predicate. For details see Sect. TODO. 2.4 Code generation ------------------- We add a new phase that creates and initializes all necessary structures for hash semi-join execution of IN close to the end of JOIN::optimize, after all optimizations completed, particularly after substitute_for_best_equal_field(), so that we get all field references correct. The general outline of the change in JOIN::optimize is: bool JOIN::optimize() { ....... substitute_for_best_equal_field() ....... make_join_readinfo() ........ /* Create all structures needed for materialized subquery execution. */ if (this is the outer-most query) { For each subquery SQ_i at any level in the global subquery list { subquery_predicate = predicate that SQ_i is an operand of if (subquery_predicate is IN && subquery_predicate->use_hash_sj) if (in_subs->setup_hash_sj_engine() return 1; // error } } Create other types of subselect_engines. ....................... } The new method Item_in_subselect::setup_hash_sj_engine is the entry point for all code generation needed for hash semi-join execution. The outline of this method is: Item_in_subselect::setup_hash_sj_engine() { if (the current engine is a single_select_engine) { set memory root to permanent memory current engine = new hash_sj_engine perform all permanent initialization of current engine restore old memory } perform all per-execution initialization of the current engine } With the method above: - all code generation is in one place - there is a clear distinction of what data structures and initialization is done in permanent memory, and what is repeated with each subquery (re-)execution. 2.4.1 Permanent initialization All initialization that should live across prepared statements (re)execution is done in the method: bool subselect_hash_sj_engine::init_permanent(List of subselect fields) { 1. Create/initialize materialization related objects. Create and initialize a select result interceptor that stores the result stream in a temporary table. The temporary table itself is managed (created/filled/etc) internally by the interceptor. 2. Create/initialize execution related objects. - Create the JOIN_TAB to be used by unique_subquery_engine. - Set its table to be the temp table with the subquery result. - Set its TABLE_REF (JOIN_TAB::ref) to represent index lookup into the temp table. } 2.4.1 Runtime initialization Due to the fact that the current design is such that the default subquery engine subselect_single_select_engine recreates and reoptimizes its JOIN object before each execution, and since we reuse this object to materialize the subquery, some parts of the initialization of our class need to be repeated for each execution. These are seprated in the method: bool subselect_hash_sj_engine::init_runtime() { /* Create and optimize the JOIN that will be used to materialize the subquery if not yet created. */ materialize_engine->prepare(); /* Let our engine reuse this query plan for materialization. */ materialize_join= materialize_engine->join; materialize_join->change_result(result); /* Initialize the cache of the left predicate operand. */ return parent_item->init_left_expr_cache(); } Above we do the initialization of the left operand cache here, because the way Cached_item is architected, it needs to be reallocated before each execution. 3. Explain ========== TODO Currently the only change related to explain is the new method: subselect_hash_sj_engine::print(). One specific thing with it is that it may be called by DBUG_PRINT before the engine is completely initialized. 4. Re-engineering ================= * Must do for this task: 4.1 Move the choice and creation of a subquery execution engine from the parsing phase (Item_subselect::init is called at parse time) to the prepare phase of the outer select. Most likely this decision can be done in Item_subselect::fix_fields and/or Item_subselect::init_hash_semi_join(). 4.2 Rename (and eventually re-engineer) the 'select_union' result sink to a generic 'select_materialize' class that stores all results in temporary table. * Good to do, but not crucial: 4.3 Merge the functionality of Item_in_optimizer with Item_in_subselect, and remove Item_in_optimizer. 4.4 Move the optimization of independent subqueries outside of the execution phase to either the prepare or optimize phase of the outer query. 5. Schedule =========== 5.1 Phases and time estimates The general approach is to get as quickly as possible to the execution code in order to reduce the project risk, and then to go back and refine the conditions under which we choose to use hash semi-join. Phase 1: Prototyping - unconditional hash semi-join execution ---------------------------------------------------------------------- - Unconditionally set hash semi-join as a subquery execution method - via a stub for Item_in_subselect::is_semi_join_applicable(). completed: 100% - Unconditionally disable the IN => EXISTS transformation (Sect. 2.2). completed: 100% - Unconditionally choose hash semi-join via a simplified version of Item_in_subselect::best_exec_method() completed: 50% - Hook Item_in_subselect::best_exec_method() to the optimizer. completed: 100% time: 24 h = 3 days Implement: - class subselect_hash_semi_join_engine, and methods: - subselect_hash_semi_join_engine::init() The main risk here is temporary table creation. time: 32 h = 4 days - subselect_hash_semi_join_engine::exec () The risk here is filling the temporary table. time: 24 h = 3 days - Item_in_subselect::exec() according to LLD.2.0. time: 1 h ~= 0 time: 56 h = 7 days - Add regression tests for IN queries, fix all discovered problems. time: 16 h = 2 days TOTAL: 96 h = 12 days M1: Query materialization and execution code implemented according to the current design, and it is possible to execute typical (simple) subqueries with top-level IN. Since hash semi-join is chosen unconditionally, other queries will fail (or crash). => Submit the code for Review 1. Phase 2: Design & schedule re-evaluation and code review ---------------------------------------------------------------------- - Update the LLD with more detailed description of temporary table creation. - 8 hours = 1 day - Analyze the implementation of Phase 1 for design flaws. - Update the LLD with more detailed description of other critical parts. - Update the WL schedule according to lessons from Phase 1. - 8 hours = 1 days - Overall design (LLD) review - 8 hours = 1 day - Post-review 1 and post-re-design code changes. time: 16 h = 2 days TOTAL: 40 hours = 5 days M2: Updated and approved design & schedule. Phase 3: applicability conditions and conditional hash semi-join ---------------------------------------------------------------------- - Implement the applicability test per Sect. 2.1 except for the branch for nested-loops semi-join. time: 8 hours = 1 day M3.a: At this point it should be possible to execute any query with subqueries, however I expect to encounter many unforceen problems with other old code. - Add new test cases that test the selection of the new execution method. Fix all encountered problems. time: 16 hours = 2 days - Run the regression test suite, investigate the failed test cases, and fix problems related to the new code. time: 24 hours = 3 days TOTAL: 48 hours = 6 days M3: Working implementation of the WL. => Submit the code for Review 2. Phase 4: EXPLAIN support and second code review ---------------------------------------------------------------------- - Add EXPLAIN support, and test cases with the new EXPLAIN output that make sure materialization is applied when it had to. time: 16 hours = 2 days - Post-review 2 changes time: 40 h = 5 days TOTAL: 56 hours = 7 days M4: Selected plan visible in EXPLAIN, all tests pass, code review passed. Phase 5: additional tests & documentation ---------------------------------------------------------------------- - Add tests for various uses of subqueries inside viewes, update statements, etc. time: 8 hours - Write documentation for User's manual. time: 8 hours TOTAL: 16 hours = 2 days M5: Task complete. Total time for the whole task: ---------------------------------------------------------------------- optimistic: 256 hours (based on the estimates above) = 32 days pessimistic: add 80 hours for unpredicted problems = 328 hours = 42 days 5.2 Plan with dates ------------------- TODO
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.