WL#1914: Plan Enumeration for Smart Explain
Affects: Server-7.1
—
Status: Assigned
This is a subtask of task #1488 "Implement Smart Explain". Implement only "SAVE PLAN SELECT ....... RETAINING ..." , starting first without RETAINING clause. 1. QUERIES HANDLED Only SELECT queries will be handled in this subtask. 2. OPTIMIZER CHANGES 2.1 EXPLAIN-related All optimizer functions will be able generate N best plans (ie. array of plans with increasing costs). Combined together, the optimizer will produce N best plans for the given query. 2.1.1 N-WAY RANGE OPTIMIZER get_N_trp_plans() { Apply current SQL_SELECT::test_quick_select algorithm and save all plans it evaluates. create SEL_TREE; put all possible range scans including ones that are more expensive then full table scan; regular index_merge: use the algorithm from the paper [1] to produce N best index_merge scans and put them into the queue; ROR-index_merge: Both ROR-index_merge search algorithms examine some sequence of possible ways to do ROR-index_merge. Save all these ways and their costs. index scans: add cost of full scan on best covering index. add cost of full table scan; return queue with all scans; } 2.1.2 N-WAY JOIN OPTIMIZER The N-best JOIN plans exhaustive search generation procedure: procedure find_N_best( partial_plan in, /* in, partial plan of tables-joined-so-far */ partial_plan_cost, /* in, cost of partial_plan */ remaining_tables, /* in, set of tables not referenced in partial_plan */ best_plan_so_far, /* in/out, best plan found so far */ best_plan_so_far_cost)/* in/out, cost of best_plan_so_far */ { for each table T from remaining_tables { for each table read plan (T) { for each in sub_selects_plan() { /* Use the algorithm from find_best but also generate conditions */ cost = complex-series-of-calculations; /* Add the cost to the cost so far. */ partial_plan_cost+= cost; if (queue.size == N && queue.bottom.cost() <= partial_plan_cost) { /* partial_plan_cost already too great, stop search */ continue; } partial_plan= expand partial_plan by best_access_method; remaining_tables= remaining_tables - table T; if (remaining_tables is not an empty set) { find_best(partial_plan, partial_plan_cost, remaining_tables, best_plan_so_far, best_plan_so_far_cost); } else queue.push(best_plan_so_far, best_plan_so_far_cost); } } } return queue; } 2.1.3 N-WAY UNION OPTIMIZER The following algorithm will be used to produce N best plans for UNION. get_first_plan() { for_each_union_part i { queue.put(plan(a1[0], ... ai[1], an[0]); } return plan(a1[0],a2[0], ..., an[0]); } get_next_plan() { elem= queue.get_prev(); for_each_union_part i { queue.put(a1[elem.a1idx], ... a1[elem.a_i_idx + 1], ... ); } } (I have a proof that it works) 2.2 BETTER COST ESTIMATES 2.2.1 TABLE ACCESS COST ESTIMATES The cost of any operation will be returned as vector < IO_OPS, // # of IO operations IO_COST // cost of one IO operation CPU_COST // CPU use cost. > The index_merge cost estimates code will be modified to return the appropriate tuple (taking 'sweep' scans into account). The range access cost estimates will be modifed to return appopriate cost vector. The table access will be considered to be always random. The 'all' access methods cost will be modified to reflect that data is read 'sweep'. 2.2.2 SINGLE TABLE CONDITION SELECTIVITY ESTIMATES The predicate pushdown will be taken into account. This will be achieved by calling make_cond_for_table in join optimizer and calculating condition selectivity. It is currently assumed that different fields values are independent (correlation=0). There will be two modes of selectivity calculation: 2.2.2.1 There are no range scans on the table. The selectivity calculated in recursive fashion, using double Item::selectivity(); function. sel('field= const_expr') == SELECTIVITY_FIELD_EQ_CONST = 0.3 sel('field moreless const_expr') == SELECTIVITY_FIELD_CMP_CONST=0.5 where moreless is >, >=, <, or <=. sel('field !=const') == 1-SELECTIVITY_FIELD_EQ_CONST sel('a AND ... AND c') and sel('a OR ... OR c') are calculated from arguments selectivities, assuming the arguments are indepenedent. sel()= SELECTIVITY_DEFAULT = 0.5. 2.2.2.2 There are range scans on the table. In this case the indexes provide selectivity info and we use it. We use it in the following way: A. In top-level AND/OR condition we save selectivities of conditions that produce NULL SEL_TREEs and later take them into account using rules of (2.2.2.1) e.g in condition "key1 < c1 AND (non_key = c2)" selectivity of "(non_key = c2)" will be taken into account. No changes in non-top level AND/OR. B. Once the SEL_TREE has been generated we produce estimates as follows: if (ROR-index_merge is possible) selectivity= selectivity of best ROR-index_merge; else { if (this is a range sel_tree) { used_fields= {empty}; for (each key scan, ordered by selectivity) { if (!(intersect(fields used in key scan, used_fields))) { used_fields |= fields used in key scan; selectivity *= key_scan.selectivity; } } } else pick best index_merge and use its selectivity. } update selectivitites with conditions saved in A. 2.2.4 CONDITION EVALUATION COST ESTIMATES class Item will be extended with cost functions: /* Return cost of item initialization. Initialization is performed before any records are retrieved. This should be 0 for everything except subselects */ COST_VECTOR Item::startup_cost(); /* Return cost of processing one record. */ COST_VECTOR Item::per_rec_cost(); Item_cond_and and Item_cond_or will take into account the probabilities of their arguments to be true. The "i-th argument is true " variates will be considered independent. 2.2.5 WHAT WILL NOT BE DONE ORDER/GROUP BY cost will not be calculated in this subtask. 3. PLAN TABLES CONTENTS 3.1 CREATION/USE POLICY The SAVED_* tables are assumed to be present in mysql database and be declared correctly. If this is not the case the error will be produced. If any of SAVED_* tables already contains a record which has a PK field with the same value as one we're going insert, it will be overwritten.' The following issue will not be addressed in this WL: It is possible to manually insert records into SAVED_* tables such that they will not be overwritten by "SAVE PLAN " command, but they will appear as additional parts of the plans that were saved by "SAVE PLAN". 3.2 COSTS The COST field in any table is calculated as follows: COST = IO*IO_COST + MEM_COST + CPU_COST MEM_COST is maximum amount of buffer memory used during query execution (w/o temp disk memory) IO_parent= SUM_children(IO_child); IO_COST_parent in set to such value that IO_parent *IO_COST_parent = SUM_children(IO_child*IO_COST_child). 3.3 STEPS TABLE The STEPS table contains a directed tree of table accesses. All cost-related step fields display the total cost of accessing the table in course of the entire query (contrary to current EXPLAIN output) 3.4 SCANS TABLE The SCANS table contains a directed tree of index scans. 3.4.1 RANGE SCAN For 'index only' range scan a single record is added into scans with all corresponding field values. For range scan retrieving full rows, two records are added into SCANS table. 1) Parent: rowid-to-row scan 2) Child: index scan (the same as in index_only case). 3.4.3 INDEX MERGE SCAN Each of the merged index scans is represented by a separate record in table SCANS. Each merging pass is represented by a separate record. Rowid-to-row scan, if present, will be represented by a separate record, too. 3.4.4 FULL INDEX SCAN For full index scan (index access method in current EXPLAIN) one record will be put into scans table. 3.4.5 'RANGE CHECKED FOR EACH RECORD' SCAN The cost of this scan is the same as cost of full table scan. 3.4.6 FULL TABLE SCAN For full table scan a record will be put into SCANS table. 3.5 CONDITIONS TABLE The CONDITIONS table contains a tree of predicates (and their selectivities) applied at given scan. 4. SYSTEM VARIABLES 4.1 @@explain_limit The "explain_limit" system variable will be created and used to limit number of evaluated plans. The optimizer code will produce @@explain_limit plans (or less if there are less plans possible). The plan condition will be applied after @@explain_limit plans were produced. If the plan condition application reduces # of plans, no additional plans will be produced by the optimizer. Default @@explain_limit value is 10. 4.2 @@memory_cost_factor Memory cost multiplier. The amount of memory (in bytes) is multiplied by @@explain_memory_cost before it is used in EXPLAIN calculations. User can adjust the value of this variable to force more/less memory-intensive plans to be preferred. 4.3 @@cpu_cost_factor CPU cost multipler. All CPU operations costs are multiplied by @@cpu_cost_factor. User can adjust the value of this variable to change the relative cost of CPU and disk/memory operations. [1] Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid. Supporting Top-k Join Queries in Relational Databases
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.