WL#2475: Batched range read functions for MyISAM/InnoDB
Affects: Server-6.0
—
Status: Complete
This task is to implement batched range read functions for MyISAM and InnoDb engines. The implementation shall optimize the batched index access of table rows for a sequence of disjoint ranges. Eventually the functions has to be implemented according to specifications presented in WL#2474. But the first implementation should rather follow the interface of WL #2126 (this is a requirement of Monty).
1. Introduction 2. Applicability criteria 3.1 The case with MRR/MyISAM for index_only scans 3.2 The case with MRR/MyISAM when there are BLOBs and sorted output is needed 4. Implementation of MRR interface functions 5. Cost model 5.1 Number and parameters of "sort and sweep" steps 5.2 Cost of one "sort and sweep" step 6. Memory costs per record 7. Note about index condition split up 8. How to fit it into current table handler functions 1. Introduction --------------- This worklog entry describes an implementation of MRR interface for MyISAM and InnoDB table engines (It seems that everything said here can be applied to BDB as well but for now we don't include BDB). The advantage of MRR/MyISAM implementation over default (MRR-to-non-MRR calls converter) MRR implementation will be better performance. The speedup will be attained by accessing table records sequentially, in disk "sweeps" instead of random disk accesses done by the default MRR implementation. In order to arrange for sequential table record access, MRR/MyISAM will have to use more memory and CPU time but we expect this added cost to be smaller then cost spared by using faster disk IO. The table access will be performed using the following algorithm: while (have more ranges to scan) { scan index range(s) and accumulate N rowids in a buffer; sort the rowids buffer; /* Access table records in a sweep: */ for each rowid in the rowids buffer { retrieve full table record; pass it to output; } } Note that it doesn't make sense to use MRR/MyISAM if we don't need to access full table records. Also note that the records are not passed to output in key order. It is possible to design variations of this algorithm that do produce records in key order (see a file attached to WL#2475), but this will not be implemented in the scope of this Worklog entry. Since a table handler must always provide the MRR interface, MRR/MyISAM and MRR/InnoDB will have an option to bypass the "native" implementation and use generic MRR implementation instead. The decision between using the MRR/MyISAM implementation and bypassing it will be based on whether MRR/MyISAM implementation is applicable and whether it is cheaper to use than the default MRR implementation. 2. Applicability criteria ------------------------- The described MRR/MyISAM algorithm will not be used in the following cases: 3.1 The case with MRR/MyISAM for index_only scans ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ MRR/MyISAM doesn't provide any benefit when we only need to retrieve fields covered by the index. Therefore, when index_only scan is requested (with HA_MRR_INDEX_ONLY flag), we will use the default MRR implementation. 3.2 The case with MRR/MyISAM when there are BLOBs and sorted output is needed ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Consider the case where * The output needs to be sorted in key order (HA_MRR_SORTED flag is set), * The set of needed fields includes a BLOB field. Here there are two possible execution scenarios: A. Perform the scan (using default MRR implementation) that returns records in key order. B. Perform the scan (using MRR/MyISAM) that returns records in no particular order and then sort the records (using filesort). Here is a pseudocode representation of actions that will be done in scenario B: { table->multi_read_range_init(); while (table->multi_range_read_next()) (*) { if (WHERE condition is satisfied) (X) put (rowid, sort_value) pair into buffer; } // the following is done by filesort(). sort the buffer of {rowid, sort_value} by sort_value; for each pair in the buffer { retrieve full table record; (**) pass it to output; } } As we can see in scenario B each satisfying full table record will be read two times: 1. first time, at location (*), using disk "sweeps". 2. second time, at location (**), using random disk accesses. Comparing this with scenario A (full table records retrieved once, using random disk accesses), we see scenario B will be too expensive except for [quite rare] cases where "condition X AND NOT range_condition" has very high selectivity. Conclusion: to avoid making some kinds of queries run slower, we will not use MRR/MyISAM when we're requested to produce BLOBs in key order. 4. Implementation of MRR interface functions -------------------------------------------- A general consideration: NoInterruptingIoAssumption: we'll assume that amount and frequency of disk I/O between multi_range_read_next() calls will not be high enough to break the read sweep, so we will not buffer ouput tuples, but rather read them one-by-one. The reasoning here is that subqueries (evaluated in the table condition) will be unnested, and since MRR will be used together with Batched Key Access, there will be no interrupting I/O from the same thread. Also Note: SergeyP experiments with interrupting I/O from other threads have shown that even certain amounts of interrupting do not "break the disk sweeps". ************************************* * Note after discussion with Monty: * ************************************* It *is* better to read full table rows at once, as this disk sweep won't be interrupted. This is outside of this WL entry, but in the future we should make use of full table row (or field tuple) buffer. Experiment: Reading two 200M files in 8K chunks one after another is about two times faster than reading them in a sequence. /* Note: the below pseudocode describes handling of all supported HA_MRR_* flags */ int multi_range_read_info(uint keyno, uint n_ranges, uint keys, uint *bufsz, uint *flags, COST_VECT *cost) { record the cost/flags/mem_usage of default MRR implementation; /* Check the rules described in sections 3.1 and 3.2 */ if (flags & (HA_MRR_INDEX_ONLY | HA_MRR_SORTED)) { return cost/flags/buffer size of the default implementation; } /* We assume HA_MRR_LIMITS is always set - we don't request memory buffer larger then the one provided. */ /* Get the cost of MRR/MyISAM scan as a function of: - number of records - table parameters - available buffer size - test(HA_MRR_NO_ASSOCIATION): if HA_MRR_NO_ASSOCIATION is used, we'll store rowids in the buffer else we'll store (rowid, range_data_ptr) pairs */ Compare the cost of default MRR implementation and MRR/MyISAM implementation and choose/return the best. } ha_rows multi_range_read_info_const(uint keyno, RANGE_SEQ_IF *seq, void *seq_init_param, uint n_ranges, uint *bufsz, uint *flags, COST_VECT *cost) { /* Call the default implementation: this will allow us to get parameters (cost/flags/mem_usage) of default MRR implementation, and will also return E(number of records). */ handler::multi_range_read_info_const(...); /* Check rules described in sections 3.1 and 3.2 */ if (flags & (HA_MRR_INDEX_ONLY | HA_MRR_SORTED)) { return cost/flags/buffer size of the default implementation; } /* Get the cost of MRR/MyISAM scan in the same way as it is done in multi_range_read_info() */ Compare the cost of default MRR implementation and MRR/MyISAM implementation and choose/return the best. } int multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param, uint n_ranges, uint mode, HANDLER_BUFFER *buf) { fill_buffer(); /* if the above call has scanned through all intervals in *seq, then adjust *buf to indicate that the remaining buffer space will not be used. */ return 0; } int multi_range_read_next(char **range_info) { if (!have rowids in the buffer && fill_buffer()) return EOF; return read_full_row(buffer.pop_front()); } fill_buffer() { if (no more index tuples to read) return 1; read index tuples and fill the buffer (either with rowids, or with {rowid,range_id} pairs); sort the buffer contents by rowid; return 0; } 5. Cost model ------------- /* The below formula is true for all components of COST_VECT except memory_cost: */ cost(MRR/MyISAM scan) = index_read_cost + sorts_and_sweeps_cost; /* This cost is is only disk-io cost, random seeks are assumed */ index_read_cost= {use handler::index_only_read_time() function} 5.1 Number and parameters of "sort and sweep" steps ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ MRR/MyISAM uses several "sort and do sweep read" steps. The number of steps is determined as follows: /* Length of one entry in the buffer: */ buffer_entry_length = rowid_length + sizeof(void*) * (!test(HA_MRR_NO_ASSOCIATION)) /* Max. number of entries in the buffer: */ max_buff_entries = buffer_size / buffer_entry_length; /* Number of iterations we'll make with full buffer (here "range_records" is estimated total number of records in all ranges) */ n_full_steps= floor(range_records / max_buff_entries); records_in_full_step= max_buff_entries; /* Number of records we'll be processing in last iteration, where the buffer will not be full */ records_in_last_step= range_records % max_buff_entries; For sorts_and_sweeps_cost we have: if (n_full_steps != 0) sorts_and_sweeps_cost.mem_cost= buffer_size; else sorts_and_sweeps_cost.mem_cost= records_in_last_step * buff_entry_length; For all other components of the COST_VECT structure, regular addition rules apply: sorts_and_sweeps_cost = n_full_steps * sort_and_sweep_cost(records_in_full_step) + (records_in_last_step==0)? 0 : sort_and_sweep_cost(records_in_last_step); 5.2 Cost of one "sort and sweep" step ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The CPU cost component is the cost of qsort() call: sort_and_sweep_cost(n_records).cpu_cost = {cost of qsort() call on an array of n_records elements }Come up with some formula for CPU cost of qsort. Cost of one comparison is rowid comparison cost: (1 / TIME_FOR_COMPARE_ROWID). The IO cost component is the cost of disk sweep. The opt_range.cc code has the get_sweep_read_cost() function that calculates the cost of one record retrieval sweep. We'll re-use the cost model from that function, but with one exception: get_sweep_read_cost() assumes that if the table is accessed within the join, then the sweep is "totally broken" and cost of reading one block is the cost of random disk seek. We will not use this assumption because we've agreed on using NoInterruptingIoAssumption (search above). So we end up with this cost model and formulas: The cost model is as follows: We need to retrieve n_rows rows from file that occupies n_blocks blocks. We assume that offsets of rows we need are independent variates with uniform distribution in [0..max_file_offset] range. We'll denote block as "busy" if it contains row(s) we need to retrieve it and as "empty" if we don't need to retrieve it (i.e. doesn't contain rows we need). Probability that a block is empty is (1 - 1/n_blocks)^n_rows (this applies to any block in file). Let x_i be a variate taking value of 1 if block #i is empty and 0 otherwise. Then E(x_i) = (1 - 1/n_blocks)^n_rows; E(n_empty_blocks) = E(sum(x_i)) = sum(E(x_i)) = = n_blocks * ((1 - 1/n_blocks)^n_rows) ~= n_blocks * exp(-n_rows/n_blocks) E(n_busy_blocks) = n_blocks*(1 - (1 - 1/n_blocks)^n_rows) = ~= n_blocks * (1 - exp(-n_rows/n_blocks)). Average size of "hole" between neighbor non-empty blocks is E(hole_size) = n_blocks/E(n_busy_blocks). The total cost of reading all needed blocks in one "sweep" is: E(n_busy_blocks) * (DISK_SEEK_BASE_COST + DISK_SEEK_PROP_COST * E(hole_size)) ************************************* * Note after discussion with Monty: * ************************************* Actually the cost of one disk seek in the sweep is not proportional to disk seek distance. It has more to do with plate rotations. If the next page to read is on the same cylinder, then the read is fast, otherwise we'll need to wait for ~0.5 rotations. The results of a quick test (see attached skip_read_results.png), at first glance, confirm this. On the other hand, the rows we need to retrieve are not uniformly distributed, the data is usually somewhat clustered (so median distance between pages we'll need to read is smaller). 6. Memory costs per record -------------------------- The following information might be required for WL#2771: Assuming there are no limits on buffers size, memory usage per returned record is as follows: For default MRR implementation mem_per_record=0 as it uses O(1) memory. For MRR/MyISAM mem_per_record = rowid_length + 1 byte * !test(HA_MRR_NO_ASSOCIATION) . 7. Note about index condition split up -------------------------------------- The "Index condition split up" (1) optimization can be used together with MRR/MyISAM: we could filter out non-matching index tuples in fill_buffer() function. The question how to arrange "Index condition split up" to be taken into account in MRR/MyISAM cost calculations remains uncleared (but this should not be a significant problem). The MRR/MyISAM implementation as described in this WL entry will not use "Index condition split up". (1) no WL# entry exist yet for this optimization, tentative spec text exist as result of pair work of Timour and SergeyP at Malta DevConf. 8. How to fit it into current table handler functions ----------------------------------------------------- Use of table handler functions: We can't use one table handler, because we'll need to make the following sequences of calls: handler->index_init() handler->index_next(); ... handler->index_next(); // (*), rowid buffer is now full handler->rnd_pos(); // retrieving full records after sorting handler->rnd_pos(); handler->rnd_pos(); ... handler->index_next(); // (**) ... The call (**) needs to continue scanning the index from the position it was at right after the (*) call. This will not happen automatically - calling rnd_pos() will cause the table handler to forget the previous position. We see two solutions to this problem: A. Restore the index scan position by calling handler->index_read({last_index_tuple_value, last_index_tuple_rowid}). (UPDATE: this just have worked with InnoDB and MyISAM code needed some small adjustments) B. Create two handler objects, one to be used for scanning the index, and another for making rnd_pos() calls. (this is how index_merge works. This is not a very elegant solution as we're already "inside the table handler", and creating another table handler seems to be an overkill). UPDATE: We're using method A.
Copyright (c) 2001-2007 by MySQL AB. All rights reserved. WL#2475 Batched range read functions for MyISAM/InnoDb - LLD ============================================================0. Changes against title and HLS 1. Cost functions implementation 2. MRR ordered-full-record retrieval algorithm 2.1 Index scan position restore 2.1.1 MyISAM: position at {key_tuple, rowid} pair 2.1.2 InnoDB: position at {key_tuple, rowid} pair 2.1.3 For the future: introduce position save/restore functions 2.2 Record retrieval algorithm 3. EXPLAIN Output changes 4. Index condition pushdown 4.1 Index condition pushdown for default MRR implementation 4.1.1 Call index condition function in MyISAM 4.1.2 Call index condition function in InnoDB 4.1.3 SQL Layer support for Index condition pushdown 4.1.3.1 Execution 4.1.3.2 Optimization 4.1.3.2.1 Detect presense of index condition in the optimizer 4.1.3.2.1 Effects of index condition on MRR interface 4.2 Index condition pushdown with DS-MRR, executioner 5. Other notes TODO: - specify how condition pushdown will work. - add a note that we don't support SINGLE_MATCH - add a note about memory sizes calculation 0. Changes against title and HLS -------------------------------- * Introduced new name "Disk-Sweep MRR, DS-MRR" so we don't have to repeat "MyISAM or InnoDB" everywhere. * Added index condition pushdown. 1. Cost functions implementation -------------------------------- Cost functions will be the same for MyISAM and InnoDB (The following half-pseudocode is produced by adding some more detail to the description of cost calculations in the HLS): int multi_range_read_info(uint keyno, uint n_ranges, uint keys, uint *bufsz, uint *flags, COST_VECT *cost) { record the cost/flags/mem_usage of default MRR implementation; if ((flags & HA_MRR_INDEX_ONLY) || {doing a scan on clustered PK) || ((flags & HA_MRR_SORTED) && output-field-set-contains-blobs)) { return cost/flags/buffer size of the default implementation; } /* Indicate that we will return unordered output */ *flags &= ~HA_MRR_SORTED; /* The following call will also adjust *bufsz if the expectation is that only part of the buffer will be used. */ cost= get_disk_sweep_mrr_cost(); Compare the cost of default MRR implementation and MRR/MyISAM implementation and choose/return the best. } ha_rows multi_range_read_info_const(uint keyno, RANGE_SEQ_IF *seq, void *seq_init_param, uint n_ranges, uint *bufsz, uint *flags, COST_VECT *cost) { Get E(#records), cost/flags/mem_usage of default MRR implementation; Do the same that multi_range_read_info() does. } /* Return cost of Disk-Sweep MRR scan. This function itself is engine-angostic. */ get_disk_sweep_mrr_cost(int &buffer_size) { /* Calculate number of passes: */ buffer_entry_length = rowid_length + sizeof(void*) * (!test(HA_MRR_NO_ASSOCIATION)) /* Max. number of entries in the buffer: */ max_buff_entries = buffer_size / buffer_entry_length; /* Number of iterations we'll make with full buffer (here "range_records" is estimated total number of records in all ranges) */ n_full_steps= floor(range_records / max_buff_entries); records_in_full_step= max_buff_entries; /* Number of records we'll be processing in last iteration, where the buffer will not be full */ records_in_last_step= range_records % max_buff_entries; if (!n_full_steps) buffer_size= records_in_last_step * buffer_entry_length; sorts_and_sweeps_cost= n_full_steps * sort_and_sweep_cost(records_in_full_step) + (records_in_last_step==0)? 0 : sort_and_sweep_cost(records_in_last_step); if (n_full_steps != 0) sorts_and_sweeps_cost.mem_cost= buffer_size; else sorts_and_sweeps_cost.mem_cost= records_in_last_step * buff_entry_length; index_read_cost= handler::index_only_read_time(...); cost(MRR/MyISAM scan) = index_read_cost + sorts_and_sweeps_cost; return cost(MRR/MyISAM scan); } /* Get cost of one sort-and-sweep step */ get_sort_and_sweep_cost(n_records) { COST_VECT res; res.cpu_cost = {cost of qsort() call on an array of n_records elements } (res.io_count, res.avg_io_cost)= get_sweep_read_cost(); //adopt from index_merge implementation return res; } 2. MRR ordered-full-record retrieval algorithm ---------------------------------------------- 2.1 Index scan position restore ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We'll need to be able to make the following sequences of calls: handler->index_init() \ handler->index_next[_same](); |- fill the rowids buffer --//-- | handler->index_next[_same](); // (*) / handler->rnd_pos(); \ --//-- |-- retrieve full records in a disk sweep handler->rnd_pos(); / handler->index_next(); // (**) \ ... |-- fill rowids buffer with next portion handler->index_next(); / The call (**) will need to continue scanning the index from the position it was at right after the (*) call. This will not happen automatically. We'll make it possible to do in the following way (following the answer #A to dilemma posed in Section 8 of the HLS): handler->index_init() \ handler->index_next[_same](); |- fill the rowids buffer --//-- | handler->index_next[_same](); // (*) / // remember the {key_tuple, rowid} pair. handler->rnd_init(); \ handler->rnd_pos(); |-- retrieve full records in a disk sweep --//-- | handler->rnd_pos(); / handler->index_init(); \ handler->index_read({key_tuple, rowid}); (***) | handler->index_next(); // (**) |-- fill rowids buffer with ... | next portion handler->index_next(); / That is, we'll "restore the index scan" by making the call (***). Some effort will be required to make this work, see following sections. CAVEAT: I'm not sure the above will work for R-TREE indexes... If it will not we'll disable MRR/MyISAM for them. 2.1.1 MyISAM: position at {key_tuple, rowid} pair ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With the current code, calling ha_myisam::index_read({key_tuple, rowid}, length(key_tuple)+length(rowid), HA_READ_KEY_EXACT); // or whatever flag will not produce the desired effect, because ha_key_cmp function will not compare the rowid keypart. ha_key_cmp() will compare the rowid keypart if it is passed appropriate comparison flags. We'll change ha_myisam::index_read to check if it has been passed a {key_tuple, rowid} pair, and if yes, arrange for appropriate flag to be passed down to ha_key_cmp(). 2.1.2 InnoDB: position at {key_tuple, rowid} pair ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Experiments in the debugger show that if one calls ha_innobase::index_read({key_tuple, rowid}, length(key_tuple)+length(rowid), HA_READ_KEY_EXACT); then InnoDB will find the {key_tuple, rowid} pair and "restore the scan position" correctly (looking at the InnoDB code along the execution path also gives some evidence that calls like the above will do what is desired) 2.1.3 For the future: introduce position save/restore functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the future we should introduce a position save/restore functions into the handler interface, so a table handler will be able to save and later on restore index scan position. This is more efficient than the approach used in this WL. This task will not be executed in scope of this WL (but when it is done, this WL code will need to be changed to use save/restore functions). (position save/restore could also be used for index_merge code, so that code doesn't use the unclean way to create table handler objects that it is currently using. However, a cloning handler::fork() function would be a better approach for index_merge code). 2.2 Record retrieval algorithm ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ int multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param, uint n_ranges, uint mode, HANDLER_BUFFER *buf) { fill_buffer(); /* if the above call has scanned through all intervals in *seq, then adjust *buf to indicate that the remaining buffer space will not be used. */ return 0; } int multi_range_read_next(char **range_info) { if (!have rowids in the buffer && fill_buffer()) return EOF; return read_full_row(buffer.pop_front()); } fill_buffer() { restore index position; // <-- added in LLD if (no more index tuples to read) return 1; while (rowid buffer is not full) { get next {index_tuple, rowid} pair; if (index condition doesn't hold) // See below for details about continue; // "index condition" put rowid, or {rowid, range_id} pair into the buffer; } save index position; // <-- added in LLD sort the buffer contents by rowid; return 0; } 3. EXPLAIN Output changes ------------------------- The "Extra" column in EXPLAIN output will reflect: * If Disk-Sweep-MRR (= MRR/MyISAM or MRR/InnoDB) is used. If it is, the "Extra" column will contain "Using MultiRangeRead" * If Condition pushdown is used. If it is, the "Extra" column will contain "Using condition pushdown; " 4. Index condition pushdown --------------------------- We will implement "Index Condition Pushdown" (ICP) within the scope of this worklog entry. "Index Condition" here means a part of the WHERE condition that can be evaluated when we have only fields covered by the index. Note that the condition doesn't need to be SARGable. ICP will work when - DS-MRR implementation is used, or - default MRR implementation is used for accessing MyISAM or InnoDB table. In both cases, cost functions will take ICP into account. 4.1 Index condition pushdown for default MRR implementation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Codewise, it is difficult to implement ICP entirely on SQL layer in the default MRR implementation, because we'll have to do something like the following: get_next_record() { do { // get {key_tuple,rowid} for next row; while (!index condition is satisfied); //save index scan position; file->rnd_init(); file->rnd_pos(); file->index_init(); //restore scan position; } Performing all of the above operations is likely to cause much CPU overhead. It seems that an easier solution would be that SQL layer provides table engine a pointer to C function that performs the filtering. The calling contract for the function will be as follows: /* Perform index condition filtering, assuming the fields covered by the index are in table->record[0]. RETURN TRUE - Ok, index condition is satisfied FALSE - Index condition is not satisfied */ my_bool (*index_cond_func)(); 4.1.1 Call index condition function in MyISAM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The call should be placed somewhere in mi_rkey() function. 4.1.2 Call index condition function in InnoDB ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The index condition check function call should be placed in row_search_for_mysql() function. InnoDB calls row_sel_get_clust_rec_for_mysql() to access full table record. 4.1.3 SQL Layer support for Index condition pushdown ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We need to make two provisions on SQL layer to support Index Condition Pushdown: 4.1.3.1 Execution ~~~~~~~~~~~~~~~~~ The operation of "pushing the condition down" will be peformed within the handler->cond_push() call. SQL layer will provide make_cond_for_index() function that will extract index condition from the table condition. make_cond_for_index() shall produce a condition that has the following properties: - It is a most complete expression that can be evaluated when you have only index fields tuple. - It may depend on = constants and fields of "const" tables = fields that are fully covered by the given index. Implementation of make_cond_for_index() will be modeled after make_cond_for_table(): * It will use self-recursion for AND/OR, like make_cond_for_table() does. * For all other kinds of Items, it will recurse down the item tree (ultimately reaching Item_field* and item_field->field->part_of_key) and determine if a particular expression can be evaluated only using the given index. 4.1.3.2 Optimization ~~~~~~~~~~~~~~~~~~~~ We need to know which fraction of records retrieved by scanning the ranges will be filtered out by index condition (as the total cost of MRR scan depends on number of full table records that we'll need to retrieve). That is rather difficult to achieve because: a. We don't have any solution to the problem "Produce a selectivity estimate for some condition C when it will be used to filter out records within list of ranges R". b. Index condition extraction is an expensive operation, and we'd like to avoid doing it at optimization phase. Therefore we will not extract index condition but do the following instead: 4.1.3.2.1 Detect presense of index condition in the optimizer ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We'll modify the range optimizer to try to detect if the index condition will be present. Within the range optimizer (i.e. in get_mm_tree): Introduce RANGE_OPT_PARAM::index_conds_bitmap, (index_conds_bitmap[i]==TRUE) <=> "There is extra condition for index i". At range optimizer start, index_conds_bitmap[i] == FALSE; Within the range optimizer: When processing a condition cond: # For SARGable elementary conditions (i.e. leafs AND/OR tree): get_mm_leaf(expr) { ... /* The condition will be guaranteed to be true for rows that fit within ranges. */ index_conds_bitmap= 0; } # For non-SARGable elementary conditions (i.e. leafs AND/OR tree):{ if (cond->used_tables() & ~(head->table-map | const_table_map)) { // Ok, the condition uses only this table. Find out if we // can check it for index tuples of some indexes. context.index_cond_present_bitmap= 0; context.index_cond_bitmap= ~0; cond->walk(&context, Item_field::process() { if (this field is not covered by any index) return TRUE; //abort indexes= {bitmap of indexes that cover this field}; context->index_cond_present_bitmap |= indexes; context->index_cond_bitmap &= indexes; }); index_conds_bitmap= context.index_cond_bitmap | context.index_conds_present_bitmap; } } # For AND: tree_and(tree1, tree2) { index_conds_bitmap= tree1.index_conds_bitmap(tree1) union tree2.index_conds_bitmap(tree2); for each index IDX { /* Check if produced list of intervals is not equivalent to the condition on that index. This is the case when the resulting SEL_ARG* tree for the index IDX represents a condition in form: non_singlepoint_range(keypart_N) AND some_range(keypart_M) AND other_kp (here N > M, other_kp doesn't refer to keypart_N or keypart_M). */ if (the resulting SEL_ARG* tree has the above form) index_conds_bitmap[IDX] = TRUE; } } # For OR: tree_or(tree1, tree2) { /* (index_cond1 AND non_index_cond1) OR (index_cond2 AND non_index_cond2) = = (index_cond1 OR index_cond2) OR non_index_result_cond Here we assume that if index_cond_1 and index_cond_2 are non-confluent (not just "TRUE"), then "index_cond_1 OR index_cond_2" will be non-confluent too. */ index_conds_bitmap= tree1.index_conds_bitmap intersect tree2.index_conds_bitmap; /* tree_or also can perform this conversion: (key1condA AND key2cond) OR (key1condB AND key3cond) = (key1condA OR key1condB) AND remainder_cond --> (key1condA OR key1condB). Here we should do index_conds_bitmap[key2]= TRUE; index_conds_bitmap[key3]= TRUE; */ } 4.1.3.2.1 Effects of index condition on MRR interface ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ multi_range_read_info_[const]() functions will accept additional flag: HA_MRR_IDX_COND_SELECTIVITY_05. The SQL layer will set the flag depending on the results of range analysis (see previous section). The MRR implementation will analyze the flag and make obvious adjustments to the estimate of number of full records it is going to retrieve. 4.2 Index condition pushdown with DS-MRR, executioner ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Since DS-MRR will perform index-only scans, table handler will not check ICP condition itself. Therefore DS-MRR function will have to check the condition. 5. Other notes -------------- * In scope of this WorkLog entry we will not support semi-join-related HA_MRR_* flags. * The implementation will correctly process the presense/absense of HA_MRR_NO_ASSOCIATION flag, but there will be no way to check that. * In scope of the first milestone of this worklog entry we will not implement MRR for MERGE table type. * It would be nice to see if our cost model matches the reality. We could run some benchmark in a simple settings (one query on the system, synthetic dataset, etc) and collect the data to plot dependency of query execution time from certain parameters. While running the benchmark, we could also collect the numbers obtained by DS-MRR cost functions, and see if there is a reasonable correspondence between the two. SCHEDULE ======== - Implement Disk-Sweep-MRR executioner code 24 hrs MS1: Disk-Sweep MRR works for InnoDB Table - Apr,3 End-of-day Index position restore for MyISAM - Implementation 16 hrs - Testing 8 hrs MS2: Disk-Sweep MRR works for MyISAM Table - Apr,11 End-of-day Reflect DS-MRR in EXPLAIN Output - Implementation and testing 12 hrs - Implement Disk-Sweep-MRR cost functions 16 hrs MS3: DS-MRR works for both engines, the choice is based on mrr_*info() functions - Apr,20 End-of-day Implement Index Condition Split-up - Make a generic Split-up function 16 hrs - Perform filtering on execution in MRR/DS 4 hrs - Account for filtering in cost function 4 hrs MS4: Index Condition Split-up works - Apr,28 End-of-day Other checks - Check if MRR works for GEOM, disable if not 8 hrs - Testing and fixing whatever pops up 24 hrs MS5: Code ready for review - May,12 End-of-day - Post-review fixes 20 hrs MS6: Post-review fixes done May,22 - May,26, End-of-day
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.