This work will fulfill a feature request from Google/Facebook. The task will enable IN queries with row value expressions to be executed using range scans. E.g. SELECT * FROM table WHERE ( column_1, column_2 ) IN (( 'a', 'b' ), ( 'c', 'd' )) Currently the only way to enable range scan is to rewrite the where condition into its equivalent DNF form, i.e. SELECT * FROM table WHERE ( column_1 = 'a' AND column_2 = 'b' ) OR ( column_1 = 'c' AND column_2 = 'd' ) The limitation arises due to the range optimizer not being able to translate the former query into its native representation. In queries over scalars are currently translatable, whereas row values are not. This worklog task will add this capability. User Documentation ================== http://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-3.html http://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#row-constructor- range-optimization
Functional Requirements: F-1: In predicates over row expressions in the where condition must be executable using range scan. Non-Functional Requirements: NF-1: In predicates over row expressions in the where condition must be executed using the same execution plan as the equivalent AND-OR expression.
Changes to the interface specification: I-1: No new files. I-2: No new syntax. I-3: No new commands I-4: No new tools. I-5: Impact on existing functionality: Query plans will change from full index/table scan to index range scan where the feature is applicable. Consider the following table definition: CREATE TABLE t1 (a INT, b INT, c INT, KEY x(a, b)); A query with an in predicate over more than one row constructor, e.g. SELECT a, b FROM t1 WHERE (a, b) IN ((0, 0), (1, 1)) will currently have the following execution plan if the table has 4098 rows in total, two of which match the where condition: *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 type: index possible_keys: NULL key: x key_len: 10 ref: NULL rows: 4098 Extra: Using where; Using index I.e. the whole index will be scanned unconditionally. After the feature is implemented, the plan will change to: *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 type: range possible_keys: x key: x key_len: 10 ref: NULL rows: 2 Extra: Using where; Using index The changes will also be visible by reading the status variables Handler_read_first, Handler_read_key and Handler_read_next.
What Is To Be Done ------------------ The range optimizer takes as input a parse tree and attempts to absorb portions of the query that can be used for index range scans. In the simplest case, the WHERE condition is 'column > constant'. In this case the range optimizer will translate the condition into a range scan expression tree. During execution this tree will be used to do an index lookup starting at the constant's position in the index and scan from there until the end of the index. A wide range of conditional expressions can be translated into the range optimizer's internal format, including expression trees consisting of AND and OR nodes. But so far the absorption of IN predicates has been limited. The folloing special case could so far be translated into a range tree. column IN (constant1, ..., constantN) The equivalent range condition would be the same as for the equivalent expression column = constant1 OR ... OR column = constantN However, the general case of an <in predicate>, where the left-hand side is a row constructor, was so far rejected altogether by the range optimizer. This work will add the capability of the translator to translate the following type of condition: (col1, col2, ...) IN ((const11, const12, ...), (const21, const22, ...), ...) The translated condition will be similar in structure to the condition ( col1 = const11 AND col2 = const12 AND ... ) OR ( col1 = const21 AND col2 = const22 AND ... ) OR ... The translating procedure will be a nested loop as described in pseudocode below: orExpr := false for each row r on the IN right-hand side andExpr := true for each column c_i in the IN left-hand side andExpr := andExpr AND c_i = r[i] orExpr := orExpr OR andExpr What Is Not To Be Done ---------------------- This work is subject to the following limitations: - Only positive predicates will be considered, i.e. NOT IN will not be considered. - There may only be column references in the row constructor on the IN operator's left-hand side. - There may only be run-time constants in all rows on the right-hand side of the IN operator. Run-time constants are: * literals, including those found during constant propagation. * local column references that are bound to constants during execution. Limitations Imposed by Other Modules ------------------------------------ This work is limited to the range optimizer. However, the chosen access method is influenced not only by the range optimizer and hence there may be differences in execution plans between an in predicate and the equivalent and-or expression even with this feature. The ICP implementation will reject all expressions that are not on a strict AND-OR form. Altering ICP is outside the scope of this work. Implementation -------------- This work will simply add to the existing functions that translate a parse tree into a range tree. The parse tree is a tree of Item classes, and the range tree is a SEL_TREE structure. Functions of interest are: static SEL_TREE *get_mm_tree(RANGE_OPT_PARAM *param, Item *cond) This is the top-level translating function. It starts from scratch with an Item tree and returns a SEL_TREE pointer. The NULL pointer is returned if the condition was rejected. Changed functions ----------------- get_mm_tree() This function is changed to not reject row constructor expressions on the IN predicate left-hand side. get_full_func_mm_tree() This function gets its signature changed from static SEL_TREE *get_full_func_mm_tree(RANGE_OPT_PARAM *param, Item_func *cond_func, Item_field *field_item, Item *value, bool inv) to static SEL_TREE *get_full_func_mm_tree(RANGE_OPT_PARAM *param, Item *predicand, Item_func *op, Item *constant, bool inv) Instead of taking a column as argument, it now takes an expression tree node, which may be a column or a row. The order of arguments is also changed to <predicand> <operator> <rows>, which corresponds to the order in the SQL syntax, and in logic/mathematics in general. The parameter names are also changed to reflect the standard SQL names for the constructs rather than nonexistent words or words that carry no information. The function is also changed to accept row constructors (Item_row) and pass them on to the next function in the call chain. Please note that calling the right-hand side arguments of the predicate 'constant' is rather misleading as there is a call in get_mm_tree() that runs this procedure with flipped arguments. On the other hand, there is also a call where a number 1 is cast to a pointer. One can only do so much to make this code comprehensible. The double C cast was changed to a reinterpret_cast to make the abuse obvious. get_func_mm_tree() This function gets its signature changed from static SEL_TREE *get_func_mm_tree(RANGE_OPT_PARAM *param, Item_func *cond_func, Field *field, Item *value, Item_result cmp_type, bool inv) to static SEL_TREE *get_func_mm_tree(RANGE_OPT_PARAM *param, Item *predicand, Item_func *cond_func, Item *value, Item_result cmp_type, bool inv) The motivation of which is the same as for get_full_func_mm_tree(). get_func_mm_tree_from_in_predicate() This is a new function, the bulk of which comes from extracting all code out of get_func_mm_tree() that translates an IN predicate. (The body of the switch() label case Item_func::IN_FUNC) This is where the predicand is actually translated, so the check if it is a row or column is done here. The new code is a rendition of the pseudocode snippet in the section "What Is To Be Done". add_key_fields() This function is part of the ref optimizer, yet it is somewhat tangled with the rest of the optimizer and affects whether the range optimizer is involved. It would appear that this function does a lot more than it advertises. It populates a bitmap, JOIN_TAB::const_keys, the set of indexed columns that will stay constant during access to a table in the execution phase. Unless indexed columns appear here, the range optimizer will not get involved to access them. There is a (well known) problem with the cost model, which would cause a ref scan to be chosen, even though it would scan the entire index and table. Fixing the cost model is outside the scope of this worklog. Hence we have the problem that we have to run the ref analysis procedure for its side effects, yet we must not succeed in producing a ref scan plan. Fortunately this hack is frequently performed already, so we only have to follow suit.