WL#7019: Add support for row value constructors in in predicates to range optimizer

Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

This work will fulfill a feature request from Google/Facebook. 

The task will enable IN queries with row value expressions to be executed using 
range scans. E.g.

SELECT * FROM table WHERE ( column_1, column_2 ) IN (( 'a', 'b' ), ( 'c', 'd' ))

Currently the only way to enable range scan is to rewrite the where condition into 
its equivalent DNF form, i.e.

SELECT * FROM table WHERE ( column_1 = 'a' AND column_2 = 'b' ) OR
( column_1 = 'c' AND column_2 = 'd' )

The limitation arises due to the range optimizer not being able to translate the 
former query into its native representation. In queries over scalars are currently 
translatable, whereas row values are not. This worklog task will add this 
capability.

User Documentation
==================

http://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-3.html
http://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#row-constructor-
range-optimization

Functional Requirements:
F-1: In predicates over row expressions in the where condition must be
     executable using range scan.

Non-Functional Requirements:
NF-1: In predicates over row expressions in the where condition must
      be executed using the same execution plan as the equivalent
      AND-OR expression.

Changes to the interface specification:
I-1: No new files.
I-2: No new syntax.
I-3: No new commands
I-4: No new tools. 
I-5: Impact on existing functionality: Query plans will change from full
index/table scan to index range scan where the feature is applicable.

Consider the following table definition:

  CREATE TABLE t1 (a INT, b INT, c INT, KEY x(a, b));

A query with an in predicate over more than one row constructor, e.g.

SELECT a, b FROM t1 WHERE (a, b) IN ((0, 0), (1, 1))

will currently have the following execution plan if the table has 4098 rows in
total, two of which match the where condition:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: index
possible_keys: NULL
          key: x
      key_len: 10
          ref: NULL
         rows: 4098
        Extra: Using where; Using index

I.e. the whole index will be scanned unconditionally. After the feature is
implemented, the plan will change to:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: range
possible_keys: x
          key: x
      key_len: 10
          ref: NULL
         rows: 2
        Extra: Using where; Using index

The changes will also be visible by reading the status variables
Handler_read_first, Handler_read_key and Handler_read_next.

What Is To Be Done
------------------

The range optimizer takes as input a parse tree and attempts to absorb
portions of the query that can be used for index range scans. In the
simplest case, the WHERE condition is 'column > constant'. In this
case the range optimizer will translate the condition into a range
scan expression tree. During execution this tree will be used to do an
index lookup starting at the constant's position in the index and scan
from there until the end of the index.

A wide range of conditional expressions can be translated into the
range optimizer's internal format, including expression trees
consisting of AND and OR nodes. But so far the absorption of IN
predicates has been limited. The folloing special case could so far be
translated into a range tree.

  column IN (constant1, ..., constantN)

The equivalent range condition would be the same as for the equivalent
expression

  column = constant1 OR ... OR column = constantN

However, the general case of an , where the left-hand side is
a row constructor, was so far rejected altogether by the range
optimizer. This work will add the capability of the translator to
translate the following type of condition:

  (col1, col2, ...) IN ((const11, const12, ...), (const21, const22, ...), ...)

The translated condition will be similar in structure to the condition

  ( col1 = const11 AND col2 = const12 AND ... )
  OR
  ( col1 = const21 AND col2 = const22 AND ... )
  OR
  ...

The translating procedure will be a nested loop as described in
pseudocode below:

  orExpr := false
  for each row r on the IN right-hand side
    andExpr := true
    for each column c_i in the IN left-hand side
      andExpr := andExpr AND c_i = r[i]
    orExpr := orExpr OR andExpr


What Is Not To Be Done
----------------------

This work is subject to the following limitations:

- Only positive predicates will be considered, i.e. NOT IN will not be
  considered.

- There may only be column references in the row constructor on the IN
  operator's left-hand side.

- There may only be run-time constants in all rows on the right-hand
  side of the IN operator. Run-time constants are:

  * literals, including those found during constant propagation.
  * local column references that are bound to constants during
    execution.


Limitations Imposed by Other Modules
------------------------------------

This work is limited to the range optimizer. However, the chosen
access method is influenced not only by the range optimizer and hence
there may be differences in execution plans between an in predicate
and the equivalent and-or expression even with this feature.

The ICP implementation will reject all expressions that are not on a
strict AND-OR form. Altering ICP is outside the scope of this work.


Implementation
--------------

This work will simply add to the existing functions that translate a
parse tree into a range tree. The parse tree is a tree of Item
classes, and the range tree is a SEL_TREE structure. Functions of
interest are:

static SEL_TREE *get_mm_tree(RANGE_OPT_PARAM *param, Item *cond)

This is the top-level translating function. It starts from scratch
with an Item tree and returns a SEL_TREE pointer. The NULL pointer is
returned if the condition was rejected.


Changed functions
-----------------

get_mm_tree()

This function is changed to not reject row constructor expressions on
the IN predicate left-hand side.


get_full_func_mm_tree()

This function gets its signature changed from

 static SEL_TREE *get_full_func_mm_tree(RANGE_OPT_PARAM *param,
                                        Item_func *cond_func,
                                        Item_field *field_item, Item *value, 
                                        bool inv)

to

 static SEL_TREE *get_full_func_mm_tree(RANGE_OPT_PARAM *param,
                                        Item *predicand,
                                        Item_func *op,
                                        Item *constant,
                                        bool inv)

Instead of taking a column as argument, it now takes an expression
tree node, which may be a column or a row. The order of arguments is
also changed to   , which corresponds to
the order in the SQL syntax, and in logic/mathematics in general. The
parameter names are also changed to reflect the standard SQL names for
the constructs rather than nonexistent words or words that carry no
information.

The function is also changed to accept row constructors (Item_row) and
pass them on to the next function in the call chain.

Please note that calling the right-hand side arguments of the
predicate 'constant' is rather misleading as there is a call in
get_mm_tree() that runs this procedure with flipped arguments. On the
other hand, there is also a call where a number 1 is cast to a
pointer. One can only do so much to make this code comprehensible.
The double C cast was changed to a reinterpret_cast to make the abuse
obvious.


get_func_mm_tree()

This function gets its signature changed from

static SEL_TREE *get_func_mm_tree(RANGE_OPT_PARAM *param, Item_func *cond_func, 
                                  Field *field, Item *value,
                                  Item_result cmp_type, bool inv)

to

static SEL_TREE *get_func_mm_tree(RANGE_OPT_PARAM *param,
                                  Item *predicand,
                                  Item_func *cond_func,
                                  Item *value,
                                  Item_result cmp_type,
                                  bool inv)

The motivation of which is the same as for get_full_func_mm_tree().


get_func_mm_tree_from_in_predicate()

This is a new function, the bulk of which comes from extracting all
code out of get_func_mm_tree() that translates an IN predicate. (The
body of the switch() label case Item_func::IN_FUNC) This is where the
predicand is actually translated, so the check if it is a row or
column is done here. The new code is a rendition of the pseudocode
snippet in the section "What Is To Be Done".


add_key_fields()

This function is part of the ref optimizer, yet it is somewhat tangled
with the rest of the optimizer and affects whether the range optimizer
is involved. It would appear that this function does a lot more than
it advertises. It populates a bitmap, JOIN_TAB::const_keys, the set of
indexed columns that will stay constant during access to a table in
the execution phase. Unless indexed columns appear here, the range
optimizer will not get involved to access them.

There is a (well known) problem with the cost model, which would cause
a ref scan to be chosen, even though it would scan the entire index
and table. Fixing the cost model is outside the scope of this worklog.
Hence we have the problem that we have to run the ref analysis
procedure for its side effects, yet we must not succeed in producing a
ref scan plan. Fortunately this hack is frequently performed already,
so we only have to follow suit.