WL#8170: Expression analyzer for GC

Status: Complete   —   Priority: Medium

The goal of this WL is to allow our range and ref optimizers to find
opportunities to use indexes defined over generated columns (GC). Intended use
of this feature is to allow building indexes over JSON documents.

Basic use case:
CREATE TABLE t1(f1 INT, gc INT AS (f1+1) STORED, KEY gc_idx(gc));
INSERT ...
SELECT * FROM t1 WHERE f1 + 1 > 9;

Optimizer should be able to use RANGE access for this query. This WL will
operate only STORED GCs as indexes over VIRTUAL GCs aren't available at the
moment. When they'll be ready, they'll be automatically used (with exception
of the primary key case below) as optimizer doesn't do any explicit checks for
a GC being stored or virtual.

Also, optimizer should take into account the comparison type: result type of
GC should match the result type of the expression it's compared to. E.g
following query wouldn't use index:

SELECT * FROM t1 WHERE f1 + 1 > 9.1;

The intended use case for it is following:

CREATE TABLE t1 (uuid BINARY(12), data JSON,
                 _gc1 INT AS MYJSON_EXTRACT(data, '$.id'),
                 _gc2 DOUBLE AS MYJSON_EXTRACT(data, '$.id'),
                 KEY _gc1_idx(_gc1),
                 KEY _gc2_idx(_gc2));

Before implementing native JSON indexes (in scope of another WL) result of
MYJSON_XX() functions would have to be casted to a particular SQL type in order
to allow proper indexing by InnoDB.

INSERT ...

SELECT ... FROM t1 WHERE MYJSON_EXTRACT(data, '$.id')=1;

After this WL is implemented, server should be able to generate ref access for
idx index. 

Another application of this WL is to allow to use an index over GC for sorting
and grouping:

CREATE TABLE t1(f1 INT, gc INT AS (f1+1) STORED PRIMARY KEY);
INSERT ...
SELECT * FROM t1 WHERE ... ORDER BY f1 + 1;

Optimizer should be able to use index for resolving ORDER BY clause in this
case.

CREATE TABLE t1(f1 INT, f2 INT, gc INT AS (f1+1) STORED PRIMARY KEY);
INSERT ...
SELECT f1+1, MAX(f2) FROM t1 WHERE ... GROUP BY f1 + 1;

Optimizer should be able to use loose index scan for such query.

Note that in both examples above the primary key is used, also the key should
be clustered. Since there is only one available primary key per table,
optimizer will use whatever type of GC used in it. If type conversion would
happen, it would be handled by server automatically, according to current type
conversion rules, this WL will not add/change anything in this area.


Functional requirements
=======================
1) Optimizer should be able to substitute parts of WHERE condition with
matching GC fields, same for elements of GROUP/ORDER lists
2) Optimizer should be able to used an index defined on a generated column for
expressions that match the definition of the generated column.

Non-functional requirements
===========================
The WL's implementation shouldn't have bugs and lead to performance
regressions.

General approach
================
First step is to create list of GC to replace with with. Optimizer will
go through all available GCs and add to the list those GCs which are a part of
a non-disabled index.
On the 2nd step, optimizer will go through the WHERE condition and look for
comparison operators that allow index to be used: >, >=, =, <=, <, IN, BETWEEN.
Upon finding such operator function optimizer will replace its arguments with
generated column when all of the following criteria are met:
.) Result type of the func's argument and the GC's expression are the same
.) func's arg to be replaced matches GC's expression
.) in case of IN, BETWEEN all args, except the one being replaced (arg[0]),
have to have same result type.
On the 3rd step optimizer tried to transform ORDER/GROUP BY. This step has more
restrictions than above ones. Consider the query:

CREATE TABLE t1(f1 INT, gc INT AS (f1+1) STORED PRIMARY KEY, KEY gc_idx(gc))
  ENGINE=innodb;
SELECT f1+1 FROM t1 WHERE f1 > 3 ORDER BY f1+1;

t1 has two indexes - primary and secondary. Since optimizer considers
non-covering keys to be slower than the table scan + filesort it'll look only
for covering indexes. When the SELECT query goes through resolution phase
(fix_fields in particular) gc_idx is marked as non-covering as it doesn't
include field f1. This means that even if we replace "f1+1" expression in both
select and ORDER BY lists with gc, we can't be sure that f1 isn't used in other
parts of the query and can't easily ensure that gc_idx is covering. Due to
that, the only key that could be safely used for ordering is InnoDB's clustered
primary key as it includes all fields.  To take this into account, optimizer
checks that the primary key is clustered and if so it removes from the list of
GCs all GCs that aren't a part of the primary key. Otherwise we skip this step.
After trimming the list, optimizer will go through ORDER/GROUP BY lists and
replace matching expressions with GCs. 


Support for multiple types of the same expression
=================================================
CREATE TABLE t1 (uuid BINARY(12), data JSON,
                 _gc1 INT AS MYJSON_EXTRACT(data, '$.id'),
                 _gc2 DOUBLE AS MYJSON_EXTRACT(data, '$.id') ,
                 KEY _gc1_idx(_gc1),
                 KEY _gc2_idx(_gc2));
INSERT ...;
SELECT ... FROM t1 WHERE MYJSON_EXTRACT(data, '$.id')=1;

Note that GC expressions in this example are the same, but stored in fields of
different types. 

Due to schema-less nature of JSON, different rows could have values of
different type for the same key. Using single index, e.g over INT field,
for different constants, e.g. INTs and DOUBLEs most likely would lead to a
wrong result. To address that user would
have to define several indexes of different types but with same expression, like
_gc1_idx and _gc2_idx in the example above. Optimizer will use the type of the
SQL constant, INT in the example above, and use it to pick the index with
appropriate type. When comparing two JSON typed values, index can't be used. To
address that, a user will have to add explicit CAST to one of JSON values.
Effectively this would block index usage on the CASTed column. This is a known
deficiency and would be addressed when server would support JSON indexes.


The analyzer is represented by a new member function of JOIN class -
JOIN::substitute_gc(). It's called before join order optimizer in order to let
the latter see all necessary indexes. 
This function consists of 3 parts:
1) It goes through the list of tables and collects all fields representing GCs
that are a part of an index that's available for query (i.e not disabled by
hints / ALTER).
2) Uses Item::compile() to transform the query's condition. The
analyzer for it (called Item::gc_subst_analyzer()) picks  >, >=, =, <=, <, IN,
BETWEEN operations. The transformer function (Item::gc_subst_transformer())
checks whether picked function's arguments are matching defined criteria, and
if so it creates a new Item_field for the matched GC's field and replaces the
matching func's argument with it.
For  >, >=, =, <=, < functions either of arguments can be replaced, while for
IN and BETWEEN only the first (left) one.
3) This step is taken only if the primary key is clustered. If so it removes
from the list of GCs all fields that aren't a part of primary key, then it
goes through ORDER/GROUP BY lists (if available) and tries to replace
lists' elements with matching GCs expressions.

This WL also introduces minor supporting change: now all objects implementing
CAST function have TYPECAST_FUNC function type.