WL#8955: Add support for multi-valued indexes

Affects: Server-8.0 — Status: Complete

Description
Requirements
Dependent Tasks
High Level Architecture
Low Level Design

Currently an index can have only one index record that correspond to a data
record, i.e it's a single-valued index. In order to provide better indexing
options for JSON, server should support multi-valued indexes - i.e when multiple
index records could point to the same data record. This is useful for indexing
JSON arrays. For example:

{
user: John,
user_id: 1,
 addr: [

      {zip: 94582},
      {zip: 94536}
 ],
}

Here, if we'd like to search all zip codes, we'd have to have two records in
index for each zip code in the document, that'll point to the same document.

There's a counterpart WL for work on the InnoDB side - WL#8763.

Functional requirements
-----------------------
F1) This WL should allow multi-valued index to be created/inserted
into/updated and deleted from
F2) Any SQL type already supported by CAST(), except JSON, shall be supported
  as the type for multi-valued index.
F3) Multi-valued index can be composite, and the multi-valued key part can
  take any place within the key (first, last, in the middle). But only one
  multi-valued key part is allowed per index.
F4) NULL-ablity of m-v index:
  If multi-value key part's generation function returns empty array, nothing
  is inserted into the index, and the data record will not be accessible via
  index scan. 
  If multi-value key part's generation function returns NULL, then single
  entry containing NULL is inserted into index. If the key part is defined as 
  NOT NULL an error is thrown.
  If any of returned values is NULL, it's treated as JSON null and the
  BAD VALUE error is thrown.
F5) If multi-valued index is defined as unique, any value in the inserted
array that is already present in the index should trigger duplicate key error.
F6) Regular access method (index scan, ref, range (plain or MRR)) over
  multiple-valued indexes should be supported.
F7) Following functions should trigger m-v index usage: MEMBER OF(),
  JSON_CONTAINS(), JSON_OVERLAPS(). Ref access also allowed for multi-valued
  index.
F8) Usage of multi-valued index in any other way, except described in F6 (e.g
index scan, or using only prefix) should be prohibited, to avoid performance
regressions/wrong results.
F9) Multi-valued index should follow same applicability rules (e.g matching
  type, expression, collation, etc) as an index over a virtual generated column.
F10) Even when a record matches multiple indexed values, it should be returned
only once, not as many times as there're matches.

Non-functional requirements
---------------------------
NF1) Typed arrays shouldn't be visible to a user.
NF2) Only JSON typed expressions should be allowed in a multi-valued key part.
NF3) CAST(... AS ... ARRAY) should be available only in functional index
  definition in CREATE/ALTER TABLE/CREATE INDEX.
NF4) Typed array field name shouldn't appear in any thrown warnings/error
  messages/optimizer trace.

WL#1075: Add support for functional indexes
WL#8763: support multi-value functional index for InnoDB

Feature usage overview
----------------------
This WL implements support for multi-valued indexes in the server.
It allows user to declare such index, server to provide multiple values to
be inserted into index and define requirements/expectations from InnoDB.
Such index is intended to index JSON arrays. Effectively it's a functional
index which uses CAST() function to cast JSON arrays to an array of SQL type.
For example:

CREATE TABLE t1 (data JSON, key zips((CAST(data->'$.zip' AS UNSIGNED ARRAY))));

or

CREATE TABLE t1 (data JSON);
ALTER TABLE t1 ADD INDEX zips((CAST(data->'$.zip' AS UNSIGNED ARRAY)));

or 

CREATE TABLE t1 (data JSON);
CREATE INDEX zips ON t1((CAST(data->'$.zip' AS UNSIGNED ARRAY)));

these statements create multi-valued index with unsigned int values of '$.zip'
key from 'data' field. Any expression that returns JSON ARRAY can be used.
Statement for creating composite index can look like following:

CREATE TABLE t1 (id int, date DATE, data JSON);
ALTER TABLE t1 ADD INDEX zips(id, date, (CAST(data->'$.zip' AS UNSIGNED ARRAY)));

In general, multi-valued index is a regular functional index, with the exception
that it requires additional handling under the hood on INSERT/UPDATE for
multi-valued key parts. It will be used automatically, like any single-valued
index.  For example for such query:

SELECT * FROM t1 WHERE 123 MEMBER OF (data->'$.zip');  // (1)

and data like

INSERT INTO t1 VALUES
('{"id":1, "zip": [0,111,333]}'),
('{"id":2, "zip": [123,456,0]}'),
('{"id":3, "zip": [123,123,111]}'),
('{"id":4, "zip": [456,567,222]}'),
('{"id":5, "zip": [333,111,777]}');

the server should use ref access on index created above and return records with
id == 2 and id == 3.

Other two functions that allow multi-valued index to be used by optimizer are
JSON_CONTAINS and JSON_OVERLAPS. The former searches multiple keys but passes
only those documents in which all keys are present. The latter also searches
for multiple keys, but passes when at least one key is present in the document.
For example, on the data above the query

SELECT * FROM t1 WHERE JSON_CONTAINS(data->'$.zip', cast('[123,456]' as json));

should return only one record with id == 2.

SELECT * FROM t1 WHERE JSON_OVERLAPS(data->'$.zip', cast('[123,456]' as json));

should return records with id == 2, id == 3 and id == 4.


Architecture overview
---------------------
This WL consists of several key points, that allows server to use m-v indexes:
*) typed array field - a JSON field that implements storing arrays of
  same-typed scalars and converts arbitrary json to typed array
*) CAST(... AS ... ARRAY) function - acts as a type-holder and indication os
  multi-valued index 
*) changes to ref/range optimizer to use m-v index for fetching records when
  MEMBER OF()/JSON_CONTAINS()/JSON_OVERLAPS() are specified in the WHERE clause

The whole design works as follows: on CREATE/ALTER TABLE for a functional
index, server creates virtual generated column using the typed array field
(instead of a regular field) for a function for which is_returns_array() method
returns true. This WL adds one such function - CAST(... AS ... ARRAY). The
typed array field (Field_typed_array class) essentially is a JSON field, a
descendant of Field_json, but it reports itself as a regular field which type
is typed array element's type. For such field InnoDB creates a regular btree
index using field's type, with additional internal handling of its
multi-valued property. 

On INSERT/UPDATE the value saved to the typed array field is checked to be an
array or a scalar, and its elements are coerced to the field's array element
type. If a conversion error occur, appropriate message is thrown. Otherwise,
the result array is saved to the typed array field. Then, InnoDB parses field's
binary JSON value on it's own and stores obtained array's values to index.

For a SELECT query, substitute_gc() checks pre-conditions for m-v index to be
used. If they're met, the matching expression is replaced with the typed array
field. Further on, when ref optimizer is called, it assumes that if a typed
array field is met, then pre-conditions check was passed and generates a
ref access over typed array field in exactly the same way as for regular field.
Same for range optimizer, when it meets an optimizable expression with typed
array field, it generated appropriate SEL_ARG tree for range access.
The rest of optimization and execution goes as usual, requiring no additional
handling for m-v indexes.

Limitations
-----------
Only one multi-valued key part is allowed per index, to avoid exponential
explosion. E.g if there would be two multi-valued key parts, and server would
provide 10 values for each, SE would have to store 100 index records. This
will significantly increase storage/lookup/updating/deleting costs.
Yet, the path expression of the functional index's generating
expression could refer to multiple arrays within JSON document,
e.g CAST(jf->'$.arr[*][*]' AS UNSIGNED ARRAY). In such case all values
matching the path expression are stored into the index as a single flat array.
This is similar to MongoDb's limitation, yet allows to index paths with
multiple arrays in them.

A key with multi-valued key part won't support ordering or be a primary key
(as the consequence), but can be unique. In the latter case, presence in the
index of any of the newly inserted multiple values will cause duplicate key
error.

A multi-valued index is never covering as we need to extract actual source
array in order to properly match the WHERE condition.

Index applicability rules are the same as for indexes over virtual generated
columns.

There are two limitations on how much keys per single data row one can index.
It's max number of keys and total keys length. Both are dictated by the
storage engine via new handler::ha_mv_key_capacity() function. In case of
InnoDB it's 655335 and 10000 bytes respectively. InnoDB limits it by how much
data can be stored in an single undo log page. To give an idea how much it is:
currently InnoDB allows to index 1250 integer values at most.

Currently mult-valued index supports only two charsets - binary and
utf8mb4_0900_as_cs. The reason is that they are most close to how memcmp()
compares data (this is what our JSON type implementation uses to compare
strings). However, it's possible that two binary different strings will be
mapped to be equal in utf8mb4_0900_as_cs charset. Multi-valued index removes
duplicate values before storing them in SE, so one of such 'duplicates' will
be lost and selecting it won't yield any results. For example:

mysql> CREATE TABLE t1 (f1 JSON, UNIQUE KEY i1((CAST(f1->"$[*]" AS CHAR(10)
ARRAY)))); 
Query OK, 0 rows affected (0,16 sec)

mysql> INSERT INTO t1(f1) VALUES(JSON_ARRAY(_utf8mb4 0xE284AB, _utf8mb4 0xc385));
Query OK, 1 row affected (0,02 sec)

mysql> SELECT * FROM t1 WHERE _utf8mb4 0xc385 MEMBER OF (f1->"$[*]");
+------------+
| f1         |
+------------+
| ["Å", "Å"] |
+------------+
1 row in set (0,00 sec)

mysql> SELECT * FROM t1 WHERE _utf8mb4 0xE284AB MEMBER OF (f1->"$[*]");
Empty set (0,00 sec)

This issue can be fixed only when non-binary charsets will be introduced for
JSON type. Until that users will have to address this limitation (by
sanitizing input) on their own.

Change in syntax
----------------
This WL adds new keyword ARRAY (ARRAY_SYM token). It's optionally can be
specified after the type in the CAST() function:

  CAST_SYM '(' expr AS cast_type opt_array_cast ')'
 
opt_array_cast:
        /* empty */ { $$= false; }
        | ARRAY_SYM { $$= true; }
        ;

when the ARRAY is specified, create_func_cast() will create
Item_func_array_cast() for any specified type. Otherwise, a regular, one per
type, cast function will be created.


Typed array field
-----------------
A new class Field_typed_array is introduced in this WL. It's a descendant of
Field_json, i.e essentially a JSON field, but reports itself as of its
element's type. The typed array field checks that everything that is stored
into it is an array and each element of the array can be coerced to the
element's type without errors. This allows Innodb to get type info needed for
creating an index and store index data without errors.
To distinguish a regular field and typed array field a new method 
Field::is_array() is added. 

The typed array field always contains an array, even if it contains only one
element. The array is always sorted and all duplicates are removed. This is
done to simplify code on InnoDB's side.

Within scope of this WL typed array fields are used as virtual generated
columns only, i.e. they are never stored to SE. But this is a limitation of
this WL and introduction of a generally available array type in scope of
another WL could lift this.

Beside type-holding purpose, Field_typed_array also coerces data saved into
the field to the array's type. Coercion rules are as follows:

*) If saved value is a JSON scalar, it's treated as a single-element array
*) If saved value is a JSON array, its elements are coerced to the target type
  one by one and a new result array is produced, and stored. Coercion can
  throw warnings/errors, depending on the severity of case. The result array is
  sorted and duplicate values are removed. This allows to avoid storing
  duplicate index entries pointing to the same record and thus minimize space
  usage and make index lookups faster.
*) Saved JSON objects cause an error, since there's no direct way to convert
  object to an array.

Note on replication
-------------------
Typed array field represents its type as of its element type. For replication
this means that it will clash with scalar fields. To avoid this, explicitly
for replication, was added a new type MYSQL_TYPE_TYPED_ARRAY (numeric value
244). It's used to denote typed arrays during replication. Typed array field's
metadata consists of 4 bytes (for convenience of fitting into 4 byte integer
as a buffer): first byte is array's element type and next 3 bytes is metadata
according to the type - precision and decimals for DECIMAL, length for CHAR,
etc.

Handling of NULL values
-----------------------
There are two kinds of NULL - SQL NULL and JSON null, they are different.
The former is a no-value, the latter is a special value. Since SE is
relational, it can't index a special value that is neither a specific value
(e.g 0) nor lack of it. Due to that, JSON null aren't allowed in indexed
arrays. When it's met, the 'bad value' error is thrown.


Sparse index
------------
This WL will implement only base of support for this feature, the proper
storage of data for indexing. The feature itself will be implemented in a
separate WL.

In total there's 3 cases of data that's being stored in MV index:
1) An array of values, e.g. '[1,2,3]'. In this case typed array field is not
null, its top node is J_ARRAY and elements count > 0. SE indexes the given
array of values as is.
2) A empty array, e.g '[]'. In this case typed array field is not null, but
element count is 0. Here SE doesn't add any records to index.
3) SQL NULL. The typed array field is set to NULL. SE stores single record
containing NULL that points to the data record.

The storage options above are implemented in scope of this WL.

Of the above, case #2 allows user to create sparse index. For example:
CREATE TABLE t1(j JSON, sp_f JSON AS (IFNULL(j->'$.sparse_data','[]')) VIRTUAL, key sp_i((CAST(sp_f AS UNSIGNED ARRAY))));

or, using functional index,

CREATE TABLE t1(j JSON, key sp_i((CAST(IFNULL(j->'$.sparse_data','[]') AS UNSIGNED ARRAY))));

This statement in addition to the data source field j, creates a virtual
generated field sp_f, which contains either data to be indexed, or an empty
array, if data isn't available. When such field indexed using MV index,
empty arrays don't take any space in index, only real data, allowing faster
index lookups due to smaller index size.

Such index can be used just like any functional index, e.g.:

SELECT * FROM t1 WHERE 1 MEMBER OF(sp_f);
or
SELECT * FROM t1 WHERE 1 MEMBER OF (IFNULL(j->'$.sparse_data','[]'))

In order to make optimizer match sp_f to an index, GC substitution code needs
to be extended to allow matching field to a multi-valued index. The rest of
the optimizer code remains unchanged and sparse VM index is handled just like
any other regular functional index. This optimizer change will be implemented
in scope of another WL.

CAST(<expr> AS <type> ARRAY)
----------------------------
This function is one of central design points of this WL. It allows to hide
actual typed array under the hood. This variant of CAST() function is allowed
only as a generating expression of a functional index's key part over a
JSON-returning expression, thus only in CREATE TABLE/ALTER TABLE/CREATE INDEX.
It won't be allowed in an expression elsewhere, including expressions of
regular generated columns. 

The function obtains JSON value of its arg, either by getting values from a
JSON-typed expression or converting an SQL scalar to JSON and returns to the
caller. Since the function can only be a generating expression, its result
is immediately stored in a typed array field and is never seen by the user.
This allows to avoid introducing a generally available array type and thus
significantly reduces scope and complexity of the WL.

Beside passing data to a typed array field, the function acts as the
type-holder. It's returns_array() property returns true, so when such
expression is met, server knows that multi-valued index should be created.
CAST()'s syntax allows user to specify data type of the index.
Server derives underlying field's type from the generation expression, and
InnoDB creates index based on the field's type.

CAST(.. AS .. ARRAY) takes single argument of JSON type as the data source. 
In a future WL that will introduce general array type, it will be possible to
lift this by adding e.g. CAST(<sql row> AS <type> ARRAY). Then, the
function will convert SQL row to a JSON array and the typed array field will
handle the conversion.

Being a CAST() function, it'll cast to only those types which are already
available:
BINARY[(N)]
CHAR[(N)] [charset_info]
DATE
DATETIME
TIME
DECIMAL[(M[,D])]
SIGNED [INTEGER]
UNSIGNED [INTEGER]

This function won't cast JSON to array but generate the error
"Invalid use of <type> and CAST(.. AS .. ARRAY)". This is because InnoDB
doesn't support index of JSON type. Casting to other types could be added 
within scope of another WL.

The JSON native types that doesn't directly match to types above: BLOB, BIT,
OPAQUE, BOOLEAN, NULL will cause warning when casted to one of
supported types and converted to NULL, e.g on INSERT or UPDATE. If index is
defined as NOT NULL, it'll cause error and statement will fail.


JSON_OVERLAPS
-------------
This WL adds a new comparison function named JSON_OVERLAPS() with following 
syntax:

JSON_OVERLAPS(<json doc>, <json doc>) 

Primarily it's aimed to work on arrays and objects - to check whether two
arrays has same elements, or two objects has same key-value pairs. When two
scalars are given, the function behaves like '='.
Elements/values are compared as whole, i.e in case of nested JSON doc there
should be same complete nested doc on the other side to match, partial matches
are treated as no match. For example:

JSON_OVERLAPS(CAST('[[1,2],3,4]' AS JSON), CAST('[5, [1,2], 6]' AS JSON))

returns true, while 

JSON_OVERLAPS(CAST('[[1,2],3,4]' AS JSON), CAST('[1, [2,3], 4]') AS JSON)

returns false.

To widen scope of applicability cross-type comparison is also supported by
means of auto-array-wrapping. It works similar to auto-wrapping in other JSON
functions. When an arg isn't an array (except case when both args are objects)
then it's treated as a value of a single-element array. For example
JSON_OVERLAPS('[1,2,3]','33') and JSON_OVERLAPS('[1,2,3]','[33]') are the same.
Both objects and scalars are subject to auto-wrapping. 

SQL NULLs and JSON nulls are handled in different ways. When one of args is
NULL, then function returns SQL NULL value. But when both arrays has JSON null
value, then function returns true, not NULL, as JSON null is a value.

This function serves as counterpart to JSON_CONTAINS(), which requires all
elements of searched-for array to be present in the searched-in array. Thus,
JSON_CONTAINS() implements AND-ing of search keys, while JSON_OVERLAPS()
implements OR-ing. Note, that index could be used for this function only when
it operates on arrays/scalars. Objects can't be indexed as InnoDB doesn't
support indexing key-value pairs.

Examples:

JSON_OVERLAPS(CAST('[1, 2, 3]' AS JSON), CAST('[3, 4]' AS JSON))
returns true

JSON_OVERLAPS(CAST('[1, 2, 3]' AS JSON), CAST('[NULL, 6]' AS JSON))
returns NULL

JSON_OVERLAPS(CAST('[1,2,3]' AS JSON), CAST('[4, 10]' AS JSON))
returns false

JSON_OVERLAPS(CAST('{"a":1, "b":2}' AS JSON), CAST('{"a":1,"c":3}' AS JSON))
returns true

JSON_OVERLAPS(CAST('{"a":1, "b":2}' AS JSON), CAST('{"a":2,"c":3}' AS JSON));
returns false

JSON_OVERLAPS(CAST('{"a":1, "b":2}' AS JSON), CAST('{"a":null,"c":3}' AS JSON));
returns false 


MEMBER OF
---------
Another addition of this WL is the MEMBER OF function. Syntax is:

<sql scalar> MEMBER OF (<json array>)

It returns true if the scalar on the left is found in the array on the right.
False otherwise (not found/no array on right), and NULL if either of args is
NULL. If two scalars are given on both sides, then it behaves like '='. The
function is similar to IN(), with difference being that IN() operates on sql
row of values, while MEMBER OF() operates on a JSON array. Other difference is
that subselects aren't allowed in neither argument.

This function can be optimized by ref lookups, when exists an index,
generation expression of which matches the expression returning JSON array.

Examples:

1 MEMBER OF("[1, 2, 3]")
returns true

1 MEMBER OF(NULL)
returns NULL

1 MEMBER OF("1")
returns true

2 MEMBER OF('{"a":2}')
return false

'{"a":2}' MEMBER OF('{"a":2}')
return false

CAST('{"a":2}' AS JSON) MEMBER OF('{"a":2}')
return true 

Note last two examples. In the first one the string on the left is treated as
a scalar, while on the left, the string is treated as string-serialized JSON,
and thus as a JSON object in an array. As the result the function returns
false, as string isn't equal to object.
In the last example, the string is explicitly CASTed to JSON, and thus the
result is true.

IN
--
This WL also enables proper comparison of JSON data by IN() function.
When one of args (any of args) is JSON, then JSON comparator is used to compare
all arguments. E.g. in following case 101 IN('f1->"$.id",2)' 101 and 2 will be
compared by JSON comparator, despite neither of them is JSON. This is similar
to as how '=' behaves: when either of args is JSON, then JSON comparator is
used.


Changes to ref/range optimizers
-------------------------------
The only pre-condition for m-v index usage is that the value(s) to lookup
should be coercible to the typed array field's type. In order to play safe,
it's also checked that the values to lookup are const, as we cannot guarantee
type of the non-const JSON. This pre-condition is checked in substitute_gc() 
function. If an expression meets the pre-condition, then it is replaced for
the typed array field.

Following expressions are considered for optimization using m-v index:
*) <scalar> MEMBER OF (<typed array field>)
  ref optimizer can generate a ref lookup for a single value in a m-v index
*) JSON_CONTAINS(<typed array field>, <json>)
  **) if <json> is a single element array with a scalar, then ref lookup is
    generated for the value
  **) if <json> is an array, then range optimizer generates SEL_ARG tree
    representing a list of point ranges, each point representing a single
    element of the <array>. The JSON_CONTAINS() function remains in the WHERE
    condition and passes only those records where <typed array field>
    completely contains <array>.
  **) otherwise the function considered to be non-optimizable
*) JSON_OVERLAPS (<json>, <typed array field>)
   JSON_OVERLAPS (<typed array field>, <json>)
   It's optimized in exactly same way as JSON_CONTAINS(), but the
   <typed array field> have to contain at least one element of <array> for
   record to pass the check.

After ref/SEL_ARG tree is generated, the mutli-valued index is handled in
exactly the same way as a regular index - it goes through cost analysis, and
if it has the lowest cost, then index is picked and used just like any regular
index.
  

Unique record filter
--------------------
Since multi-valued index could contain many record pointing to a single table
record, it's possible that range access could return same table record many
times. For example:
CREATE TABLE t1 (id INT, f1 JSON, KEY((CAST(f1->"$[*]" AS SIGNED ARRAY))));
INSERT INTO t1 VALUES
  (1, CAST('[1,2,3]' AS JSON)), (2, CAST('[1,2,3]' AS JSON));
SELECT * FROM t1 WHERE JSON_CONTAINS(f1->'$[*]','[1,2,3]');

The SELECT above would return 6 records. 3 times the first record, and 3 times
the second, as range access would return same table record for indexed value 1,
2 and 3. Such record 'duplication' could happen only for range access, because
it looks up several keys in the index. Ref access looks up for a single key,
and server ensures that no duplicate keys can point to the same table record.
thus ref would return one record per single table record at most.

In order to return each record only once, it's needed to filter all records
returned by SE and return only those with unique primary key. This is done
automatically in the handler object. Filtering is enabled when an index with
multi-valued key part is activated. Records are checked as they are retrieved.


Behavior of UNIQUE multi-valued indexes
---------------------------------------
Behavior of UNIQUE multi-valued indexes are defined as follows:
CREATE TABLE t1(j JSON, KEY ((CAST(j AS UNSIGNED ARRAY)));

Allowed:
INSERT INTO t1 VALUES('[1,1,2]');
INSERT INTO t1 VALUES('[3,3,3,4,4,4]');

Disallowed:
INSERT INTO t1 VALUES('[1,2]');
INSERT INTO t1 VALUES('[2,3]');

I.e one is allowed specify duplicate values for the same record, but isn't
allowed (duplicate key error will be thrown) to specify duplicate values for
different records. Reason for this is twofold:
1) This mimics behavior or regular SQL UNIQUE index, where it disallows two
different records from having same indexed value.
2) Disallowing users to specify duplicate values for the same record has
next to nothing practical effect. In order to distinguish records users will
have to pre-filter values they insert and MySQL server will waste resources on
ensuring this. This means that users will have to spend more resources just to
achieve current behavior, i.e whole architecture will be less effective and
require more attention.

Overview
--------
Same as in HLS, there're 3 major blocks of changes, typed array field, CAST.
Additionally, there're some supporting changes, which are mostly refactoring,
unification, or re-use of already existing code.

Syntax
------
This WL extends current syntax by allowing user to specify ARRAY keyword as a
part of CAST's type specification:
  CAST(<expr> AS <type> ARRAY)
This ARRAY keyword is specific only to the CAST function and not available
for general type specification.

Changes to Field and Item trees
-------------------------------
Type-wise, the typed arrays in both trees are represented as typed array's
element type, i.e there's no type dedicated to the array. To distinguish
scalar type from typed array two methods are added: Field::is_array() and
Item::returns_array(). When they return true, it means that the field/item
returns typed array, which is stored as JSON and val_json() method should be
used to obtain the value.

Another change is due to reuse of code that converts sql scalar to a JSON one.
This is done by the function val_json_func_field_subselect() and now it's a
templated func, to make it work with both Field and Item classes as the data
source. Following methods are implemented for template-compatibility of
Field and Item trees:

Field:
data_type() returns Field::type()
is_null_value() returns Field::is_null()
update_null_value() returns false
get_collation() return Field::charset()

Item:
is_null_value() returns Item::null_value
get_collation() returns Item::collation.collation

Other calls that used in val_json_func_field_subselect() are used as is.

CAST( .. AS .. ARRAY)
---------------------
The function is implemented as a descendant of Item_func class. Objects of it
are created by create_func_cast() for any cast type when function's argument
is_array is set to true.
The class holds all type info, which later is used to create typed array field
in the table and an index over the field.

Field_typed_array
-----------------
It's a new Field class which implements typed arrays. It's a descendant of the
Field_json, i.e essentially a JSON field, but it reports itself as of type of
array element's type. To distinguish scalar type from a typed array, A new
method Field::is_array() is added. 

In scope of this WL the typed array field isn't used for actually storing
array data in SE. Instead it's used for conversion/type check purpose.

To do data conversion, an internal scalar conversion field is used. It is
created when a table is opened from table share and fields are copied to the
table, and initialized. The field and its buffer are allocated on table's
mem_root, but aren't bound to table's record buffer. 

On INSERT/UPDATE it's used as data converter to ensure that all array elements
correspond to array element's type:
0) Field_typed_array::store_json() sorts and removes duplicates from array
  being saved. If number of unique keys is greater than what's supported by
  SE - an error is thrown.  
1) Field_typed_array::store_json() goes through the given array and for each
  element it calls Field_typed_array::coerce_json_value(<element>). If given
  JSON data is object or a boolean, an error is thrown.
2) Field_typed_array::coerce_json_value() saves given JSON data into the
  conversion field, which does the actual data conversion to the array
  element's type. This is done by calling save_json_to_column() (one of
  JSON_TABLE helper functions, it's renamed in this WL to save_json_to_field()).
  If there were no errors, then the value from the conversion is immediately
  converted to appropriate JSON scalar. The latter will have proper length and
  precision/decimals set. This JSON scalar is appended to the
  Field_typed_array's result array.
3) If the total length of keys is greater than what's supported by SE, an
  error is thrown. Then, the result array is saved as binary JSON in the
  record buffer and becomes available to SE.
4) InnoDB (or any other SE that supports multi-valued index) reads the field,
  parses JSON binary and inserts values into multi-valued index on its own.
  Since the typed array field is always a virtual field, its actual value is
  never stored in SE.


In optimizer it's used by substitute_gc() to check that all values to lookup
in the index could be coerced to the array element's type without errors, in
order to provide consistent result when index is used. For this purpose
Field_typed_array overloads all store() methods that stores sql scalars
(strings, ints, decimals, etc) and routes them to appropriate methods of
internal conversion field. The saved value isn't used, it's only
checked whether there were any error during the conversion. If there was, the
typed array field won't substitute the expression and the multi-valued index
won't be used.

The function that creates fields out of type info - make_field() - now accepts
additional argument, bool is_array, and if it's true then Field_typed_array is
created and its element type is what was given to make_field().

Also, there're several places, in which actions are taken, depending on the
field's type. Where it's relevant to this WL, the type is adjusted to
MYSQL_TYPE_JSON, if the field->is_array() returns true. Same applies to Item
and Item::returns_array() respectively.

Changes to optimizer
--------------------
These changes consist of 3 main parts: changes to generated columns
substitution code, that does pre-check of multi-valued index applicability in
general, changed to ref optimizer and changes to range optimizer.

*) Item_func::gc_subst_transformer() is the work horse function for 
substitute_gc() that checks whether an expression can be replaced with a
generated column in order for an index over that column to be used. In this
WL it's refactored to allow values for lookup to be checked with the field,
prior to substitution. Changes are following:
**) Piece of code that looks up particular expression in the list of generated
  columns and replaces the expr in case of match is extracted to a new helper 
  function called replace_expr_with_field(). Code in
  Item_func::gc_subst_transformer() that handles [in]equalities and IN/BETWEEN
  now uses this helper function. This allowed to reduce code duplication in the
  transformer func.
**) get_gc_for_expr(), the helper function that finds a field that matches the
  given expression, now has new output arg to return found field. When it's
  provided, the function doesn't create Item_field for the found field. Instead
  it returns the field immediately. This is used just to check applicability of
  JSON_CONTAINS/OVERLAPS/MEMBER OF for multi-valued index.
**) Item_func::gc_subst_transformer() now checks applicability of
  JSON_CONTAINS() and OVERLAPS for optimization with multi-valued index. To do
  that, it first ensures that the JSON document actually contains an array of
  values, and it's constant. Then it tries to find a generated column field
  which matches the expression. If so, get_gc_for_expr() will return found
  Field, not Item_field. After that, all values from the array are saved, one
  after another, into the field and the conversion error is checked. If
  there were no errors, then function can be optimized using multi-valued
  index, and the expression is replaced for Item_field based on found field.

*) Ref optimizer in regard of multi-valued index handles two cases:
**) <scalar> MEMBER OF (<typed array field>)
  For this case ref optimizer generates a ref access over typed array field
**) JSON_CONTAINS(<typed array field>, <json array>)
  JSON_OVERLAPS(<typed array field>, <json array>)
  JSON_OVERLAPS(<json array>, <typed array field>)
  Since ref access can't handle more than one lookup value, ref optimizer
  simply marks indexes over typed array field as available for range
  optimizer.

*) Range optimizer is used to lookup multiple values in a multi-valued index
  by generating range access with set of point ranges.  Changes in it consist of
  get_func_mm_tree() detecting now that JSON_CONTAINS()/JSON_OVERLAPS() are
  used and calls a new function get_func_mm_tree_from_json_contains() or
  get_func_mm_tree_from_json_overlaps() respectively. They, in their turn,
  extracts JSON array from function's arguments and creates SEL_ARG tree out
  of it in the same way as it's done for MEMBER OF(). All values to lookup form 
  OR tree, independently of the function. The reason is that SE API (handler)
  needs SEL_ARG tree to be serialized into a consequent list of ranges, and the
  latter can't be created out of SEL_ARG tree of form "fld=X AND fld=Y". When
  SE will return actual records, the expression in WHERE condition would ensure
  that values are properly AND-ed or OR-ed, according to the function being
  optimized.

Changes to Data Dictionary
--------------------------
The only new property introduced in this WL from the DD POV is the is_array
property of the typed array field. It's stored as a boolean column's option,
just like e.g. treat_bit_as_char. No changes are made to the DD's schema. 

JSON_OVERLAPS
-------------
This is a new function added by in this WL and it's implemented by the
Item_func_json_overlaps class. It's a descendant of the Item_bool_func, as its
returns values are true/false/NULL. The implementation is straightforward and
the comparison function itself is contained in the 
Item_func_json_overlaps::val_int().  Main difference from other functions is 
that function's key_item() member function returns either first or second
func's argument, depending on which one contains typed array field.

MEMBER OF
---------
Another new funciton added by this WL. It's implemented by the
Item_func_member_of class. Same as above, it's a descendant of the
Item_bool_func and the implementation it straightforward, and resides in the 
Item_func_member_of::val_int()

Changes to JSON DOM
-------------------
This is minor change, yet it changes the behavior of JSON DOM API. Now when 
Json_wrapper::operator[] is called on a scalar and asked index is 0, then the
operator returns the Json_wrapper itself. This makes scalars to be treated as
single element arrays. Previously, calling [] operator caused failed assert.
Reason for this change is that it allows to noticeably simplify comparison
code. We already treat scalars this way, e.g Json_wrapper::length() returns 1
for scalars, instead of failing assertion. So now the behavior is more
consistent. 

JSON can store strings in two ways: as J_STRING and as J_OPAQUE. This WL makes
them equal in terms of data class and add proper comparison, similar to as
integer and double types compared as numeric.

Added a new method Json_datetime::from_packed_to_key() that converts packed
date/time representations to binary form that can be indexed by InnoDB. This
is a helper method used by SE.

Unique records filter
---------------------
This filter is in implemented in the handler class, because this is the most
close place to the data and it eases filtering itself and allows to shorten
round trip for duplicated records. 

The duplicate filtering per se is done by the new class called Unique_on_insert.
When a record is retrieved, its rowid is stored in Unique_on_insert object and
true/false is returned back indicating duplicate/unique rowid respectively.
To achieve that, Unique_on_insert uses same approach as is used for
semi-joins: it creates in-memory duplicate weedout tmp table. When a value 
(rowid) is inserted, an index is used to check for a duplicate. When the memory
tmp table is full, it's flushed to on-disk duplicate weedout tmp table.

The check for record uniqueness is done in the handler::multi_range_read_next()
method as it is used by the range access method. When a duplicate rowid is
found, the method loops over to retrieve next record, switching the next range
if necessary. 

To allow optimizer to enable unique filter, a new operation for
handler::extra() call is added - HA_EXTRA_UNIQUE_RECORD_FILTER. In this
operation the unique filter is created and initialized, if it's already exists
it is reset - stored rowids are deleted from filter, allowing a new scan. Since
handler::extra() is a virtual function and is overloaded by storage engines,
a new non-virtual method ha_extra() is added to the handler class, and
handler::extra() is made a protected class' member. handler::ha_extra() only
handles the operation to enable unique filter. Other operations are passed to
SE via handler::extra() call. This approach is the same as the one used for
many other handler's methods, e.g ha_index_init() or ha_index_read_map().
The filter is freed in ha_close() function.

In addition, a helper methods is added to the handler class. 
handler::filter_dup_records() retrieves the row id for the current record and
checks whether it's already present in the Unique_on_insert object.