WL#8763: support multi-value functional index for InnoDB

Affects: Server-8.0   —   Status: Complete

This worklog is to support multi-value functional index through InnoDB. Since
this is a functional index, and functional index in MySQL/InnoDB is implemented
through "virtual columns", so the multi-value funaiotnal index is based on
multi-value "virtual columns".

Since virtual column is never materialized in the table Primary Key (or data),
so it can be only materilaized by specific index, in this case the multi-value
functional index.

For example: 
if we try to index address or zip code (index(user_id, zip)) for a user
described following:
{
user: John,
user_id: 1,
 addr: [

      {zip: 94582},
      {zip: 94536}
 ],
}

Since he has multiple homes, it could generate multiple zip code out of a single
document.

we will support create index on ZIP code in this case.

Another simple example is:

mysql>  create table t1(f1 json, f2 int as (cast(`f1` as unsigned array)));
Query OK, 0 rows affected (4.16 sec)

mysql> insert into t1(f1) values   (cast('[1,3]' as json));
Query OK, 1 row affected, 2 warnings (0.04 sec)
mysql> insert into t1(f1) values   (cast('[1,3,9,8]' as json));
Query OK, 1 row affected, 2 warnings (0.05 sec)

mysql> create index idx1 on t1(f2);
Query OK, 0 rows affected (5.17 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select f2 from t1;
+------+
| f2   |
+------+
|    1 |
|    1 |
|    3 |
|    3 |
|    8 |
|    9 |
+------+
6 rows in set (0.00 sec)

mysql> select * from t1 where f2 = 1;
+--------------+------+
| f1           | f2   |
+--------------+------+
| [1, 3]       |    1 |
| [1, 3, 9, 8] |    1 |
+--------------+------+
2 rows in set, 4 warnings (0.01 sec)

As you can see in above, there could be multiple f2 values corresponding a
single f1 values.
1. Allow create multi-value functional index along with create table. 

2. Allow create multi-value functional index using "alter table add index" and
"create index" clause

3. Only one multi-value column can be in a index

4. DML operation will generate correct number of values and correct values in
index. Index is properly maintained.

5. Check table will recognize such index and will not report table corruption on
mismatch number of index records and data records

6. Replication will properly replicate operations involving the multi-value index

7. Such index cannot be used to support Foreign Key constraint

8. Do not support multi-value prefix index

9. Index can only apply on limited data type:
DECIMAL, INTEGER, DATETIME, VARCHAR/CHAR

10. Do not support online create multi-value index
Terminology
===========

The words used in the design:
'Multi-value index' is short for 'multi-value functional index'.
'Multi-value column' means the column where multi-value index is built on. The 
source of this column is basically a normal stored column, the column itself 
should be a virtual column.


General
=======

Essentially, multi-value index is a secondary index, which requires a virtual 
column on the table, and a multi-value index built on this virtual column. The 
index itself is actually a virtual index. The difference from normal virtual 
column and virtual index is that it used to add virtual column and virtual index 
separately without any mandatory rule, however, to build a mutli-value index, it 
will automatically generate a virtual column on the data and then build the 
virtual index on this virtual column.

As mentioned before, one virtual column entry for multi-value index may be 
mapped to several different entries on the multi-value index, or nothing mapped 
too(Due to no matching data found in the virtual column entry). But please note 
that there won't be duplicate entries on the multi-value index. That means even 
if there are several entries for the same virtual column with the same primary 
key, the multi-value column data should be different for all entries.

Currently, only one multi-value column can be defined for one multi-value index. 
This multi-value column can be in any place of the multi-value index, which 
means that there could be other columns defined for the same multi-value index, 
and there is no restriction of column order.

For example:

CREATE TABLE t1 (
  j JSON DEFAULT (CAST('["HelloWorld"]' AS JSON)),
  KEY mv_idx_binary ((CAST(j->'$[*]' AS BINARY(10) ARRAY))));

CREATE TABLE t2 (
  j4 json ,
  KEY mv_idx_binary ((( CAST(j4->'$[*]' AS BINARY(10) ARRAY))), 
(json_depth(j4)), (json_valid(j4))) USING BTREE);

Here table t1 has one column 'j', and there is a multi-value index defined on 
the table, which will also create a multi-value column(virtual one) for 
'((CAST(j->'$[*]' AS BINARY(10) ARRAY)))'.

Table also has a multi-value index which now includes not only one multi-value 
column, but also some other normal columns. No more than one multi-value column 
can be on the single multi-value index.


Data types
==========

The multi-value column data type could only be array of:
1. DECIMAL
2. SIGNED / UNSIGNED
3. DATETIME
4. CHAR(the underlying storage is actually same with VARCHAR)


Capacity
========

To make server to double check the data validation earlier before diving into 
SE(InnoDB), it is useful that SE which supports multi-value index informs server 
how it supports the multi-value data. The two main important factors are the 
multi-value data length(or key length since it's on the index) and the number of 
maximum supported number of multi-value data(or the maximum size of the data 
array).

Currently, the limits from InnoDB is somehow bounded by undo log size. Since the 
undo log of INSERT should be written in a single undo log page, so basically, 
above two factors are restricted by the maximum undo log page size.

However, it's a not realistic to have a true estimation on the multi-value data 
length and maximum size of the data array before InnoDB knows the real data. 
Because the data length of other dynamic fields and each data array size are not 
known beforehand, so there is no exact estimation for each different row.

One doable solution is to give a rough estimation on above two factors, which 
allows users to insert as many multi-value data as possible, and once it exceeds 
the undo log page limit on later insertion, a different row is too long error 
would be raised. Based on this, it should assume nearly the whole undo page is 
available for one multi-value data, except some hard-coded header content, and 
the maximum length of the array can be calculated against the multi-value field 
with the shortest data length in the table.


Index creation
==============

Multi-value index should be created either along with CREATE TABLE or by CREATE 
INDEX/ALTER TABLE. Both approaches should be atomic.

Create multi-value index during table creation is easy, just create the index 
itself, no data operations. However there would be some differences in behavior 
to create multi-value index on a table with some data.

First difference is that this kind of index creation will create not only the 
multi-value index, but also the virtual column which the index is based.

Second one is to build the index, record from clustered index would be read, 
which may generate more than one multi-value index entries. This 1 to N mapping 
requires some special handling during the merge buffer creation for sorting 
index entries. And one merge sort buffer may not be able to accommodate multi- 
value index entries from single clustered index record.


SELECT on Index
===============

SELECT on multi-value index are supported as normal index. However, currently, 
the only way to leverage multi-value index to speed up scan is query with the 
'MEMBER OF' key word. For example:

CREATE TABLE t1 (
  j4 json ,
  KEY mv_idx_binary ((( CAST(j4->'$[*]' AS BINARY(10) ARRAY))), 
(json_depth(j4)), (json_valid(j4))) USING BTREE);

INSERT INTO t1 VALUES ('["foobar"]');
SELECT * FROM t1 WHERE "foobar" MEMBER OF (j4->'$[*]');

Because of the character of multi-value index, e.g. entries for same clustered 
index record would scatter in the full index, multi-value index can not support 
two kinds of scans:
1. Range scan. It's currently not proper to do range search on the multi-value 
index.
2. Index only scan. The entries on multi-value index is kind of meaningless if 
clustered index record is not fetched.


DMLs on Index
=============

All DMLs on multi-value index should be done similar with normal index. Only 
difference is that there would be more than one insertion or update on the index 
for one clustered index record.

There are actually three status of multi-value column data:
1. NULL
2. No value on index
3. Multiple entries on multi-value index

1 and 3 are easy to understand, 2 is specific for the case that the data array 
is empty or nothing matches, then there is no need to maintain the corresponding 
entry on multi-value index.

One promise from server is that, regardless of the DML type, as long as the 
multi-value column data is parsed by server to form an data array for InnoDB to 
operate on, the values in the data array are already sorted and no duplicate 
values.

For UPDATE, it is free to update data of any status to any different or same 
status. Please note that currently it's useless to build a NULL entry on multi-
value index, since there is no proper SQL syntax to search this. However this 
would be supported in the future, so internally, InnoDB is still building NULL 
entry on multi-value index.

Another important point for UPDATE is that if the UPDATE changes the data from 
above status 3 to a new value of status 3, then it is probably not all multi-
value entries on the index need to be updated. Let's say the old value is 
'["Hello", "HelloWorld"]', and new value is '["HelloMySQL", "Hello"]'. 
Obviously, only the "HellowWorld" has to be updated to "HelloMySQL", the "Hello" 
doesn't require any update. So it's important to keep the entries which are not 
touched at all. Furthermore, even if it's the above case and above multi-value 
column is the only column on the multi-value index, once the primary key of it 
gets changed too, then all entries tied to this primary key have to be updated 
for the multi-value index.

Finally, for all DMLs, it's necessary to handle the lock wait scenario like DMLs 
on normal indexes.


Redo and undo log
=================

There is no need to specially handle the redo log for multi-value column or 
multi-value index, all can be done in normal way, because changes to multi-value 
index is actually done one by one, and every change is similar to the one for 
normal index.

It's necessary to write special undo log for the multi-value column data, since 
it is internally stored in a different way from normal secondary index record. 
Due to this reason, undo log parsing also needs to take care of multi-value 
column data.


Rollback and purge
==================

Basically, there is no special handling for both rollback and purge, they just 
parse the record information from undo logs, and apply them to multi-value 
indexes and other indexes. Only difference is still that for multi-value index, 
more than one entries may have to be checked and rolled back(or purged).


Online DDL
==========

Online creation of multi-value index is not supported, since it includes 
creating a virtual column and a virtual index, which has been prohibited from 
online operation since virtual column story began.

However, online DDL should be done for a table with multi-value index(es). To 
support this, multi-value column data has to be remembered for online DDL logs, 
in the same format as they are remembered for undo log.


Check table and page validation
===============================

Since there are more than one entries on the multi-value index for one single 
entry on clustered index, so the expected 1 to 1 mapping between record from 
clustered index and secondary index is now broken for multi-value index. This is 
the special case for multi-value index, which should be screen out for CHECK 
TABLE.

For page validation, if the entries from multi-value index are buffered into 
change buffer, there could be some cases that values are actually different, 
however the records comparison function for change buffer will think they are 
equal. For example, if two entries are "abc" and "abc ", they are thought to be 
different multi-value data, but change buffer will think they are equal. So this 
case should be filtered out and regarded as correct too.
1. Signify the capability of multi-value index

HA_MULTI_VALUED_INDEX is added to m_int_table_flags to show InnoDB supports
multi-value index, however, it does not support ordered retrieval and key read 
only scan:
+  /* Multi-valued keys don't support ordered retrieval, neither they're
+  suitable for keyread only retrieval. */
+  if (table_share->key_info[key].flags & HA_MULTIVALUED_KEY) {
+    flags &= ~(HA_READ_ORDER | HA_KEYREAD_ONLY);
+  }


2. API to specify the multi-value capacity supported by InnoDB

void ha_innobase::mv_key_capacity(uint *num_keys, size_t *keys_length) const;


3. Indexed field length

The indexed table column is usually of JSON type, however, for the index, these
can be integer type of char type. So when we build the template, we need to
specifically get the actual field length:

+  if (templ->is_virtual && field->is_array()) {
+    templ->mysql_mvidx_len = static_cast<ulint>(field->key_length());
+    templ->is_mv_v = true;
+  } 

This also applies to how it is handled during ALTER TABLE.


3. Parse multi-value out of JSON column

The multi-value from a JSON data column is obtained from 
innobase_get_multi_value(). There, the number of values in the JSON data, and 
individual data value is retrieved. The retrieved data is sorted without any 
duplication.


4. Multi-value flags

There are several flags introduced to indicate if a column, field, index is a 
multi-value one.
4.1. #define DATA_MULTI_VALUE 16384 /* Multi-value Virtual column */
4.2. #define DICT_MULTI_VALUE 512 /* Multi-value index */

Basically, the DATA_MULTI_VALUE is used in prtype, to indicate if a column or 
field, etc. is a multi-value one, while the DICT_MULTI_VALUE is used for 
dict_index_t::type to indicate if an index is a multi-value index.


5. In-memory multi-value data structure

Following structure multi_value_data is used to store those multi-values:
/** Structure to hold number of multiple values */
struct multi_value_data {
 public:
  /** points to different value */
  const void **datap;

  /** each individual value length */
  uint32_t *data_len;

  /** convert buffer if the data is an integer */
  uint64_t *conv_buf;

  /** number of values */
  uint32_t num_v;

  /** number of pointers allocated */
  uint32_t num_alc;

  /** Bitset to indicate which data should be handled for current data
  array. This is mainly used for UPDATE case. UPDATE may not need
  to delete all old values and insert all new values because there could
  be some same values in both old and new data array.
  If current data array is for INSERT and DELETE, this can(should) be
  nullptr since all values in current array should be handled in these
  two cases. */
  Bitset *bitset;

  ...
  /** Some more member functions to handle multi-value related logic */
};

and then this whole structure will be stored/pointed by dfield_t::data.

If there is no value to build on the multi-value index, one flag is introduced 
as the length of the multi-value column field:
/** The following number as the length of a logical field means that no
attribute value for the multi-value index exists in the JSON doc */
#define UNIV_NO_INDEX_VALUE (UINT32_UNDEFINED - 2)

So , as mentioned in "DMLs on Index" from HLS, there are three status of the 
multi-value data:
5.1. For the NULL case, the relevant dfield_t object would be same with normal 
NULL data
5.2. For the no value on multi-value index case, the dfield_t::data would be 
nullptr, however the len field is set with UNIV_NO_INDEX_VALUE.
5.3. For the case there are multiple entries from the multi-value column, 
dfield_t::data would be set to above multi_value_data structure, and the 
dfield_t::len would be set with 0, to indicate this is a multiple value data.


6. SELECT and matching

For SELECT or any matching scenario, InnoDB will fetch the clustered index 
record to check if the multi-value index entry at hand is valid, not deleted, 
etc.

Since the clustered index record will probably actually hold the multi-value 
data array, as mentioned as 5.3, so in this case, InnoDB will iterate over the 
data array of multi_value_data, and check if any value from the array matches 
the secondary index entry.


7. DMLs

7.1. INSERT
InnoDB will loop through each value of the multi_value_data, and then insert 
into multi-value index.

7.2. DELETE
Similar work to INSERT will be done

7.3. UPDATE
This is a bit more complicated. Basically, both old and new values have to be 
parsed via innobase_get_multi_value(), unless either of them is NULL.

If both old and new values are as described in 5.3, then it's quite possible not 
all old values have to be deleted and all new values have to be inserted, 
because there would be some overlap between old and new values, which should be 
kept as is.

So to get rid of above case, and to keep performance, the overlap between old 
and new values would be calculated during parsing both old and new values. This 
leverages the feature of the parsed out data array having been sorted. So a 
merge sort can be done quickly. This is done by 
innobase_get_multi_value_and_diff(). One bitmap is used to remember which value 
is in overlap area or not. This bitmap is defined as multi_value_data::bitset, 
which is only initialized for this scenario, otherwise, this pointer is always 
nullptr.

With this supported, during real updating, only the old values not in the 
overlap area would be removed one by one, and new values outside the overlap 
zone would be inserted one by one.

7.4. Lock wait
Since all DMLs will modify the multi-value index several times for one single 
row, so to handle the lock wait scenario, current working position should be 
remembered, so next time after locking wait, it can continue from a correct 
point to not do redundant work. The position is remembered by some flag of those 
"node" structure.


8. Index building

As mentioned in HLS, to build multi-value index entries for one row, 
row_merge_buf_add() could be called for several times. And merge buffer may be 
allocated more than once for one single row too.


9. Undo logging and online DDL logging

9.1 Generic multi-value logging and parsing
A helper class for multi_value_data is defined to log the data array and then 
parse it from log.

/** Class to log the multi-value data and read it from the log */
class Multi_value_logger {
 public:
  /** Constructor
  @param[in]    mv_data         multi-value data structure to log
  @param[in]    field_len       multi-value data field length */
  Multi_value_logger(const multi_value_data *mv_data, uint32_t field_len)
      : m_mv_data(mv_data), m_field_len(field_len) {}

  /** Get the log length for the multi-value data
  @param[in]    precise true if precise length is needed, false if rough
                estimation is OK
  @return the total log length for the multi-value data */
  uint32_t get_log_len(bool precise) const;

  /** Log the multi-value data to specified memory
  @param[in,out]        ptr     the memory to write
  @return next to the end of the multi-value data log */
  byte *log(byte **ptr);

  /** Read the log length for the multi-value data log starting from ptr
  @param[in]    ptr     log starting from here
  @return the length of this log */
  static uint32_t read_log_len(const byte *ptr);

  /** Read the multi-value data from the ptr
  @param[in]            ptr     log starting from here
  @param[in,out]        field   multi-value data field to store the array
  @param[in,out]        heap    memory heap
  @return next to the end of the multi-value data log */
  static const byte *read(const byte *ptr, dfield_t *field, mem_heap_t *heap);

  ...
};

With this helper class, all content from multi_value_data would be written to 
any specified log. Please note that the multi_value_data::bitset would be logged 
only if it's not nullptr.

9.2 Undo logging
So for undo logging, multi-value fields are logged via APIs defined in above 
class, same apply to log parsing. The actual function for these purpose are 
trx_undo_store_multi_value() and trx_undo_rec_get_multi_value().

9.3 Online DDL logging
This also requires the logging of multi-value fields, so it leverages the same 
APIs from above helper class to do this.


10. Rollback and purge

For either operation, it will parse the multi-value field from undo log, and 
then do the proper operations on the multi-value index. They are same with 
normal DMLs, and will iterate over the multi-value data array and operate on 
every different entries.


11. Multi-value entry builder

To build and operate on a multi-value data entry for any operation on multi-
value index, it will first check which multi-value field is tied to current 
multi-value index; then if it's NULL value, just do it as normal; next, check if 
there is really any value built on the index for current row; finally, if it's a 
data array, then it has to iterate over the array and build different entries 
for each data.

This is a complex flow and which is nearly the same for every operation. More 
important is that building different entries for one single row should be 
prevented since only the multi-value field data needs to be changed, while other 
fields must hold the same data, and this can be done simply by replacing the 
data pointers properly.

So two kinds of multi-value entry builder are defined for this purpose, they 
have a base interface called: class Multi_value_entry_builder. Two sub-classes 
are derived from this base interface, and one of them supports the most 
operations which can provide the node->row and node->ext while another of them 
supports the INSERT specially.


12. Change buffer

As explained in the last section of HLS, change buffer has to know if a field is 
multi-value one or not, to screen out the duplicate key case for page 
validation.

So this flag is remembered in the metadata area of each field. Luckily, there is 
only one free bit left for this purpose, which is in the first byte, and '& 64'. 
This change is done in dtype_new_store_for_order_and_null_size() and check in 
the counterpart function of it.


13. multi-value field duplication

InnoDB used to call dfield_dup to duplicate a normal data. However, if the 
multi-value field has data as described in 5.3, it has to duplicate the full 
data array, rather than a simple value pointed by one data pointer.