WL#8149: B-tree Index Support on non-materialized virtual columns

Status: Complete   —   Priority: Medium

This worklog implements creating secondary indexes on non-materialized virtual
columns (let's simplify it as "virtual index") as well as using such indexes for
fast computed-value retrieving and searching. The non-materialized virtual
column support is described in WL#8114 and WL#411, which design the
virtual column in such way that they are not present in actual InnoDB index
records, but their metadata is registered with InnoDB system tables and
metadata caches.

Virtual column provides flexibility and space saving on the table, more
importantly, adding/droping such columns does not require table rebuild. Making
it a much better choice for storing and processing non-relational data such as
JSON. However, since they are not materialized, the scan/search could be very
slow. With this worklog, the virtual column value will be materialized in the
secondary index, making it much easier for value search and processing.

This worklog implements following major functionalities:

1) Create index on non-materialized virtual columns. User can create secondary
index on one or more virtual columns. And since this is a secondary index, and
also the columns are not materialized, adding/dropping columns and creating
index does not require a table rebuild, which is way better than the
materialized virtual column. (WL#8114 covers ADD/DROP COLUMN for virtual columns.) 

2) User is able to search qualified rows using such "virtual indexes", both in
terms of non-covered and covered scan. When searching such an index, no
computation is needed, because the value is materialized in the index record
and in undo log records. Also this should support all MVCC look ups by following
up cluster index and its "earlier versions".

3) Such "virtual index" gets updated correctly for DMLs, as well as any
transactional commits and rollbacks.

4) Such "virtual index" maintains its coherence during all crash recovery with
sufficient operational undo and redo logging.

5) Such "virtual index" should support all isolation level searches
and locking like using any other "normal" index

6) Such "virtual index" should be able to work with large BLOB/TEXT data in
their "base" column. The index key size is goes with their original limit (767
bytes for Antelope, 3072 for Barracuda format)

In summary, the "virtual index" is an essential means to efficiently use virtual
columns, without levy heavy computation and table scan for value searching and
other DML actions.
1. Can create/drop secondary index on one or more virtual columns

2. Primary index cannot contain any virtual columns

3. Does not support create index on a mixture of virtual columns and non-virtual
columns.

4. Support covered and non-covered scan with such virtual index

5. There is no locking change in this feature. All isolation level search should
be supported. So does fetching correct value with MVCC

6. When there are DMLs on the table, the index gets update correctly

7. Support computed column whose base columns are large BLOB, externally stored
object

8. Support prefix index on large computed column (can be BLOB).

9. Support run time rollback on DMLs with virtual indexes. Virtual column values
are undo and redo logged

10. Support crash recovery on DMLs with virtual indexes

11. Support check table on virtual indexes

12. Support null base column or null virtual column and index on them.

13. Support unique virtual indexes

14. Purge on virtual index should work as expected

15. Does not support spatial index or Fulltext index on Virtual columns (first
phase). This will be blocked in the server level for now.

16. Support change buffering on virtual column index

17. Truncate table should behave the same as it is with normal table/index. No
change is needed.

18. Virtual index does not qualify to be the index for foreign key. This should
be re-evaluated when we support indexing on mixed virtual/non-virtual index.

19. There is no index row format change. The (virtual) index row looks exactly
same as those index on normal columns. So does the undo and redo log format.

20. In this version, we will not support index on virtual column whose base
column could be a foreign constraint cascading update/delete candidate,that is,
anything that with 

"ON DELETE or ON UPDATE" with "CASCADE | SET NULL" clause

This is because the foreign key cascading update is currently performed in
InnoDB without server action, which is different from normal update operation.
For the special index on non-materialized virtual columns, there are following
properties associated with it:

1. The index will be secondary index only (virtual column cannot be part of
primary key anyway). And create index syntax remains unchanged.

2. The index can only index on virtual columns, not a mixture of virtual and
normal columns in this worklog. This feature will be addressed in a future WL
(Remove limitations on indexing virtual columns).

3. Support unique index

4. Fulltext and Spatial index will not be in this worklog's scope. It will be
addressed in a future WL(Remove limitations on indexing virtual columns).

5. As described in WL#8114, metadata/system table, the index metadata will
need to record not only the virtual column, but also its "base" columns (used to
compute the virtual columns). 

The direct impact on this worklog is that if there are any updates on the "base"
columns, the indexed virtual column(s) will need to be updated as well.

6. Metadata change
a) The index metadata looks the same as normal index, except the
SYS_INDEXES.TYPE has the DICT_VIRTUAL bit.

b) A new dict_table_t member, v_col_names is added. It is needed when we create
index and match up the indexed field

7. For Index Search, it will behave the same as normal index search. It also
supports MVCC search. In the case that we return early version virtual column
values, they are constructed from undo logged column info.

8. For DMLs, for all insert, delete and updates, the computed value needs to be
"materialized" and updated in the index, and logged in the undo log. 

We need to emphasize the importance of the function to be deterministic.
Otherwise, the index can be easily corrupted. The determinism needs to be
established in server

9. Logging: as discussed, if there is index on the virtual columns, the
operation on the columns will be logged (in the redo/undo log). The undo logged
value will be used for Purging, run time rollback, MVCC, implicit locking,
and crash recovery.

10. Create index: will not support LOCK=NONE in the first phase, to
keep everything simple. This will be addressed in a future WL (Support online 
create index on Computed/Generated Columns). But it can be created using the
WL#5526 ALGORITHM=INPLACE. Index value will be computed at the
time of the clustered index scan, and along with Primary key, passed to
the sorter. It can still use the bulk copy interface (WL#7277).

11. The same limits on maximum index record length apply to both virtual
and normal indexes. If the computed column is a very large row, the
virtual index can be defined on a prefix of the computed column,
just like we support column prefixes in normal indexes.

12. The callback function to obtain the virtual value could be called by purge
threads which are InnoDB specific background threads (Please see detail in LLD).
InnoDB related change:

1) Base column metadata:
Please refer to WL#8114 on how we register the base columns' info for
virtual columns. 
s
2) Template building:
In order to calculate the Virtual Column value, we need to construct a filled
MySQL row (with base column value) sending to server in a callback function. So
we will need to know the row mapping info between MySQL and InnoDB. To support
that, we added a virtual column template attached to the table share structure,
and accessible by the dict_table_t structure.

/* Structure defines template related to virtual columns and
their base columns */
struct innodb_col_templ_t {
        ulint                   n_col;          /* num of regular col */
        ulint                   n_v_col;        /* num of virtual col */
        mysql_row_templ_t**     vtempl;         /* array of templates for 
                                                virtual col and their 
                                                base col */
        char                    db_name[MAX_DATABASE_NAME_LEN];
                                                /* table's database name */
        char                    tb_name[MAX_TABLE_NAME_LEN];
                                                /* table name */
        ulint                   rec_len;
                                                /* MySQL record length */
        const byte*             default_rec;    /* default rec if it is NULL */
};

The db_name and tb_name is needed for the callback function supplied by
Optimizer to the InnoDB purge subsystem (see section 12 below).

/** InnoDB table share */
typedef struct st_innobase_share {
        ......
        innodb_col_templ_t
                        s_templ;        /*!< table virtual column template 
                                        info */ 
} INNOBASE_SHARE;
        

Following 2 functions are called when the table is opened first time to create
above templates:

innobase_build_v_templ
innobase_vcol_build_templ

The main consumer of this structure is the callback function to compute the
virtual column value, which is described next.

3) A new function (innobase_get_computed_value) is created to call the callback
function (handler::my_eval_gcolumn_expr) provided by server to compute the
virtual values.

This function will use above template info to create a MySQL row
(TABLE::record[0]) with base column values filled and send to server, and server
will fill in the computed column data, which then send back to us. We shall
parse out the virtual column value then.


4) index type of DICT_VIRTUAL

Currently, we only support index on either all virtual columns or all normal
columns. We do not support indexing on mixed columns. For those index on virtual
columns, DICT_VIRTUAL is marked on index->type. Once this limitation is removed, 
this bit will be used to indicate any index contains one or more
virtual fields.

6) For internal structure dtuple_t, we will also need to add additional virtual
fields, so that virtual values can be filled (for example, carry data from 
undolog):

struct dtuple_t {
        ....
 +       ulint           n_v_fields;     /*!< number of virtual fields */
 +       dfield_t*       v_fields;       /*!< fields on virtual column */
        ....
}

Such dtuples are created with additional virtual column info supplied to
dtuple_create_with_vcol().

7) Select
In WL#8114, if there is no virtual index. InnoDB does not need to process any
virtual columns. However, in this worklog, there are a couple of case that
InnoDB fills in the virtual column values:
 
 a) it is a covered scan on virtual index and no need to access the cluster
index (this is the case that in row_sel_store_mysql_rec the index parameter
passed in is the same virtual index in prebuilt)

 b) if cluster index needs to be consulted (due to MVCC etc.), and if the
prebuilt->read_just_key is set, then the virtual column values can be copied
from the undo log records related to earlier versions of the clustered index
record.

This is done due to the fact that server is expecting only virtual column is
returned (and no base columns returned to allow them to do a computation).

In all other cases, the virtual column will be filled by server layer.

8) Undo Logging

Insert:
If there are any virtual columns that are part of index (col->m_col.ord_part),
then it is logged along with any other indexed columns. If it is a large column,
maximum DICT_MAX_FIELD_LEN_BY_FORMAT bytes are logged, which is sufficient for
building an index field. The undo log is generated by
trx_undo_report_insert_virtual().

Update and delete:
In addition to log the column if it is a ord_part, the update also log the
update vector. So that during rollback or crash recovery, the update vector can
be reconstructed with asking for the callbacks.

In all those case, the column pos is added REC_MAX_N_FIELDS(1023) to signify
this is a virtual column.

9) for insertion, the row is already computed by MySQL, so just need to parse
out the input virtual column using row_mysql_convert_row_to_innobase() called by
row_get_prebuilt_insert_row(). The change is logged in
trx_undo_page_report_insert() for undo/recovery purpose.

10) for modification/update, the new row (including the virtual columns) are
supplied by MySQL. So the update vector can be generated for virtual columns can
be calculated in calc_row_difference() along with other columns. Then such
update vector on virtual columns are logged in trx_undo_page_report_modify(). To
avoid calculation during crash recovery, its before and after image are both
logged. In addition, any indexed virtual columns are also logged after the
update vector, for purging purpose, as well some of them could be non-changing
part of a index, so need to be restored during rollback/crash recovery too.

11) for delete, it is logged similarly in trx_undo_page_report_modify(), except
there is no update vector needs to be logged. It will log whatever indexed
columns. So this can be used for purge threads to build the index entries to
purge the record.

12) A special case for purge.
As mentioned above, for deleted row, all of its indexed columns will be logged.
So to sufficient to build the index fields. However, there is a verification
function row_vers_old_has_index_entry() that does the comparison of fields in
purge records with that of current cluster index. In this case, the "current"
virtual columns might need to be computed from the "current" cluster index
(also_curr). Unfortunately, in such case, the table might not be opened. In
addition, there currently is no MySQL THD created for InnoDB purge threads.
So it cannot simply invoke the callback function to compute the virtual value.
This issue is being addressed by Optimizer and runtime.

We will try to reduce the need for callback in such case. So only in certain
"infrequent" scenario, this is needed. To achieve this, we will need to do some
"additional" logging in the undo log for updates. The detail information is as
following:

1. The need for comparing to current clustered index record (PR) in
row_vers_old_has_index_entry() is only needed by purge, when purging a secondary
index key (in this case, the virtual index key).

2. Such need comes only when the Primary Key pointed by secondary index key is
not deleted. So if it is a simple delete (without any following "insert by
modify" that undo the delete-marking of the rows), then this comparison (thus
callback) is not needed. Or if it is update by delete + reinsert, the row could
also be marked as delete, and comparison is not needed. The more frequent
scenario is when there is update by modify (which update the PR key directly on
the row) or delete and later insert by modify (undo delete marking on PR), then
it creates possible scenario that multiple secondary keys (all delete marked
except one) point to a single undelete-marked Primary Key. In such case, when we
purging those delete marked secondary keys, the comparison is needed, thus the
callback.

3. We could mitigate the need for callback even in late case described above.
The key is being primary key has a linked list of its old history (or
specifically, the roll_ptr points to the undo log records). Theoretically, we
could log additional information in undo log to help backtrack the current
virtual column value in PR. To do so, we will log no only the "before image" of
the updated columns (for undo) in undo log, but also the "after image" (the
updated value) for those columns in undo log. With such "after image", we could
in some way to backtrack the current virtual column image if those image are
available.

4. However, we cannot completely eliminate the need for callback. Because above
undo log chain would not be always available. The secondary index could point to
a "brand new" Primary key without any history (when
trx_undo_roll_ptr_is_insert(roll_ptr) is true). This could happen in following
scenario:

1. CREATE TABLE t(a CHAR(1) PRIMARY KEY, b INT, INDEX(b));
2. INSERT INTO t VALUES ('a',1); COMMIT;
-- Clustered index: ('a',1) with insert_undo, no history
-- Secondary index: (1,'a')
3. DELETE FROM t; COMMIT;
-- Clustered index: delete-marked ('a',1) with update_undo history
-- Secondary index: delete-marked (1,'a')
4. INSERT INTO t VALUES ('a',2); COMMIT;
-- Clustered index: ('a',2) that was updated from delete-marked ('a',1)
-- Secondary index: (2,'a') and delete-marked (1,'a') [until it is purged]
5. DELETE FROM t; COMMIT;
-- Clustered index: delete-marked ('a',2)
-- Secondary index: delete-marked (2,'a') and delete-marked (1,'a')
6. Assuming that there are no other transactions in the system,
the purge view (purge_sys->view) is free to advance to the current
maximum transaction.
Purge workers are free to process the records from step 3 and 5.
But, there is no ordering guarantee!
Purge can process the second delete (step 5) first, leading to:
-- Clustered index: [empty]
-- Secondary index: delete-marked (1,'a')
7. INSERT INTO t values('a', 3);
-- Clustered index: ('a',3) with insert_undo, no history
-- Secondary index: (3,'a') and delete-marked (1,'a')
8. Purge processes the first delete (step 3).
We find the delete-marked (1,'a') in the secondary index.
We can also find the non-delete-marked ('a',3) in the clustered index,
but there is no undo history.
If the secondary index were built on a virtual column
(the value b=1 or b=3 is not present in the clustered index record),
we would have to evaluate its value to determine whether the record (1,'a')
can be safely purged from the secondary index.

You can see above scenario is due to there could be multiple delete marked sec
keys points to single PK. And it is possible that purge could clean up the PK
with one of those purge. And if a new PK with same primary key value comes
along, those pending purge sec key would again point to this new PK, who has no
history for you to backtrack the VC value.

So essentially, the callback is still needed, but might in a infrequent use
scenario.

In the above scenario, the history that would be needed at step 8
is discarded at step 6 when the undo records are purged out of order.
We believe that it is not feasible to try to add synchronization between
purge worker threads. Another case where history can go missing in a similar
way is on DELETE;COMMIT;INSERT;ROLLBACK;INSERT;COMMIT;purge when the
first INSERT replaced a purgeable record. In this case,
row_undo_mod_remove_clust_low() would
remove the clustered index record when executing the ROLLBACK, and the purge
of the DELETE operation would only see the result of the new INSERT. It
would have to invoke the callback to determine whether the secondary index
record matches the current clustered index record.


13) Building an earlier version of virtual index field
As discussed, the old version of virtual columns are logged in undo log during
DML, so if an earlier version of the row is needed,
trx_undo_prev_version_build() is called with a new parameter that fills a dtuple
with the virtual column information (note, it cannot be built into rec_t since
the clustered index record does not contain virtual columns). Then this virtual
dtuple_t's info is plugged into a row from row_build() before calling
row_build_index_entry() to build virtual index fields.

14) A new flag ROW_BUILD_NORMAL_W_VCOL is introduced for
row_build_index_entry_low() in the case if we need to build the virtual column
by calling the innobase_get_computed_value() callback. This happens in situation
such as build a secondary index for virtual index. Another flag
ROW_BUILD_FOR_INSERT is introduced to indicate that we are building a cluster
index record for insertion. Note we do not support virtual column in PK. So the
only case we will build virtual columns for cluster index is during insert,
where we will need to write the virtual column values to the undo log.

15) The implicit lock (row_vers_impl_x_locked_low) should also work with slight
modification as well. When look for earlier version of the cluster index,
trx_undo_prev_version_build() will be called to fetch the virtual column
values from the undo log, then build the virtual index entry corresponding
to the clustered index record for comparison with the record no the
virtual index page.