WL#8114: Don't store virtual generated columns in database

Status: Complete   —   Priority: Medium

Currently virtual generated columns still occupy space as if they were stored.
Effectively server is responsible to calculate the value, but storage engine
stores trash in place of those fields. With this WL, virtual columns will no
longer take any space in actual data rows. Thus making such column much more
flexible (add/drop does not need to rebuild table) and space efficient (do not
take space in actual data row any more).

For example, if we have a table with 5 regular columns and 2 virtual columns. On
disk, the data record (cluster index record) will only have those 5 regular
columns (along with any necessary hidden system columns), and those 2 virtual
columns are never stored in those data rows.

The virtual column will still be represented in the InnoDB metadata. It is
treated as a column with prtype includes DATA_VIRTUAL bit. When adding a virtual
column, the column will register to Sys_columns as if for any other columns. And
the number of virtual column is also encoded in the N_COLS field of Sys_tables.
In addition, a new system table, SYS_VIRTUAL, is added to register the base
column information for each virtual columns.

For corresponding in-memory metadata structure, the virtual column info is
separated from that of regular columns. The regular column will still hang off
the dict_table_t::cols, while the new virtual columns will be hang off a new
list with dict_table_t::v_cols. In this way, the repercussion of new virtual
columns on existing access methods is limited.

A new virtual column structure (dict_v_col_t) is created, which has not only
regular column information, but also contains information from SYS_VIRTUAL,
which records the "base columns" information for the virtual column.

When we fill the result through the row_prebuilt_t::mysql_template[], the
virtual column could be skipped, since they do not exist in our data record. So,
the optimizer will fill in the data in TABLE::record

Since the virtual column is never stored in real rows, so alter table add or
drop virtual columns are reduced to a few system table modification, and it is
mostly effective instantly.

So in summary, with this worklog, the virtual column data will never stored in
data rows. But its metadata is recorded in various system tables, to show its
presence. And add/drop such columns are instant without table rebuild
1) Virtual columns are represented correctly in system tables, mainly SYS_COLUMNS.

2) Virtual columns' "base columns" (columns needed to generate virtual columns)
are represented correctly in system table SYS_VIRTUAL.

3) In this release, we do not support virtual column based on virtual columns.
The virtual column will be based only on regular materialized columns.

4) Alter table add/drop virtual columns will be an instant operation, does not
require table rebuild. Since this is an instant operation, we will limit the
alter table add/drop virtual columns as a single operation itself, and not to
mixed with other DDL operations in inplace alter interface, so to reduce the
complexity of testing. This restriction will be removed in another WL (Combine
alter add/drop virtual columns with other inplace alter operations).

5) Virtual columns are not stored on disk rows, so the size of table is as if
there is no such column.

6) Metadata can be loaded successfully after server crash or reboot

7) DMLs runs correctly on the table with virtual column

8) Base columns can be very large BLOB, so does the virtual columns.

9) Base columns can be NULLable, so does the virtual columns.

10) Virtual column can't be used as an AUTO_INCREMENT column.

11) Virtual column can't be used in FOREIGN KEY constraints.

12) Virtual column cannot be indexed (until WL#8149)

13) NO lock changes in this worklog
This worklog would involve change in following areas:

1) Metadata change:

The column will be registered with System tables (Sys_columns, to be specific).
It will look as a normal column in Sys_columns other than it will be marked as
"virtual column" in its PRTYPE (DATA_VIRTUAL), in addition to its inherited type
properties (char, int etc.).

We will not put additional limit on number of virtual columns can be added, but
it is under the overall limitation of number of columns can be supported for a
table, which is 1017
(http://dev.mysql.com/doc/refman/5.6/en/innodb-restrictions.html):

#define REC_MAX_N_FIELDS        (1024 - 1)

#define REC_MAX_N_USER_FIELDS   (REC_MAX_N_FIELDS - DATA_N_SYS_COLS * 2)

As mentioned in HLD, for in memory representation, we will track virtual column
info separately with that of regular columns. So that the existing system
continues to function as if those virtual columns are not present.

2) In Row Storage:

To an InnoDB row, these virtual columns are "non-present". So for in-row record,
only regular materialized columns are present, and virtual columns are not
visible as if they do not exist. This is achieved by completely separate the
virtual and non-virtual column in InnoDB in memory metadata (notablely,
dict_table_t and dict_col_t).

An alternative approach would be stored it similar as a null variable length
column in the row. This will not incur additional space in the data part of the
row, but maybe have 1 or 2 bytes in the row offset. So the extra cost of having
it would be minimum, but we also preserve the column identity in the row. But
this still have extra cost, and it cannot support instant alter table add / drop
column. So this approach is far less ideal than the current one.

Since the non-materialized column does not store any value in row, so it cannot
be part of primary key and its presence will not make any difference in terms of
sort order of primary index.

3) Logging:

Without indexing, the virtual column value will never comes to actual rows, and
storage engine / data row has no record of such values. So there is no need for
any logging.

4) Search and MVCC:
Since the value is not stored in InnoDB, all the virtual column value will be
computed with the "base" columns. And the MVCC would be supported through values
obtained from these "base" columns.

5) Check table will ignore virtual column without index

6) Alter table add/drop virtual column will not require table rebuild. They will
be simple and instant update to system tables. Each operation will require a
update of SYS_TABLES (for num cols), SYS_COLUMNS (adding, dropping cols) and
SYS_VIRTUAL. In the first phase, we will support alter table add/drop virtual
columns all by their own, not to combine with other table altering commands.

7) A new system table SYS_VIRTUAL will be added to track virtual columns' base
columns. Its corresponding Information Schema table innodb_sys_virtual will be
added to view the content of the system table.

The detail column defines will be shown in LLD. But in general, SYS_VIRTUAL
provides necessary information about virtual column mappings (which columns are
needed to compute the value):

CREATE TABLE `t` (
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  `c` int(11) GENERATED ALWAYS AS (a+b) VIRTUAL,
  `h` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

mysql> select * from information_schema.innodb_sys_virtual;                    

+----------+-------+----------+
| TABLE_ID | POS   | BASE_POS |
+----------+-------+----------+
|       38 | 65538 |        0 |
|       38 | 65538 |        1 |
+----------+-------+----------+
2 rows in set (3.00 sec)

Their corresponding SYS_COLUMNS view is following:

mysql> select * from information_schema.innodb_sys_columns where TABLE_ID in
(select TABLE_ID from information_schema.innodb_sys_tables where name like
"test/t");
+----------+------+-------+-------+--------+-----+
| TABLE_ID | NAME | POS   | MTYPE | PRTYPE | LEN |
+----------+------+-------+-------+--------+-----+
|       38 | a    |     0 |     6 |   1027 |   4 |
|       38 | b    |     1 |     6 |   1027 |   4 |
|       38 | h    |     2 |     1 | 524303 |  10 |
|       38 | c    | 65538 |     6 |   9219 |   4 |
+----------+------+-------+-------+--------+-----+
4 rows in set (0.01 sec)

The POS fields of both tables are a large number is because it encodes 2
positions (the position in original MySQL table and its virtual column sequence
number in InnoDB), detail can see LLD (5).

In above case, column c is a virtual column that generated by normal column 0
(`a`) and
column 1 (`b`) of the table.

If there is a constant virtual column, it does not have any base column, so it
will not enter into SYS_VIRTUAL:

 CREATE TABLE `t1` (
  `a` int(11) DEFAULT NULL,
  `b` int(11) GENERATED ALWAYS AS (a+1) VIRTUAL,
  `c` int(11) GENERATED ALWAYS AS (5) VIRTUAL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


mysql>  select * from information_schema.innodb_sys_virtual;                   
+----------+-------+----------+
| TABLE_ID | POS   | BASE_POS |
+----------+-------+----------+
|       39 | 65537 |        0 |
+----------+-------+----------+
1 row in set (0.00 sec)

But the constant column will be in SYS_COLUMNS:
mysql> select * from information_schema.innodb_sys_columns where TABLE_ID in
(select TABLE_ID from information_schema.innodb_sys_tables where name like
"test/t1");
+----------+------+--------+-------+--------+-----+
| TABLE_ID | NAME | POS    | MTYPE | PRTYPE | LEN |
+----------+------+--------+-------+--------+-----+
|       39 | a    |      0 |     6 |   1027 |   4 |
|       39 | b    |  65537 |     6 |   9219 |   4 |
|       39 | c    | 131074 |     6 |   9219 |   4 |
+----------+------+--------+-------+--------+-----+
3 rows in set (0.01 sec)

You can see column `c` is also a virtual column, but it does not have any base
columns (it is a constant). So it will have no entry in SYS_VIRTUAL but lives in
SYS_COLUMNS.
As described, the main change to support non-materialized virtual columns is to
add such column info in InnoDB system table and in memory structures. Following
changes are made for this purpose:

1) A new virtual column structure is added:

/** Data structure for a virtual column in a table */
struct dict_v_col_t{
        dict_col_t              m_col;          /* column information */
        dict_col_t**            base_col;       /* array of base column ptr,
                                                pointing to base columns */
        ulint                   num_base;       /* number of base column */
        ulint                   v_pos;          /* original column position */
};

The base column information is filled by function innodb_base_col_setup() when
we create a table with virtual column or alter table add virtual column.

2) In dict_table_t, following members are added to track virtual columns:

struct dict_table_t {

        /** Number of virtual columns defined so far. */
        unsigned                                n_v_def:10;

        /** Number of virtual columns. */
        unsigned                                n_v_cols:10;

        /** Array of virtual column descriptions. */
        dict_v_col_t*                           v_cols;
       ....
}


3) Following prtype is added:

#define DATA_VIRTUAL    8192    /* Virtual column */

4) The number of virtual column is embedded in Sys_tables' N_COLS column with
following formula:

static_cast<ulint>(table->n_def) + (static_cast<ulint>(table->n_v_def) << 16)

We could do so because the max number of columns for MySQL table is 4096, and
1017 for InnoDB(http://dev.mysql.com/doc/refman/5.6/en/innodb-restrictions.html).

5) A virtual column comes with 2 positions info:
 a) Its position in original (MySQL) table definition
 b) Its sequence number of virtual columns in InnoDB (the "nth" virtual column

a) is represented by dict_v_col_t::m_col::ind
b) is represented by dict_v_col_t::v_pos

To make them persistent, they are both encoded in SYS_COLUMN's POS field, by
following formula:

(dict_v_col_t::v_pos + 1) << 16 + dict_v_col_t::m_col::ind

We could do so again due to the number of columns in InnoDB is limited to 1017.
Encoding and decoding functions are created.

An example of innodb_sys_columns output would be following:

CREATE TABLE `t` (
  `a` int(11) DEFAULT NULL,
  `c` char(10) DEFAULT NULL,
  `d` char(20) DEFAULT NULL,
  `e` char(10) GENERATED ALWAYS AS (c) VIRTUAL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

mysql> select * from information_schema.innodb_sys_columns where TABLE_ID in
(select TABLE_ID from information_schema.innodb_sys_tables where name like
"test/t");

+----------+------+-------+-------+--------+-----+
| TABLE_ID | NAME | POS   | MTYPE | PRTYPE | LEN |
+----------+------+-------+-------+--------+-----+
|       35 | a    |     0 |     6 |   1027 |   4 |
|       35 | c    |     1 |     2 | 524542 |  10 |
|       35 | d    |     2 |     2 | 524542 |  20 |
|       35 | e    | 65539 |     2 | 532734 |  10 |
+----------+------+-------+-------+--------+-----+
4 rows in set (0.02 sec)


6) A new system table SYS_VIRTUAL is added. This system will be created if it is
not present when server boots up (see dict_create_or_check_sys_virtual()) .

/* The columns in SYS_VIRTUAL */
enum dict_col_sys_virtual_enum {
        DICT_COL__SYS_VIRTUAL__TABLE_ID         = 0,
        DICT_COL__SYS_VIRTUAL__POS              = 1,
        DICT_COL__SYS_VIRTUAL__BASE_POS         = 2,
        DICT_NUM_COLS__SYS_VIRTUAL              = 3
};
/* The field numbers in the SYS_VIRTUAL clustered index */
enum dict_fld_sys_virtual_enum {
        DICT_FLD__SYS_VIRTUAL__TABLE_ID         = 0,
        DICT_FLD__SYS_VIRTUAL__POS              = 1,
        DICT_FLD__SYS_VIRTUAL__BASE_POS         = 2,
        DICT_FLD__SYS_VIRTUAL__DB_TRX_ID        = 3,
        DICT_FLD__SYS_VIRTUAL__DB_ROLL_PTR      = 4,
        DICT_NUM_FIELDS__SYS_VIRTUAL            = 5
};

Adding a virtual column will involve adding a row in SYS_COLUMNS table, a number
of rows, one for each base column, into SYS_VIRTUAL column. And also update the
SYS_TABLES to mark how many virtual columns in N_COLS field.

Some example of SYS_VIRTUAL data is shown below:

CREATE TABLE `t` (
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  `c` int(11) GENERATED ALWAYS AS (a+b) VIRTUAL,
  `h` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

mysql> select * from information_schema.innodb_sys_virtual;                    

2 rows in set (0.00 sec)
+----------+-------+----------+
| TABLE_ID | POS   | BASE_POS |
+----------+-------+----------+
|       36 | 65538 |        0 |
|       36 | 65538 |        1 |
+----------+-------+----------+
2 rows in set (2.16 sec)

The POS field matches that of corresponding row in SYS_COLUMNS:
+----------+------+-------+-------+--------+-----+
| TABLE_ID | NAME | POS   | MTYPE | PRTYPE | LEN |
+----------+------+-------+-------+--------+-----+
|       36 | a    |     0 |     6 |   1027 |   4 |
|       36 | b    |     1 |     6 |   1027 |   4 |
|       36 | h    |     2 |     1 | 524303 |  10 |
|       36 | c    | 65538 |     6 |   9219 |   4 |
+----------+------+-------+-------+--------+-----+
4 rows in set (0.00 sec)

65538 equals (0 + 1) << 16 + 2. 
The "(0+ 1) << 16" means it is the first virtual column. 
The "2" means it is the 3rd column in original MySQL table.

If there is a constant virtual column, it does not have any base column, so it
will not enter into SYS_VIRTUAL table:

 CREATE TABLE `t1` (
  `a` int(11) DEFAULT NULL,
  `b` int(11) GENERATED ALWAYS AS (a+1) VIRTUAL,
  `c` int(11) GENERATED ALWAYS AS (5) VIRTUAL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

mysql> select * from information_schema.innodb_sys_virtual;

+----------+-------+----------+
| TABLE_ID | POS   | BASE_POS |
+----------+-------+----------+
|       37 | 65537 |        0 |
+----------+-------+----------+
1 row in set (1.69 sec)

But the constant virtual column is still represented in SYS_COLUMNS:

mysql> select * from information_schema.innodb_sys_columns where TABLE_ID in
(select TABLE_ID from information_schema.innodb_sys_tables where name like
"test/t1");
+----------+------+--------+-------+--------+-----+
| TABLE_ID | NAME | POS    | MTYPE | PRTYPE | LEN |
+----------+------+--------+-------+--------+-----+
|       37 | a    |      0 |     6 |   1027 |   4 |
|       37 | b    |  65537 |     6 |   9219 |   4 |
|       37 | c    | 131074 |     6 |   9219 |   4 |
+----------+------+--------+-------+--------+-----+
3 rows in set (0.01 sec)

Following functions are added to enter a SYS_VIRTUAL row into the table:

dict_build_v_col_def_step
dict_create_sys_virtual_tuple

and during server boot time, the "base column info" is loaded by calling
dict_load_virtual_low() and dict_load_virtual_one_col()