MySQL :: Virtual Columns and Effective Functional Indexes in InnoDB

In April, I wrote a blog post introducing the new InnoDB virtual column and effective functional index work for the JSON lab release. Now the feature is officially in MySQL 5.7.8. It’s worth revisiting the topic again in order to write about what is in the 5.7 Release Candidate, along with providing some additional examples. I also ran some quick performance tests comparing some simple operations on a table with and without virtual columns and indexes, comparing them with those using materialized generated columns.

Key Features

The earlier article already described the design in detail. If you have not already read it, I would recommend that you check it out first. I won’t go into all of the design detail again here, but I will restate some key points:

The virtual columns are no longer materialized in the table. They are truly “virtual”, which means they are NOT stored in InnoDB rows (InnoDB stores the table data in its primary key records—a clustered index—so it means that virtual columns will not be present in InnoDB primary key records), thus decreasing the overall size of the table. This then allows for faster table scans and other large operations.
Since they are truly “virtual”, adding and dropping virtual columns does not require a table rebuild. Rather, it only requires a quick system table update that registers the new metadata. This makes the schema changes simple and fast.
Creating an index on a virtual column (only secondary indexes are allowed) will essentially “materialize” the virtual column in the index records as the computed values are then stored in the secondary index, but the values are not stored in the primary key (clustered index). So the table itself is still small, and you can quickly look up the computed (and stored) virtual column values in the secondary index.
Once an index is created on a virtual column, the value for such a virtual column is MVCC logged so as to avoid unnecessary re-computation of the generated column value later when we have to perform a rollback or purge operation. However, since its purpose is in maintaining the secondary index only, we will only log only up to a certain limited length of the data in order to save space, since our index has a key length limitation of 767 bytes for COMPACT and REDUNDANT for formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats.

Changes Since the Lab Release

There are a few noteworthy and useful changes/additions since the initial Lab release:

A single “functional index” can now be created on a combination of both virtual columns and non-virtual generated columns. That is, you can create a composite index on a mix of virtual and non-virtual generated columns.
Users can create functional indexes ONLINE using the in-place algorithm so that DML statements can still be processed while the index is being created. In order to achieve that, the virtual column values used within the concurrent DML statements are computed and logged while the index is being created, and later replayed on the functional index.
Users can create virtual columns based on other virtual columns, and then index them.
The next improvement is less visible to user, but still worth mentioning. It is about enhancements to the purge related activities on indexed virtual columns. A new callback (WL#8841) provides a server layer function that can be called by InnoDB purge threads to compute virtual column index values. Generally this computation is done from connection threads (or sessions), however, since internal InnoDB purge threads do not correspond to connections/sessions and thus don’t have THDs or access to TABLE objects, this work was necessary in order to provide a server layer callback which enables the purge threads to make the necessary computations.

Limitations

There are still some notable restrictions around the “functional indexes”, some of which will be lifted by later work:

Primary Keys cannot be added on virtual columns.
You cannot create a spatial or fulltext index on virtual columns (this limitation will eventually be lifted).
A virtual index cannot be used as a foreign key.

You cannot create virtual columns on non-repeatable/non-deterministic functions. For example:

mysql> ALTER TABLE `t` ADD p3 DATE GENERATED ALWAYS AS (curtime()) virtual;
ERROR 3102 (HY000): Expression of generated column 'p3' contains a disallowed function.

1 2	mysql> ALTER TABLE `t` ADD p3 DATE GENERATED ALWAYS AS (curtime()) virtual; ERROR 3102 (HY000): Expression of generated column 'p3' contains a disallowed function.

Adding and dropping of virtual columns can be done in-place or online only when done as single operations by themselves, and not when combined with other table alterations. This limit will be removed later, but you can always go around it by keeping the in-place ADD/DROP virtual columns operations contained within a separate DDL statement.

A Few More Examples

In the previous blog post we have an example on how to use a “functional index” in conjunction with some JSON functions. Users can essentially use any functions for virtual columns, except for those that are non-deterministic (such as NOW()). So let’s next walk through some additional examples using some non-JSON functions:

Indexes on XML fields

mysql> create table t(a int, b varchar(100), c varchar(100) generated always as (ExtractValue(b, '//b[1]')) virtual);
Query OK, 0 rows affected (0.22 sec)

mysql> insert into t values (1, '<a><b>X</b><b>Y</b></a>', default);
Query OK, 1 row affected (0.05 sec)

mysql> select * from t;
+------+-------------------------+------+
| a    | b                       | c    |
+------+-------------------------+------+
|    1 | <a><b>X</b><b>Y</b></a> | X    |
+------+-------------------------+------+
1 row in set (0.00 sec)

mysql> create table t(a int, b varchar(100), c varchar(100) generated always as (ExtractValue(b, '//b[1]')) virtual);

Query OK, 0 rows affected (0.22 sec)

mysql> insert into t values (1, '<a>XY</a>', default);

Query OK, 1 row affected (0.05 sec)

mysql> select * from t;

+------+-------------------------+------+

| a | b | c |

+------+-------------------------+------+

| 1 | <a>XY</a> | X |

+------+-------------------------+------+

1 row in set (0.00 sec)

Indexes on Geometry calculations

Here is an example showing how you can quickly add a virtual column that stores the distance (in meters) between two geographic points or (LONGITUDE, LATITUDE) coordinate pairs:

First we’ll create a table with some geography data in it:

mysql> CREATE TABLE t (
id int(11) NOT NULL,
p1 geometry DEFAULT NULL,
p2 geometry DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
Query OK, 0 rows affected (0.20 sec)

mysql> insert into t values(1, POINT(-75.341621, 41.061987), POINT(-75.3555043, 41.0515628));
Query OK, 1 row affected (0.04 sec)

mysql> insert into t values(2, POINT(-75.341621, 41.061987), POINT(-75.3215434, 41.0595024));
Query OK, 1 row affected (0.04 sec)

mysql> CREATE TABLE t (

id int(11) NOT NULL,

p1 geometry DEFAULT NULL,

p2 geometry DEFAULT NULL,

PRIMARY KEY (`id`)

) ENGINE=InnoDB;

Query OK, 0 rows affected (0.20 sec)

mysql> insert into t values(1, POINT(-75.341621, 41.061987), POINT(-75.3555043, 41.0515628));

Query OK, 1 row affected (0.04 sec)

mysql> insert into t values(2, POINT(-75.341621, 41.061987), POINT(-75.3215434, 41.0595024));

Query OK, 1 row affected (0.04 sec)

Now you want to measure the distance (in meters) between the two points. You can quickly ADD a virtual column and then index it, all without having to rebuild the table:

mysql> ALTER TABLE t ADD distance double GENERATED ALWAYS AS (st_distance_sphere(p1, p2)) VIRTUAL;
Query OK, 0 rows affected (0.10 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> create index idx on t(distance);
Query OK, 0 rows affected (0.27 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> ALTER TABLE t ADD distance double GENERATED ALWAYS AS (st_distance_sphere(p1, p2)) VIRTUAL;

Query OK, 0 rows affected (0.10 sec)

Records: 0 Duplicates: 0 Warnings: 0

mysql> create index idx on t(distance);

Query OK, 0 rows affected (0.27 sec)

Records: 0 Duplicates: 0 Warnings: 0

Now you can query this table easily and quickly using this new virtual column and its index:

mysql> explain select distance from t\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t
   partitions: NULL
         type: index
possible_keys: NULL
          key: idx
      key_len: 9
          ref: NULL
         rows: 2
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)

mysql> select distance, id from t;
+--------------------+----+
| distance           | id |
+--------------------+----+
| 1642.7497709937588 |  1 |
| 1705.8728579019303 |  2 |
+--------------------+----+
2 rows in set (0.00 sec)

mysql> explain select distance from t\G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: t

partitions: NULL

type: index

possible_keys: NULL

key: idx

key_len: 9

ref: NULL

rows: 2

filtered: 100.00

Extra: Using index

1 row in set, 1 warning (0.00 sec)

mysql> select distance, id from t;

+--------------------+----+

| distance | id |

+--------------------+----+

| 1642.7497709937588 | 1 |

| 1705.8728579019303 | 2 |

+--------------------+----+

2 rows in set (0.00 sec)

String manipulation

You can also add some columns that use any of the string manipulation functions. For example:

mysql> CREATE TABLE `t` (
`a` varchar(100) DEFAULT NULL,
`b` varchar(100) DEFAULT NULL
) ENGINE=InnoDB;
Query OK, 0 rows affected (0.20 sec)

mysql> insert into t values ("this is an experiment", "with string manipulation");
Query OK, 1 row affected (0.03 sec)

mysql> ALTER TABLE t ADD COLUMN count1 int GENERATED ALWAYS AS (char_length(a)) VIRTUAL;
Query OK, 0 rows affected (0.06 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE t ADD COLUMN count2 int GENERATED ALWAYS AS (char_length(b)) VIRTUAL;
Query OK, 0 rows affected (0.06 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select * from t;
+-----------------------+--------------------------+--------+--------+
| a                     | b                        | count1 | count2 |
+-----------------------+--------------------------+--------+--------+
| this is an experiment | with string manipulation |     21 |     24 |
+-----------------------+--------------------------+--------+--------+
1 row in set (0.00 sec)

mysql> ALTER TABLE t ADD COLUMN count3 int GENERATED ALWAYS AS (instr(a, "exp")) VIRTUAL;
Query OK, 0 rows affected (0.07 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> select count3 from t;
+--------+
| count3 |
+--------+
|     12 |
+--------+
1 row in set (0.00 sec)

mysql> CREATE TABLE `t` (

`a` varchar(100) DEFAULT NULL,

`b` varchar(100) DEFAULT NULL

) ENGINE=InnoDB;

Query OK, 0 rows affected (0.20 sec)

mysql> insert into t values ("this is an experiment", "with string manipulation");

Query OK, 1 row affected (0.03 sec)

mysql> ALTER TABLE t ADD COLUMN count1 int GENERATED ALWAYS AS (char_length(a)) VIRTUAL;

Query OK, 0 rows affected (0.06 sec)

Records: 0 Duplicates: 0 Warnings: 0

mysql> ALTER TABLE t ADD COLUMN count2 int GENERATED ALWAYS AS (char_length(b)) VIRTUAL;

Query OK, 0 rows affected (0.06 sec)

Records: 0 Duplicates: 0 Warnings: 0

mysql> select * from t;

+-----------------------+--------------------------+--------+--------+

| a | b | count1 | count2 |

+-----------------------+--------------------------+--------+--------+

| this is an experiment | with string manipulation | 21 | 24 |

+-----------------------+--------------------------+--------+--------+

1 row in set (0.00 sec)

mysql> ALTER TABLE t ADD COLUMN count3 int GENERATED ALWAYS AS (instr(a, "exp")) VIRTUAL;

Query OK, 0 rows affected (0.07 sec)

Records: 0 Duplicates: 0 Warnings: 0

mysql> select count3 from t;

+--------+

| count3 |

+--------+

| 12 |

+--------+

1 row in set (0.00 sec)

Lastly, it’s worth noting that you can also create your own custom functions (or UDFs) and use those as the basis for your virtual columns.

Some Quick Performance Benchmarks

As expected, there will be some additional write cost when using an index on a virtual column due to the necessary computation of the virtual columns when they need to be materialized (e.g. INSERT or UPDATE), in other words the costs will be associated with creating and maintaining the index. If the column does not have a functional index, however, then the cost will instead be associated with reads as the value will need to be materialized any time that the row is examined. The added cost is also directly related to the complexity of the computation functions used.

However, even with such additional costs, using virtual columns and “functional indexes” can still be far better than creating the table with such STORED generated columns as the latter materialize the data in the clustered index (primary key), thus resulting in a larger table (both on disk and in memory).

These quick tests are conducted on 3 types of table:

A table with virtual columns:

CREATE TABLE `t` (
`h` INT NOT NULL PRIMARY KEY,
`a` varchar(30),
`b` BLOB,
`v_a_b` BLOB GENERATED ALWAYS AS (CONCAT(a,b)) VIRTUAL,
`v_b` BLOB GENERATED ALWAYS AS (b) VIRTUAL,
`e` int,
`v_h_e` INT(11) GENERATED ALWAYS AS (h + e) VIRTUAL,
`v_e` INT GENERATED ALWAYS AS (e) VIRTUAL,
`v_a` INT GENERATED ALWAYS AS (char_length(a)) VIRTUAL
) ENGINE=InnoDB;

CREATE TABLE `t` (

`h` INT NOT NULL PRIMARY KEY,

`a` varchar(30),

`b` BLOB,

`v_a_b` BLOB GENERATED ALWAYS AS (CONCAT(a,b)) VIRTUAL,

`v_b` BLOB GENERATED ALWAYS AS (b) VIRTUAL,

`e` int,

`v_h_e` INT(11) GENERATED ALWAYS AS (h + e) VIRTUAL,

`v_e` INT GENERATED ALWAYS AS (e) VIRTUAL,

`v_a` INT GENERATED ALWAYS AS (char_length(a)) VIRTUAL

) ENGINE=InnoDB;

This table has 4 normal columns and 5 VIRTUAL columns. We make the computation function used very simple so as to minimize the impact from the function itself.

A “normal” table without a any generated columns at all (neither VIRTUAL or STORED):

MySQL

CREATE TABLE `t_nv` ( `h` INT NOT NULL PRIMARY KEY, `a` VARCHAR(30), `b` BLOB, `e` INT ) ENGINE=InnoDB;

1

2

3

4

5

6

CREATE TABLE `t_nv` (
`h` INT NOT NULL PRIMARY KEY,
`a` VARCHAR(30),
`b` BLOB,
`e` INT
) ENGINE=InnoDB;

A table with materialized or STORED generated columns:

CREATE TABLE `t_m` (
`h` INT NOT NULL PRIMARY KEY,
`a` varchar(30),
`b` BLOB,
`v_a_b` BLOB GENERATED ALWAYS AS (CONCAT(a,b)) STORED,
`v_b` BLOB GENERATED ALWAYS AS (b) STORED,
`e` int,
`v_h_e` INT(11) GENERATED ALWAYS AS (h + e) STORED,
`v_e` INT GENERATED ALWAYS AS (e) STORED,
`v_a` INT GENERATED ALWAYS AS (char_length(a)) STORED
) ENGINE=InnoDB;

CREATE TABLE `t_m` (

`h` INT NOT NULL PRIMARY KEY,

`a` varchar(30),

`b` BLOB,

`v_a_b` BLOB GENERATED ALWAYS AS (CONCAT(a,b)) STORED,

`v_b` BLOB GENERATED ALWAYS AS (b) STORED,

`e` int,

`v_h_e` INT(11) GENERATED ALWAYS AS (h + e) STORED,

`v_e` INT GENERATED ALWAYS AS (e) STORED,

`v_a` INT GENERATED ALWAYS AS (char_length(a)) STORED

) ENGINE=InnoDB;

We then use the following procedure to INSERT rows into each of these tables:

CREATE PROCEDURE insert_values(n1 int, n2 int)
begin
        DECLARE i INT DEFAULT 1;
        WHILE (i+n1 <= 100000+n2) DO
                INSERT INTO t VALUES (n1+i, CAST(n1+i AS CHAR), REPEAT('b', 2000), DEFAULT, DEFAULT, n1+i+10, DEFAULT, DEFAULT, DEFAULT);
                INSERT INTO t_nv VALUES (n1+i, CAST(n1+i AS CHAR),  REPEAT('b', 2000), n1+i+10);
                INSERT INTO t_m VALUES (n1+i, CAST(n1+i AS CHAR), REPEAT('b', 2000), DEFAULT, DEFAULT, n1+i+10, DEFAULT, DEFAULT, DEFAULT);
                SET i = i + 1;
        END WHILE;
END|

CREATE PROCEDURE insert_values(n1 int, n2 int)

begin

DECLARE i INT DEFAULT 1;

WHILE (i+n1 <= 100000+n2) DO

INSERT INTO t VALUES (n1+i, CAST(n1+i AS CHAR), REPEAT('b', 2000), DEFAULT, DEFAULT, n1+i+10, DEFAULT, DEFAULT, DEFAULT);

INSERT INTO t_nv VALUES (n1+i, CAST(n1+i AS CHAR), REPEAT('b', 2000), n1+i+10);

INSERT INTO t_m VALUES (n1+i, CAST(n1+i AS CHAR), REPEAT('b', 2000), DEFAULT, DEFAULT, n1+i+10, DEFAULT, DEFAULT, DEFAULT);

SET i = i + 1;

END WHILE;

END|

All tests were conducted on a 48 core x86_64 GNU/Linux machine, with a 10GB InnoDB buffer pool. All tests were run with a single thread. Each number comes from averages of over 3 runs. Here are the results:

Insertion without an index:

	Insert of 500,000 row	Insert of 1,000,000 row
Table 1. with virtual column (t)	3 min 24.65 sec	6 min 59.91 sec
Table 2. without virtual column (t_nv)	3 min 21.41 sec	6 min 31.82 sec
Table 3. with materialized column	4 min 25.58 sec	8 min 43.66 sec

So for insertion into tables without any secondary index, the time is very similar for a table without any generated columns at all and the one with VIRTUAL columns. The latter does not have those columns materialized, so the amount of data inserted is exactly the same for the first two tables. However, if the columns are materialized/stored (as in table 3), the time will take substantially longer.

One thing to note is that even though the time taken for table 1 and table 2 are very similar, table 1 insertion still takes slightly longer. This is due an issue that some unnecessary computation is still done for a table with virtual columns, this will be fixed soon.

Creating an index

Create index on similar columns for 3 different tables:

	Table with 1,000,000 row
Create index on t(v_e)	2.90 sec
Create index on t_nv(e)	2.40 sec
Create index on t_m(v_e) (Table 3. with materialized column )	3.31 sec

Create the index on table 1’s virtual column v_e is a bit more expensive than its base column in table 2, since indexing virtual column(s) requires computing the value. However, in this case, it is still faster than creating an index on the same column with a STORED generated column, simply because the table with generated columns is so much bigger, a simple scan takes more time.

A few more runs of CREATE INDEX on the table t, just to show the scale of the costs when adding indexes on a virtual column.

	Time to create index on 1,000,000 rows on table 1 (table `t` with virtual columns)
Create index on column `v_e`	2.90 sec
Create index on column `e`	2.47 sec
Create index on column `v_b(3)`	3.26 sec
Create index on column `b(3)`	2.67 sec
Create index on column `v_h_e`	2.97 sec
Create index on column `v_a`	3.06 sec
Create index on column `v_a_b(10)`	4.19 sec

As mentioned, creating an index on virtual columns are a bit more costly than creating an index on normal columns, since the computation needs to be performed on each row.

Adding a new column

ALTER TABLE ... ADD COLUMN would usually require a full table rebuild for normal or STORED generated columns. But if you add a virtual column, it is not required and thus will be almost instant.

	ALTER TABLE ADD COLUMN on table with 1million rows
alter table t_nv add column col1 int;	1 min 20.50 sec
alter table t_nv add column col2 int GENERATED ALWAYS AS (e) stored;	1 min 32.40 sec
alter table t_nv add column col3 int GENERATED ALWAYS AS (e) virtual;	0.10 sec

So if you add a virtual column and then materialize it via CREATE INDEX, it will only take a few seconds (2 to 3 sec for creating the index according to previous experiment). If you do that for a normal column or generated column, it will take 50x to 60x more time (mostly spent in rebuilding the table).

Dropping a column

Similarly, dropping a virtual column is far faster for the same reasons.

	ALTER TABLE DROP COLUMN on table with 1million rows
alter table t_nv drop column col1;	47.02 sec
alter table t_nv drop column col2; (A GENERATED column)	50.41 sec
alter table t_nv add column col3; (A virtual column)	0.10 sec

DMLs with a virtual index or “functional index”:

INSERT

Insert of 500,000 rows

Table 1 with functional index on column v_e 6 min 57.31 sec

Table 2 with index on column e 6 min 33.09 sec

Table 3 with index on column v_e 9 min 5.24 sec

As shown in this example, for a table with indexed virtual columns, ts insertion times are significantly less than table 3, which materializes the value in the clustered index (primary key).

UPDATE

The following UPDATE statements were then performed on the 3 tables, and results were:

mysql> update t set e=e+1;

	Update time on table with 1,000,000 rows
Update on table 1 with index on virtual column `v_e`	1 min 20.39 sec
Update on table 2 with index on column `e`	52.26 sec
Update on table 3 with index on materialized column `v_e`	1 min 2.52 sec

As you can see, UPDATE statements on indexed virtual columns are more expensive. This demonstrates the additional MVCC costs associated with the operation (in addition to any operation associated with column e) because 1) The old value for v_e needs to be computed (for the UNDO log) and 2) The old and new values for v_e will need to be UNDO logged.

DELETE

	Delete of 1,000,000 rows
Delete all row with index on virtual column ‘v_e’	21.52 sec
Delete all row with index on ‘e’	20.54 sec
Delete on table 3 with index on materialized column `v_e`	32.09 sec

The DELETE statements were faster on table with VIRTUAL generated columns than those for table with the STORED generated columns. The table with virtual columns is apparently much smaller than that with materialized column. So the deletion operation is much faster.

The DELETE operation will also require a little extra MVCC work if there are indexes on virtual columns because 1) The old value for v_e needs to be computed (for the UNDO log) and 2) The old and new values for v_e will need to be UNDO logged.

So the DELETE statement is a little bit more expensive than when using a regular column, but much faster than those with STORED generated columns.

SELECT Queries:

Of course, as expected, the table with STORED generated columns is much larger than the one with VIRTUAL generated columns. And this is clearly shown with a quick table scan (after the initial run to bring the data into the buffer pool)

	select count(*) from t
Table 1. with virtual column (t)	0.59 sec
Table 2. without virtual column (t_no_v)	0.60 sec
Table 3. with materialized column	1.30 sec

As shown above, a table scan on the table with STORED generated columns took 3x longer than the scan on the table with VIRTUAL generated columns.

While the table with virtual columns and indexes remains small, it still takes advantage of having a materialized (secondary) index to facilitate efficient queries:

	Query on char_length(a) on table with 1,000,000 rows
Table 1 with virtual column and index on ‘char_length(a)’	0.00 sec
Table 2 without index on ‘char_length(a)’	0.66 sec
Table 3 with stored column and index on ‘char_length(a)’	0.00 sec

Without the “functional index” on the ‘char_length(a)‘ value, table 2 requires a full table scan to get the results.

Summary

The virtual column and functional index work is now officially in 5.7! This is a great feature that allows users to ADD and DROP VIRTUAL generated columns, along with adding optional secondary indexes on those columns, all done using ONLINE operations. As shown in the simple performance study above, the data materialized in such a way keeps the base table small (as it does not have duplicate copies in the InnoDB clustered/primary index) and thus making more efficient use of persistent storage space and memory, while at the same time providing vastly improved query performance!

That’s it for now. As always, THANK YOU for using MySQL!