In April, I wrote a blog post introducing the new InnoDB virtual column and effective functional index work for the JSON lab release. Now the feature is officially in MySQL 5.7.8. It’s worth revisiting the topic again in order to write about what is in the 5.7 Release Candidate, along with providing some additional examples. I also ran some quick performance tests comparing some simple operations on a table with and without virtual columns and indexes, comparing them with those using materialized generated columns.
Key Features
The earlier article already described the design in detail. If you have not already read it, I would recommend that you check it out first. I won’t go into all of the design detail again here, but I will restate some key points:
- The virtual columns are no longer materialized in the table. They are truly “virtual”, which means they are NOT stored in InnoDB rows (InnoDB stores the table data in its primary key records—a clustered index—so it means that virtual columns will not be present in InnoDB primary key records), thus decreasing the overall size of the table. This then allows for faster table scans and other large operations.
- Since they are truly “virtual”, adding and dropping virtual columns does not require a table rebuild. Rather, it only requires a quick system table update that registers the new metadata. This makes the schema changes simple and fast.
- Creating an index on a virtual column (only secondary indexes are allowed) will essentially “materialize” the virtual column in the index records as the computed values are then stored in the secondary index, but the values are not stored in the primary key (clustered index). So the table itself is still small, and you can quickly look up the computed (and stored) virtual column values in the secondary index.
- Once an index is created on a virtual column, the value for such a virtual column is MVCC logged so as to avoid unnecessary re-computation of the generated column value later when we have to perform a rollback or purge operation. However, since its purpose is in maintaining the secondary index only, we will only log only up to a certain limited length of the data in order to save space, since our index has a key length limitation of 767 bytes for COMPACT and REDUNDANT for formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats.
Changes Since the Lab Release
There are a few noteworthy and useful changes/additions since the initial Lab release:
- A single “functional index” can now be created on a combination of both virtual columns and non-virtual generated columns. That is, you can create a composite index on a mix of virtual and non-virtual generated columns.
- Users can create functional indexes ONLINE using the
in-place
algorithm so that DML statements can still be processed while the index is being created. In order to achieve that, the virtual column values used within the concurrent DML statements are computed and logged while the index is being created, and later replayed on the functional index. - Users can create virtual columns based on other virtual columns, and then index them.
- The next improvement is less visible to user, but still worth mentioning. It is about enhancements to the purge related activities on indexed virtual columns. A new callback (WL#8841) provides a server layer function that can be called by InnoDB purge threads to compute virtual column index values. Generally this computation is done from connection threads (or sessions), however, since internal InnoDB purge threads do not correspond to connections/sessions and thus don’t have
THD
s or access toTABLE
objects, this work was necessary in order to provide a server layer callback which enables the purge threads to make the necessary computations.
Limitations
There are still some notable restrictions around the “functional indexes”, some of which will be lifted by later work:
- Primary Keys cannot be added on virtual columns.
- You cannot create a spatial or fulltext index on virtual columns (this limitation will eventually be lifted).
- A virtual index cannot be used as a foreign key.
- You cannot create virtual columns on non-repeatable/non-deterministic functions. For example:
12mysql> ALTER TABLE `t` ADD p3 DATE GENERATED ALWAYS AS (curtime()) virtual;ERROR 3102 (HY000): Expression of generated column 'p3' contains a disallowed function. - Adding and dropping of virtual columns can be done in-place or online only when done as single operations by themselves, and not when combined with other table alterations. This limit will be removed later, but you can always go around it by keeping the in-place ADD/DROP virtual columns operations contained within a separate DDL statement.
A Few More Examples
In the previous blog post we have an example on how to use a “functional index” in conjunction with some JSON functions. Users can essentially use any functions for virtual columns, except for those that are non-deterministic (such as NOW()
). So let’s next walk through some additional examples using some non-JSON functions:
-
Indexes on XML fields
12345678910111213mysql> create table t(a int, b varchar(100), c varchar(100) generated always as (ExtractValue(b, '//b[1]')) virtual);Query OK, 0 rows affected (0.22 sec)mysql> insert into t values (1, '<a><b>X</b><b>Y</b></a>', default);Query OK, 1 row affected (0.05 sec)mysql> select * from t;+------+-------------------------+------+| a | b | c |+------+-------------------------+------+| 1 | <a><b>X</b><b>Y</b></a> | X |+------+-------------------------+------+1 row in set (0.00 sec) -
Indexes on Geometry calculations
Here is an example showing how you can quickly add a virtual column that stores the distance (in meters) between two geographic points or
(LONGITUDE, LATITUDE)
coordinate pairs:First we’ll create a table with some geography data in it:
12345678910111213mysql> CREATE TABLE t (id int(11) NOT NULL,p1 geometry DEFAULT NULL,p2 geometry DEFAULT NULL,PRIMARY KEY (`id`)) ENGINE=InnoDB;Query OK, 0 rows affected (0.20 sec)mysql> insert into t values(1, POINT(-75.341621, 41.061987), POINT(-75.3555043, 41.0515628));Query OK, 1 row affected (0.04 sec)mysql> insert into t values(2, POINT(-75.341621, 41.061987), POINT(-75.3215434, 41.0595024));Query OK, 1 row affected (0.04 sec)Now you want to measure the distance (in meters) between the two points. You can quickly ADD a virtual column and then index it, all without having to rebuild the table:
1234567mysql> ALTER TABLE t ADD distance double GENERATED ALWAYS AS (st_distance_sphere(p1, p2)) VIRTUAL;Query OK, 0 rows affected (0.10 sec)Records: 0 Duplicates: 0 Warnings: 0mysql> create index idx on t(distance);Query OK, 0 rows affected (0.27 sec)Records: 0 Duplicates: 0 Warnings: 0Now you can query this table easily and quickly using this new virtual column and its index:
123456789101112131415161718192021222324mysql> explain select distance from t\G*************************** 1. row ***************************id: 1select_type: SIMPLEtable: tpartitions: NULLtype: indexpossible_keys: NULLkey: idxkey_len: 9ref: NULLrows: 2filtered: 100.00Extra: Using index1 row in set, 1 warning (0.00 sec)mysql> select distance, id from t;+--------------------+----+| distance | id |+--------------------+----+| 1642.7497709937588 | 1 || 1705.8728579019303 | 2 |+--------------------+----+2 rows in set (0.00 sec) -
String manipulation
You can also add some columns that use any of the string manipulation functions. For example:
123456789101112131415161718192021222324252627282930313233343536mysql> CREATE TABLE `t` (`a` varchar(100) DEFAULT NULL,`b` varchar(100) DEFAULT NULL) ENGINE=InnoDB;Query OK, 0 rows affected (0.20 sec)mysql> insert into t values ("this is an experiment", "with string manipulation");Query OK, 1 row affected (0.03 sec)mysql> ALTER TABLE t ADD COLUMN count1 int GENERATED ALWAYS AS (char_length(a)) VIRTUAL;Query OK, 0 rows affected (0.06 sec)Records: 0 Duplicates: 0 Warnings: 0mysql> ALTER TABLE t ADD COLUMN count2 int GENERATED ALWAYS AS (char_length(b)) VIRTUAL;Query OK, 0 rows affected (0.06 sec)Records: 0 Duplicates: 0 Warnings: 0mysql> select * from t;+-----------------------+--------------------------+--------+--------+| a | b | count1 | count2 |+-----------------------+--------------------------+--------+--------+| this is an experiment | with string manipulation | 21 | 24 |+-----------------------+--------------------------+--------+--------+1 row in set (0.00 sec)mysql> ALTER TABLE t ADD COLUMN count3 int GENERATED ALWAYS AS (instr(a, "exp")) VIRTUAL;Query OK, 0 rows affected (0.07 sec)Records: 0 Duplicates: 0 Warnings: 0mysql> select count3 from t;+--------+| count3 |+--------+| 12 |+--------+1 row in set (0.00 sec)
Lastly, it’s worth noting that you can also create your own custom functions (or UDFs) and use those as the basis for your virtual columns.
Some Quick Performance Benchmarks
As expected, there will be some additional write cost when using an index on a virtual column due to the necessary computation of the virtual columns when they need to be materialized (e.g. INSERT
or UPDATE
), in other words the costs will be associated with creating and maintaining the index. If the column does not have a functional index, however, then the cost will instead be associated with reads as the value will need to be materialized any time that the row is examined. The added cost is also directly related to the complexity of the computation functions used.
However, even with such additional costs, using virtual columns and “functional indexes” can still be far better than creating the table with such STORED
generated columns as the latter materialize the data in the clustered index (primary key), thus resulting in a larger table (both on disk and in memory).
These quick tests are conducted on 3 types of table:
- A table with virtual columns:
1234567891011CREATE TABLE `t` (`h` INT NOT NULL PRIMARY KEY,`a` varchar(30),`b` BLOB,`v_a_b` BLOB GENERATED ALWAYS AS (CONCAT(a,b)) VIRTUAL,`v_b` BLOB GENERATED ALWAYS AS (b) VIRTUAL,`e` int,`v_h_e` INT(11) GENERATED ALWAYS AS (h + e) VIRTUAL,`v_e` INT GENERATED ALWAYS AS (e) VIRTUAL,`v_a` INT GENERATED ALWAYS AS (char_length(a)) VIRTUAL) ENGINE=InnoDB;
This table has 4 normal columns and 5VIRTUAL
columns. We make the computation function used very simple so as to minimize the impact from the function itself. - A “normal” table without a any generated columns at all (neither
VIRTUAL
orSTORED
):
123456CREATE TABLE `t_nv` (`h` INT NOT NULL PRIMARY KEY,`a` VARCHAR(30),`b` BLOB,`e` INT) ENGINE=InnoDB; - A table with materialized or STORED generated columns:
1234567891011CREATE TABLE `t_m` (`h` INT NOT NULL PRIMARY KEY,`a` varchar(30),`b` BLOB,`v_a_b` BLOB GENERATED ALWAYS AS (CONCAT(a,b)) STORED,`v_b` BLOB GENERATED ALWAYS AS (b) STORED,`e` int,`v_h_e` INT(11) GENERATED ALWAYS AS (h + e) STORED,`v_e` INT GENERATED ALWAYS AS (e) STORED,`v_a` INT GENERATED ALWAYS AS (char_length(a)) STORED) ENGINE=InnoDB;
We then use the following procedure to INSERT
rows into each of these tables:
1
2
3
4
5
6
7
8
9
10
|
CREATE PROCEDURE insert_values(n1 int, n2 int) begin DECLARE i INT DEFAULT 1; WHILE (i+n1 <= 100000+n2) DO INSERT INTO t VALUES (n1+i, CAST(n1+i AS CHAR), REPEAT('b', 2000), DEFAULT, DEFAULT, n1+i+10, DEFAULT, DEFAULT, DEFAULT); INSERT INTO t_nv VALUES (n1+i, CAST(n1+i AS CHAR), REPEAT('b', 2000), n1+i+10); INSERT INTO t_m VALUES (n1+i, CAST(n1+i AS CHAR), REPEAT('b', 2000), DEFAULT, DEFAULT, n1+i+10, DEFAULT, DEFAULT, DEFAULT); SET i = i + 1; END WHILE; END| |
All tests were conducted on a 48 core x86_64 GNU/Linux machine, with a 10GB InnoDB buffer pool. All tests were run with a single thread. Each number comes from averages of over 3 runs. Here are the results:
-
Insertion without an index:
Insert of 500,000 row Insert of 1,000,000 row Table 1. with virtual column (t) 3 min 24.65 sec 6 min 59.91 sec Table 2. without virtual column (t_nv) 3 min 21.41 sec 6 min 31.82 sec Table 3. with materialized column 4 min 25.58 sec 8 min 43.66 sec So for insertion into tables without any secondary index, the time is very similar for a table without any generated columns at all and the one with
VIRTUAL
columns. The latter does not have those columns materialized, so the amount of data inserted is exactly the same for the first two tables. However, if the columns are materialized/stored (as in table 3), the time will take substantially longer.One thing to note is that even though the time taken for table 1 and table 2 are very similar, table 1 insertion still takes slightly longer. This is due an issue that some unnecessary computation is still done for a table with virtual columns, this will be fixed soon.
-
Creating an index
Create index on similar columns for 3 different tables:
Table with 1,000,000 row Create index on t(v_e) 2.90 sec Create index on t_nv(e) 2.40 sec Create index on t_m(v_e) (Table 3. with materialized column ) 3.31 sec Create the index on table 1’s virtual column
v_e
is a bit more expensive than its base column in table 2, since indexing virtual column(s) requires computing the value. However, in this case, it is still faster than creating an index on the same column with aSTORED
generated column, simply because the table with generated columns is so much bigger, a simple scan takes more time.A few more runs of
CREATE INDEX
on the tablet
, just to show the scale of the costs when adding indexes on a virtual column.Time to create index on 1,000,000 rows on table 1 (table t
with virtual columns)Create index on column v_e
2.90 sec Create index on column e
2.47 sec Create index on column v_b(3)
3.26 sec Create index on column b(3)
2.67 sec Create index on column v_h_e
2.97 sec Create index on column v_a
3.06 sec Create index on column v_a_b(10)
4.19 sec As mentioned, creating an index on virtual columns are a bit more costly than creating an index on normal columns, since the computation needs to be performed on each row.
-
Adding a new column
ALTER TABLE ... ADD COLUMN
would usually require a full table rebuild for normal or STORED generated columns. But if you add a virtual column, it is not required and thus will be almost instant.ALTER TABLE ADD COLUMN on table with 1million rows alter table t_nv add column col1 int; 1 min 20.50 sec alter table t_nv add column col2 int GENERATED ALWAYS AS (e) stored; 1 min 32.40 sec alter table t_nv add column col3 int GENERATED ALWAYS AS (e) virtual; 0.10 sec So if you add a virtual column and then materialize it via CREATE INDEX, it will only take a few seconds (2 to 3 sec for creating the index according to previous experiment). If you do that for a normal column or generated column, it will take 50x to 60x more time (mostly spent in rebuilding the table).
-
Dropping a column
Similarly, dropping a virtual column is far faster for the same reasons.
ALTER TABLE DROP COLUMN on table with 1million rows alter table t_nv drop column col1; 47.02 sec alter table t_nv drop column col2; (A GENERATED column) 50.41 sec alter table t_nv add column col3; (A virtual column) 0.10 sec -
DMLs with a virtual index or “functional index”:
-
INSERT
Insert of 500,000 rows Table 1 with functional index on column v_e
6 min 57.31 sec Table 2 with index on column e
6 min 33.09 sec Table 3 with index on column v_e
9 min 5.24 sec As shown in this example, for a table with indexed virtual columns, ts insertion times are significantly less than table 3, which materializes the value in the clustered index (primary key).
-
UPDATE
The following
UPDATE
statements were then performed on the 3 tables, and results were:mysql> update t set e=e+1;
Update time on table with 1,000,000 rows Update on table 1 with index on virtual column v_e
1 min 20.39 sec Update on table 2 with index on column e
52.26 sec Update on table 3 with index on materialized column v_e
1 min 2.52 sec As you can see,
UPDATE
statements on indexed virtual columns are more expensive. This demonstrates the additional MVCC costs associated with the operation (in addition to any operation associated with columne
) because 1) The old value forv_e
needs to be computed (for the UNDO log) and 2) The old and new values forv_e
will need to be UNDO logged. -
DELETE
Delete of 1,000,000 rows Delete all row with index on virtual column ‘v_e’ 21.52 sec Delete all row with index on ‘e’ 20.54 sec Delete on table 3 with index on materialized column v_e
32.09 sec The
DELETE
statements were faster on table withVIRTUAL
generated columns than those for table with theSTORED
generated columns. The table with virtual columns is apparently much smaller than that with materialized column. So the deletion operation is much faster.The
DELETE
operation will also require a little extra MVCC work if there are indexes on virtual columns because 1) The old value forv_e
needs to be computed (for the UNDO log) and 2) The old and new values forv_e
will need to be UNDO logged.So the
DELETE
statement is a little bit more expensive than when using a regular column, but much faster than those withSTORED
generated columns.
-
INSERT
-
SELECT Queries:
Of course, as expected, the table with
STORED
generated columns is much larger than the one withVIRTUAL
generated columns. And this is clearly shown with a quick table scan (after the initial run to bring the data into the buffer pool)select count(*) from t Table 1. with virtual column (t) 0.59 sec Table 2. without virtual column (t_no_v) 0.60 sec Table 3. with materialized column 1.30 sec As shown above, a table scan on the table with
STORED
generated columns took 3x longer than the scan on the table withVIRTUAL
generated columns.While the table with virtual columns and indexes remains small, it still takes advantage of having a materialized (secondary) index to facilitate efficient queries:
Query on char_length(a) on table with 1,000,000 rows Table 1 with virtual column and index on ‘char_length(a)’ 0.00 sec Table 2 without index on ‘char_length(a)’ 0.66 sec Table 3 with stored column and index on ‘char_length(a)’ 0.00 sec Without the “functional index” on the ‘
char_length(a)
‘ value, table 2 requires a full table scan to get the results.
Summary
The virtual column and functional index work is now officially in 5.7! This is a great feature that allows users to ADD
and DROP VIRTUAL
generated columns, along with adding optional secondary indexes on those columns, all done using ONLINE operations. As shown in the simple performance study above, the data materialized in such a way keeps the base table small (as it does not have duplicate copies in the InnoDB clustered/primary index) and thus making more efficient use of persistent storage space and memory, while at the same time providing vastly improved query performance!
That’s it for now. As always, THANK YOU for using MySQL!