MySQL :: JSON Labs Release: Effective Functional Indexes in InnoDB

In MySQL 5.7.6, we added a new feature called Generated Columns. In the initial work all Generated Columns were materialized, even virtual ones. This not only resulted in unnecessary disk space being used and disk I/O being done, but it also meant that any table alteration required that the full table be rebuilt. In the new MySQL 5.7.7 JSON Lab release, we have resolved all of these issues by implementing new features that not only allow users to create non-materialized virtual columns, but that also allow them to create indexes on them in InnoDB. The virtual column data can also be generated using functions, so to some degree, the “virtual index” can be viewed as a form of functional index or function-based index.

In this blog, we’ll explore some details about the virtual column and virtual index design, so as to give you a good idea of how they are being implemented within InnoDB.

Virtual Columns in InnoDB

Let’s first explore how virtual columns are now represented inside InnoDB. In short, virtual columns are no longer stored at all in the InnoDB user table or its clustered index, but they are still represented in the InnoDB system tables. Let’s look at this in more detail.

The virtual column is not stored within the InnoDB table

The virtual column is now truly “virtual” in InnoDB, which means that InnoDB does not store any data for that column within its clustered index (which is used as basic storage for the table data). Let’s look at an example.

mysql> CREATE TABLE t (a INT, b INT, c INT GENERATED ALWAYS AS(a+b), PRIMARY KEY(a));
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO t VALUES (11, 3, default);
Query OK, 1 row affected (0.00 sec)

mysql> CREATE TABLE t (a INT, b INT, c INT GENERATED ALWAYS AS(a+b), PRIMARY KEY(a));

Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO t VALUES (11, 3, default);

Query OK, 1 row affected (0.00 sec)

Here, column ‘c’ is a virtual column. If we look at the physical data layout of this table in InnoDB, we can see that it only has 2 user columns—column ‘a’ and column ‘b’—along with the 2 standard InnoDB hidden/internal columns (DATA_TRX_ID and DATA_ROLL_PTR):

not-deleted PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 8000000b; asc     ;;           /* column 'a' */
 1: len 6; hex 00000000670b; asc     g ;;     /* InnoDB hidden column */
 2: len 7; hex a90000011d0110; asc        ;;  /* InnoDB hidden column */
 3: len 4; hex 80000003; asc     ;;           /* column 'b' */

not-deleted PHYSICAL RECORD: n_fields 4; compact format; info bits 0

0: len 4; hex 8000000b; asc ;; /* column 'a' */

1: len 6; hex 00000000670b; asc g ;; /* InnoDB hidden column */

2: len 7; hex a90000011d0110; asc ;; /* InnoDB hidden column */

3: len 4; hex 80000003; asc ;; /* column 'b' */

So the column ‘c’ is not stored within the InnoDB table and rows, but is instead calculated on the fly when you query the table.

mysql> SELECT * FROM t;
+----+------+------+
| a  | b    | c    |
+----+------+------+
| 11 |    3 |   14 |
+----+------+------+
1 row in set (0.00 sec)

mysql> SELECT * FROM t;

+----+------+------+

| a | b | c |

+----+------+------+

| 11 | 3 | 14 |

+----+------+------+

1 row in set (0.00 sec)

The virtual column’s metadata representation

Even though the virtual column itself is no longer stored within InnoDB, its metadata is. We need to do that in order to support the creation of secondary indexes on such columns.

The virtual column’s metadata information is stored in the InnoDB SYS_COLUMNS system table along with other columns, the only difference being that its ‘PRTYPE’ value has an additional DATA_VIRTUAL (8192) bit set on it.

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_COLUMNS 
WHERE TABLE_ID IN (SELECT TABLE_ID FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE NAME LIKE "t%");

+----------+------+-------+-------+--------+-----+
| TABLE_ID | NAME | POS   | MTYPE | PRTYPE | LEN |
+----------+------+-------+-------+--------+-----+
|       74 | a    |     0 |     6 |   1283 |   4 |
|       74 | b    |     1 |     6 |   1027 |   4 |
|       74 | c    | 65538 |     6 |   9219 |   4 |
+----------+------+-------+-------+--------+-----+
3 rows in set (0.01 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_COLUMNS

WHERE TABLE_ID IN (SELECT TABLE_ID FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE NAME LIKE "t%");

+----------+------+-------+-------+--------+-----+

+----------+------+-------+-------+--------+-----+

| 74 | a | 0 | 6 | 1283 | 4 |

| 74 | b | 1 | 6 | 1027 | 4 |

| 74 | c | 65538 | 6 | 9219 | 4 |

+----------+------+-------+-------+--------+-----+

3 rows in set (0.01 sec)

Notice that the virtual column ‘c’ is registered in the SYS_COLUMNS system table with its ‘PRTYPE’ having the DATA_VIRTUAL (8192) bit set. The ‘POS’ field is also special, as it encodes both the position in original table (the 3rd column) and its sequence as virtual column (the first virtual column).

In addition to the SYS_COLUMNS system table, we also added a new system table called SYS_VIRTUAL, in order to record whether a virtual column is generated based on some other columns (the base columns) or not. In above example, column ‘c’ is calculated based on column ‘a’ and column ‘b’:

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_VIRTUAL;
+----------+-------+----------+
| TABLE_ID | POS   | BASE_POS |
+----------+-------+----------+
|       74 | 65538 |        0 |
|       74 | 65538 |        1 |
+----------+-------+----------+
2 rows in set (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_VIRTUAL;

+----------+-------+----------+

| TABLE_ID | POS | BASE_POS |

+----------+-------+----------+

| 74 | 65538 | 0 |

| 74 | 65538 | 1 |

+----------+-------+----------+

2 rows in set (0.00 sec)

In the above example, the ‘POS’ column represents the virtual column’s ‘POS’ value (in this case, column ‘c’) in SYS_COLUMNS, and ‘BASE_POS’ represents the base column’s ‘POS’ value in SYS_COLUMNS (0 for column ‘a’ and 1 for column ‘b’). Currently, the “base column” can only consist of standard materialized columns and cannot be other Generated Columns.

Even though the virtual columns are added to system tables similarly to other standard columns, they are represented in a separate domain than normal columns for the in-memory metadata. In this way, little change is needed for InnoDB, as it largely continues to work as if the virtual column does not exist. But at the same time we can get the virtual column info whenever it’s actually needed. For example, the dict_table_t::cols structure still holds information on all of the normal materialized columns, while the new dict_table_t::v_cols structure holds information on all of the virtual or non-materialized columns.

With these designs, virtual columns can be added and dropped easily and without the need for a full table rebuild. This makes related table schema changes very simple and very fast:

mysql> ALTER TABLE t ADD new_col INT GENERATED ALWAYS AS (a - b) VIRTUAL;
Query OK, 0 rows affected (0.06 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM t;
+----+------+------+---------+
| a  | b    | c    | new_col |
+----+------+------+---------+
| 11 |    3 |   14 |       8 |
+----+------+------+---------+
1 row in set (0.01 sec)

mysql> ALTER TABLE t ADD new_col INT GENERATED ALWAYS AS (a - b) VIRTUAL;

Query OK, 0 rows affected (0.06 sec)

Records: 0 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM t;

+----+------+------+---------+

| a | b | c | new_col |

+----+------+------+---------+

| 11 | 3 | 14 | 8 |

+----+------+------+---------+

1 row in set (0.01 sec)

Creating Indexes on Non-Materialized Virtual Columns

As we saw in the previous section, a virtual column is very flexible and can easily be added or dropped. However, since it is not stored within InnoDB’s clustered index, in order to query the value we need to fetch the base column data and then do the necessary calculation(s) for each possible qualified row. This makes the query somewhat slow and inefficient. There is a way to make the query as efficient as those of other normal columns though! We can now simply create secondary indexes on the virtual columns!

Once a secondary index has been created on the virtual column, the virtual column data is then essentially materialized and stored in the secondary index records. This means that the virtual column’s value does not need to be calculated when the virtual column is queried. Again, this makes them effectively a functional index or function-based index.

Virtual index creation

The index creation syntax is the same as that used when creating any other secondary index:

mysql> CREATE INDEX idx ON t(c);
Query OK, 1 row affected (0.08 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> CREATE INDEX idx ON t(c);

Query OK, 1 row affected (0.08 sec)

Records: 1 Duplicates: 0 Warnings: 0

Now the new ‘idx’ index has been added to the SYS_INDEXES system table, and the virtual column ‘c’ has been added to SYS_FIELDS system table:

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_INDEXES WHERE TABLE_ID IN (SELECT TABLE_ID FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE NAME LIKE "t%");
+----------+---------+----------+------+----------+---------+-------+-----------------+
| INDEX_ID | NAME    | TABLE_ID | TYPE | N_FIELDS | PAGE_NO | SPACE | MERGE_THRESHOLD |
+----------+---------+----------+------+----------+---------+-------+-----------------+
|      123 | PRIMARY |       75 |    3 |        1 |       3 |    63 |              50 |
|      124 | idx     |       75 |  128 |        1 |       4 |    63 |              50 |
+----------+---------+----------+------+----------+---------+-------+-----------------+
2 rows in set (0.01 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_FIELDS WHERE INDEX_ID = 124;
+----------+------+-----+
| INDEX_ID | NAME | POS |
+----------+------+-----+
|      124 | c    |   0 |
+----------+------+-----+
1 row in set (0.03 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_INDEXES WHERE TABLE_ID IN (SELECT TABLE_ID FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE NAME LIKE "t%");

+----------+---------+----------+------+----------+---------+-------+-----------------+

+----------+---------+----------+------+----------+---------+-------+-----------------+

| 123 | PRIMARY | 75 | 3 | 1 | 3 | 63 | 50 |

| 124 | idx | 75 | 128 | 1 | 4 | 63 | 50 |

+----------+---------+----------+------+----------+---------+-------+-----------------+

2 rows in set (0.01 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_FIELDS WHERE INDEX_ID = 124;

+----------+------+-----+

| INDEX_ID | NAME | POS |

+----------+------+-----+

| 124 | c | 0 |

+----------+------+-----+

1 row in set (0.03 sec)

Since we store the virtual column just as other normal columns in InnoDB’s metadata system tables, virtual index metadata can be represented in the same ways as that of a normal index.

The difference with creating indexes on normal columns is that during the index creation time, if we identify that the indexed column is a virtual column, then its “base column(s)” are fetched and a callback function is used to access the base column(s) value before finally calling the specified generation function. Once we get the calculated value back from this callback function, then the value is piped to a sorter and is later used to instantiate the index record.

DML Statements

Since the virtual column data is now “materialized” through a secondary index, any DML (INSERT, UPDATE, DELETE) could have an effect on the index. The virtual column values are updated similarly as to other indexed column values. However, you cannot INSERT or UPDATE values for any virtual column directly. Instead, the INSERT and UPDATE operations for them are performed indirectly through any changes to the base columns. Let’s continue to use the previous example table we have in order to demonstrate this.

mysql> select * from t;
+----+------+------+---------+
| a  | b    | c    | new_col |
+----+------+------+---------+
| 11 | 3    | 14   | 8       |
+----+------+------+---------+
1 row in set (0.00 sec)

mysql> UPDATE t SET a = 20;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> select * from t;
+----+------+------+---------+
| a  | b    | c    | new_col |
+----+------+------+---------+
| 20 | 3    | 23   | 17      |
+----+------+------+---------+
1 row in set (0.01 sec)

mysql> EXPLAIN SELECT c FROM t;
+----+-------------+-------+------------+-------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+------+---------+------+------+----------+-------------+
| 1  | SIMPLE      | t     | NULL       | index | NULL          | idx  | 5       | NULL | 1    | 100.00   | Using index |
+----+-------------+-------+------------+-------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql> SELECT c FROM t;
+------+
| c    |
+------+
| 23   |
+------+
1 row in set (0.00 sec)

mysql> select * from t;

+----+------+------+---------+

| a | b | c | new_col |

+----+------+------+---------+

| 11 | 3 | 14 | 8 |

+----+------+------+---------+

1 row in set (0.00 sec)

mysql> UPDATE t SET a = 20;

Query OK, 1 row affected (0.01 sec)

Rows matched: 1 Changed: 1 Warnings: 0

mysql> select * from t;

+----+------+------+---------+

| a | b | c | new_col |

+----+------+------+---------+

| 20 | 3 | 23 | 17 |

+----+------+------+---------+

1 row in set (0.01 sec)

mysql> EXPLAIN SELECT c FROM t;

+----+-------------+-------+------------+-------+---------------+------+---------+------+------+----------+-------------+

+----+-------------+-------+------------+-------+---------------+------+---------+------+------+----------+-------------+

| 1 | SIMPLE | t | NULL | index | NULL | idx | 5 | NULL | 1 | 100.00 | Using index |

+----+-------------+-------+------------+-------+---------------+------+---------+------+------+----------+-------------+

1 row in set, 1 warning (0.00 sec)

mysql> SELECT c FROM t;

+------+

| c |

+------+

| 23 |

+------+

1 row in set (0.00 sec)

As you can see, the index value on column ‘c’ is updated when the base column ‘a’ is updated.

One thing to note here is that we do log virtual column DMLs so that such data changes may not need to be recalculated for MVCC, crash recovery, and/or UNDO operations. Of course, only operations on those indexed virtual columns are logged.

Queries Using A “Functional Index”

With a “functional index” on virtual columns, a user is now able to search qualified rows using both non-covered and covered scans. The “functional index” can be queried, and depending on the scenario (isolation level etc.), the clustered index can be subsequently consulted.

Queries on virtual columns also support MVCC, since we log the updates on the virtual column in the UNDO log. However, there are maximum index size limitations to consider—767 bytes for COMPACT/REDUNDANT ROW formats or 3072 bytes for COMPRESSED/DYNAMIC ROW formats. If the queried object has a longer length, then the value will have to be generated from the base column(s) on the fly.

The query would also support all isolation levels, which means that you can also place GAP locks on the indexed virtual columns in certain situations.

In addition, the “functional index” also supports prefix indexes as well as unique indexes. Here’s a full example that also incorporates the new JSON support:

mysql> create table employees(id bigint not null primary key auto_increment, info JSON);
Query OK, 0 rows affected (0.20 sec)

mysql> insert into employees (info) values ('{ "name": "Matt Lord", "age": 38, "Duties": { "Product Manager": ["stuff", "more stuff"]} }');
Query OK, 1 row affected (0.04 sec)

mysql> select jsn_valid(info) from employees;
+-----------------+
| jsn_valid(info) |
+-----------------+
|               1 |
+-----------------+
1 row in set (0.00 sec)

mysql> select id, jsn_extract(info, '$.name') from employees;
+----+-----------------------------+
| id | jsn_extract(info, '$.name') |
+----+-----------------------------+
|  1 | "Matt Lord"                 |
+----+-----------------------------+
1 row in set (0.00 sec)

mysql> alter table employees add name varchar(100) generated always as (jsn_extract(info, '$.name')) virtual;
Query OK, 0 rows affected (0.07 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table employees add index (name);
Query OK, 1 row affected (0.51 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> show create table employees\G
*************************** 1. row ***************************
       Table: employees
Create Table: CREATE TABLE `employees` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `info` json DEFAULT NULL,
  `name` varchar(100) GENERATED ALWAYS AS (jsn_extract(info, '$.name')) VIRTUAL,
  PRIMARY KEY (`id`),
  KEY `name` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> explain format=JSON select id, name from employees where name = "Matt Lord"\G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "1.20"
    },
    "table": {
      "table_name": "employees",
      "access_type": "ref",
      "possible_keys": [
        "name"
      ],
      "key": "name",
      "used_key_parts": [
        "name"
      ],
      "key_length": "103",
      "ref": [
        "const"
      ],
      "rows_examined_per_scan": 1,
      "rows_produced_per_join": 1,
      "filtered": "100.00",
      "using_index": true,
      "cost_info": {
        "read_cost": "1.00",
        "eval_cost": "0.20",
        "prefix_cost": "1.20",
        "data_read_per_join": "128"
      },
      "used_columns": [
        "id",
        "info",
        "name"
      ]
    }
  }
}
1 row in set, 1 warning (0.00 sec)

mysql> create table employees(id bigint not null primary key auto_increment, info JSON);

Query OK, 0 rows affected (0.20 sec)

mysql> insert into employees (info) values ('{ "name": "Matt Lord", "age": 38, "Duties": { "Product Manager": ["stuff", "more stuff"]} }');

Query OK, 1 row affected (0.04 sec)

mysql> select jsn_valid(info) from employees;

+-----------------+

| jsn_valid(info) |

+-----------------+

| 1 |

+-----------------+

1 row in set (0.00 sec)

mysql> select id, jsn_extract(info, '$.name') from employees;

+----+-----------------------------+

| id | jsn_extract(info, '$.name') |

+----+-----------------------------+

| 1 | "Matt Lord" |

+----+-----------------------------+

1 row in set (0.00 sec)

mysql> alter table employees add name varchar(100) generated always as (jsn_extract(info, '$.name')) virtual;

Query OK, 0 rows affected (0.07 sec)

Records: 0 Duplicates: 0 Warnings: 0

mysql> alter table employees add index (name);

Query OK, 1 row affected (0.51 sec)

Records: 1 Duplicates: 0 Warnings: 0

mysql> show create table employees\G

*************************** 1. row ***************************

Table: employees

Create Table: CREATE TABLE `employees` (

`id` bigint(20) NOT NULL AUTO_INCREMENT,

`info` json DEFAULT NULL,

`name` varchar(100) GENERATED ALWAYS AS (jsn_extract(info, '$.name')) VIRTUAL,

PRIMARY KEY (`id`),

KEY `name` (`name`)

) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1

1 row in set (0.00 sec)

mysql> explain format=JSON select id, name from employees where name = "Matt Lord"\G

*************************** 1. row ***************************

EXPLAIN: {

"query_block": {

"select_id": 1,

"cost_info": {

"query_cost": "1.20"

"table": {

"table_name": "employees",

"access_type": "ref",

"possible_keys": [

"name"

"key": "name",

"used_key_parts": [

"name"

"key_length": "103",

"ref": [

"const"

"rows_examined_per_scan": 1,

"rows_produced_per_join": 1,

"filtered": "100.00",

"using_index": true,

"cost_info": {

"read_cost": "1.00",

"eval_cost": "0.20",

"prefix_cost": "1.20",

"data_read_per_join": "128"

"used_columns": [

"id",

"info",

"name"

]

}

1 row in set, 1 warning (0.00 sec)

Limitations

There are currently some restrictions around virtual indexes, some of which will be lifted later:

Primary Keys cannot contain any virtual columns
You cannot create an index on a mix of virtual and non-virtual columns
You cannot create a spatial or fulltext index on virtual columns (this limitation will be lifted later)
A virtual index cannot be used as a foreign key

Summary

In summary, the new implementation of virtual columns, virtual indexes, and effective “functional indexes” allow users to add/drop new virtual columns quickly and still enjoy the possibility of efficient queries by allowing secondary indexes on such columns. This makes it an ideal solution for large TEXT/JSON fields and other non-relational data indexing, thus facilitating the efficient storage, manipulation, and querying of such data.

Please let us know what you think of these new features! We’d love to hear your feedback on what else you’d like to see related to Generated Columns and our wider JSON support. If you encounter any problems with the new features, please let us know here in the comments, open a bug report at bugs.mysql.com, or open a support ticket.

Thank you for using MySQL!