MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
JSON Labs Release: Effective Functional Indexes in InnoDB

In MySQL 5.7.6, we added a new feature called Generated Columns. In the initial work all Generated Columns were materialized, even virtual ones. This not only resulted in unnecessary disk space being used and disk I/O being done, but it also meant that any table alteration required that the full table be rebuilt. In the new MySQL 5.7.7 JSON Lab release, we have resolved all of these issues by implementing new features that not only allow users to create non-materialized virtual columns, but that also allow them to create indexes on them in InnoDB. The virtual column data can also be generated using functions, so to some degree, the “virtual index” can be viewed as a form of functional index or function-based index.

In this blog, we’ll explore some details about the virtual column and virtual index design, so as to give you a good idea of how they are being implemented within InnoDB.

Virtual Columns in InnoDB

Let’s first explore how virtual columns are now represented inside InnoDB. In short, virtual columns are no longer stored at all in the InnoDB user table or its clustered index, but they are still represented in the InnoDB system tables. Let’s look at this in more detail.

  1. The virtual column is not stored within the InnoDB table

    The virtual column is now truly “virtual” in InnoDB, which means that InnoDB does not store any data for that column within its clustered index (which is used as basic storage for the table data). Let’s look at an example.

    Here, column ‘c’ is a virtual column. If we look at the physical data layout of this table in InnoDB, we can see that it only has 2 user columns—column ‘a’ and column ‘b’—along with the 2 standard InnoDB hidden/internal columns (DATA_TRX_ID and DATA_ROLL_PTR):

    So the column ‘c’ is not stored within the InnoDB table and rows, but is instead calculated on the fly when you query the table.

  2. The virtual column’s metadata representation

    Even though the virtual column itself is no longer stored within InnoDB, its metadata is. We need to do that in order to support the creation of secondary indexes on such columns.

    The virtual column’s metadata information is stored in the InnoDB SYS_COLUMNS system table along with other columns, the only difference being that its ‘PRTYPE’ value has an additional DATA_VIRTUAL (8192) bit set on it.

    Notice that the virtual column ‘c’ is registered in the SYS_COLUMNS system table with its ‘PRTYPE’ having the DATA_VIRTUAL (8192) bit set. The ‘POS’ field is also special, as it encodes both the position in original table (the 3rd column) and its sequence as virtual column (the first virtual column).

    In addition to the SYS_COLUMNS system table, we also added a new system table called SYS_VIRTUAL, in order to record whether a virtual column is generated based on some other columns (the base columns) or not. In above example, column ‘c’ is calculated based on column ‘a’ and column ‘b’:

    In the above example, the ‘POS’ column represents the virtual column’s ‘POS’ value (in this case, column ‘c’) in SYS_COLUMNS, and ‘BASE_POS’ represents the base column’s ‘POS’ value in SYS_COLUMNS (0 for column ‘a’ and 1 for column ‘b’). Currently, the “base column” can only consist of standard materialized columns and cannot be other Generated Columns.

    Even though the virtual columns are added to system tables similarly to other standard columns, they are represented in a separate domain than normal columns for the in-memory metadata. In this way, little change is needed for InnoDB, as it largely continues to work as if the virtual column does not exist. But at the same time we can get the virtual column info whenever it’s actually needed. For example, the dict_table_t::cols structure still holds information on all of the normal materialized columns, while the new dict_table_t::v_cols structure holds information on all of the virtual or non-materialized columns.

With these designs, virtual columns can be added and dropped easily and without the need for a full table rebuild. This makes related table schema changes very simple and very fast:

Creating Indexes on Non-Materialized Virtual Columns

As we saw in the previous section, a virtual column is very flexible and can easily be added or dropped. However, since it is not stored within InnoDB’s clustered index, in order to query the value we need to fetch the base column data and then do the necessary calculation(s) for each possible qualified row. This makes the query somewhat slow and inefficient. There is a way to make the query as efficient as those of other normal columns though! We can now simply create secondary indexes on the virtual columns!

Once a secondary index has been created on the virtual column, the virtual column data is then essentially materialized and stored in the secondary index records. This means that the virtual column’s value does not need to be calculated when the virtual column is queried. Again, this makes them effectively a functional index or function-based index.

  1. Virtual index creation

    The index creation syntax is the same as that used when creating any other secondary index:

    Now the new ‘idx’ index has been added to the SYS_INDEXES system table, and the virtual column ‘c’ has been added to SYS_FIELDS system table:

    Since we store the virtual column just as other normal columns in InnoDB’s metadata system tables, virtual index metadata can be represented in the same ways as that of a normal index.

    The difference with creating indexes on normal columns is that during the index creation time, if we identify that the indexed column is a virtual column, then its “base column(s)” are fetched and a callback function is used to access the base column(s) value before finally calling the specified generation function. Once we get the calculated value back from this callback function, then the value is piped to a sorter and is later used to instantiate the index record.

  2. DML Statements

    Since the virtual column data is now “materialized” through a secondary index, any DML (INSERT, UPDATE, DELETE) could have an effect on the index. The virtual column values are updated similarly as to other indexed column values. However, you cannot INSERT or UPDATE values for any virtual column directly. Instead, the INSERT and UPDATE operations for them are performed indirectly through any changes to the base columns. Let’s continue to use the previous example table we have in order to demonstrate this.

    As you can see, the index value on column ‘c’ is updated when the base column ‘a’ is updated.

    One thing to note here is that we do log virtual column DMLs so that such data changes may not need to be recalculated for MVCC, crash recovery, and/or UNDO operations. Of course, only operations on those indexed virtual columns are logged.

Queries Using A “Functional Index”

With a “functional index” on virtual columns, a user is now able to search qualified rows using both non-covered and covered scans. The “functional index” can be queried, and depending on the scenario (isolation level etc.), the clustered index can be subsequently consulted.

Queries on virtual columns also support MVCC, since we log the updates on the virtual column in the UNDO log. However, there are maximum index size limitations to consider—767 bytes for COMPACT/REDUNDANT ROW formats or 3072 bytes for COMPRESSED/DYNAMIC ROW formats. If the queried object has a longer length, then the value will have to be generated from the base column(s) on the fly.

The query would also support all isolation levels, which means that you can also place GAP locks on the indexed virtual columns in certain situations.

In addition, the “functional index” also supports prefix indexes as well as unique indexes. Here’s a full example that also incorporates the new JSON support:

Limitations

There are currently some restrictions around virtual indexes, some of which will be lifted later:

  1. Primary Keys cannot contain any virtual columns
  2. You cannot create an index on a mix of virtual and non-virtual columns
  3. You cannot create a spatial or fulltext index on virtual columns (this limitation will be lifted later)
  4. A virtual index cannot be used as a foreign key

Summary

In summary, the new implementation of virtual columns, virtual indexes, and effective “functional indexes” allow users to add/drop new virtual columns quickly and still enjoy the possibility of efficient queries by allowing secondary indexes on such columns. This makes it an ideal solution for large TEXT/JSON fields and other non-relational data indexing, thus facilitating the efficient storage, manipulation, and querying of such data.

Please let us know what you think of these new features! We’d love to hear your feedback on what else you’d like to see related to Generated Columns and our wider JSON support. If you encounter any problems with the new features, please let us know here in the comments, open a bug report at bugs.mysql.com, or open a support ticket.

Thank you for using MySQL!