MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
Indexing JSON documents via Virtual Columns

In a previous post I took the new JSON datatype and functions for a test drive, using sample data provided by SF OpenData. Today I will extend that example to include indexing.

Introducing Generated Columns

MySQL 5.7 introduces a new feature called generated columns. To start with a simple example:

Here my_integer_plus_one is what is known as a virtual generated column. It exists in meta data only, and does not require any additional storage in memory or in disk for row data. You can think of it as similar to a view or a macro that is generated on demand.

The advantage of creating virtual columns, is that they support indexes. To extend the above example:

Generated columns allow MySQL to effectively index on an expression or what is known as a functional index. The index itself is now material and it does exist (as all other indexes do).

Indexing JSON

As the MySQL manual notes:

JSON columns cannot be indexed. You can work around this restriction by creating an index on a generated column that extracts a scalar value from the JSON column.

Lets do exactly that by reusing the example from my previous blog post:

This allows me to then repeat this query by using either the virtual column, or the original expression:

It is important to note that as demonstrated in EXPLAIN, both variations of the above query will use an index. That is to say that the optimizer will look for compatible[1] virtual columns with indexes that match the expression in JSON queries.

[1] Note that use of JSON_UNQUOTE is required in the virtual column expression, since the return type of JSON_EXTRACT (and shorthand operator ->) needs to be cast to a string.

Stored Generated Columns

For completeness, it is important to point out that the examples above use the default VIRTUAL variety of generated columns. There also exists a second variety of generated columns that can be STORED in the actual rows of tables:

STORED generated columns are automatically maintained and kept up to date as modifications to tables are made. It is recommended that you use the VIRTUAL variety of virtual columns unless:

  1. You need to index a PRIMARY KEY
  2. You need a FULLTEXT or RTREE index (instead of regular BTREE)
  3. The virtual column is computationally expensive, and you anticipate large amounts of scanning (i.e. not using indexes)

In the first two cases the STORED variety of generated columns is required, and in the third case it is likely to lead to just faster queries.

Conclusion

Hopefully this demonstrates how virtual columns can be used to add indexes to JSON data. Virtual columns can index other data besides JSON, but this is obviously one of the main use cases. Having the optimizer also understand how to use indexes from virtual columns makes the usage quite flexible too!