MySQL :: Indexing JSON documents via Virtual Columns

In a previous post I took the new JSON datatype and functions for a test drive, using sample data provided by SF OpenData. Today I will extend that example to include indexing.

Introducing Generated Columns

MySQL 5.7 introduces a new feature called generated columns. To start with a simple example:

CREATE TABLE t1 (
 id INT NOT NULL PRIMARY KEY auto_increment,
 my_integer INT,
 my_integer_plus_one INT AS (my_integer+1)
);

CREATE TABLE t1 (

id INT NOT NULL PRIMARY KEY auto_increment,

my_integer INT,

my_integer_plus_one INT AS (my_integer+1)

);

Here my_integer_plus_one is what is known as a virtual generated column. It exists in meta data only, and does not require any additional storage in memory or in disk for row data. You can think of it as similar to a view or a macro that is generated on demand.

The advantage of creating virtual columns, is that they support indexes. To extend the above example:

CREATE TABLE t1 (
 id INT NOT NULL PRIMARY KEY auto_increment,
 my_integer INT,
 my_integer_plus_one INT AS (my_integer+1),
 INDEX (my_integer_plus_one)
);

CREATE TABLE t1 (

id INT NOT NULL PRIMARY KEY auto_increment,

my_integer INT,

my_integer_plus_one INT AS (my_integer+1),

INDEX (my_integer_plus_one)

);

Generated columns allow MySQL to effectively index on an expression or what is known as a functional index. The index itself is now material and it does exist (as all other indexes do).

Indexing JSON

As the MySQL manual notes:

JSON columns cannot be indexed. You can work around this restriction by creating an index on a generated column that extracts a scalar value from the JSON column.

Lets do exactly that by reusing the example from my previous blog post:

# My original query
SELECT * FROM features 
WHERE feature->"$.properties.STREET" = 'MARKET'

# Add a virtual column
# This operation is instant, since it only changes meta data
ALTER TABLE features ADD feature_street VARCHAR(30) AS (JSON_UNQUOTE(feature->"$.properties.STREET"));
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

# Add an index
# This is an online operation, but may take some time
# for a large amount of data

ALTER TABLE features ADD INDEX (feature_street);
Query OK, 0 rows affected (9.41 sec)
Records: 0  Duplicates: 0  Warnings: 0

# My original query

SELECT * FROM features

WHERE feature->"$.properties.STREET" = 'MARKET'

# Add a virtual column

# This operation is instant, since it only changes meta data

ALTER TABLE features ADD feature_street VARCHAR(30) AS (JSON_UNQUOTE(feature->"$.properties.STREET"));

Query OK, 0 rows affected (0.01 sec)

Records: 0 Duplicates: 0 Warnings: 0

# Add an index

# This is an online operation, but may take some time

# for a large amount of data

ALTER TABLE features ADD INDEX (feature_street);

Query OK, 0 rows affected (9.41 sec)

Records: 0 Duplicates: 0 Warnings: 0

This allows me to then repeat this query by using either the virtual column, or the original expression:

# Using the JSON extract shorthand operator
EXPLAIN SELECT * FROM features  WHERE feature->"$.properties.STREET" = 'MARKET'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: features
   partitions: NULL
         type: ref
possible_keys: feature_street
          key: feature_street
      key_len: 33
          ref: const
         rows: 808
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.00 sec)

# Using the virtual column
mysql> EXPLAIN SELECT * FROM features  WHERE feature_street = 'MARKET'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: features
   partitions: NULL
         type: ref
possible_keys: feature_street
          key: feature_street
      key_len: 33
          ref: const
         rows: 808
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.00 sec)

# Using the JSON extract shorthand operator

EXPLAIN SELECT * FROM features WHERE feature->"$.properties.STREET" = 'MARKET'\G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: features

partitions: NULL

type: ref

possible_keys: feature_street

key: feature_street

key_len: 33

ref: const

rows: 808

filtered: 100.00

Extra: NULL

1 row in set, 1 warning (0.00 sec)

# Using the virtual column

mysql> EXPLAIN SELECT * FROM features WHERE feature_street = 'MARKET'\G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: features

partitions: NULL

type: ref

possible_keys: feature_street

key: feature_street

key_len: 33

ref: const

rows: 808

filtered: 100.00

Extra: NULL

1 row in set, 1 warning (0.00 sec)

It is important to note that as demonstrated in EXPLAIN, both variations of the above query will use an index. That is to say that the optimizer will look for compatible[1] virtual columns with indexes that match the expression in JSON queries.

[1] Note that use of JSON_UNQUOTE is required in the virtual column expression, since the return type of JSON_EXTRACT (and shorthand operator ->) needs to be cast to a string.

Stored Generated Columns

For completeness, it is important to point out that the examples above use the default VIRTUAL variety of generated columns. There also exists a second variety of generated columns that can be STORED in the actual rows of tables:

ALTER TABLE features ADD feature_type VARCHAR(30) 
AS (JSON_UNQUOTE(feature->"$.type")) STORED;

1 2	ALTER TABLE features ADD feature_type VARCHAR(30) AS (JSON_UNQUOTE(feature->"$.type")) STORED;

STORED generated columns are automatically maintained and kept up to date as modifications to tables are made. It is recommended that you use the VIRTUAL variety of virtual columns unless:

You need to index a PRIMARY KEY
You need a FULLTEXT or RTREE index (instead of regular BTREE)
The virtual column is computationally expensive, and you anticipate large amounts of scanning (i.e. not using indexes)

In the first two cases the STORED variety of generated columns is required, and in the third case it is likely to lead to just faster queries.

Conclusion

Hopefully this demonstrates how virtual columns can be used to add indexes to JSON data. Virtual columns can index other data besides JSON, but this is obviously one of the main use cases. Having the optimizer also understand how to use indexes from virtual columns makes the usage quite flexible too!