In a previous post I took the new JSON datatype and functions for a test drive, using sample data provided by SF OpenData. Today I will extend that example to include indexing.
Introducing Generated Columns
MySQL 5.7 introduces a new feature called generated columns. To start with a simple example:
1
2
3
4
5
|
CREATE TABLE t1 ( id INT NOT NULL PRIMARY KEY auto_increment, my_integer INT, my_integer_plus_one INT AS (my_integer+1) ); |
Here my_integer_plus_one is what is known as a virtual generated column. It exists in meta data only, and does not require any additional storage in memory or in disk for row data. You can think of it as similar to a view or a macro that is generated on demand.
The advantage of creating virtual columns, is that they support indexes. To extend the above example:
1
2
3
4
5
6
|
CREATE TABLE t1 ( id INT NOT NULL PRIMARY KEY auto_increment, my_integer INT, my_integer_plus_one INT AS (my_integer+1), INDEX (my_integer_plus_one) ); |
Generated columns allow MySQL to effectively index on an expression or what is known as a functional index. The index itself is now material and it does exist (as all other indexes do).
Indexing JSON
As the MySQL manual notes:
JSON columns cannot be indexed. You can work around this restriction by creating an index on a generated column that extracts a scalar value from the JSON column.
Lets do exactly that by reusing the example from my previous blog post:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# My original query SELECT * FROM features WHERE feature->"$.properties.STREET" = 'MARKET' # Add a virtual column # This operation is instant, since it only changes meta data ALTER TABLE features ADD feature_street VARCHAR(30) AS (JSON_UNQUOTE(feature->"$.properties.STREET")); Query OK, 0 rows affected (0.01 sec) Records: 0 Duplicates: 0 Warnings: 0 # Add an index # This is an online operation, but may take some time # for a large amount of data ALTER TABLE features ADD INDEX (feature_street); Query OK, 0 rows affected (9.41 sec) Records: 0 Duplicates: 0 Warnings: 0 |
This allows me to then repeat this query by using either the virtual column, or the original expression:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
# Using the JSON extract shorthand operator EXPLAIN SELECT * FROM features WHERE feature->"$.properties.STREET" = 'MARKET'\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: features partitions: NULL type: ref possible_keys: feature_street key: feature_street key_len: 33 ref: const rows: 808 filtered: 100.00 Extra: NULL 1 row in set, 1 warning (0.00 sec) # Using the virtual column mysql> EXPLAIN SELECT * FROM features WHERE feature_street = 'MARKET'\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: features partitions: NULL type: ref possible_keys: feature_street key: feature_street key_len: 33 ref: const rows: 808 filtered: 100.00 Extra: NULL 1 row in set, 1 warning (0.00 sec) |
It is important to note that as demonstrated in EXPLAIN, both variations of the above query will use an index. That is to say that the optimizer will look for compatible[1] virtual columns with indexes that match the expression in JSON queries.
[1] Note that use of JSON_UNQUOTE is required in the virtual column expression, since the return type of JSON_EXTRACT (and shorthand operator ->) needs to be cast to a string.
Stored Generated Columns
For completeness, it is important to point out that the examples above use the default VIRTUAL variety of generated columns. There also exists a second variety of generated columns that can be STORED in the actual rows of tables:
1
2
|
ALTER TABLE features ADD feature_type VARCHAR(30) AS (JSON_UNQUOTE(feature->"$.type")) STORED; |
STORED generated columns are automatically maintained and kept up to date as modifications to tables are made. It is recommended that you use the VIRTUAL variety of virtual columns unless:
- You need to index a PRIMARY KEY
- You need a FULLTEXT or RTREE index (instead of regular BTREE)
- The virtual column is computationally expensive, and you anticipate large amounts of scanning (i.e. not using indexes)
In the first two cases the STORED variety of generated columns is required, and in the third case it is likely to lead to just faster queries.
Conclusion
Hopefully this demonstrates how virtual columns can be used to add indexes to JSON data. Virtual columns can index other data besides JSON, but this is obviously one of the main use cases. Having the optimizer also understand how to use indexes from virtual columns makes the usage quite flexible too!