WL#1075: Add support for functional indexes

Affects: Server-8.0   —   Status: Complete   —   Priority: Very High

This worklog adds the possibility to create functional indexes in MySQL. The funtionality will be implemented as indexed hidden generated columns, so the worklog will re-use a good amount of existing functionality in the server.

Example usage

 CREATE TABLE t1 (col1 INT, col2 INT, INDEX func_index ((ABS(col1))));
 CREATE INDEX idx1 ON t1 ((col1 + col2));
 CREATE INDEX idx2 ON t1 ((col1 + col2), (col1 - col2), col1);
 ALTER TABLE t1 ADD INDEX ((col1 * 40));

Terminology

Key part: An index/key consists of one or more key parts. For instance, "CREATE INDEX idx1 ON t1 (col1, col2);" creates one index with two key parts. The key part is a reference to a column in the table, and for STRING/BINARY/TEXT/BLOB columns it is possible to only prefix a part of the column by specifying a prefix in parentheses after the column name; "CREATE INDEX idx1 ON t1 (col1(12));"

Functional key part: A key part that is an expression, and not a column. For instance, "CREATE INDEX idx1 ON t1 (col1, (col2 + 2));" creates one index where the first key part is a "normal" key part and the second key part (col2 + 2) is a functional key part. Note that it is not possible to specify a prefix length for functional key parts like it is for regular key parts.

Note 1) Functional indexes can not be defined as primary keys. Primary keys require the generated column to be stored, but we are only creating virtual generated columns in this worklog as it reduces the scope and implications.

Note 2) Each functional key part will take from the total number of columns limit.

F-1: The parser MUST accept the new key part rule "'(' expr ')' [ASC|DESC]" in CREATE INDEX, CREATE TABLE and ALTER TABLE.

F-2: For each functional key part, the server MUST generate a hidden virtual generated column.

F-3: The hidden virtual generated column MUST NOT be visible to the user (including 'SELECT *', 'SHOW CREATE TABLE', 'EXPLAIN', insert without column list etc).

F-4: The hidden virtual generated column SHOULD have a deterministic name for debugging purposes, while keeping the possibilities for name collisions low.

F-5: "SHOW CREATE TABLE" MUST NOT show the generated column, but instead display the functional index specified by the user.

F-6: If a hidden generated column is explicitly specified in a SELECT statement, the server MUST return the error ER_BAD_FIELD_ERROR.

F-7: If a user executes a INSERT without a column list, the server MUST NOT expect a value for any hidden generated column.

F-8: If a user executes a INSERT with a column list and one of the column names matches one of the hidden generated columns, the server MUST return the error ER_BAD_FIELD_ERROR.

F-9: In the extended EXPLAIN output, the server MUST NOT print the name of the hidden generated column, but instead print the expression of the functional key part(s).

F-10: If a user tries to add a column to an existing table where the name of the new column collides with the name of a hidden generated column, the server MUST return the error ER_INTERNAL_ERROR with the message "Internal error: The column name '...' is already in use by a hidden column".

F-11: The system view INFORMATION_SCHEMA.COLUMNS MUST NOT display any of the hidden generated columns to the user.

F-12: The system view INFORMATION_SCHEMA.STATISTICS MUST print out the expression for any functional key part in the new column "EXPRESSION". For all non-functional key parts, this column MUST be SQL NULL.

F-13: The system view INFORMATION_SCHEMA.STATISTICS MUST print out SQL NULL in the column "COLUMN_NAME" for any functional key part.

F-14: "SHOW KEYS FROM tbl" MUST print out the expression for any functional key parts.

F-15: If a user tries to drop a column with the same name as a hidden generated column, the server MUST return the error ER_CANT_DROP_FIELD_OR_KEY.

F-16: A functional index CAN NOT be a primary key.

F-17: When executing "CREATE TABLE t1 LIKE t2;" and t2 contains a functional index, table t1 MUST contain the same functional index.

F-18: A functional index SHOULD support the same "features" as normal indexes (UNIQUE, ASC/DESC etc).

F-19: When renaming a functional index, the server SHOULD rename any hidden virtual generated columns for this functional index.

F-20: Only functions suitable for generated columns MUST be allowed as a key part expression.

F-21: A functional index MUST NOT contain any stored procedures, stored functions or user defined functions (this is a restriction from generated columns).

F-22: If a user tries to define a functional index for which the expression is a column name, the server MUST return the error ER_FUNCTIONAL_INDEX_ON_FIELD. The error SHOULD indicate that a regular index should be used instead.

Replication requirements

F-23: The hidden generated column column SHOULD NOT be included in the binlog for any modes of binlog_row_image.

F-24: When the only unique NOT NULL index is a functional index, the binary log MUST contain a full before-image regardless of binlog_row_image.

F-25: When computing the writeset for a row, it MUST add hashes also for functional unique indexes.

F-26: When slave_rows_search_algorithms=INDEX_SCAN, it SHOULD NOT try to use a functional index.

Contents


Syntax changes

In order to support creating functional indexes, we add a new parser rule key_part_expression:

 key_part_expression:
     key_part
   | '(' expr ')' opt_ordering_direction
   ;

Examples:

 CREATE INDEX idx1 ON employees ((SUBSTRING(name, 1, 50)));
 ALTER TABLE employees ADD INDEX salary_idx (salary, (SQRT(salary)));
 CREATE TABLE tbl1 (col1 INT, INDEX idx (col1, (col1 + col1)));

Having parentheses around the expression is chosen so we avoid changing existing behavior. If we had chosen not to use parentheses, there would be an ambiguity between "col(prefix_length)" and "function(argument)". Expressions aren't supported and thus not allowed in foreign key specifications. Thus a separate rule is introduced.

Note that while regular key parts supports a prefix length specification, this is not possible for functional key parts. The solution is to use CAST or SUBSTRING (see notes further down on CAST). In order for a functional index containing the SUBSTRING function to be used in a query, there WHERE clause must contain SUBSTRING with the same arguments;

 CREATE TABLE tbl (col1 LONGTEXT, INDEX idx1 ((SUBSTRING(col1, 1, 10))));
 SELECT * FROM tbl WHERE SUBSTRING(col1, 1, 9) = "123456789";
 SELECT * FROM tbl WHERE SUBSTRING(col1, 1, 10) = "1234567890";

In the above example, only the last query will be able to use the index since that is the only query where the arguments to SUBSTRING match the index specification.

ASC and DESC specification (ascending/descending) is respected and works for functional key parts just like regular key parts.

Adding a Create_field wrapper

Consider the following statement:

 CREATE TABLE t1 (
   col1 VARCHAR(10),
   col2 VARCHAR(10),
   INDEX idx (CONCAT(col1, col2))
 );

This statement would translate into creating a hidden virtual generated column with the definition "colname VARCHAR(20) AS (CONCAT(col1, col2))", and then adding an index for the generated column. The challenge here is to find the correct data type and length for the generated column. In order to do that, we have to resolve the function "CONCAT". Resolving it would require us to have the fields col1 and col2 available, but during CREATE/ALTER table we do not have access to Fields (only Create_fields).

One way of getting around this, is to create a class that inherits from Field and holds a Create_field internally. Before resolving the functional index expression, we traverse the Item tree and replaces all the field references in Item_field's with this wrapper class. That ensures that when resolving the function "CONCAT", we will get the correct character set, length etc from the Create_field wrapper instead of the non-existing field.

Unification with CREATE TABLE AS SELECT

In order to create the hidden generated column for a functional key part, we need to know what data type the column should have, which length, signedness etc. We already have code that deduces this information from an expression for CREATE TABLE AS SELECT. So in order to be consistent, we refactor and reuse that code for both CREATE TABLE AS SELECT and functional indexes. In particular, we will refactor out parts of "create_table_from_items" in sql_insert.cc into separate function(s). For instance, consider this statement:

 CREATE TABLE t1 (
   col1 INT UNSIGNED,
   INDEX idx (ABS(col1))
 );

We want the definition of the hidden generated column to be exactly the same as the column definition for column "t2.col1" from the following CREATE TABLE:

 CREATE TABLE t1 (
   col1 INT UNSIGNED
 );
 CREATE TABLE t2 AS SELECT ABS(col1) AS col1 FROM t1;

Errors/warnings

Since we are using generated columns to implement functional indexes, any errors regarding the functional index will be thrown by the generated column. Thus, many of the errors will say something like "Data truncated for column 'col_name' at row 42". The column name will be the name of the hidden generated column, which we do not want to expose. Instead, we want the error message to say something like "Data truncated for functional index 'index_name' at row 42".

In order to "convert" these error messages, we insert a error handler "Functional_index_error_handler" that traps the various error messages that may expose the hidden generated column name, and report a more appropriate error instead. We recognize that this approach may be difficult to maintain as new error messages are added, but as of now we do not have any other good solution.

SHOW CREATE TABLE

SHOW CREATE TABLE must not show the hidden generated column that is created for the functional key part, but instead display print the functional key part expression specified by the user. Since the class Generated_column already has a function "print_expr" for this purpose, we will make use of that to print something like this when executing "SHOW CREATE TABLE":

 CREATE TABLE `t1` (
  `col1` int(11) DEFAULT NULL,
  KEY `idx1` ((`col1` + `col1`))
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

SELECT

Executing a "SELECT *" against a table that has one or more functional indexes will not show the hidden generated column. If a user specifies a column name in "SELECT" that matches the name of a hidden generated column, the server will return the error ER_BAD_FIELD_ERROR.

INSERT INTO t1 VALUES

If a user executes the command "INSERT INTO t1 VALUES ..." without a column list, the server will not expect a value for any hidden virtual generated column. If a user do specify a column list and one of the column names matches a hidden generated column, the server will report ER_BAD_FIELD_ERROR.

EXPLAIN

EXPLAIN should not print the name of the hidden generated column, but instead print the functional key part expression:

 CREATE TABLE t1 (col1 INT);
 CREATE INDEX idx1 ON t1 (col1 + col1);
 EXPLAIN SELECT * FROM t1 WHERE col1 + col1 = 2;
 SHOW WARNINGS;
 /* select#1 */ select `test`.`t1`.`col1` AS `col1` from `test`.`t1` where ((`test`.`t1`.`col1` + `test`.`t1`.`col1`) = 2)

INFORMATION SCHEMA VIEWS

INFORMATION_SCHEMA.STATISTICS

The system view INFORMATION_SCHEMA.STATISTICS provides information about table indexes. There is one row for each key part, and we will do the following modification for this view:

  1. Add a new column "EXPRESSION" that is set to NULL for non-functional key parts. For functional key parts, we will print the expression.
  2. For functional key parts, the field COLUMN_NAME will be set to null.

INFORMATION_SCHEMA.KEY_COLUMN_USAGE

The system view INFORMATION_SCHEMA.KEY_COLUMN_USAGE provides information about which key columns that have constraints. This view will remain unchanged, and functional indexes will not be listed here. The reasoning is that this table provides information about columns, and a functional index does not represent a column per se.

INFORMATION_SCHEMA.TABLE_CONSTRAINTS

One row will be added to the system view INFORMATION_SCHEMA.TABLE_CONSTRAINTS for each unique functional index. No other modifications is done to this table.

SHOW KEYS and SHOW INDEX

If a user executes the command "SHOW KEYS FROM tbl;" or "SHOW INDEX FROM tbl;", the result set will contain a new column with the name "Expression". This column will contain the indexed expression for any functional key parts. For non-functional key parts, this column will contain SQL NULL. Also, the column "Column_name" will contain SQL NULL for functional key parts.

Naming the hidden virtual generated column

Even though the hidden virtual generated column is auto-generated, it should have a deterministic name. This will make debugging/support easier, and it would minimize any problems regarding replication (different column names on master/slave).

The column name will be a md5 of index name and key part number: md5(index name + key part number)

By using MD5 we will stay within the 64 character naming limit for columns while keeping the probability of name collisions very low.

Replication

As this worklog mainly re-uses existing functionality in the server (indexed generated columns) that works in a replication setup, functional indexes is also expected to work equally well in a replication setup:

  • When a table is created with a functional index on the master, the same index and hidden generated column should exist on the slave with the same name.
  • When a functional index is renamed on a master, the index name and the hidden generated column name should be renamed on both the master and the slave.

Usage of CAST in functional index

One of the main motivations behind this worklog is indexing of JSON values. It seems obvious that this syntax would work well:

 CREATE TABLE employees (data JSON, INDEX ((data->>"$.name")));

However, this fails due to the following reasons;

  1. The operator "->>" translates into JSON_UNQUOTE(JSON_EXTRACT(...))
  2. JSON_UNQUOTE returns a value with data type LONGTEXT, and the hidden generated column will thus get the data type LONGTEXT
  3. MySQL cannot index LONGTEXT columns without specifying a prefix length on the key part.
  4. In the functional index syntax, there is no possibility to specify a prefix length.

The workaround (or, proper solution) is to use the CAST function:

 CREATE TABLE employees (data JSON, INDEX ((CAST(data->>"$.name" AS CHAR(30)))));

This will make the hidden generated column get the data type VARCHAR(30), which can be indexed. But this gives us a new issue when trying to use the index:

  1. CAST(...) returns a string with the collation utf8mb4_0900_ai_ci (server default collation)
  2. JSON_UNQUOTE(...) returns a string with the collation utf8mb4_bin (hard coded)

This gives us a collation mismatch between the indexed expression and the expression in the WHERE clause in the following scenario:

 CREATE TABLE employees (data JSON, INDEX ((CAST(data->>"$.name" AS CHAR(30)))));
 SELECT * FROM employees WHERE data->>"$.name" = "James";

Thus, the index will not be used. It seems pretty obvious that this won't work since the expression in the query and the index is different, but we do support stripping of CAST with indexed generated columns:

 CREATE TABLE employees (
   data JSON
 , generated_col VARCHAR(30) AS (CAST(data->>"$.name" AS CHAR(30)))
 , INDEX (generated_col));
 SELECT * FROM employees WHERE data->>"$.name" = "James";

The last query will use the index, because the server automatically strips away the CAST when looking for a matching expression. This however should not be allowed, as it causes a different result with and without an index (see bug#27337092):

 mysql> CREATE TABLE employees (
     ->   data JSON
     -> , generated_col VARCHAR(30) AS (CAST(data->>"$.name" AS CHAR(30))));
 Query OK, 0 rows affected (0.03 sec)
 mysql> INSERT INTO employees (data) VALUES ('{"name": "james"}'), ('{"name": "James"}');
 Query OK, 2 rows affected (0.01 sec)
 Records: 2  Duplicates: 0  Warnings: 0
 mysql> SELECT * FROM employees WHERE data->>"$.name" = "James";
 +-------------------+---------------+
 | data              | generated_col |
 +-------------------+---------------+
 | {"name": "James"} | James         |
 +-------------------+---------------+
 1 row in set (0.00 sec)
 mysql> ALTER TABLE employees ADD INDEX idx (generated_col);
 Query OK, 0 rows affected (0.03 sec)
 Records: 0  Duplicates: 0  Warnings: 0
 mysql> SELECT * FROM employees WHERE data->>"$.name" = "James";
 +-------------------+---------------+
 | data              | generated_col |
 +-------------------+---------------+
 | {"name": "james"} | james         |
 | {"name": "James"} | James         |
 +-------------------+---------------+
 2 rows in set (0.00 sec)

So for functional indexes, we have decided to automatically strip away the CAST _ONLY_ if the collation of the indexed expression and the query expression matches. So in order for a functional index to be used in the described scenarios, the following two solutions will work:

 # Give the indexed expression the same collation as JSON_UNQUOTE
 CREATE TABLE employees (
   data JSON
 , INDEX idx ((CAST(data->>"$.name" AS CHAR(30)) COLLATE utf8mb4_bin)));
 INSERT INTO employees (data) VALUES ('{"name": "james"}'), ('{"name": "James"}');
 EXPLAIN SELECT * FROM employees WHERE data->>"$.name" = "James";
 # Specify the full expression in the query
 CREATE TABLE employees (
   data JSON
 , INDEX idx ((CAST(data->>"$.name" AS CHAR(30)))));
 INSERT INTO employees (data) VALUES ('{"name": "james"}'), ('{"name": "James"}');
 EXPLAIN SELECT * FROM employees WHERE CAST(data->>"$.name" AS CHAR(30)) = "James";

See the hidden generated columns

For debug builds, we will provide an option to see the hidden generated columns for easier debugging:

 CREATE TABLE t (col1 INT, INDEX ((ABS(col1))));
 SET SESSION debug="+d,show_hidden_columns";
 SHOW CREATE TABLE t;
 CREATE TABLE `t` (
   `col1` int(11) DEFAULT NULL,
   `3bb8c14d415110ac3b3c55ce9108ae2d` int(11) GENERATED ALWAYS AS (abs(`col1`)) VIRTUAL,
   KEY `functional_index` ((abs(`col1`)))
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

This allows us to inspect that the hidden generated column has the correct data type, length, character set etc.

Notes for documentation

  • If a table doesn't contain a PRIMARY KEY, InnoDB automatically promotes the first UNIQUE NOT NULL index to the primary key. This is not supported for functional UNIQUE NOT NULL indexes.
  • If a user wants to remove a column that is referenced by a functional index, the functional index must be removed first. If not, the user will get an error saying "Cannot drop column 'column_name' because it is used by a functional index. In order to drop the column, you must remove the functional index."
  • Normal indexes raises a warning if there are duplicate indexes. Functional indexes does not have this feature.

Contents


Parser

  1. Add a new rule "key_part_with_expression" that accepts an expression enclosed in parantheses and an optional ordering:

     key_part_with_expression:
         key_part
       | '(' expr ')' opt_ordering_direction
       ;
    
  2. All key_list rules are replaced with key_list_with_expression, except for foreign key definition; that will remain the same.
  3. We'll add a new parse tree node PT_key_part_specification that will contain the unresolved key part expression, and both the key_part and key_part_with_expression rule will use this type. The method contextualize() in this new class will take care of itemizing the expression.

Key_part_spec

  1. We add a new private variable "Item *m_expression;" that represents the indexed expression.
  2. The field_name variable is made non-const and ecapsulated in a getter so we can add some asserts to ensure correct usage of the field name. It must be non-const so we can set the hidden generated column it references AFTER we have added the column to the create list.
  3. is_ascending and is_explicit will also be encapsulated in a getter for consistency.
  4. Since this class is used for both normal indexes and foregin key specifications, we add a new variable "bool m_has_expression". This variable will help us determine if we need to resolve the indexed expression or if we can use the field name directly.
  5. A new method "bool resolve_expression(THD *);" is added that basically calls "fix_fields" on the expression.

Key_spec

As a consequence of the above, the Key_spec will be changed to contain an array of non-const Key_part_spec

Create_field

  1. A new private variable "dd::Column::enum_hidden_type hidden;" is added to this class, which will be used when adding the hidden generated columns for functional key parts. Associated methods (like Create_field::init) will get one new argument for the hidden type as well.

Field

The following changes will be done to the Field class:

  1. A new private field "dd::Column::enum_hidden_type m_hidden;". This field is used to distinguish the column between used-defined generated columns and hidden generated columns that are automatically added by the server. It will also tell us if the field should be visible to the user or not, and whether it can be dropped/modified by the user or not.

  2. Two new methods; "bool is_hidden_from_user()" and "bool is_field_for_functional_index()". As of now, these two methods will always return the same result in release builds. However, the may mean different thing in the future if we for instance add support for invisible columns. Also, in debug builds we can set a debug flag to show the hidden generated column in SELECT * etc for debugging purposes.

The method is_hidden_from_user() will be used in the following places/scenarios:

  1. "INSERT INTO tbl;" without a column list
  2. "SELECT * FROM tbl;"
  3. "SHOW CREATE TABLE tbl;" and "SHOW FIELDS FROM tbl;"

The method is_field_for_functional_index() will be used in the following places/scenarios:

  1. "ALTER TABLE tbl DROP COLUMN;", which will cause an error if the column specified is a hidden generated column for a functional key part.
  2. In "Item_field::print", so that we print the expression instead of the column name for functional key parts

Item

The Item class will get one new public method, "virtual bool replace_field_processor(uchar *)". This function will replace all Field pointers with Create_field_wrapper pointers (explained below) for an Item of type Item_field. It will also set some of the type properties like decimals and char_length so the data type for the hidden generated column will be deduced properly.

Create_field_wrapper

We will add a new class named Create_field_wrapper:

 class Create_field_wrapper : public Field {
  private:
   Create_field *m_field;
  public:
   Create_field_wrapper(Create_field *field);
   Item_result result_type() const override;
   enum_field_types type() const override;
   virtual uint32 max_display_length() override;
   virtual const CHARSET_INFO *charset() const override;
 };

All methods that tries to store or retrieve data from this Field will cause an assertion since this isn't a real Field per se. All the methods that retrieve data type information (max_display_length, charset etc) will fetch the data from the Create_field.

sql_table.cc

prepare_create_field()

If a user tries to add a column with the same name as a hidden generated column used for functional index, we raise the error ER_INTERNAL_ERROR - "The column name '...' is already in use by a hidden column"

Assigning name to indexes

If the index name is not explicitly specified by the user, the server auto generates a name. For functional indexes, this is a bit tricky;

  • Adding a functional will add one generated column for each key part. This must be done early in the stage so that we have the correct number of columns to create when calculating field offsets and similar.
  • The column name is a hash of the index name and the key part number, so the index name MUST be available before adding the hidden generated columns.

For this reason, we assign index name for functional indexes much earlier than for regular indexes. More specifically, it is done very early in mysql_prepare_create_table(), before "calculate_field_offsets()" and "prepare_create_field()".

mysql_prepare_create_table()

Early in this function, we go through all the indexes (non foreign keys) that has an expression attached. For each of these, we replace any Field pointers with Create_field_wrappers as explained above. When this is done, we go through each key part and add a new hidden generated column to the create list. Since we potentially are modifying the create list, this MUST be done before both "calculate_field_offsets()" and "prepare_create_field()".

mysql_alter_table()

In mysql_alter_table, there are two things we need to handle; dropping a functional index and renaming a functional index.

  1. When dropping a functional index, we need to remove any hidden generated column that was added for the functional index as well. Right after the table is opened in mysql_alter_table(), we have access to all the table information. If we see that we are dropping a functional index, we add any relevant hidden generated columns to the drop list as well

  2. If we are renaming a functional index, we need to rename the hidden generated column as well (we have a requirement that the column name should be a hash of the index name and the key part number). So for each hidden generated column that is affected, we calculate the new name and add the column to the create list. This is done right after the table is opened in mysql_alter_table(), where we have access to all the table information.

add_functional_index_to_create_list()

We add a new function named "add_functional_index_to_create_list" which will handle adding a hidden generated column to the create list. The function will do the following steps:

  • Assign a name to the index if the user didn't assign a name explicitly. This must be done because the name of the hidden generated column will be based on the index name.
  • Replace all Field references in items with Create_field_wrapper
  • Resolve the expression that the key part contains
  • Resolve the hidden generated columns data type from the key part expression, using the function "generate_create_field". "generate_create_field" is the function that is refactored out of "CREATE TABLE AS SELECT"
  • Add a new Create_field with an expression attached to the create list.