WL#8170: Expression analyzer for GC
Status: Complete
The goal of this WL is to allow our range and ref optimizers to find opportunities to use indexes defined over generated columns (GC). Intended use of this feature is to allow building indexes over JSON documents. Basic use case: CREATE TABLE t1(f1 INT, gc INT AS (f1+1) STORED, KEY gc_idx(gc)); INSERT ... SELECT * FROM t1 WHERE f1 + 1 > 9; Optimizer should be able to use RANGE access for this query. This WL will operate only STORED GCs as indexes over VIRTUAL GCs aren't available at the moment. When they'll be ready, they'll be automatically used (with exception of the primary key case below) as optimizer doesn't do any explicit checks for a GC being stored or virtual. Also, optimizer should take into account the comparison type: result type of GC should match the result type of the expression it's compared to. E.g following query wouldn't use index: SELECT * FROM t1 WHERE f1 + 1 > 9.1; The intended use case for it is following: CREATE TABLE t1 (uuid BINARY(12), data JSON, _gc1 INT AS MYJSON_EXTRACT(data, '$.id'), _gc2 DOUBLE AS MYJSON_EXTRACT(data, '$.id'), KEY _gc1_idx(_gc1), KEY _gc2_idx(_gc2)); Before implementing native JSON indexes (in scope of another WL) result of MYJSON_XX() functions would have to be casted to a particular SQL type in order to allow proper indexing by InnoDB. INSERT ... SELECT ... FROM t1 WHERE MYJSON_EXTRACT(data, '$.id')=1; After this WL is implemented, server should be able to generate ref access for idx index. Another application of this WL is to allow to use an index over GC for sorting and grouping: CREATE TABLE t1(f1 INT, gc INT AS (f1+1) STORED PRIMARY KEY); INSERT ... SELECT * FROM t1 WHERE ... ORDER BY f1 + 1; Optimizer should be able to use index for resolving ORDER BY clause in this case. CREATE TABLE t1(f1 INT, f2 INT, gc INT AS (f1+1) STORED PRIMARY KEY); INSERT ... SELECT f1+1, MAX(f2) FROM t1 WHERE ... GROUP BY f1 + 1; Optimizer should be able to use loose index scan for such query. Note that in both examples above the primary key is used, also the key should be clustered. Since there is only one available primary key per table, optimizer will use whatever type of GC used in it. If type conversion would happen, it would be handled by server automatically, according to current type conversion rules, this WL will not add/change anything in this area.
Functional requirements ======================= 1) Optimizer should be able to substitute parts of WHERE condition with matching GC fields, same for elements of GROUP/ORDER lists 2) Optimizer should be able to used an index defined on a generated column for expressions that match the definition of the generated column. Non-functional requirements =========================== The WL's implementation shouldn't have bugs and lead to performance regressions.
General approach ================ First step is to create list of GC to replace with with. Optimizer will go through all available GCs and add to the list those GCs which are a part of a non-disabled index. On the 2nd step, optimizer will go through the WHERE condition and look for comparison operators that allow index to be used: >, >=, =, <=, <, IN, BETWEEN. Upon finding such operator function optimizer will replace its arguments with generated column when all of the following criteria are met: .) Result type of the func's argument and the GC's expression are the same .) func's arg to be replaced matches GC's expression .) in case of IN, BETWEEN all args, except the one being replaced (arg[0]), have to have same result type. On the 3rd step optimizer tried to transform ORDER/GROUP BY. This step has more restrictions than above ones. Consider the query: CREATE TABLE t1(f1 INT, gc INT AS (f1+1) STORED PRIMARY KEY, KEY gc_idx(gc)) ENGINE=innodb; SELECT f1+1 FROM t1 WHERE f1 > 3 ORDER BY f1+1; t1 has two indexes - primary and secondary. Since optimizer considers non-covering keys to be slower than the table scan + filesort it'll look only for covering indexes. When the SELECT query goes through resolution phase (fix_fields in particular) gc_idx is marked as non-covering as it doesn't include field f1. This means that even if we replace "f1+1" expression in both select and ORDER BY lists with gc, we can't be sure that f1 isn't used in other parts of the query and can't easily ensure that gc_idx is covering. Due to that, the only key that could be safely used for ordering is InnoDB's clustered primary key as it includes all fields. To take this into account, optimizer checks that the primary key is clustered and if so it removes from the list of GCs all GCs that aren't a part of the primary key. Otherwise we skip this step. After trimming the list, optimizer will go through ORDER/GROUP BY lists and replace matching expressions with GCs. Support for multiple types of the same expression ================================================= CREATE TABLE t1 (uuid BINARY(12), data JSON, _gc1 INT AS MYJSON_EXTRACT(data, '$.id'), _gc2 DOUBLE AS MYJSON_EXTRACT(data, '$.id') , KEY _gc1_idx(_gc1), KEY _gc2_idx(_gc2)); INSERT ...; SELECT ... FROM t1 WHERE MYJSON_EXTRACT(data, '$.id')=1; Note that GC expressions in this example are the same, but stored in fields of different types. Due to schema-less nature of JSON, different rows could have values of different type for the same key. Using single index, e.g over INT field, for different constants, e.g. INTs and DOUBLEs most likely would lead to a wrong result. To address that user would have to define several indexes of different types but with same expression, like _gc1_idx and _gc2_idx in the example above. Optimizer will use the type of the SQL constant, INT in the example above, and use it to pick the index with appropriate type. When comparing two JSON typed values, index can't be used. To address that, a user will have to add explicit CAST to one of JSON values. Effectively this would block index usage on the CASTed column. This is a known deficiency and would be addressed when server would support JSON indexes.
The analyzer is represented by a new member function of JOIN class - JOIN::substitute_gc(). It's called before join order optimizer in order to let the latter see all necessary indexes. This function consists of 3 parts: 1) It goes through the list of tables and collects all fields representing GCs that are a part of an index that's available for query (i.e not disabled by hints / ALTER). 2) Uses Item::compile() to transform the query's condition. The analyzer for it (called Item::gc_subst_analyzer()) picks >, >=, =, <=, <, IN, BETWEEN operations. The transformer function (Item::gc_subst_transformer()) checks whether picked function's arguments are matching defined criteria, and if so it creates a new Item_field for the matched GC's field and replaces the matching func's argument with it. For >, >=, =, <=, < functions either of arguments can be replaced, while for IN and BETWEEN only the first (left) one. 3) This step is taken only if the primary key is clustered. If so it removes from the list of GCs all fields that aren't a part of primary key, then it goes through ORDER/GROUP BY lists (if available) and tries to replace lists' elements with matching GCs expressions. This WL also introduces minor supporting change: now all objects implementing CAST function have TYPECAST_FUNC function type.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.