WL#411: Generated columns
Affects: Server-5.7
—
Status: Complete
The goal of this WL is to add support for Generated Columns (GC). Value of such column, unlike regular field's value, isn't set by the user, but computed from the expression given by user at the time GC is created. For example, if you have FirstName and SecondName columns, you may add a computed column FullName as CONCAT(FirstName," ",SecondName). There are two types of generated columns - stored and virtual. Values for former is computed only once, when a new record is inserted or old one is updated. After computing the value is stored in the table in exactly the same way as a value for a regular field. Values for latter aren't stored at all, but computed each time when server reads a record from the table. User-specified values can't be inserted into GCs. Stored GCs could be a part of index(es), while non-stored - can't. This feature could be used in several ways. The virtual GCs could be used as a way to simplify and unify queries. This way a complicated condition could be saved as a GC and then used as a field in a set of queries to ensure that all of them use exactly the same condition. Stored GCs could be used as a materialized cache for complicated conditions that are costly to calculate on the fly. Another use for stored GCs is a workaround for lack of functional indexes. This way user have to create a stored GC using required functional expression and then define a secondary index over it. Disadvantage of such approach is that values are stored twice - first time in the table as the value of GC and the second time in the index. True functional index would store data only once.
Functional requirements FR1: Syntax requirements ======================= 1) Max length of a generation expression (unreachable) is unlimited. But currently, we still have a bug on this problem. (Ref BUG#73700) 2) GCs may refer only to GCs defined prior to it. 3) Only deterministic built-in functions are allowed in generating expressions. Subqueries, parameters, variables, UDF/SP are not allowed in generated expression. 4) Only stored GCs are allowed to be a part of an index. 5) Attributes of a GC's are limited to: VIRTUAL/STORED, UNIQUE [KEY], [PRIMARY] KEY, NOT NULL, COMMENT 6) No special limitation on the number of GCs created for one table. FR2: GC's type ========================== In this WL, the type for a GC is explicitly specified by user. There is no limitation on GC's type. However the type may be changed because of generation expression. If they aren't the same type coercion will happen according to our regular type coercion rules. A GC(stored or virtual) has the same storage requirement as a regular column of the same type. FR3: INSERT/REPLACE and UPDATE requirements ====================================== INSERT/REPLACE and UPDATE only DEFAULT is allowed as a value of a GC. Otherwise, an error is thrown. Same applies to CREATE TABLE .. AS SELECT and INSERT .. SELECT. For view-definition purposes, a generated column is considered to be "updatable". FR4: Alter GC ====================================== 1) GC may be added/dropped. 2) GC must not be altered into base column. 3) GC's type and expression may be changed. 4) Stored and Virtual GC are not allowed to be transformed for each other. 5) Base column may be altered into Stored GC but not Virtual. 6) When rename or drop a column which has a GC dependency, an error is thrown. The details may be referenced by LLS. FR5: Changes to INFORMATION_SCHEMA ======================================= Pls reference 'Changes to INFORMATION_SCHEMA' in HLS. FR6: Error handling ======================================= ER1: If the generated expression contains an not-allowed expression such as non- deterministic function, an error will be reported: [ERROR] "Expression of generated column '%s' contains a disallowed function." ER2: If an index is created on a virtual generated column, the following error will be reported: [ERROR] "Key/Index cannot be defined on a virtual generated column." ER3: If FK on GC option has any of 'ON DELETE SET NULL', 'ON UPDATE SET NULL', or 'ON UPDATE CASCADE', the following error will be reported: [ERROR] "Cannot define foreign key with %s clause on a virtual generated column." ER4: INSERT/REPLACE/UPDATE a GC with a non-DEFAULT value, the following error will be reported: [ERROR] "The value specified for generated column '%s' in table '%s' is not allowed." ER5: Illegal operation on GC such as alter stored GC into virtual, the following error will be reported: [ERROR] "'%s' is not yet supported for generated columns." ER6: Only the GCs defined prior to the current one can be referenced, otherwise the following error will be reported: [ERROR] "GENERATED column can refer only to generated columns defined prior it." ER7: If a column which has GC's dependency is dropped, the following error will be reported: [ERROR] "Column '%s' has a generated column dependency." Non-Functional requirments ========================== NF1: Any performance regression should not be introduced.
Syntax ------ column_name type [GENERATED ALWAYS] AS ( expression ) [VIRTUAL | STORED] [UNIQUE [KEY]] [COMMENT comment] [[NOT] NULL] [[PRIMARY] KEY] STORED means the GC is stored. VIRTUAL defines GC as virtual. If neither VIRTUAL or STORED is present, VIRTUAL is the default. UNIQUE [KEY] defines a secondary unique key. COMMENT sets a comment. [NOT] NULL defines NULL or Not NULL. [PRIMARY] KEY defines a primary key. VIRTUAL/STORED/UNIQUE/COMMENT/[NOT] NULL/[PRIMARY] KEY are the only column attributes available for generated columns. Other column attributes (DEFAULT, ON UPDATE, AUTO_INC etc) aren't supported and cause error. When GC's expression evaluates to null and is saved to the non-null underlying field exactly the same error is thrown as when saving null into a regular non- null field. Thus the behavior is consistent. As SEs see stored GCs as regular fields, [PRIMARY] KEY is allowed for stored GC but not for virtual GC. The above syntax is valid in CREATE TABLE, in ALTER TABLE ... ADD, in ALTER TABLE ... MODIFY, and in ALTER TABLE ... CHANGE. But ALTER TABLE ADD COLUMN in this case will not be an online operation compared to regular ADD COLUMN because expression must be evaluated in the server. Making it an online operation and generated column indexable will need changes in InnoDB and will be done in a separate WL. Terminology ----------- Consider the statement: CREATE TABLE t (Column_1 CHAR(5) GENERATED ALWAYS AS (PI()+5), Column_2 CHAR(5)); Column_1 is a "generated column". A column which is defined with a GENERATED ALWAYS clause is a generated column. Column_2 is a "base column". Any column which is not a generated column is a base column. The clause PI()+5, inside the parentheses, is a "generation expression". The clause GENERATED ALWAYS AS (PI()+5) is a "generation clause". Virtual vs stored --------------------- The main difference is that values of virtual GCs are calculated on the fly each time table's record is read while values of stored GCs are calculated only when being stored to the SE. Virtual GCs have to exist only on server layer, but due to current design they are stored in SE but server ignores values SE reads. Effectively this means that SE uselessly stores some trash. Values of stored GCs are actually stored in SE, read and written. Due to this they can be a part of [foreign] indexes, partitioned by, etc. GC's type --------- The SQL standard allows generated columns to be created with no specified type, but where the type is derived from the type of the generated expression. This is difficult in MySQL because we can't obtain type of generation expression early enough (this requires fix_fields). So we propose to not support this syntax. In this WL, the type for a GC is explicitly specified by user. The current server design wouldn't allow us to make it optional. However the type might be changed because of generation expression. If they aren't the same type coercion will happen according to our regular type coercion rules. Indexes ------- stored GCs can be a part of a secondary foreign/UNIQUE/FTS/GIS index. virtual GCs can't be a part of a key and an appropriate error is throws on attempt to define such key because its data is on the fly. What can be in the generation expression ---------------------------------------- A generation expression may contain a literal, a built-in function, an operator, or a reference to any base column within the same table. No subqueries. No parameters. No variables (MariaDB allows them). No UDF/SP. A GC could refer to other GCs that are defined prior to it, i.e: CREATE TABLE ... (a int, b INT GENERATED ALWAYS AS (a), c INT GENERATED ALWAYS AS(b)) is allowed, CREATE TABLE ... (a int, b INT GENERATED ALWAYS AS (b), c INT GENERATED ALWAYS AS (b)) CREATE TABLE ... (a int, b INT GENERATED ALWAYS AS (c), c INT GENERATED ALWAYS AS (b)) aren't. This differs from MariaDB, which doesn't allow referring other GCs. Length of generating expression is limited to 64K (unlike 255 bytes in original patch and MariaDB) to allow lengthly expressions for JSON and GIS functions. If the expression contains a function, that function should be scalar and deterministic. That is the standard requirement, and it's reasonable, especially where we allow indexing. This is enforced by cheking function to be defined with DETERMINISTIC and NO SQL. Assignments ----------- For both INSERT/REPLACE and UPDATE only DEFAULT is allowed as a value of a GC. Otherwise, an error is thrown. Same applies to CREATE TABLE .. AS SELECT and INSERT .. SELECT. For view-definition purposes, a generated column is considered to be "updatable", since it is possible to assign to it, although the only thing one can assign is DEFAULT. Altering GCs ------------ Altering generation expression is allowed for both stored and virtual GCs. For stored GCs it can't be done in-place as expressions are evaluated by the server. The standard optional feature F385 "Drop column generation expression clause" allows: ALTER TABLE t ALTER COLUMN generated_column DROP EXPRESSION; This turns the generated column into a base column. This isn't supported. Following statements could be used instead: ALTER TABLE t MODIFY stored_generated_column data_type; Drops generation expression and changed type of the column to data_type. Column's data is left as is. ALTER TABLE t MODIFY stored_generated_column data_type [GENERATED ALWAYS] AS (expression) STORED; Alters generation expression, changes type of the column to data_type, and updates column's data according to specified expression. ALTER TABLE t MODIFY stored_generated_column data_type [GENERATED ALWAYS] AS (expression); Not allowed. ALTER TABLE t MODIFY virtual_generated_column data_type; Not allowed. ALTER TABLE t MODIFY virtual_generated_column data_type [GENERATED ALWAYS] AS (expression) STORED; Not allowed. ALTER TABLE t MODIFY virtual_generated_column data_type [GENERATED ALWAYS] AS (expression); Alters generation expression and changes type of the column to data_type. ALTER TABLE t MODIFY base_column data_type [GENERATED ALWAYS] AS (expression) STORED; Turns base_column into GC and updates column's data according to specified expression. Note: Oracle doesn't allow such an alter: Schema Creation Failed: ORA-54026: Real column cannot have an expression. ALTER TABLE t MODIFY base_column data_type [GENERATED ALWAYS] AS (expression); Not allowed. One might note an inconsistency here - it's possible to turn base column into a stored GC and vice versa, but isn't possible to do the same for virtual GC. This limitation seems to be completely artificial as currently SEs (InnoDB and MyISAM) just skips virtual fields from processing, but effectively storing some garbage instead. So turning base column into a virtual GC is exactly the same as for stored GC. When SE will be able to skip virtual GC from storing it also wouldn't impose any issues as those alterations would effectively mean adding/dropping a column, which we already support. All data changes (including field value and index value etc.) for GCs are done offline (i.e not in-place) after alter operation as generated expressions are only evaluated by the server layer except virtual generated column modification because its' data is on the fly. This might be lifted after implementation of WL#1075 "Support for function indexes". Dropping / altering / renaming objects that the generation expression depends on -------------------------------------------------------------------------------- On attempt to rename/drop a column a GC depends on the error is thrown: ERROR 1054 (42S22): Column 'column' has a generated column dependency' where 'column' is the column being renamed/dropped. Foreign keys ------------ GCs could be in both parent and child tables. For example: CREATE TABLE t1 (.., s1 INT GENERATED ALWAYS AS (..), PRIMARY KEY (s1)); CREATE TABLE t2 (.., s1 INT GENERATED ALWAYS AS (..) REFERENCES t1 (s1)); Since foreign-key checking needs indexes, and since we don't allow indexes for not-stored columns, this works only for stored generated columns. A generated column is allowed to be be in a foreign-key reference. Triggers -------- A trigger can not refer to NEW.column_name or OLD.column_name if column_name is a generated column. If then, an error will be thrown out except it's set to DEFAULT. Partitions ---------- Partitioning by generated columns are allowed for stored GC, for example: CREATE TABLE t (s1 INT, s2 VIRTUAL INT (ABS(s1) STORED) PARTITION BY LIST (s2) (PARTITION p1 VALUES IN (1); As partitioning sees a stored GC as a regular field this can be used to work around partitioning limitation related to functions used for partitioning. See also http://dev.mysql.com/doc/refman/5.1/en/partitioning-limitations-functions.html When is the generation expression evaluated? -------------------------------------------- Generation expressions for stored GCs are evaluated in INSERT/REPLACE/UPDATE only, immediately after a BEFORE trigger is evaluated. Therefore it is evaluated before a constraint check. For example: CREATE TABLE t (s1 INT, s2 VIRTUAL INT AS (CASE WHEN s1 IS NULL THEN 5 ELSE s1 END) STORED); CREATE TRIGGER t_bu BEFORE UPDATE ON t FOR EACH ROW SET @s1 = NULL; CREATE VIEW v AS SELECT s1 FROM t WHERE s2 = 5 WITH CHECK OPTION; INSERT INTO v VALUES (NULL); SELECT * FROM t\G *************************** 1. row *************************** s1: NULL s2: 5 There is no evaluation at CREATE/ALTER time. Generation expressions for virtual GC are evaluated immediately after a record is read. Changes to INFORMATION_SCHEMA ------------------------------ INFORMATION_SCHEMA.COLUMNS has a new column: GENERATION_EXPRESSION VARCHAR NULL or generation expression Beside that, INFORMATION_SCHEMA.COLUMNS.EXTRA prints GENERATED for both stored and virtual GCs. An option: print "VIRTUAL GENERATED" for virtual GCs, "stored GENERATED" for stored GCs. DESCRIBE and SHOW FIELDS behaves as INFORMATION_SCHEMA.COLUMNS.EXTRA. Both commands doesn't print generation expression. New error messages introduced ================================ ER_GENERATED_COLUMN_FUNCTION_IS_NOT_ALLOWED eng "Expression of generated column '%s' contains a disallowed function." ER_KEY_BASED_ON_GENERATED_GENERATED_COLUMN eng "Key/Index cannot be defined on a virtual generated column." ER_WRONG_FK_OPTION_FOR_GENERATED_COLUMN eng "Cannot define foreign key with %s clause on a generated column." ER_NON_DEFAULT_VALUE_FOR_GENERATED_COLUMN eng "The value specified for generated column '%s' in table '%s' is not allowed." ER_UNSUPPORTED_ACTION_ON_GENERATED_COLUMN eng "'%s' is not yet supported for generated columns." ER_GENERATED_COLUMN_NON_PRIOR eng "GENERATED column can refer only to generated columns defined prior it." ER_DEPENDENT_BY_GENERATED_COLUMN eng "Column '%s' has a generated column dependency." ===FUTURE PLAN=== ------------------- Query rewrite ------------- Suppose we create a table thus: CREATE TABLE t (s1 INT, s2 GENERATED ALWAYS AS (s1+5)); We expect that like "SELECT s2 ..." will have exactly the same effect as "SELECT (s1+5) ...". Since s2 is not stored, the query may be rewritten with the simple rule: replace column-name with (generation-expression). However, this isn't and won't be implemented in scope of this WL. CREATE TABLE ... LIKE --------------------- The standard allows: CREATE TABLE t1 LIKE t2 { INCLUDING | EXCLUDING } GENERATED; This isn't supported and CREATE TABLE ... LIKE behaves like INCLUDING GENERATED is given. Appendix A: Syntax in MariaDB --------------------- column_name type GENERATED ALWAYS AS (expression) [VIRTUAL | PERSISTENT] [UNIQUE [KEY]] [COMMENT 'string'] I.e. it's same as Peter's, but [NOT] VIRTUAL is replaced for VIRTUAL | PERSISTENT. The latter is obvious, the former is opposite to it. Appendix B: Peter Gulutzan's original HLS specification There's no immediate need to change the original title "Computed virtual columns as MS [i.e. Microsoft] SQL server has". But we might choose the term "generated column" rather than "computed column", and such columns are not necessarily virtual, and the recommendations in this document assume we will choose standard SQL as a model, rather than SQL Server. Think of the Holy Roman Empire, which wasn't holy, wasn't Roman, and wasn't an empire. Titles aren't specifications. SQL Server ---------- SQL Server 2008 allows:::= column_name AS computed_column_expression [ PERSISTED [ NOT NULL ] ] where PERSISTED means "store calculation results when inserting or updating". Microsoft's example: " CREATE TABLE mytable ( low int, high int, myavg AS (low + high)/2 ) ; " There are some restrictions and additional clauses, see SQL Server 2008 "CREATE TABLE" syntax: http://msdn.microsoft.com/en-us/library/ms174979.aspx Oracle ------ Oracle 11g allows: column [datatype] [GENERATED ALWAYS] AS (column_expression) [VIRTUAL] [ inline_constraint [inline_constraint]... ] For details see Oracle 11g CREATE TABLE statement: http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_7002.h tm DB2 --- DB2 9.1 allows: GENERATED ALWAYS AS (generation-expression) See DB2 UDB 9.1 CREATE TABLE statement: http://www.informatik.uni- bonn.de/~tb/Lehre/ws01/vRDBMS/db2s0/frame3.htm#sqls0621 Standard SQL ------------ SQL:2008 non-core (optional) feature T175 "Generated columns" allows: GENERATED ALWAYS AS (generation expression) For example: CREATE TABLE t (a INT, b INT, c INT GENERATED ALWAYS AS (a + b)); Here the value of the generated column c is derived from a and b. GENERATED and ALWAYS are non-reserved words. It's interesting that if we made GENERATED ALWAYS into [GENERATED ALWAYS], that is, if we made the words optional, as Oracle does, then we'd still be able to parse correctly. Since Oracle and DB2 are both following the standard (with slight variations as usual), and since the standard specification is clearer than vendor manuals, this specification will follow the standard specification wherever practical. Terminology ----------- Consider the statement: CREATE TABLE t (Column_1 CHAR(5) GENERATED ALWAYS AS (PI()+5), Column_2 CHAR(5)); Column_1 is a "generated column". A column which is defined with a GENERATED ALWAYS clause is a generated column. Column_2 is a "base column". Any column which is not a generated column is a base column. The clause (PI()+5), including the parentheses, is a "generation expression". The clause GENERATED ALWAYS AS (PI()+5) is a "generation clause". A "generated column" is a column which is not stored, it must be calculated whenever the row is accessed. VIRTUAL ------- The SQL standard does not have a way to specify whether a generated column is virtual or not. This is common; the standard often ignores storage considerations. We'd be better off if we could distinguish, since there are some advantages to generated not-virtual columns (we can guarantee that you'll get the same value twice even if generation expressions are non-deterministic), and some advantages to generated virtual columns (they require little or no storage space). So we want to add a non-standard bit to the syntax to say "it's [not] virtual". The Oracle syntax "[GENERATED ALWAYS] AS (expression) VIRTUAL" would be dangerous, because in Oracle it doesn't mean anything if you leave out the word VIRTUAL, it still will be virtual. We could try "GENERATED VIRTUALLY AS (expression)" or some such thing, but that appears to deviate too much from the standard. The SQL Server syntax "AS expression PERSISTED" has an unEnglish air. So let's say that the MySQL syntax for a generation clause is GENERATED ALWAYS AS (generation expression) [[NOT] VIRTUAL] and VIRTUAL is the default. That is, you have to say NOT VIRTUAL explicitly if you want the generated values to actually be stored in the database. The functionality is deliverable with triggers, so expect gripes about redundancy. Column definition details ------------------------- In standard SQL, a column definition looks like (omitting irrelevancies): ::= [ ] [ | | ] [ ... ] [ ] In MySQL, a column definition currently looks like this: column_definition: data_type [NOT NULL | NULL] [DEFAULT default_value] [AUTO_INCREMENT] [UNIQUE [KEY] | [PRIMARY] KEY] [COMMENT 'string'] [reference_definition] If we follow the standard rules for analogous MySQL clauses, then: The data type is optional. If it's specified, then the generation expression must have a result that's assignable to the column, but that's okay, MySQL is very loose about what's assignable. If data type is not specified, then the generation expression result decides what the data type is. For example: CREATE TABLE t (s1 GENERATED ALWAYS AS (5)); /* data type is INTEGER */ CREATE TABLE t (s1 BIGINT GENERATED ALWAYS AS (5)); /* conversion happens */ CREATE TABLE t (s1 TINYINT GENERATED ALWAYS AS (555555)); /* uh-oh. */ In the last example, we must be able to assign, which means we must always be able to truncate, so we cannot pay attention to sql_mode strictness. The default clause and identity column specification (in MySQL the analogue is auto_increment) cannot be specified along with generation clause. These are examples of syntax errors: CREATE TABLE t (s1 INT AUTO_INCREMENT GENERATED ALWAYS AS (5), KEY(s1)); CREATE TABLE t (s1 INT GENERATED ALWAYS AS (5) DEFAULT 5); The constraint clauses should follow the generation clause. But we can't move "NOT NULL | NULL" or "reference definition" (which are both constraints) from their established places. It appears then that the best position for a generation clause is after data type, and before NULL | NOT NULL. The collate clause, in MySQL, must be associated with a data type clause. So, in MySQL, a column definition in future, if and only if it includes a generation clause, looks like this: column_definition: [data_type] GENERATED ALWAYS AS (value expression) [[NOT] VIRTUAL] [NOT NULL | NULL] [UNIQUE [KEY] | [PRIMARY] KEY] [COMMENT 'string'] [reference_definition] This column_definition variant is legal in CREATE TABLE, in ALTER TABLE ... ADD, in ALTER TABLE ... MODIFY, and in ALTER TABLE ... CHANGE. Indexes ------- Recommendation: don't allow generated virtual columns in indexes. That's properly a job for the implementor of WL#1075 "Add support for functional indexes". Since PRIMARY and UNIQUE and REFERENCES specifications depend on indexes, it follows that table definitions like this are illegal: CREATE TABLE t (s1 INT GENERATED ALWAYS AS (5) NOT NULL UNIQUE); /* illegal */ On the other hand, a generated not-virtual column does not require a functional index, and so a column defined with "GENERATED ALWAYS AS (generation expression) NOT VIRTUAL" may be in indexes, and in unique- or foreign-key definitions. Sometimes this might mean that it's impractical to ALTER a column from NOT VIRTUAL to VIRTUAL. What can be in the generation expression ---------------------------------------- A generation expression may contain a literal, a built-in function, a function defined with CREATE FUNCTION, an operator, or a reference to any base column within the same table. That's all. No subqueries. No references to any generated column. No parameters. No variables ... well, maybe certain variables, we'll see. If the expression contains a function, that function should be scalar and deterministic. That is the standard requirement, and it's reasonable, especially where we allow indexing. However, we don't really enforce determinism. All we do is check whether a user-defined function is defined with DETERMINISTIC. For replication purposes, we'll need to ensure that generation expressions that contain PI() etc. will have a deterministic PI() value in the binlog. If the expression contains a function defined with CREATE FUNCTION, then the function, and any functions it invokes, may not reference any tables, may not change stored data, must be checked for appropriate privileges at time of column access, Properly speaking the function should be created with CREATE FUNCTION ... DETERMINISTIC NO SQL, but we won't require those keywords. Assignments ----------- For both INSERT and UPDATE, the standard allows assigning DEFAULT. Nothing else. For example: INSERT INTO t (generated_column) VALUES (DEFAULT); /* legal */ UPDATE t SET generated_column = DEFAULT; /* legal */ INSERT INTO t (generated_column, base_column) VALUES (NULL, NULL); /* illegal */ UPDATE t SET generated_column = base_column; /* illegal */ Since REPLACE works like INSERT, we also allow REPLACE ... VALUES (DEFAULT); We will allow the following non-standard syntax: CREATE TABLE t (s1 GENERATED ALWAYS AS (5)) AS SELECT 6; That is, for a situation where we can't avoid an apparent assignment to a generated column, we allow the syntax but ignore the source value. The same consideration applies for INSERT ... SELECT. The order of assignments has to be irrelevant. Therefore, for this situation: CREATE TABLE t (base_column INT, generated_column BLOB GENERATED ALWAYS AS (base_column)); INSERT INTO t VALUES (5, DEFAULT); UPDATE t SET s2 = DEFAULT, s1 = 6; the value of t.s2 will be 6. Hmm, there's a contradiction with what's said elsewhere, eh? For view-definition purposes, consider a generated column to be "updatable", since it is possible to assign to it, although the only thing you can assign is DEFAULT. ALTER TABLE ... DROP EXPRESSION ------------------------------- The standard optional feature F385 "Drop column generation expression clause" allows: ALTER TABLE t ALTER COLUMN generated_column DROP EXPRESSION; This turns the generated column into a base column. We won't implement this. It's sufficient to say ALTER TABLE t MODIFY generated_column data_type; and the generation clause, since it's not explicitly specified, disappears. This sort of ALTER TABLE statement may change a column from virtual to not-virtual, and therefore will be slow and will require more storage. Dropping / altering / renaming objects that the generation expression depends on -------------------------------------------------------------------------------- Before allowing the renaming/altering of a base column, it's now necessary to check: is this column referenced in any generation expression? If so, ALTER is illegal. ALTER ... DROP base_column has a cascade effect, that is, it causes dropping of any generated columns that depend on the base column. Before allowing the dropping of a function, it's now necessary to check: is this function, directly or indirectly, referenced in any generation expression? If so, DROP is illegal. Foreign keys ------------ It is legal to use not-virtual generated columns in parent or child tables. For example: CREATE TABLE t1 (s1 INT GENERATED ALWAYS AS (5), PRIMARY KEY (s1)); CREATE TABLE t2 (s1 INT GENERATED ALWAYS AS (5) REFERENCES t1 (s1)); Since foreign-key checking needs indexes, and since we don't allow indexes for not-virtual columns, this works only for not-virtual generated columns. A generated column may not be in a foreign-key reference that includes ON UPDATE CASCADE, or ON UPDATE|DELETE SET NULL, or ON UPDATE|DELETE SET DEFAULT. Probably we won't support foreign keys with generated columns until we have WL#148 Foreign keys: Implement Foreign Keys (all storage engines). Triggers -------- A trigger may not refer to NEW.column_name or OLD.column_name if column_name is a generated column. This may not be what other DBMSs do. Partitions ---------- Recommendation: allow partitioning by generated columns, for example: CREATE TABLE t (s1 INT, s2 INT GENERATED ALWAYS AS (ABS(s1)) PARTITION BY LIST (s2) (PARTITION p1 VALUES IN (1); This cannot be used to work around any partitioning limitation. For example, the MySQL Reference Manual has a section "Partitioning Limitations Relating to Functions" http://dev.mysql.com/doc/refman/5.1/en/partitioning-limitations-functions.html which does not say you can use UPPER. Therefore this is illegal: CREATE TABLE t (s1 CHAR(1), s2 CHAR(1) GENERATED ALWAYS AS (UPPER(s1))) PARTITION BY KEY(s2); It might be objected that a generated not-virtual column does not have to be subjected to the partitioning limitations relating to functions. When is the generation expression evaluated? -------------------------------------------- A generation expression is evaluated immediately after a BEFORE trigger is evaluated. Therefore it is evaluated before a constraint check. For example: CREATE TABLE t (s1 INT, s2 GENERATED ALWAYS AS (WHEN s1 IS NULL THEN 5 ELSE s1 END)); CREATE TRIGGER t_bu BEFORE UPDATE ON t FOR EACH ROW SET s1 = NULL; CREATE VIEW v AS SELECT s1 FROM t WHERE s2 = 5 WITH CHECK OPTION; INSERT INTO v VALUES (NULL); /* results left as an exercise for the reader */ There is no evaluation at CREATE/ALTER time. Thus some niladic functions, such as CURRENT_USER, will always depend on an evaluation at the time we access the row. This is somewhat different from the way we handle CURRENT_USER for views. Query rewrite ------------- Suppose we create a table thus: CREATE TABLE t (s1 INT, s2 GENERATED ALWAYS AS (s1+5)); We expect that like "SELECT s2 ..." will have exactly the same effect as "SELECT (s1+5) ...". Since s2 is not stored, the query may be rewritten with the simple rule: replace column-name with (generation-expression). However, this should only be exposed when necessary. For example, after CREATE VIEW v AS SELECT s2 FROM t; SHOW CREATE VIEW v; The user will see "s2", not "s1+5", for the view definition. CREATE TABLE ... LIKE --------------------- The standard allows: CREATE TABLE t1 LIKE t2 { INCLUDING | EXCLUDING } GENERATED; The default is EXCLUDING GENERATED. Generated columns are only copied if one specifies INCLUDING GENERATED. INFORMATION_SCHEMA ------------------ INFORMATION_SCHEMA.COLUMNS has two new columns: IS_GENERATED VARCHAR 'NEVER' or 'ALWAYS' GENERATION_EXPRESSION VARCHAR NULL or generation expression We also need a way to indicate our non-standard VIRTUAL usage. There is no proposal here to change DESCRIBE or SHOW FIELDS. Timestamp --------- Ordinarily, with many caveats and special cases, the default value for a MySQL TIMESTAMP column may be the current date and time, as if we'd said "column_name TIMESTAMP GENERATED ALWAYS AS (CURRENT_TIMESTAMP)". Since we long ago passed the maximum number of rules that mortals can keep in their heads about timestamps, we'll just disallow all generation expressions other than (CURRENT_TIMESTAMP). Potential Benefits ------------------ A specification is not the place to tout or propagandize. Maximally, one can point out that, although one can get usually get the same benefits or more by using views, a generated column is in a base table, and therefore: it can be used for partitions and indexes if it's not-virtual, it can be be defined without needing CREATE VIEW privileges, and maybe someday for some storage engines you can get statistics that could be useful with optimizing.
Design overview =============== A generated column is represented by the Item_field/Field classes, as any regular field. Additional info required for handling of GC is stored in new structure called generated_column_info. It's pointed to by Field::gcol_info, when it's non-null it means the columns is a GC. In addition to that a new bool Field::stored_in_db indicates that the field is stored. It should be true for stored GCs and for all regular fields. False only for virtual GCs. The expressions themselves are stored in .frm files. SQL parser has a new set of rules to be able to parse GC expressions. It's used to parse GC expressions when opening table from .frm too. Limits on GC are checked at following points: Max length limit - when writing .frm file Allowed functions and referred columns - on table open Keys limitation - on key creation GCs generation expressions are fix_field'ed in open_table_from_share. Such approach allows to have single point of check, unlike other approaches. From storage engine POV GCs are exactly like regular fields. The difference is that InnoDB is modified to completely ignore virtual fields, i.e it's neither reads them nor writes. MyISAM is left as is. Effectively this means that SEs would reserve space for GCs, but wouldn't use it (i.e. have garbage there). This will be fixed (only for InnoDB) in scope of a separate WL#8114. Support of GC on SE level ========================= There is not much to support, but in order to indicate that SE is aware of GCs and know how to deal with them (e.g. ignore virtual GCs) it should return HA_GENERATED_COLUMNS among its table_flags. New structures ============== The only new data structure is generated_column_info: class generated_column_info: public Sql_alloc { public: Item *expr_item; // Parsed generation expression LEX_STRING expr_str; // Generation expression string Item *item_free_list; // Item free list generated_column_info() : expr_item(0), item_free_list(0), field_type(MYSQL_TYPE_LONG), stored_in_db(FALSE), data_inited(FALSE) { expr_str.str= NULL; expr_str.length= 0; }; ~generated_column_info() {} enum_field_types get_real_type() { DBUG_ASSERT(data_inited); return field_type; } void set_field_type(enum_field_types fld_type) { /* Calling this function can only be done once. */ DBUG_ASSERT(!data_inited); data_inited= TRUE; } bool get_field_stored() { DBUG_ASSERT(data_inited); return stored_in_db; } void set_field_stored(bool stored) { stored_in_db= stored; } private: --8<-- This is a leftover and should be removed from code -- /* The following data is only updated by the parser and read when a Create_field object is created/initialized. */ enum_field_types field_type; /* Real field type*/ --8<-- bool stored_in_db; /* Indication that the field is phisically stored in the database*/ /* This flag is used to prevent other applications from reading and using incorrect data. */ bool data_inited; }; Parsing ======= GCs generation expressions are parsed in two cases: 1) CREATE/ALTER TABLE 2) opening a table In the 1st case parsing is done as usual. LEX::gcol_info is allocated and filled for GCs during parsing. 2nd case is handled as follows: .) open_binary_frm parses .frm file, reads expressions strings and attaches them to appropriate fields. .) open_table_from_share calls new function unpack_gcol_info_from_frm which constructs a special command: PARSE_GCOL_EXPRthen it sets LEX::parse_gcol_expr to true and calls SQL parser. The flag is used to block users from using this command. When it's not set the syntax error is thrown for this command. On success parser allocates and fills LEX::gcol_info and it's attached to the field being processed. .) New function fix_fields_gcol_func is called by unpack_gcol_info_from_frm. It resolves the expression against the table (through fix_fields) and does limits check. In particular, check for forbidden functions is performed by the new check_gcol_func_processor(..) processor. It returns true for functions that aren't allowed to be a part of a GC's expression. For Item_field this processor also treats argument as field index of the GC being checked and if Item_field field's index equals to it or greater than (which means that the GC refers to itself or forward) then it sets *arg to -1 to indicate wrong reference and returns true. Note that for CREATE/ALTER TABLE expressions are parsed twice: 1st time when SQL command being parsed, 2nd time when table in question being opened after creation/altering. Storing/Reading to/from .frm ============================ .frm is read by open_binary_frm and written by pack_header/pack_fields. Thus those functions are modified to read/write additional info for GCs. Storage format for a GC is as follows: Byte Description 1 always 1 to allow further extensions 2,3 expression's length 4 flags,as of now: 0 - no flags 1 - field is stored 5... expression string Beside that, for a GC an additional unireg flag (a new Field::GENERATED_FIELD = 128) is set to distinguish GC from regular fields. Altering ======== The alteration of a GC doesn't differ much from a regular column, but with few restrictions. A GC can't change it's 'stored' status, i.e a virtual GC can't become a stored one and vice versa. But this allows a stored GC to become a regular field, opposite is possible too. The check itself is done by mysql_prepare_alter_table(). As all GCs are evaluated on the server layer InnoDB's in-place altering isn't supported. To block in-place update a new flag called Alter_inplace_info::HA_ALTER_STORED_GCOL is used. It's set by fill_alter_inplace_info and checked by handler::check_if_supported_inplace_alter. It's accompanied by another new flag Alter_info::ALTER_STORED_GCOLUMN which blocks in-place ALTER requests from a user. ALTER_STORED_GCOLUMN Storing/reading data ==================== The main difference between virtual and stored GCs on the server level is how their value is obtained and saved. Value of each virtual GC is calculated "on the fly" for each record read. To do that, each record reading function (e.g. join_read_first or join_read_next) calls new function update_virtual_fields_marked_for_write. In the case of INSERT/UPDATE this function is called by fill_record and fill_record_n_invoke_before_triggers. The latter calls this function 2nd time, after triggers, to take into account changes to regular fields made by triggers. Note that all changes made to GCs by triggers are lost because their values are recalculated. update_virtual_fields_marked_for_write goes through all generated columns and if it's in the write_set of the table and it's stored and shouldn't be ignored then it updates field's value with the value of the expression. When record is read all stored GCs are ignored, and only virtual GCs are updated. When record is written, all GCs are marked for write, so all of them are updated. Such approach allows GCs to refer other GCs that are defined prior them. An example: CREATE TABLE t(f1 int, f2 INT GENERATED ALWAYS AS (f1) VIRTUAL, f3 INT GENERATED ALWAYS AS (f2) STORED) When a value is inserted into the table, all GCs are always processed from left to right in the order of definition. So when the expression of f3 is evaluated, value of f2 is already stored in the field (even if it's not being saved to the SE), so f3 will be evaluated correctly. Same when data is being read: CREATE TABLE t(f1 int, f2 INT GENERATED ALWAYS AS (f1) STORED, f3 INT GENERATED ALWAYS AS (f2) VIRTUAL) update_virtual_fields_marked_for_write is called after the record has been read, thus values of all stored GCs are already correct. Evaluation from left to right allows to get correct values for virtual GCs from both stored and virtual GCs defined prior it. Partition pruning ================= As GCs values are obtained much later than PP happens we can't allow it when GCs are involved. Current implementation blocks PP completely if there is any GC, but it should just exclude GCs from PP process and return only when nothing but GC is specified. Allowed functions ================= All deterministic built-in functions are allowed. SP and UDFs aren't allowed because currently we can't trust their 'deterministic' property. Beside non-deterministic functions, some other functions are blocked, e.g. all XML things. This limitation could be lifted in a separate WL. This is the reason why we can't say that any deterministic function is allowed. Full list of functions TBD. (Check for UDFs should be added to the code)
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.