WL#411: Generated columns

Affects: Server-5.7   —   Status: Complete   —   Priority: High

The goal of this WL is to add support for Generated Columns (GC). Value of such
column, unlike regular field's value, isn't set by the user, but computed
from the expression given by user at the time GC is created.

For example, if you have  FirstName and SecondName columns,   
you may add a computed column FullName as CONCAT(FirstName," ",SecondName).  

There are two types of generated columns - stored and virtual. Values for
former is computed only once, when a new record is inserted or old one is
updated. After computing the value is stored in the table in exactly the
same way as a value for a regular field. Values for latter aren't stored at
all, but computed each time when server reads a record from the table.
User-specified values can't be inserted into GCs. Stored GCs could be a
part of index(es), while non-stored - can't.

This feature could be used in several ways. The virtual GCs could be
used as a way to simplify and unify queries. This way a complicated
condition could be saved as a GC and then used as a field in a set of queries
to ensure that all of them use exactly the same condition.
Stored GCs could be used as a materialized cache for complicated conditions
that are costly to calculate on the fly. Another use for stored GCs is a
workaround for lack of functional indexes. This way user have to create a
stored GC using required functional expression and then define a
secondary index over it. Disadvantage of such approach is that values are stored 
twice - first time in the table as the value of GC and the second time in the 
index. True functional index would store data only once.


Functional requirements

FR1: Syntax requirements
=======================
1) Max length of a generation expression (unreachable) is unlimited. But 
currently, we still have a bug on this problem. (Ref BUG#73700) 
2) GCs may refer only to GCs defined prior to it.
3) Only deterministic built-in functions are allowed in generating expressions.
   Subqueries, parameters, variables, UDF/SP are not allowed in generated 
expression.
4) Only stored GCs are allowed to be a part of an index.
5) Attributes of a GC's are limited to:
  VIRTUAL/STORED, UNIQUE [KEY], [PRIMARY] KEY, NOT NULL, COMMENT
6) No special limitation on the number of GCs created for one table.

FR2: GC's type
==========================
In this WL, the type for a GC is explicitly specified by user. There is no 
limitation on GC's type. However the type may be changed because of generation 
expression. If they aren't the same type coercion will happen according to our 
regular type coercion rules. A GC(stored or virtual) has the same storage 
requirement
as a regular column of the same type.

FR3: INSERT/REPLACE and UPDATE requirements
======================================
INSERT/REPLACE and UPDATE only DEFAULT is allowed as a value of a GC. Otherwise, 
an error is thrown. Same applies to CREATE TABLE .. AS SELECT and INSERT .. 
SELECT. For view-definition purposes, a generated column is considered to be 
"updatable".

FR4: Alter GC
======================================
1) GC may be added/dropped.
2) GC must not be altered into base column.
3) GC's type and expression may be changed.
4) Stored and Virtual GC are not allowed to be transformed for each other. 
5) Base column may be altered into Stored GC but not Virtual.
6) When rename or drop a column which has a GC dependency, an error is thrown.
The details may be referenced by LLS.

FR5: Changes to INFORMATION_SCHEMA
=======================================
Pls reference 'Changes to INFORMATION_SCHEMA' in HLS.

FR6: Error handling
=======================================
ER1: If the generated expression contains an not-allowed expression such as non-
deterministic function, an error will be reported:
  [ERROR] "Expression of generated column '%s' contains a disallowed function."

ER2: If an index is created on a virtual generated column, the following error 
will be reported:
  [ERROR] "Key/Index cannot be defined on a virtual generated column."

ER3: If FK on GC option has any of 'ON DELETE SET NULL', 'ON UPDATE SET NULL', 
or 'ON UPDATE CASCADE', the following error will be reported:
  [ERROR] "Cannot define foreign key with %s clause on a virtual generated 
column."

ER4: INSERT/REPLACE/UPDATE a GC with a non-DEFAULT value, the following error 
will be reported:	
  [ERROR] "The value specified for generated column '%s' in table '%s' is not 
allowed."

ER5: Illegal operation on GC such as alter stored GC into virtual, the following 
error will be reported:
  [ERROR] "'%s' is not yet supported for generated columns."

ER6: Only the GCs defined prior to the current one can be referenced, otherwise 
the following error will be reported:
  [ERROR] "GENERATED column can refer only to generated columns defined prior 
it."

ER7: If a column which has GC's dependency is dropped, the following error will 
be reported:
  [ERROR] "Column '%s' has a generated column dependency."

Non-Functional requirments
==========================
NF1: Any performance regression should not be introduced.
Syntax
------

column_name type [GENERATED ALWAYS] AS ( expression )
  [VIRTUAL | STORED] [UNIQUE [KEY]] [COMMENT comment]
  [[NOT] NULL] [[PRIMARY] KEY]

STORED means the GC is stored.
VIRTUAL defines GC as virtual.
If neither VIRTUAL or STORED is present, VIRTUAL is the default.
UNIQUE [KEY] defines a secondary unique key.
COMMENT sets a comment.
[NOT] NULL defines NULL or Not NULL.
[PRIMARY] KEY defines a primary key.

VIRTUAL/STORED/UNIQUE/COMMENT/[NOT] NULL/[PRIMARY] KEY are the only column 
attributes available for generated columns. Other column attributes (DEFAULT, 
ON UPDATE, AUTO_INC etc) aren't supported and cause error.

When GC's expression evaluates to null and is saved to the non-null underlying 
field exactly the same error is thrown as when saving null into a regular non-
null field. Thus the behavior is consistent.

As SEs see stored GCs as regular fields, [PRIMARY] KEY is allowed for stored GC 
but not for virtual GC.

The above syntax is valid in CREATE TABLE, in ALTER TABLE ... ADD,
in ALTER TABLE ... MODIFY, and in ALTER TABLE ... CHANGE. But ALTER TABLE ADD 
COLUMN in this case will not be an online operation compared to regular ADD 
COLUMN because expression must be evaluated in the server. Making it an online 
operation and generated column indexable will need changes in InnoDB and will be
done in a separate WL.

Terminology
-----------

Consider the statement:
CREATE TABLE t (Column_1 CHAR(5) GENERATED ALWAYS AS (PI()+5), Column_2 
CHAR(5));

Column_1 is a "generated column". A column which is defined with a
GENERATED ALWAYS clause is a generated column.

Column_2 is a "base column". Any column which is not a generated column is a
base column.

The clause PI()+5, inside the parentheses, is a "generation expression".

The clause GENERATED ALWAYS AS (PI()+5) is a "generation clause".

Virtual vs stored
---------------------

The main difference is that values of virtual GCs are calculated on the
fly each time table's record is read while values of stored GCs are
calculated only when being stored to the SE. 

Virtual GCs have to exist only on server layer, but due to current design
they are stored in SE but server ignores values SE reads. Effectively this
means that SE uselessly stores some trash.

Values of stored GCs are actually stored in SE, read and written. Due to
this they can be a part of [foreign] indexes, partitioned by, etc.

GC's type
---------
The SQL standard allows generated columns to be created with no specified type, 
but where the type is derived from the type of the generated expression. This is 
difficult in MySQL because we can't obtain type of generation expression early 
enough (this requires fix_fields). So we propose to not support this syntax.

In this WL, the type for a GC is explicitly specified by user. The current 
server design wouldn't allow us to make it optional. However the type might be 
changed because of generation expression. If they aren't the same type coercion 
will happen according to our regular type coercion rules.

Indexes
-------

stored GCs can be a part of a secondary foreign/UNIQUE/FTS/GIS index.
virtual GCs can't be a part of a key and an appropriate error is
throws on attempt to define such key because its data is on the fly.

What can be in the generation expression
----------------------------------------

A generation expression may contain a literal, a built-in function,
an operator, or a reference to any base column within the same table. No
subqueries. No parameters. No variables (MariaDB allows them).
No UDF/SP. A GC could refer to other GCs that are defined prior to it, i.e:

  CREATE TABLE ... (a int, b INT GENERATED ALWAYS AS (a), c INT GENERATED ALWAYS 
AS(b))

is allowed,

  CREATE TABLE ... (a int, b INT GENERATED ALWAYS AS (b), c INT GENERATED ALWAYS 
 AS (b))
  CREATE TABLE ... (a int, b INT GENERATED ALWAYS AS (c), c INT GENERATED ALWAYS 
AS (b))

aren't. This differs from MariaDB, which doesn't allow referring other GCs.
Length of generating expression is limited to 64K (unlike 255 bytes in
original patch and MariaDB) to allow lengthly expressions for JSON and GIS
functions.

If the expression contains a function, that function should be scalar and
deterministic. That is the standard requirement, and it's reasonable,
especially where we allow indexing. This is enforced by cheking function to
be defined with DETERMINISTIC and NO SQL.

Assignments
-----------

For both INSERT/REPLACE and UPDATE only DEFAULT is allowed as a value of a GC.
Otherwise, an error is thrown. Same applies to CREATE TABLE .. AS SELECT and 
INSERT .. SELECT.

For view-definition purposes, a generated column is considered to be
"updatable", since it is possible to assign to it, although the only thing
one can assign is DEFAULT.

Altering GCs
------------

Altering generation expression is allowed for both stored and
virtual GCs. For stored GCs it can't be done in-place as
expressions are evaluated by the server.

The standard optional feature F385 "Drop column generation expression clause"
allows:

  ALTER TABLE t ALTER COLUMN generated_column DROP EXPRESSION;

This turns the generated column into a base column. This isn't supported.
Following statements could be used instead:

  ALTER TABLE t MODIFY stored_generated_column data_type;
    
    Drops generation expression and changed type of the column to data_type.
    Column's data is left as is.

  ALTER TABLE t MODIFY
    stored_generated_column data_type [GENERATED ALWAYS] AS (expression) STORED;
    
    Alters generation expression, changes type of the column to data_type, and
    updates column's data according to specified expression.

  ALTER TABLE t MODIFY
    stored_generated_column data_type [GENERATED ALWAYS] AS (expression);
    
    Not allowed.

  ALTER TABLE t MODIFY virtual_generated_column data_type;

    Not allowed.

  ALTER TABLE t MODIFY
    virtual_generated_column data_type [GENERATED ALWAYS] AS (expression) 
STORED;
    
    Not allowed.
    
  ALTER TABLE t MODIFY
    virtual_generated_column data_type [GENERATED ALWAYS] AS (expression);
    
    Alters generation expression and changes type of the column to data_type.

  ALTER TABLE t MODIFY
    base_column data_type [GENERATED ALWAYS] AS (expression) STORED;
    
    Turns base_column into GC and updates column's data according to specified
    expression.

    Note: Oracle doesn't allow such an alter: Schema Creation Failed: ORA-54026: 
Real column cannot have an expression.
  ALTER TABLE t MODIFY
    base_column data_type [GENERATED ALWAYS] AS (expression);
    
    Not allowed.

One might note an inconsistency here - it's possible to turn base column into
a stored GC and vice versa, but isn't possible to do the same for
virtual GC. This limitation seems to be completely artificial as
currently SEs (InnoDB and MyISAM) just skips virtual fields from
processing, but effectively storing some garbage instead. So turning base
column into a virtual GC is exactly the same as for stored GC.
When SE will be able to skip virtual GC from storing it also wouldn't
impose any issues as those alterations would effectively mean adding/dropping
a column, which we already support.

All data changes (including field value and index value etc.) for GCs are done 
offline (i.e not in-place) after alter operation as generated expressions are 
only evaluated by the server layer except virtual generated column 
modification because its' data is on the fly. This might be lifted after 
implementation of WL#1075 "Support for function indexes".

Dropping / altering / renaming objects that the generation expression depends on
--------------------------------------------------------------------------------

On attempt to rename/drop a column a GC depends on the error is thrown:

  ERROR 1054 (42S22): Column 'column' has a generated column dependency'

where 'column' is the column being renamed/dropped.

Foreign keys
------------

GCs could be in both parent and child tables. 
For example:
CREATE TABLE t1 (.., s1 INT GENERATED ALWAYS AS (..), PRIMARY KEY (s1));
CREATE TABLE t2 (.., s1 INT GENERATED ALWAYS AS (..) REFERENCES t1 (s1));

Since foreign-key checking needs indexes, and since we don't allow indexes
for not-stored columns, this works only for stored generated columns.

A generated column is allowed to be be in a foreign-key reference.

Triggers
--------

A trigger can not refer to NEW.column_name or OLD.column_name if column_name
is a generated column. If then, an error will be thrown out except it's set to 
DEFAULT.

Partitions
----------

Partitioning by generated columns are allowed for stored GC, for example:
CREATE TABLE t (s1 INT, s2 VIRTUAL INT (ABS(s1) STORED)
PARTITION BY LIST (s2) (PARTITION p1 VALUES IN (1);

As partitioning sees a stored GC as a regular field this can be used to
work around partitioning limitation related to functions used for partitioning.
See also
http://dev.mysql.com/doc/refman/5.1/en/partitioning-limitations-functions.html

When is the generation expression evaluated?
--------------------------------------------

Generation expressions for stored GCs are evaluated in
INSERT/REPLACE/UPDATE only, immediately after a BEFORE trigger is
evaluated. Therefore it is evaluated before a constraint check. For example:

CREATE TABLE t
  (s1 INT, s2 VIRTUAL INT AS (CASE WHEN s1 IS NULL THEN 5 ELSE s1 END) STORED);
CREATE TRIGGER t_bu BEFORE UPDATE ON t FOR EACH ROW SET @s1 = NULL;
CREATE VIEW v AS SELECT s1 FROM t WHERE s2 = 5 WITH CHECK OPTION;
INSERT INTO v VALUES (NULL); 
SELECT * FROM t\G
*************************** 1. row ***************************
s1: NULL
s2: 5

There is no evaluation at CREATE/ALTER time. 

Generation expressions for virtual GC are evaluated immediately after
a record is read.

Changes to INFORMATION_SCHEMA
------------------------------

INFORMATION_SCHEMA.COLUMNS has a new column:
GENERATION_EXPRESSION     VARCHAR        NULL or generation expression

Beside that, INFORMATION_SCHEMA.COLUMNS.EXTRA prints GENERATED for both
stored and virtual GCs.

An option: print "VIRTUAL GENERATED" for virtual GCs, "stored
GENERATED" for stored GCs.

DESCRIBE and SHOW FIELDS behaves as INFORMATION_SCHEMA.COLUMNS.EXTRA. Both
commands doesn't print generation expression.

New error messages introduced
================================
ER_GENERATED_COLUMN_FUNCTION_IS_NOT_ALLOWED
  eng "Expression of generated column '%s' contains a disallowed function."

ER_KEY_BASED_ON_GENERATED_GENERATED_COLUMN
  eng "Key/Index cannot be defined on a virtual generated column."

ER_WRONG_FK_OPTION_FOR_GENERATED_COLUMN
  eng "Cannot define foreign key with %s clause on a generated column."

ER_NON_DEFAULT_VALUE_FOR_GENERATED_COLUMN
  eng "The value specified for generated column '%s' in table '%s' is not 
allowed."

ER_UNSUPPORTED_ACTION_ON_GENERATED_COLUMN
  eng "'%s' is not yet supported for generated columns."

ER_GENERATED_COLUMN_NON_PRIOR
  eng "GENERATED column can refer only to generated columns defined prior it."

ER_DEPENDENT_BY_GENERATED_COLUMN
  eng "Column '%s' has a generated column dependency."

===FUTURE PLAN===
-------------------
Query rewrite
-------------

Suppose we create a table thus:
CREATE TABLE t (s1 INT, s2 GENERATED ALWAYS AS (s1+5));
We expect that like "SELECT s2 ..." will have exactly the same effect as
"SELECT (s1+5) ...". Since s2 is not stored, the query may be rewritten
with the simple rule: replace column-name with (generation-expression).

However, this isn't and won't be implemented in scope of this WL.

CREATE TABLE ... LIKE
---------------------

The standard allows:
CREATE TABLE t1 LIKE t2  { INCLUDING | EXCLUDING } GENERATED;

This isn't supported and CREATE TABLE ... LIKE behaves like INCLUDING GENERATED
is given.

Appendix A: Syntax in MariaDB
---------------------

column_name type GENERATED ALWAYS AS (expression) [VIRTUAL | PERSISTENT]
    [UNIQUE [KEY]] [COMMENT 'string']

I.e. it's same as Peter's, but [NOT] VIRTUAL is replaced for 
VIRTUAL | PERSISTENT. The latter is obvious, the former is opposite to it.

Appendix B: Peter Gulutzan's original HLS specification

There's no immediate need to change the original title
"Computed virtual columns as MS [i.e. Microsoft]
SQL server has". But we might choose the term
"generated column" rather than "computed column",
and such columns are not necessarily virtual,
and the recommendations in this document assume we will
choose standard SQL as a model, rather than SQL Server.

Think of the Holy Roman Empire, which wasn't holy,
wasn't Roman, and wasn't an empire. Titles aren't specifications.

SQL Server
----------

SQL Server 2008 allows:

<computed_column_definition> ::=
column_name AS computed_column_expression 
[ PERSISTED [ NOT NULL ] ]
where PERSISTED means "store calculation results when inserting
or updating".

Microsoft's example:
"
CREATE TABLE mytable 
( low int, high int, myavg AS (low + high)/2 ) ;
"

There are some restrictions and additional clauses, see
SQL Server 2008 "CREATE TABLE" syntax:
http://msdn.microsoft.com/en-us/library/ms174979.aspx

Oracle
------

Oracle 11g allows:
column [datatype] [GENERATED ALWAYS] AS (column_expression)
   [VIRTUAL]
   [ inline_constraint [inline_constraint]... ]

For details see Oracle 11g CREATE TABLE statement:
http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_7002.h
tm

DB2
---

DB2 9.1 allows:
GENERATED ALWAYS AS (generation-expression)

See DB2 UDB 9.1 CREATE TABLE statement:
http://www.informatik.uni-
bonn.de/~tb/Lehre/ws01/vRDBMS/db2s0/frame3.htm#sqls0621

Standard SQL
------------

SQL:2008 non-core (optional) feature T175 "Generated columns" allows:
GENERATED ALWAYS AS (generation expression)
For example:
CREATE TABLE t (a INT, b INT, c INT GENERATED ALWAYS AS (a + b)); 
Here the value of the generated column c is derived from a and b.

GENERATED and ALWAYS are non-reserved words. It's interesting that if we
made GENERATED ALWAYS into [GENERATED ALWAYS], that is, if we made the
words optional, as Oracle does, then we'd still be able to parse correctly.

Since Oracle and DB2 are both following the standard (with slight variations
as usual), and since the standard specification is clearer than vendor manuals,
this specification will follow the standard specification wherever practical.

Terminology
-----------

Consider the statement:
CREATE TABLE t (Column_1 CHAR(5) GENERATED ALWAYS AS (PI()+5), Column_2 
CHAR(5));

Column_1 is a "generated column". A column which is defined with a
GENERATED ALWAYS clause is a generated column.

Column_2 is a "base column". Any column which is not a generated column is a
base column.

The clause (PI()+5), including the parentheses, is a "generation expression".

The clause GENERATED ALWAYS AS (PI()+5) is a "generation clause".

A "generated column" is a column which is not stored, it must be calculated
whenever the row is accessed.

VIRTUAL
-------

The SQL standard does not have a way to specify whether a generated column
is virtual or not. This is common; the standard often ignores storage
considerations. We'd be better off if we could distinguish, since there are
some advantages to generated not-virtual columns (we can guarantee that
you'll get the same value twice even if generation expressions are
non-deterministic), and some advantages to generated virtual columns
(they require little or no storage space).

So we want to add a non-standard bit to the syntax to say "it's [not] virtual".

The Oracle syntax "[GENERATED ALWAYS] AS (expression) VIRTUAL" would be
dangerous, because in Oracle it doesn't mean anything if you leave out
the word VIRTUAL, it still will be virtual.

We could try "GENERATED VIRTUALLY AS (expression)" or some such thing,
but that appears to deviate too much from the standard.

The SQL Server syntax "AS expression PERSISTED" has an unEnglish air.

So let's say that the MySQL syntax for a generation clause is
GENERATED ALWAYS AS (generation expression) [[NOT] VIRTUAL]
and VIRTUAL is the default. That is, you have to say NOT VIRTUAL explicitly
if you want the generated values to actually be stored in the database.

The functionality is deliverable with triggers, so expect gripes about
redundancy.

Column definition details
-------------------------

In standard SQL, a column definition looks like (omitting irrelevancies):

<column definition> ::=
<column name>
[ <data type> ]
[ <default clause> | <identity column specification> | <generation clause> ]
[ <column constraint definition>... ]
[ <collate clause> ]

In MySQL, a column definition currently looks like this:

column_definition:
    data_type [NOT NULL | NULL] [DEFAULT default_value]
      [AUTO_INCREMENT] [UNIQUE [KEY] | [PRIMARY] KEY]
      [COMMENT 'string'] [reference_definition]

If we follow the standard rules for analogous MySQL clauses, then:

The data type is optional. If it's specified, then the generation expression
must have a result that's assignable to the column, but that's okay, MySQL
is very loose about what's assignable. If data type is not specified,
then the generation expression result decides what the data type is.
For example:
CREATE TABLE t (s1 GENERATED ALWAYS AS (5));        /* data type is INTEGER */
CREATE TABLE t (s1 BIGINT GENERATED ALWAYS AS (5)); /* conversion happens */
CREATE TABLE t (s1 TINYINT GENERATED ALWAYS AS (555555)); /* uh-oh. */
In the last example, we must be able to assign, which means we must always
be able to truncate, so we cannot pay attention to sql_mode strictness.

The default clause and identity column specification (in MySQL the analogue
is auto_increment) cannot be specified along with generation clause.
These are examples of syntax errors:
CREATE TABLE t (s1 INT AUTO_INCREMENT GENERATED ALWAYS AS (5), KEY(s1));
CREATE TABLE t (s1 INT GENERATED ALWAYS AS (5) DEFAULT 5);

The constraint clauses should follow the generation clause. But we can't move
"NOT NULL | NULL" or "reference definition" (which are both constraints)
from their established places. It appears then that the best position for
a generation clause is after data type, and before NULL | NOT NULL.

The collate clause, in MySQL, must be associated with a data type clause.

So, in MySQL, a column definition in future, if and only if it includes
a generation clause, looks like this:

column_definition:
    [data_type]
    GENERATED ALWAYS AS (value expression) [[NOT] VIRTUAL]
    [NOT NULL | NULL]
    [UNIQUE [KEY] | [PRIMARY] KEY]
    [COMMENT 'string'] [reference_definition]

This column_definition variant is legal in CREATE TABLE, in ALTER TABLE
... ADD, in ALTER TABLE ... MODIFY, and in ALTER TABLE ... CHANGE.

Indexes
-------

Recommendation: don't allow generated virtual columns in indexes.
That's properly a job for the implementor of WL#1075 "Add support for functional
indexes".

Since PRIMARY and UNIQUE and REFERENCES specifications depend on indexes,
it follows that table definitions like this are illegal:
CREATE TABLE t (s1 INT GENERATED ALWAYS AS (5) NOT NULL UNIQUE); /* illegal */

On the other hand, a generated not-virtual column does not require a
functional index, and so a column defined with "GENERATED ALWAYS AS
(generation expression) NOT VIRTUAL" may be in indexes, and in unique- or
foreign-key definitions.

Sometimes this might mean that it's impractical to ALTER a column from
NOT VIRTUAL to VIRTUAL.

What can be in the generation expression
----------------------------------------

A generation expression may contain a literal, a built-in function, a function
defined with CREATE FUNCTION, an operator, or a reference to any base column
within the same table. That's all. No subqueries. No references to any
generated column. No parameters. No variables ... well, maybe certain
variables, we'll see.

If the expression contains a function, that function should be scalar and
deterministic. That is the standard requirement, and it's reasonable,
especially where we allow indexing. However, we don't really enforce
determinism. All we do is check whether a user-defined function is defined
with DETERMINISTIC. For replication purposes, we'll need to ensure that
generation expressions that contain PI() etc. will have a deterministic
PI() value in the binlog.

If the expression contains a function defined with CREATE FUNCTION, then
the function, and any functions it invokes, may not reference any tables,
may not change stored data, must be checked for appropriate privileges
at time of column access, Properly speaking the function should be created
with CREATE FUNCTION ... DETERMINISTIC NO SQL, but we won't require those 
keywords.

Assignments
-----------

For both INSERT and UPDATE, the standard allows assigning DEFAULT. Nothing else.
For example:
INSERT INTO t (generated_column) VALUES (DEFAULT); /* legal */
UPDATE t SET generated_column = DEFAULT;           /* legal */
INSERT INTO t (generated_column, base_column) VALUES (NULL, NULL); /* illegal */
UPDATE t SET generated_column = base_column;                       /* illegal */

Since REPLACE works like INSERT, we also allow REPLACE ... VALUES (DEFAULT);

We will allow the following non-standard syntax:
CREATE TABLE t (s1 GENERATED ALWAYS AS (5)) AS SELECT 6;
That is, for a situation where we can't avoid an apparent assignment to
a generated column, we allow the syntax but ignore the source value.
The same consideration applies for INSERT ... SELECT.

The order of assignments has to be irrelevant. Therefore, for this situation:
CREATE TABLE t (base_column INT, generated_column BLOB GENERATED ALWAYS AS
(base_column));
INSERT INTO t VALUES (5, DEFAULT);
UPDATE t SET s2 = DEFAULT, s1 = 6;
the value of t.s2 will be 6. Hmm, there's a contradiction with what's said
elsewhere, eh?

For view-definition purposes, consider a generated column to be "updatable",
since it is possible to assign to it, although the only thing you can assign is
DEFAULT.

ALTER TABLE ... DROP EXPRESSION
-------------------------------

The standard optional feature F385 "Drop column generation expression clause"
allows:
ALTER TABLE t ALTER COLUMN generated_column DROP EXPRESSION;
This turns the generated column into a base column.

We won't implement this. It's sufficient to say
ALTER TABLE t MODIFY generated_column data_type;
and the generation clause, since it's not explicitly specified, disappears.

This sort of ALTER TABLE statement may change a column from virtual to
not-virtual, and therefore will be slow and will require more storage.

Dropping / altering / renaming objects that the generation expression depends on
--------------------------------------------------------------------------------

Before allowing the renaming/altering of a base column, it's now
necessary to check: is this column referenced in any generation expression?
If so, ALTER is illegal.

ALTER ... DROP base_column has a cascade effect, that is, it causes dropping
of any generated columns that depend on the base column.

Before allowing the dropping of a function, it's now
necessary to check: is this function, directly or indirectly, referenced
in any generation expression?
If so, DROP is illegal.

Foreign keys
------------

It is legal to use not-virtual generated columns in parent or child tables.
For example:
CREATE TABLE t1 (s1 INT GENERATED ALWAYS AS (5), PRIMARY KEY (s1));
CREATE TABLE t2 (s1 INT GENERATED ALWAYS AS (5) REFERENCES t1 (s1));

Since foreign-key checking needs indexes, and since we don't allow indexes
for not-virtual columns, this works only for not-virtual generated columns.

A generated column may not be in a foreign-key reference that includes
ON UPDATE CASCADE, or ON UPDATE|DELETE SET NULL, or ON UPDATE|DELETE SET 
DEFAULT.

Probably we won't support foreign keys with generated columns until we have
WL#148 Foreign keys: Implement Foreign Keys (all storage engines).

Triggers
--------

A trigger may not refer to NEW.column_name or OLD.column_name if column_name
is a generated column. This may not be what other DBMSs do.

Partitions
----------

Recommendation: allow partitioning by generated columns, for example:
CREATE TABLE t (s1 INT, s2 INT GENERATED ALWAYS AS (ABS(s1))
PARTITION BY LIST (s2) (PARTITION p1 VALUES IN (1);

This cannot be used to work around any partitioning limitation.
For example, the MySQL Reference Manual has a section
"Partitioning Limitations Relating to Functions"
http://dev.mysql.com/doc/refman/5.1/en/partitioning-limitations-functions.html
which does not say you can use UPPER. Therefore this is illegal:
CREATE TABLE t (s1 CHAR(1), s2 CHAR(1) GENERATED ALWAYS AS (UPPER(s1)))
PARTITION BY KEY(s2);

It might be objected that a generated not-virtual column does not have to be
subjected to the partitioning limitations relating to functions.

When is the generation expression evaluated?
--------------------------------------------

A generation expression is evaluated immediately after a BEFORE trigger is
evaluated. Therefore it is evaluated before a constraint check. For example:

CREATE TABLE t (s1 INT,
                s2 GENERATED ALWAYS AS (WHEN s1 IS NULL THEN 5 ELSE s1 END));
CREATE TRIGGER t_bu BEFORE UPDATE ON t FOR EACH ROW SET s1 = NULL;
CREATE VIEW v AS SELECT s1 FROM t WHERE s2 = 5 WITH CHECK OPTION;
INSERT INTO v VALUES (NULL); /* results left as an exercise for the reader */

There is no evaluation at CREATE/ALTER time. Thus some niladic functions,
such as CURRENT_USER, will always depend on an evaluation at the time we
access the row. This is somewhat different from the way we handle
CURRENT_USER for views.

Query rewrite
-------------

Suppose we create a table thus:
CREATE TABLE t (s1 INT, s2 GENERATED ALWAYS AS (s1+5));
We expect that like "SELECT s2 ..." will have exactly the same effect as
"SELECT (s1+5) ...". Since s2 is not stored, the query may be rewritten
with the simple rule: replace column-name with (generation-expression).

However, this should only be exposed when necessary.
For example, after
CREATE VIEW v AS SELECT s2 FROM t;
SHOW CREATE VIEW v;
The user will see "s2", not "s1+5", for the view definition.

CREATE TABLE ... LIKE
---------------------

The standard allows:
CREATE TABLE t1 LIKE t2  { INCLUDING | EXCLUDING } GENERATED;

The default is EXCLUDING GENERATED.

Generated columns are only copied if one specifies INCLUDING GENERATED.

INFORMATION_SCHEMA
------------------

INFORMATION_SCHEMA.COLUMNS has two new columns:
IS_GENERATED              VARCHAR        'NEVER' or 'ALWAYS'
GENERATION_EXPRESSION     VARCHAR        NULL or generation expression

We also need a way to indicate our non-standard VIRTUAL usage.

There is no proposal here to change DESCRIBE or SHOW FIELDS.

Timestamp
---------

Ordinarily, with many caveats and special cases, the default value for
a MySQL TIMESTAMP column may be the current date and time, as if we'd said
"column_name TIMESTAMP GENERATED ALWAYS AS (CURRENT_TIMESTAMP)". Since
we long ago passed the maximum number of rules that mortals can keep
in their heads about timestamps, we'll just disallow all generation
expressions other than (CURRENT_TIMESTAMP).

Potential Benefits
------------------

A specification is not the place to tout or propagandize.
Maximally, one can point out that, although one can get usually get
the same benefits or more by using views, a generated column is in
a base table, and therefore: it can be used for partitions and indexes
if it's not-virtual, it can be be defined without needing CREATE VIEW
privileges, and maybe someday for some storage engines you can get
statistics that could be useful with optimizing.
Design overview
===============
A generated column is represented by the Item_field/Field classes, as any
regular field. Additional info
required for handling of GC is stored in new structure called
generated_column_info. It's pointed to by Field::gcol_info, when it's
non-null it means the columns is a GC. In addition to that a new bool
Field::stored_in_db indicates that the field is stored. It should be true
for stored GCs and for all regular fields. False only for virtual GCs.
The expressions themselves are stored in .frm files.
SQL parser has a new set of rules to be able to parse GC expressions. It's
used to parse GC expressions when opening table from .frm too.
Limits on GC are checked at following points:
  Max length limit - when writing .frm file
  Allowed functions and referred columns - on table open
  Keys limitation - on key creation
GCs generation expressions are fix_field'ed in open_table_from_share. Such
approach allows to have single point of check, unlike other approaches.
From storage engine POV GCs are exactly like regular fields. The difference
is that InnoDB is modified to completely ignore virtual fields, i.e
it's neither reads them nor writes. MyISAM is left as is. Effectively this
means that SEs would reserve space for GCs, but wouldn't use it (i.e. have
garbage there). This will be fixed (only for InnoDB) in scope of a
separate WL#8114. 

Support of GC on SE level
=========================
There is not much to support, but in order to indicate that SE is aware of
GCs and know how to deal with them (e.g. ignore virtual GCs) it should
return HA_GENERATED_COLUMNS among its table_flags.

New structures
==============
The only new data structure is generated_column_info:

class generated_column_info: public Sql_alloc
{
public:
  Item *expr_item;          // Parsed generation expression
  LEX_STRING expr_str;      // Generation expression string
  Item *item_free_list;     // Item free list
  generated_column_info() 
  : expr_item(0), item_free_list(0),
    field_type(MYSQL_TYPE_LONG),
    stored_in_db(FALSE), data_inited(FALSE)
  {
    expr_str.str= NULL;
    expr_str.length= 0;
  };
  ~generated_column_info() {}
  enum_field_types get_real_type()
  {
    DBUG_ASSERT(data_inited);
    return field_type;
  }
  void set_field_type(enum_field_types fld_type)
  {
    /* Calling this function can only be done once. */
    DBUG_ASSERT(!data_inited);
    data_inited= TRUE;
  }
  bool get_field_stored()
  {
    DBUG_ASSERT(data_inited);
    return stored_in_db;
  }
  void set_field_stored(bool stored)
  {
    stored_in_db= stored;
  }
private:
--8<-- This is a leftover and should be removed from code --
  /*
    The following data is only updated by the parser and read
    when a Create_field object is created/initialized.
  */
  enum_field_types field_type;   /* Real field type*/
--8<--

  bool stored_in_db;             /* Indication that the field is 
                                    phisically stored in the database*/
  /*
    This flag is used to prevent other applications from
    reading and using incorrect data.
  */
  bool data_inited; 
};


Parsing
=======
GCs generation expressions are parsed in two cases:
1) CREATE/ALTER TABLE
2) opening a table

In the 1st case parsing is done as usual. LEX::gcol_info is allocated and
filled for GCs during parsing. 

2nd case is handled as follows:
.) open_binary_frm parses .frm file, reads expressions strings and attaches
  them to appropriate fields.
.) open_table_from_share calls new function unpack_gcol_info_from_frm which
  constructs a special command:
    PARSE_GCOL_EXPR <generation expression>
  then it sets LEX::parse_gcol_expr to true and calls SQL parser. The flag
  is used to block users from using this command. When it's not set the
  syntax error is thrown for this command.
  On success parser allocates and fills LEX::gcol_info and it's attached to
  the field being processed.
.) New function fix_fields_gcol_func is called by unpack_gcol_info_from_frm.
  It resolves the expression against the table (through fix_fields) and
  does limits check. In particular, check for forbidden functions is
  performed by the new check_gcol_func_processor(..) processor. It returns
  true for functions that aren't allowed to be a part of a GC's expression.
  For Item_field this processor also treats argument as field index of the
  GC being checked and if Item_field field's index equals to it or greater
  than (which means that the GC refers to itself or forward) then it sets
  *arg to -1 to indicate wrong reference and returns true.

Note that for CREATE/ALTER TABLE expressions are parsed twice:
1st time when SQL command being parsed, 2nd time when table in question
being opened after creation/altering.

Storing/Reading to/from .frm
============================
.frm is read by open_binary_frm and written by pack_header/pack_fields. Thus
those functions are modified to read/write additional info for GCs.
Storage format for a GC is as follows:
Byte    Description
1       always 1 to allow further extensions
2,3     expression's length
4       flags,as of now:
        0 - no flags
        1 - field is stored
5...    expression string

Beside that, for a GC an additional unireg flag (a new Field::GENERATED_FIELD
= 128) is set to distinguish GC from regular fields.

Altering
========
The alteration of a GC doesn't differ much from a regular column, but with
few restrictions. A GC can't change it's 'stored' status, i.e a virtual
GC can't become a stored one and vice versa. But this allows a
stored GC to become a regular field, opposite is possible too. The check
itself is done by mysql_prepare_alter_table().
As all GCs are evaluated on the server layer InnoDB's in-place altering isn't
supported. To block in-place update a new flag called
Alter_inplace_info::HA_ALTER_STORED_GCOL is used. It's set by 
fill_alter_inplace_info and checked by
handler::check_if_supported_inplace_alter. It's accompanied by another new
flag Alter_info::ALTER_STORED_GCOLUMN which blocks in-place ALTER requests
from a user.

ALTER_STORED_GCOLUMN

Storing/reading data
====================
The main difference between virtual and stored GCs on the server level
is how their value is obtained and saved.
Value of each virtual GC is calculated "on the fly" for each record read. To
do that, each record reading function (e.g. join_read_first or join_read_next)
calls new function update_virtual_fields_marked_for_write. In the case of
INSERT/UPDATE this function is called by fill_record and
fill_record_n_invoke_before_triggers. The latter calls this function 2nd
time, after triggers, to take into account changes to regular fields made by
triggers. Note that all changes made to GCs by triggers are lost because
their values are recalculated.

update_virtual_fields_marked_for_write goes through
all generated columns and if it's in the write_set of the table and it's
stored and shouldn't be ignored then it updates field's value with the value
of the expression. When record is read all stored GCs are ignored, and
only virtual GCs are updated. When record is written, all GCs are marked for
write, so all of them are updated.
Such approach allows GCs to refer other GCs that are defined prior them.
An example:

  CREATE TABLE t(f1 int, f2 INT GENERATED ALWAYS AS (f1) VIRTUAL,
                 f3 INT GENERATED ALWAYS AS (f2) STORED)

When a value is inserted into the table, all GCs are always processed from
left to right in the order of definition. So when the expression of f3 is
evaluated, value of f2 is already stored in the field (even if it's not
being saved to the SE), so f3 will be evaluated correctly.
Same when data is being read:

  CREATE TABLE t(f1 int, f2 INT GENERATED ALWAYS  AS (f1) STORED,
                 f3 INT GENERATED ALWAYS AS (f2) VIRTUAL)

update_virtual_fields_marked_for_write is called after the record has been
read, thus values of all stored GCs are already correct. Evaluation from
left to right allows to get correct values for virtual GCs from both
stored and virtual GCs defined prior it.

Partition pruning
=================
As GCs values are obtained much later than PP happens we can't allow it when
GCs are involved. Current implementation blocks PP completely if there is
any GC, but it should just exclude GCs from PP process and return only when
nothing but GC is specified. 

Allowed functions
=================

All deterministic built-in functions are allowed. SP and UDFs aren't
allowed because currently we can't trust their 'deterministic' property.
Beside non-deterministic functions, some other functions are blocked, e.g.
all XML things. This limitation could be lifted in a separate WL. This is
the reason why we can't say that any deterministic function is allowed.
Full list of functions TBD.
(Check for UDFs should be added to the code)