WL#4052: Preserve character set information of stored objects

Affects: Server-5.1   —   Status: Complete

There were a few problems in the server, that caused the following bugs:
  - BUG#11986: Stored routines and triggers can fail if the code
    has a non-ascii symbol
  - BUG#16291: mysqldump corrupts string-constants with non-ascii-chars
  - BUG#19443: INFORMATION_SCHEMA does not support charsets properly
  - BUG#21249: Character set of SP-var can be ignored
  - BUG#25212: Character set of string constant is ignored (stored routines)
  - BUG#25221: Character set of string constant is ignored (triggers)

A fix for these problems is quite large and changes existing behavior.
So, this WL item has been created.

The user-visible problems in short:

  - INFORMATION_SCHEMA may produce wrong results.

  - Database dump/restore may fail because:
    - mysqldump may produce wrong results;
    - restoration may fail, because mysqldump outputs incomplete data.

  - Stored programs and views may generate different (unexpected) results
    from time to time.

Update as of 04-Feb-2008
------------------------

This WL item was implemented in 5.1. However, there are the following
follow-ups of this task:

  - BUG#30217: INFORMATION_SCHEMA output is not stable with regard to
    mysqldump/restore;

    This bug discovered that it is impossible to separate the original
    query and query for INFORMATION_SCHEMA. Anyway, INFORMATION_SCHEMA
    query should be built on the actual view definition query.

  - BUG#32538: View definition picks up character set, but not collation

    This bug discovered another issue with normalized SELECT-query,
    used for view definition. It's expected that we'll be suffering from
    bugs similar to that until we use normalized SELECT-query as a view
    definition.

  - WL#4255: Store original view definition query

    This WL item was created to identify the need to use original query
    as a view definition and summarizes all the info about that topic.
Invalid INFORMATION_SCHEMA data
-------------------------------

According to The SQL Standard, all the data in INFORMATION_SCHEMA must be
in UTF8. The original query however can contain literals in several
encodings by means of character set introducers.

At the moment the problem happens when a query of an object with such a
multi-encoding definition is selected from INFORMATION_SCHEMA --
INFORMATION_SCHEMA output contains the query up to the first character set
introducer.

Example:

  - original query:

    CREATE VIEW v1 AS
      SELECT
        'Text in latin1' AS c1,
        _koi8r 'Text in koi8r' AS c2;

  - select from INFORMATION_SCHEMA will return:
  
    > SELECT view_definition FROM INFORMATION_SCHEMA.VIEWS;
    +-------------------------------------------------------------------------+
    | view_definition                                                         |
    +-------------------------------------------------------------------------+
    | /* ALGORITHM=UNDEFINED */ select _koi8r'Text in latin1' AS `c1`,_koi8r' |
    +-------------------------------------------------------------------------+
    1 row in set, 1 warning (0.01 sec)

    > SHOW WARNINGS;
    +---------+------+--------------------------------------------------------+
    | Level   | Code | Message                                                |
    +---------+------+--------------------------------------------------------+
    | Warning | 1366 | Incorrect string value: '\xC2\xCC\xC1-\xC2\xCC...' for
                       column 'VIEW_DEFINITION' at row 1 |
    +---------+------+--------------------------------------------------------+
    1 row in set (0.00 sec)


The cause of this is the following: we try to convert the definition query
into UTF8 as an ordinary string and this is not the case. In order to
convert the query properly, we must have parsed it and become aware of all
literals in character sets different from the original one.

There are a few possible ways to fix this problem:

  - Translate the query so that it contains literals only in UTF8. That
    could be done by replacing non-UTF8 literals with converted
    UTF8-literals and a CONVERT ... USING function.

    For example:

      Original query:

        CREATE VIEW v1 AS
          SELECT
            'Text in latin1' AS c1,
            _koi8r 'Text in koi8r' AS c2;

      Translated query:

        CREATE VIEW v1 AS
          SELECT
            CONVERT('Converted latin1-text in utf8' USING latin1) AS c1,
            CONVERT('Converted koi8r-test in utf8' USING koi8r) AS c2;

    We decided not to go in this direction. However, in the future,
    we should probably try again to implement this functionality.

  - Introduce a smart converter, that would parse the query and convert
    literals prepended by a character set introducer properly.

    For example:

      Original query:

        CREATE VIEW v1 AS
          SELECT
            'Text in latin1' AS c1,
            _koi8r 'Text in koi8r' AS c2;

      Converted query for INFORMATION_SCHEMA:

        CREATE VIEW v1 AS
          SELECT
            'Converted latin1-text in utf8' AS c1,
            'Converted koi8r-text in utf8' AS c2;
    
    There are two possibilities when doing that: on the fly (on request)
    or in advance. It's proposed to do it in advance, because:

      - we can combine this conversion with the existing parser stage
        (i.e. we can prepare the converted query on parsing), so we will
        not parse the query twice;

      - the time to generate results for an INFORMATION_SCHEMA query is
        more or less constant with regard to the number of objects queried
        (because we do such a slow operation like parsing for each
        request).

    There is however a problem with this approach: the converted query is
    not equal to the original one. I.e. the object created by the converted
    query would differ from the object created by the original query.
    
    So, we will not be able (even theoretically, because as shown above
    it does not work now either) to use INFORMATION_SCHEMA for retrieving
    definition queries for applications like mysqldump any more.

    For example:

      Let's create an object by means of the original query:

        SET NAMES latin1; -- set client character set

        CREATE VIEW v1 AS
          SELECT
            'Text in latin1' AS c1,
            _koi8r 'Text in koi8r' AS c2;

      Let's see what we have in v1:

        SELECT CHARSET(c1), CHARSET(c2) FROM v1;
        +-------------+-------------+
        | charset(c1) | charset(c2) |
        +-------------+-------------+
        | latin1      | koi8r       |
        +-------------+-------------+
        1 row in set (0.00 sec)

      Now, create an object by means of the converted query, retrieved from
      INFORMATION_SCHEMA:

        SET NAMES utf8; -- set client character set

        CREATE VIEW v1 AS
          SELECT
            'Converted latin1-text in utf8' AS c1,
            'Converted koi8r-text in utf8' AS c2;

      And see what we have:

        SELECT CHARSET(c1), CHARSET(c2) FROM v1;
        +-------------+-------------+
        | charset(c1) | charset(c2) |
        +-------------+-------------+
        | utf8        | utf8        |
        +-------------+-------------+
        1 row in set (0.00 sec)

      Note that we have UTF8 even for column c2, which was originally
      explicitly specified as koi8r.

      Theoretically, we could inline built-in functions (like CHARSET()
      and COLLATION()) because they are deterministic, so always return
      the same thing for a constant argument. However there is another
      example, showing that functions can't always be inlined:

        CREATE FUNCTION f1(arg VARCHAR(255)) RETURNS INT
        BEGIN
          IF 'blablabla' = arg THEN # <-- This is the case.
             RETURN 1;
          ELSE
            RETURN 0;
          END IF;
        END

      The problem is with literal comparison in IF statement.

      Generally problems are with: (1) built-in string comparison
      functions that don't use literals (2) user-defined string functions.

    Luckily, we have the SHOW CREATE statement family, which can be used to
    return the original (not converted) query to the client. That would
    perfectly suit mysqldump's requirements.

The proposed solution is to

  - Create a copy of the original query, translate it to UTF8 and store
    it along with the object definition in memory for future use in
    INFORMATION_SCHEMA.

  - Update SHOW CREATE statements, so that they return the original
    (non-modified) query to be used in applications like mysqldump.

Dump/restore problem
--------------------

There are actually three causes of this one general dump/restore problem:

  - For some object types, mysqldump uses INFORMATION_SCHEMA to retrieve
    the definition query. As was shown in the previous section, this
    does not work now, and this will not work if the proposed changes will be
    made.

    mysqldump uses INFORMATION_SCHEMA because there is no SHOW CREATE
    statement for that object type.

  - SHOW CREATE statements have bugs in the server implementation, which
    prevent them from returning the proper query under some circumstances.

  - mysqldump does not generate statements to preserve the environment
    attributes that affect execution of the definition query. That might
    lead to incorrect restoration if the environment has been changed.

    Example:

      Let's suppose, we've created a stored program using the following
      sequence of SQL statements:
      
        SET NAMES latin1;

        CREATE PROCEDURE p1()
            SELECT COLLATION('text');

      If we further execute this function (in the same connection), we'll
      get the following result:

        CALL p1();
        +-------------------+
        | COLLATION('text') |
        +-------------------+
        | latin1_swedish_ci |
        +-------------------+
        1 row in set (0.00 sec)

      If we dump the current database, we'll get the following output:

        DELIMITER ;;
        /*!50003 CREATE*/ /*!50020 DEFINER=`root`@`localhost`*/
          /*!50003 PROCEDURE `p1`() SELECT COLLATION('text') */;;
        DELIMITER ;

      The problem is that this dump does not contain any reference to the
      connection collation that was used at the time the stored procedure
      was created (latin1). So, if the dump will be loaded with another
      connection collation, the stored procedure will be created with
      different behavior.

      Let's suppose, we've loaded this dump into the server with "koi8r"
      default collation. So, we'll have:

        CALL p1();
        +-------------------+
        | COLLATION('text') |
        +-------------------+
        | koi8r_general_ci  |
        +-------------------+
        1 row in set (0.00 sec)

The proposed solution is to:

  - Introduce SHOW CREATE TRIGGER statement (see details below);

  - Dump statements that would switch environment attributes to the saved
    values before dumping the object definition query.

    Example:
      SET @saved_character_set_client = @@character_set_client;
      CREATE VIEW ...
      SET @@character_set_client = @saved_character_set_client;

  - No conversion for output of SHOW CREATE statements.

    Since the object definition query can be in multiple encodings,
    mysqldump should prevent conversion of the query to its
    character_set_results. In order to do that, we should temporarily set
    @@character_set_results in mysqldump's connection to "binary".

mysqldump compatibility modes
-----------------------------

Everything that was said above is applied when the mysqldump is used in
MySQL-native mode (i.e. mysqldump produces the SQL-file, which will be
further used by MySQL). However, mysqldump can be used to generate the
output for other RDBMSs, like Postgres, MS SQL Server or Oracle. In this
case, mysqldump should produce output that is more compatible with
standard SQL. That in particular means the following:
  - No MySQL-specific clauses in CREATE statements;
  - There should be no SET statements, that set MySQL-specific variables;
  - The output SQL-file must be in one character set. That means that the
    @@character_set_results session variable should not be set to binary to
    preserve conversion to the client's character set. That certainly means
    that the query can become unusable (as was described earlier), but if
    character sets are compatible, the user still has a way to get a proper
    SQL-file.

SHOW CREATE TRIGGER
-------------------

Syntax:

  SHOW CREATE TRIGGER 

SHOW CREATE TRIGGER output should have the following columns:

  - "Trigger"

    This is the trigger name.

  - "sql_mode"

    This is the value of sql_mode session variable at the creation
    time.

  - "SQL Original Statement"

    This is the trigger definition query in the original character set
    without any modification. Note that this statement, in a general
    case, can be parsed only after changing the current sql_mode to the
    sql_mode value from the previous column. Some sql modes influence
    lexing rules (e.g. PIPES_AS_CONCAT and ANSI_QUOTES).

    NOTE: the value of this column is subject to change
    (see WL#3995: SHOW CREATE TRIGGER should output a statement with
    version-specific comments).

  - "character_set_client"

    This is the value of character_set_client session variable at
    creation time.

  - "collation_connection"

    This is the value of the collation_connection session variable at
    creation time.

  - "Database Collation"

    This is the trigger's database collation at creation time.

  - Change existing SHOW CREATE statements to return the original query.

    So, SHOW CREATE VIEW | PROCEDURE | FUNCTION | EVENT statements should
    output the query in the original character set.

Object creation context
-----------------------

Objects (views, stored programs) are executed as follows:

  - Views: loaded and created from FRM-file each time view is used.

  - Stored routines (functions and procedures):

    - loaded from mysql.proc table into the connection cache on the first
      use;

    - executed from the connection cache;

  - Triggers:

    - loaded into the server-wide cache when the base table is first used;

  - Events:

    - loaded from mysql.event table each time on execution.

In order to compile a query into execution code, extra data (such as
environment variables or the current database character set) are used. This
data will be further referred to in this document as creation context. The 5.0
server did not preserve creation context but instead used the current
environment. The problem was that this creation context was not preserved.

So, if the creation context had been changed since the object was first
created, the object was different on the next load. Thus, its
behavior/result was different.

Generally, creation context is specific to each object type. For example,
for events, time zone should be preserved; for almost all object types,
sql_mode should be preserved. However, the scope of this WL item is only to
introduce the notion of the creation context, which consists of character
set information only. Moving other attributes (sql_mode, time zone, ...) to
the creation context is the task of another WL item.

View creation context consists of:
  - client character set;
  - connection collation (character set and collation);

Stored program creation context consists of:
  - collation of the owner database;
  - client character set;
  - connection collation (character set and collation);

State of affairs as of 5.1.23-BK
--------------------------------

The main problems reported in the following bugs were already fixed in 5.1:
  - BUG#11986: Stored routines and triggers can fail if the code
    has a non-ascii symbol
  - BUG#16291: mysqldump corrupts string-constants with non-ascii-chars
  - BUG#19443: INFORMATION_SCHEMA does not support charsets properly
  - BUG#21249: Character set of SP-var can be ignored
  - BUG#25212: Character set of string constant is ignored (stored routines)
  - BUG#25221: Character set of string constant is ignored (triggers)

However, a number of issues appeared after that, so this WL item was created
to reflect the whole picture.

New issues
----------

These issues still have to be fixed:

  - After the patch for BUG#10491 it was discovered that mysqldump has
    compatibility modes, and though they are probably already broken, we
    should not break them more.

    I.e.:

      If mysqldump is run with --default-character-set=latin1, SHOW CREATE
      TABLE will return data in UTF8, and that data will be converted to
      latin1.
      
      If data are convertible (identifiers in latin1), everything will work
      and we will have a dump in a single character set (latin1).

      The problem is that at the moment there is no way to have the dump in
      a single character set, even if only a subset of UTF8 is used (latin1,
      for one). In any case, dump will have UTF8 symbols.
    
    If/when this WL is approved, a new bug about broken compatibility modes
    shall be reported.

  - BUG#30217: Views: changes in metadata behaviour between 5.0 and 5.1

    The original bug description actually does not describe the problem,
    which is the following:

      - It was wrongly decided, that UTF8-query is not needed for views,
        because we actually parse a view into the internal tree, and then
        generate the definition query from this tree.
        
      - That auto-generated query is correct, but still it can not be used
        in the INFORMATION_SCHEMA, because it may contain character set
        introducers, thus it may be not solely in UTF8, which is forbidden
        for INFORMATION_SCHEMA.

      - Although that auto-generated query is perfectly correct, it still
        can not be used for SHOW CREATE VIEW (the original query must be
        used), because after dump/restore, we will have different "original
        query", so UTF8-query will be based on a different query, so
        INFORMATION_SCHEMA results will be different.

    So, the solution for these problems (this bug) seems to be the
    following:

      - Get rid of query generation;

      - Use the original definition query for SHOW CREATE VIEW (as for all
        other SHOW CREATE statements);

      - Generate UTF8-query based on the original query;

      - Use that UTF8-query in INFORMATION_SCHEMA.

  - It was discovered that we have problems not only with stored programs
    and views, but also with tables and most probably other objects, that
    can contain literals in their definitions.

    For example:

      > SET NAMES koi8r;
      > CREATE TABLE t1 (c VARBINARY(10) DEFAULT 'text in koi8r');
      > SHOW CREATE TABLE t1;
      +-------+----------------------------------------------+
      | Table | Create Table                                 |
      +-------+----------------------------------------------+
      | t1    | CREATE TABLE `t1` (
        `c` varbinary(10) DEFAULT '?????????'
      ) ENGINE=MyISAM DEFAULT CHARSET=latin1 | 
      +-------+----------------------------------------------+
      1 row in set (0.00 sec)

      > SET NAMES utf8;
      > SHOW CREATE TABLE t1;
      +-------+----------------------------------------------+
      | Table | Create Table                                 |
      +-------+----------------------------------------------+
      | t1    | CREATE TABLE `t1` (
        `c` varbinary(10) DEFAULT '���������'
      ) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
      +-------+----------------------------------------------+
      1 row in set (0.00 sec)

      > SELECT column_default
        FROM information_schema.columns
        WHERE table_name = 't1';
      +----------------+
      | column_default |
      +----------------+
      |                | 
      +----------------+
      1 row in set, 1 warning (0.00 sec)
      > SHOW WARNINGS;
      +---------+------+--------------------------------------+
      | Level   | Code | Message                              |
      +---------+------+--------------------------------------+
      | Warning | 1366 | Incorrect string value:
                         '\xC2\xCC\xC1\xC2\xCC\xC1...' for
                         column 'COLUMN_DEFAULT' at row 1     |
      +---------+------+--------------------------------------+
      1 row in set (0.00 sec)

    At the moment, only TABLE seems to have this kind of problems.

  - Another issue with views has been discovered.

    For example:

      > SET NAMES koi8r;
      > CREATE VIEW v1 AS
          SELECT _binary 'text-1 in koi8r',
                 _binary 'text-2 in koi8r',
                 _binary 'text-3 in koi8r';
      > SHOW CREATE VIEW v1G
      *************************** 1. row ***************************
                      View: v1
               Create View: CREATE ALGORITHM=UNDEFINED
                            DEFINER=`root`@`localhost`
                            SQL SECURITY DEFINER VIEW `v1` AS
                            SELECT _binary'text-1 in koi8r' AS ``,
                                   _binary'text-2 in koi8r' AS `My_exp_`,
                                   _binary'text-3 in koi8r' AS `My_exp_1_`
      character_set_client: koi8r
      collation_connection: koi8r_general_ci
      1 row in set (0.00 sec)

    Note, that the view columns have auto-generated names, the first column
    has an empty name. This is at least not-documented behavior, which should
    be either fixed or documented.
Common changes
--------------

  - Introduce Creation_ctx interface.

  - Add UTF8-query support to Lex_input_stream class.

  - Generate UTF8-query on parsing stage if it is needed.

Stored functions and stored procedures
--------------------------------------

  - Add character_set_client, collation_connection, db_collation columns
    as CHAR(32) COLLATE utf8_bin to mysql.proc table to store
    values of @@character_set_client, @@collation_connection and the
    database collation respectively.


  - Add body_utf8 column as LONGBLOB to mysql.proc table to store
    UTF8-query.

    Although, the UTF8-query is created each time a stored routine is being
    loaded into memory, we do not always load stored routines before use
    them. For one, when we create the output for INFORMATION_SCHEMA.ROUTINES
    view, we simply copy appropriate columns from the mysql.proc table.

  - Store the creation context in these fields when a stored routine is
    being created.

  - Switch client, connection and database character sets each time the
    object definition query is parsed (i.e. stored routine is compiled).

  - Switch client, connection and database character sets each time the
    routine is executed.

  - In order to provide backward compatibility, assume that the object
    definition query is in UTF8 if there is no creation context for a
    stored routine. Also, throw a warning in this case.

  - Add CHARACTER_SET_CLIENT, COLLATION_CONNECTION, DATABASE_COLLATION
    columns in INFORMATION_SCHEMA.ROUTINES pseudo-table to show client,
    connection and database character sets respectively.

  - Add "character_set_client", "collation_connection", "Database
    Collation" columns to SHOW PROCEDURE | FUNCTION STATUS statement to
    show client, connection and database character sets respectively.

  - Add "character_set_client", "collation_connection", "Database
    Collation" columns to SHOW CREATE PROCEDURE | FUNCTION statement to
    show client, connection and database character sets respectively.

  - Change character set of the "Create"-column (the object definition
    query) in SHOW CREATE PROCEDURE | FUNCTION statement to
    client-character-set at the creation time.

Events
------

  - Add character_set_client, collation_connection, db_collation columns
    as CHAR(32) COLLATE utf8_bin to mysql.event table to store
    values of @@character_set_client, @@collation_connection and the
    database collation respectively.

  - Add body_utf8 column as LONGBLOB to mysql.event table to store
    UTF8-query.

    Although, the UTF8-query is being created each time an event is loaded
    into memory, we do not always load events before use them. For one,
    when we create the output for INFORMATION_SCHEMA.EVENTS view, we simply
    copy appropriate columns from the mysql.event table.

  - Store the creation context in these fields when an event is being
    created.

  - Switch client, connection and database character sets each time the
    object definition query is parsed (i.e. event is compiled/executed).

  - In order to provide backward compatibility, assume that the object
    definition query is in UTF8 if there is no creation context for an
    event. Also, throw a warning in this case.

  - Add CHARACTER_SET_CLIENT, COLLATION_CONNECTION, DATABASE_COLLATION
    columns in INFORMATION_SCHEMA.ROUTINES pseudo-table to show client,
    connection and database character sets respectively.

  - Add "character_set_client", "collation_connection", "Database
    Collation" columns to SHOW CREATE EVENT statement to show client,
    connection and database character sets respectively.

  - Change character set of the "Create"-column (the object definition
    query) in SHOW CREATE EVENT statement to client-character-set at the
    creation time.

Triggers
--------

  - Store values of @@character_set_client and @@collation_connection, and
    database character set in TRG-file.

  - Switch client, connection and database character sets each time
    triggers for a table are parsed (loaded and compiled).

  - In order to provide backward compatibility, assume that the query is in
    UTF8 if there is no query definition context for a trigger. Also, throw
    a warning in this case.

  - Add CHARACTER_SET_CLIENT, COLLATION_CONNECTION, DATABASE_COLLATION
    columns in INFORMATION_SCHEMA.TRIGGERS pseudo-table to show client,
    connection and database character sets respectively.

  - Introduce SHOW CREATE TRIGGER statement.

Views
-----

  - Store values of @@character_set_client and @@collation_connection in
    FRM-file.

  - Store view-definition-query in the original character set in FRM-file.

  - Switch @@character_set_client and @@collation_connection each time
    the view definition query is parsed.

  - In order to provide backward compatibility, assume that the view
    definition query is in UTF8 if there is no context for a view. Also,
    throw a warning in this case.

  - Add CHARACTER_SET_CLIENT and COLLATION_CONNECTION columns in
    INFORMATION_SCHEMA.VIEWS pseudo-table to show client, connection and
    database character sets respectively.

  - Add "character_set_client", "collation_connection" columns in SHOW
    CREATE VIEW statement to show client and connection character sets
    respectively.

  - Change character set of "Create view" column in SHOW CREATE VIEW
    statement to client-character-set at the creation time.

mysqldump
---------

  - Use SHOW CREATE TRIGGER instead of SHOW TRIGGERS.

  - Dump of the definition query for each object type should be done as
    follows:

    - Dump statements to save current @@character_set_client,
      @@character_set_results and @@collation_connection;
   - Dump statements to set @@character_set_client,
      @@character_set_results and @@collation_connection to proper values
      (from the creation context);

    - Dump "ALTER DATABASE  COLLATE "
      statement if the database collation has been changed.

    - Dump the object definition query.

    - Dump "ALTER DATABASE  COLLATION "
      statement if the database collation has been changed.

    - Dump statements to restore @@character_set_client,
      @@character_set_results and @@collation_connection to their current
      state.

  - Temporarily switch @@character_set_results of mysqldump's connection to
    "binary" before invoking "SHOW CREATE" statements.