WL#13355: Make CREATE TABLE...SELECT atomic and crash-safe

Affects: Server-8.0   —   Status: Complete

EXECUTIVE SUMMARY
=================

  This WL would implement the DDL 'CREATE TABLE ... AS SELECT'
  (lets call it as DDL_CTAS) as an atomic operation in MySQL in
  mainly replication side, unlike today where it involves two
  transaction for single DDL at applier thread.

  This WL would also enable DDL_CTAS execution in group
  replication as atomic operation.


USER/DEV STORIES
================

Primary user stories of interest are:

 1) The DDL_CTAS operation is an atomic operation at SQL server.
    So, when user executes DDL_CTAS and if server crashes while
    inserting rows into the table, then user would not see a
    empty table being created without the intended data in it.

    However, with binlog_format = ROW, the binlog written by above
    execution leads to use of two transactions.  One for CREATE TABLE
    and another for INSERT.  This WL would make use of single
    transaction when writing binlog events for DDL_CTAS.

 2) A replica mysqld instance (apply thread) executes DDL_CTAS and
    if server crashes while inserting rows into the table, replica
    would be left with a empty table being created without the
    intended data in it.

    After this WL, replica thread should execute the DDL_CTAS as
    single transaction. When DDL_CTAS is replicated and
    binlog_format = ROW, the replica should see a CREATE TABLE
    statement followed by row events as a single binlog
    transaction.

 3) A DDL_CTAS operation is written as two transactions in binlog
    when binlog_format = ROW.  After replica applies the first binlog
    transaction which creates a table, the metadata lock on the
    table is released, before replica applies row events that inserts
    data into it. This does not block concurrent access to the table
    while replica thread is inserting rows in the tables.

    After this WL, concurrent access to table should be block,
    while replica is filling rows into the table.

 4) When GTID's are enabled, the DDL_CTAS operation is
    currently restricted, mainly as DDL_CTAS is not an atomic
    operation.

    After this WL, DDL_CTAS should be executed as a single
    transaction, to lift this restriction.

 5) When user replays the binlog events of DDL_CTAS operation,
    using mysqlbinlog tool, currently there are two events that
    gets applied as two separate transactions if binlog_format = ROW.
    1) a query log event that creates the table. 2) row events that
    populate the table.  If the server crashes after 1), the table
    is left empty.

    After this WL, DDL_CTAS should be executed as a single
    transaction when relaying the events.


SCOPE
=====

  The scope of this WL is to implement the DDL_CTAS as atomic
  operation and there by solving the problems listed in USER/DEV
  STORIES section above.


LIMITATIONS
===========

  This WL would be implemented with a additional restriction on
  the statement that you cannot add foreign keys. Foreign keys
  can be added in later statements.

  The scope of this WL is limited only to storage engines which
  support atomic-DDL and not other storage engines.
FR1)
DDL_CTAS would be atomic operation for supporting SE's that
support atomic DDL's from table existence, its contents and
binlog contents view point.

FR2)
For SE's supporting atomic DDL and FKs, DDL_CTAS command with
request to create foreign key would be disallowed. Users should
be able to add required foreign keys using ALTER TABLE after
DDL_CTAS command completion. See B.1.c1.
     
FR3)
FR1 would be applicable when applying the DDL_CTAS operation on
replica.  This also means that the concurrent connections on replica
would not be able to see empty table, before the DDL_CTAS
operation is committed, similar to the behavior seen on source.
This involves implementation of changes described in HLS section
B.2.c#.

FR4)
DDL_CTAS command execution would be allowed with GTID enabled.
And it would exhibit behavior described in FR1. This involves
implementing changes B.3.c#.

FR5)
FR1 would be applicable when replaying replay binlog on server
using mysqlbinlog tool. This behavior is achieved by implementing
above FR's.

FR6)
For storage engines which do not support atomic DDL's, the
binlogging will work the same way and there will be no change.
And replication of such statements with GTID mode enabled would
be still prohibited.
ABBREVIATION
============
  DDL_CTAS -> CREATE TABLE ... AS SELECT


A. INTRODUCTION
===============

The primary aim of the worklog is to make DDL_CTAS execution as a
atomic operation on replication side. In other words this
involves making DDL_CTAS to fully rollback on failure or crash in
both cases when execution happens on source and replica.

The USER/DEV stories section listed in HLD highlights the
scenario of interest.

  1 ROW BASED BINARY LOGS CAPTURE TWO TRANSACTIONS FOR DDL_CTAS.

    MySQL server executes DDL_CTAS as single transaction, but the
    binary log is written as two separate transaction.  One for
    the DDL part and another with the DML part.  The DDL part is
    logged as non-transaction Query log event. And the DML part
    of is written as transactional Row log event.

  2 BINARY LOG APPLIER EXTERNALIZES TABLE BEFORE INSERTING THE ROWS.

    As a consequence of #1, the binary log applier externalizes
    the table before actually inserting the rows part of the
    DDL_CTAS. This also means that if there is a crash in-between
    the two transactions an empty table is externalized after
    recovery. I.e., the whole thing is not atomic.

  3 BINARY LOG APPLIER APPLIES TWO TRANSACTIONS INSTEAD OF ONE
    FOR DDL_CTAS.

    As a consequence of #1 the applier applies two different
    transactions, which means again that the whole thing is not
    atomic and can lead to replication stream mismatches.

  4 DUE TO #1, #2, #3 WITH GTIDS DDL_CTAS IS BLOCKED

    When GTID's are enabled, the DDL_CTAS fails with an error
    ER_GTID_UNSAFE_CREATE_SELECT. This is because a single
    DDL_CTAS may be split by the secondary applier. This can lead
    the secondary to effectively report having more transactions
    than those on the primary (e.g., primary logs in SBR and
    secondary logs in RBR). Thus it leads to inconsistent GTID
    sets between the two servers for the same input workload.


 This WL aim's to lift above limitations which would enable
 execution of DDL_CTAS operation to be atomic on server,
 replication replica, group replication node and even during
 binlog replay.


B. DESIGN OF ATOMIC DDL_CTAS
============================

 Following are the items which would enable recovery of
 DDL_CTAS operation in replication side:


 1. Atomic DDL_CTAS execution by SQL Server:

  When binlogging is off, the current MySQL server invokes single
  implicit commit after CREATE TABLE and INSERT operation during
  DDL_CTAS execution. There will be no change to this. This
  applies to SE's that support atomic DDLs.

  When binlogging is on, the changes required are discussed in
  section B.2.

  Following change will be made during execution of DDL_CTAS at
  server.

  c1. Disallow creation of foreign keys along with DDL_CTAS.
      Users should be able to add required foreign keys using ALTER
      TABLE later. This is so that the row's inserted into the table
      remains consistent one source and replica, avoiding constraint
      check failures which can lead to inconsistency.

      More specifically the foreign keys in DDL_CTAS will be
      disallowed if,
        - SE supports foreign keys.
        - SE supports atomic DDL.
        - The binlogging is enabled.
        - The binlog format is 'row'.

      Note:
      The main reason for this is complications related to locking,
      updating and invalidation (in case of rollback) of DD cache
      for parent table.
      We can think of disallowing foreign keys in DDL_CTAS when
      binlog format is 'stmt' for more uniformity. However doing
      this may break old compatibility and so we keep it as is.
      Alternatively, we can think of adding code to remove this
      limitation for binlog_format = ROW mode.

 2. Binlog handling for atomic DDL_CTAS:

  Binlog creation in row mode and apply operation needs following
  changes:

  c1. Enable the Query_log_event created for CREATE TABLE part of
      DDL_CTAS to be written to transactional binlog cache and not to
      statement binlog cache. This enable DDL_CTAS operation to be
      logged in a single BEGIN and COMMIT block in the binlog.

      Query_log_event(CREATE TABLE ... START TRANSACTION)
      ... row events (INSERT) ...
      ... row events (INSERT) ...
      ... row events (INSERT) ...
      Query_log_event(COMMIT)

      See LLD N1) for a sample binlog event.

      The new clause 'START TRANSACTION' for CREATE TABLE is written
      in the binlog. 
       ...
          CREATE TABLE [IF NOT EXISTS] tbl_name (create_definition)
            [table_options] [partition_options]

          In this terminology, we propose to add a new
          table_option, like: table_option:
              START TRANSACTION
            | [...]

          Example:
          CREATE TABLE t (a INT) START TRANSACTION;
        ...

      The behavior of the new clause is discussed below in c3 and c4.

   NOTES ON A ALTERNATIVE:
    We have also considered following alternative, to use a
    SET @@ddl_commit variable e.g.,

      SET @@ddl_commit=0;
      BEGIN;
      CREATE TABLE t (a INT); // Won't do commit with @@ddl_commit=0;
      BINLOG ...;
      ...
      BINLOG ...;
      COMMIT;
      SET @@ddl_commit=1;

      +) This looks like a logical extension to existing syntax (in fact
         no new syntax is introduced). In future we might support
         transactional DDL and allow other statements than CREATE TABLE
         inside of BEGIN/COMMIT.
      +) It is possible to execute on earlier versions by surrounding
         SET statements in versioned comments and preferrably turning
         autocommit off.
      -) It might confuse users and made them think that we support
         transactional DDL. So extra care in documentation and possibly
         emitting errors for other DDL will be necessary?
      -) @@ddl_commit is yet another variable which needs to be binlogged.


    As of now, we do not prefer this alternative because,
    Transactional DDL is not in any medium term plans. This
    later might happen rather far away in future.

  c2. Replaying DDL_CTAS binlog event using mysqlbinlog tool.
      This should work same as discussed in below in section c3 & c4.

  c3 CREATE TABLE behavior changes in the following way when it used
     with START TRANSACTION clause (be it executed by SQL thread as
     result of processing Query_log_event(CREATE TABLE ...  START
     TRANSACTION) or in normal connection which executes output from
     mysqlbinlog tool).

      1 Do not invoke commit at the end of CREATE TABLE.

      2 Do not release MDL lock on the table at the end of CREATE
        TABLE.

      3 Applying row events (for INSERT) should be able to
        execute on already locked table name.

      4 Release the table lock during COMMIT operation.


      Note:
      - The behavior that is observed for concurrent operations with a
        regular CREATE TABLE is also observed for DDL_CTAS

      - Like other atomic DDLs, the effects of a DDL_CTAS will only be
        visible (table and table contents) once the DDL transaction
        commits.

  c4. To support processing of mysqlbinlog output we need to
      allow use of START TRANSACTION by general user.

      Note that:  
      - We will allow only applying BINLOG INSERT statement, COMMIT
        and ROLLBACK statements after CREATE TABLE ... START
        TRANSACTION command.

      - START TRANSACTION clause won't require an additional privilege.

      - Use of this syntax on table using engine that does not
        support atomic DDL will result in a error.

  c5. Binlogging CREATE TABLE ... START TRANSACTION.

      When user executes this statement on a normal or
      on a replica connection,

      - A old transaction is committed and a new is started.

      - If binlog is disabled, transaction is not committed and
        user is allowed to execute BINLOG INSERT, COMMIT or
        ROLLBACK command.

      - If binlog is enabled, a Query_log_event will be written
        with user provided query string in it. The transaction is
        left uncommitted and user is allowed to execute BINLOG
        INSERT, COMMIT or ROLLBACK command.

      - If binlog format is 'row' or 'mixed', then the events
        generated are similar to as described in B.1.c2.

      - If binlog format is 'statement' then BINLOG statements
        will fail. So applying CTAS + BINLOG + COMMIT combo
        representing ANY CTAS is impossible. However, we will
        allow CREATE TABLE ... START TRANSACTION, immediately
        followed by COMMIT to be able apply transactions
        representing "CREATE TABLE ... SELECT LIMIT 0".
        This is good for compatibility and should not complicate
        code.

  c6. For storage engines which do not support atomic DDL's, the
      binlogging will work the same way and there will be no change.


 3. GTID's for DDL_CTAS:

  When GTID's are enabled, the following action items are
  required to make DDL_CTAS an atomic commit.

  c1 Allow DDL_CTAS even when --enforce-gtid-consistency=on.

  c2 A single Gtid_log_event should be logged before the events
     written as mentioned in section B.2.c1). E.g.,

      Gtid_log_event
      Query_log_event(CREATE TABLE ... START TRANSACTION)
      ... row events (INSERT) ...
      ... row events (INSERT) ...
      ... row events (INSERT) ...
      Query_log_event(COMMIT)



C. RECOVERY
===========

  This WL primarily deals with make DDL_CTAS a rollbackable
  operation.

  - From section A. we see that the DDL_CTAS operation is atomic at
  server level, with binlog is off. This enables operation to be
  rolled back on a event of a failure. Test t1 and t2 in section E.
  verifies the same.

  - From section B.c1 and c3, we see that the DDL_CTAS is applied
  at replica as a single transaction. This enables operation to be
  rollbackable on a event of failure in replica applier thread. Test
  t3 and t4 in section E. verifies the same. This same behavior
  is applicable during binlog replay.
  


D. BINLOG FORMAT/PROTOCOL
=========================

  This WL will write query log event (for DDL) followed by row
  events (for INSERT's) within BEGIN and COMMIT block, as shown in
  LLD. There are no changes to binlog format or protocol.

  We need to change the transaction boundary parser to
  accommodate the new combination of events: DDL + ROWS + COMMIT


E. TESTING
==========

  Primary test cases to be executed with binlog row format:

  t1 Induce crash after table creation during DDL_CTAS execution,
     before the inserts into the table.  Recover the server and
     see,

     - that the table is dropped.

     - Check that nothing is binlogged. Atomicity has to go all
       the way, engines and binary log.


  t2 Induce crash after all the insert operation during DDL_CTAS
     execution, before the commit operation. Recover the server and
     see,

     - that the table is dropped.

     - Check that nothing is binlogged. Atomicity has to go all
       the way, engines and binary log.


  t3 Induce crash after replica thread creates table when applying the
     binlog for DDL_CTAS. Recover the replica and see,

     - that the table is dropped.

     - Restart replication. Check that it resumes from the point
       where it should. If the replica has binary log, check that no
       "garbage" ended up in the binary log.


  t4 Induce crash after replica thread executes inserts operation when
     applying binlog for DDL_CTAS, before the commit operation.
     Recover the server and see,

     - that the table is dropped.

     - Restart replication. Check that it resumes from the point
       where it should. If the replica has binary log, check that no
       "garbage" ended up in the binary log.


  t5 Repeat t3 and t4 when group replication applies the DDL_CTAS
     operation. Also check that servers are able to get back to
     the group after recovery. No "garbage" in the binary log
     either.


  t6 Check if attempt to access the table gets blocked, while it
     is being used to insert rows when applying DDL_CTAS binlog by
     replica thread or GR node.

  TODO - Develop more test cases.


F. PERFORMANCE
==============

 - There would be no major performance impact (less than 2%) in
   execution of DDL_CTAS either on the server, replica thread and GR
   node.


G. UPGRADE
==========

 OLD -> refers to MySQL server without this WL implemented.
 New -> refers to MySQL server with this WL implemented.

 1. Replication replica is upgraded first, so binlog from OLD
    source would be still applied by NEW replica.

 2. When GTID_MODE = OFF, we can make NEW->OLD work by logging
    CREATE TABLE ... SELECT as:
      CREATE TABLE ... /*! START TRANSACTION */

    This behavior is not feasible to be implemented, mainly because
    the CREATE TABLE would be executed as a separate transaction, keeping
    row inserts as separate operation. Even if the binlog semantics is
    made compatible by adding versioned comment, the replica wouldn't apply
    it atomically. 

 3. When GTID_MODE = ON, we can never make NEW->OLD work, because
    OLD will commit CREATE TABLE before it applies the row events,
    so one transaction on new source becomes two on old replica, which
    is bad. In practice it will generate an error and stop the replica
    executor thread on the first row event. It is better to generate
    this error already on CREATE, so when GTID_MODE = ON it's better
    to *not* put START TRANSACTION inside version comments and make
    the old replica stop on unknown syntax in the CREATE statement.


H. OBSERVABILITY
================

  It would be helpful to enable users to observe the process of
  DDL_CTAS operation. We can extend the performance_schema to show
  two stages of DDL_CTAS operation 1) Creating table and 2)
  Inserting rows.

  1) DDL_CTAS execution run on source:
     Following is output from performance_schema.events_stages_history_long,

    378 48 ... stage/sql/Opening tables  sql_base.cc:5594        ...
    379 48 ... stage/sql/init            sql_select.cc:171       ...
    380 48 ... stage/sql/creating table  sql_table.cc:8399       ...
    381 48 ... stage/sql/After create    sql_table.cc:8472       ...
    382 48 ... stage/sql/System lock     lock.cc:331     ...
    383 48 ... stage/sql/optimizing      sql_optimizer.cc:282    ...
    384 48 ... stage/sql/statistics      sql_optimizer.cc:502    ...
    385 48 ... stage/sql/preparing       sql_optimizer.cc:583    ...
    386 48 ... stage/sql/executing       sql_union.cc:1410       ...
    387 48 ... stage/sql/System lock     lock.cc:331     ...
    388 48 ... stage/sql/waiting for handler commit    handler.cc:1590 ...
    389 48 ... stage/sql/waiting for handler commit    handler.cc:1590 ...
    390 48 ... stage/sql/end             sql_select.cc:222       ...

    1.1) Line 380 and 381 can be used to observer table being created.

    37x 48 ... stage/sql/creating table as select   sql_select.cc:171  ...

    1.2) Line 386 actually means insertion of data into created table.

    Would it help to modify it as following ?

    386 48 ... stage/sql/inserting rows       sql_union.cc:1410  ...


  2) DDL_CTAS execution by applier thread:

    As the applier thread executes CREATE TABLE and BINLOG INSERTs as several
    statement the performance schema would show the stages involved in each
    statement execution, and would be sufficient.


I. FAILURE MODEL
================

- Recovering after a crash after CREATE TABLE
  - No table shall exist in the database.
  - No change logged into the binary log due to DDL_CTAS.
  - No GTID metadata changed. (see mysql.gtid_executed)

- Recovering after a crash in the middle of rows being inserted
  - No table shall exist in the database.
  - No rows shall exist either (obviously).
  - No change logged into the binary log due to DDL_CTAS.
  - No GTID metadata changed. (see mysql.gtid_executed)

- Recovering after a crash just before PREPARING in the engine
  - No table shall exist in the database.
  - No rows shall exist either (obviously).
  - No change logged into the binary log due to DDL_CTAS.
  - No GTID metadata changed. (see mysql.gtid_executed)

- Recovering after a crash during the FLUSH to the binary log.
  (after process_flush_stage_queue() and before update_binlog_end_pos())
  - Table shall not exist in the database.
  - No rows shall exist either (obviously).
  - No change logged into the binary log due to DDL_CTAS.
  - No GTID metadata changed. (see mysql.gtid_executed)

- Recovering after a crash after WRITING to the binary log successfully.
  (after update_binlog_end_pos())
  - Table shall exist in the database.
  - Rows shall exist in the table.
  - Changes are logged into the binary log.
  - Transaction is committed in the engine.
  - GTID metadata updated. (see mysql.gtid_executed)
A) This is DDL_CTAS binlog sample proposed in design above.
   This is result of mysqlbinlog tool on a prototype patch.
==============================================================

..................................................................
# at 1437
#200304 22:31:53 server id 1  end_log_pos 1516 CRC32 0x45e0ba69 	Anonymous_GTID	last_committed=5	sequence_number=6	rbr_only=no	original_committed_timestamp=1583350313298585	immediate_commit_timestamp=1583350313298585	transaction_length=423
# original_commit_timestamp=1583350313298585 (2020-03-04 22:31:53.298585 GMT)
# immediate_commit_timestamp=1583350313298585 (2020-03-04 22:31:53.298585 GMT)
/*!80001 SET @@session.original_commit_timestamp=1583350313298585*//*!*/;
/*!80014 SET @@session.original_server_version=80021*//*!*/;
/*!80014 SET @@session.immediate_server_version=80021*//*!*/;
SET @@SESSION.GTID_NEXT= 'ANONYMOUS'/*!*/;
# at 1516
#200304 22:31:53 server id 1  end_log_pos 1591 CRC32 0xa7ee392c 	Query	thread_id=9	exec_time=0	error_code=0
SET TIMESTAMP=1583350313/*!*/;
BEGIN
/*!*/;
# at 1591
#200304 22:31:53 server id 1  end_log_pos 1726 CRC32 0x36ebf6de 	Query	thread_id=9	exec_time=0	error_code=0
SET TIMESTAMP=1583350313/*!*/;
/*!80013 SET @@session.sql_require_primary_key=0*//*!*/;
CREATE TABLE `t1` (
  `f1` int DEFAULT NULL
) START TRANSACTION
/*!*/;
# at 1726
#200304 22:31:53 server id 1  end_log_pos 1774 CRC32 0xb360e710 	Table_map: `test`.`t1` mapped to number 154
# at 1774
#200304 22:31:53 server id 1  end_log_pos 1829 CRC32 0xc83d4f94 	Write_rows: table id 154 flags: STMT_END_F

BINLOG '
KQJgXhMBAAAAMAAAAO4GAAAAAJoAAAAAAAEABHRlc3QAAnQxAAEDAAEBAQAQ52Cz
KQJgXh4BAAAANwAAACUHAAAAAJoAAAAAAAEAAgAB/wABAAAAAAIAAAAAAwAAAAAEAAAAlE89yA==
'/*!*/;
# at 1829
#200304 22:31:53 server id 1  end_log_pos 1860 CRC32 0x63a11dbe 	Xid = 67
COMMIT/*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
..................................................................


B) LOW LEVEL DETAILS IN IMPLEMENTING B.2.C3.
============================================
1. Do not commit the transaction after CREATE TABLE .. START
   TRANSACTION. (Lets say DDL_CTAS_ST)

2. Maintain context of DDL_CTAS_ST execution within the.
   THD->Transactional_ddl_context
   This context will hold following information,

   a. Create THD->m_transactional_ddl_ctx if we are executing
      DDL_CTAS on SE supporting atomic DDL.

   b. Schema and table name of the table being created.

   c. The handlerton pointer on which the table is being created.

   d. Dictionary_client::Auto_releaser() which holds the
      dictionary object that is built to to create new table and which
      is not yet committed. The Table definition cache would
      still hold the TABLE*.

3. Do not release MDL lock on schema.tablename at the end of DDL_CTAS_ST.

4. The rollback operation involves doing following,

   a. removes the TABLE* using tdc_remove_table().
   b. Frees the Auto_releaser from THD->Transactional_ddl_context.
   c. Invoke the handlerton->post_ddl() hook. This also evicts
      the table from the innodb internal dict cache.
   d. Free the Transactional_ddl_context itself.

5. The commit operation involves doing following,

   a. Free the Auto_releaser maintained in Transactional_ddl_context.
   b. Invoke the handlerton->post_ddl() hook.
   c. Free the Transactional_ddl_context itself.

...
class Transactional_ddl_context {
 public:
  ~Transactional_ddl_context() { end(); }
  void init(THD *thd, dd::String_type db, dd::String_type tablename,
            handlerton *hton);

  bool inited() { return m_thd != nullptr; }

  void rollback();

  void end();

 private:
  std::unique_ptr m_auto_releaser;
  THD *m_thd{nullptr};
  handlerton *m_hton{nullptr};
  dd::String_type m_db;
  dd::String_type m_tablename;
};
...