WL#13355: Make CREATE TABLE...SELECT atomic and crash-safe
Affects: Server-8.0
—
Status: Complete
EXECUTIVE SUMMARY
=================
This WL would implement the DDL 'CREATE TABLE ... AS SELECT'
(lets call it as DDL_CTAS) as an atomic operation in MySQL in
mainly replication side, unlike today where it involves two
transaction for single DDL at applier thread.
This WL would also enable DDL_CTAS execution in group
replication as atomic operation.
USER/DEV STORIES
================
Primary user stories of interest are:
1) The DDL_CTAS operation is an atomic operation at SQL server.
So, when user executes DDL_CTAS and if server crashes while
inserting rows into the table, then user would not see a
empty table being created without the intended data in it.
However, with binlog_format = ROW, the binlog written by above
execution leads to use of two transactions. One for CREATE TABLE
and another for INSERT. This WL would make use of single
transaction when writing binlog events for DDL_CTAS.
2) A replica mysqld instance (apply thread) executes DDL_CTAS and
if server crashes while inserting rows into the table, replica
would be left with a empty table being created without the
intended data in it.
After this WL, replica thread should execute the DDL_CTAS as
single transaction. When DDL_CTAS is replicated and
binlog_format = ROW, the replica should see a CREATE TABLE
statement followed by row events as a single binlog
transaction.
3) A DDL_CTAS operation is written as two transactions in binlog
when binlog_format = ROW. After replica applies the first binlog
transaction which creates a table, the metadata lock on the
table is released, before replica applies row events that inserts
data into it. This does not block concurrent access to the table
while replica thread is inserting rows in the tables.
After this WL, concurrent access to table should be block,
while replica is filling rows into the table.
4) When GTID's are enabled, the DDL_CTAS operation is
currently restricted, mainly as DDL_CTAS is not an atomic
operation.
After this WL, DDL_CTAS should be executed as a single
transaction, to lift this restriction.
5) When user replays the binlog events of DDL_CTAS operation,
using mysqlbinlog tool, currently there are two events that
gets applied as two separate transactions if binlog_format = ROW.
1) a query log event that creates the table. 2) row events that
populate the table. If the server crashes after 1), the table
is left empty.
After this WL, DDL_CTAS should be executed as a single
transaction when relaying the events.
SCOPE
=====
The scope of this WL is to implement the DDL_CTAS as atomic
operation and there by solving the problems listed in USER/DEV
STORIES section above.
LIMITATIONS
===========
This WL would be implemented with a additional restriction on
the statement that you cannot add foreign keys. Foreign keys
can be added in later statements.
The scope of this WL is limited only to storage engines which
support atomic-DDL and not other storage engines.
FR1)
DDL_CTAS would be atomic operation for supporting SE's that
support atomic DDL's from table existence, its contents and
binlog contents view point.
FR2)
For SE's supporting atomic DDL and FKs, DDL_CTAS command with
request to create foreign key would be disallowed. Users should
be able to add required foreign keys using ALTER TABLE after
DDL_CTAS command completion. See B.1.c1.
FR3)
FR1 would be applicable when applying the DDL_CTAS operation on
replica. This also means that the concurrent connections on replica
would not be able to see empty table, before the DDL_CTAS
operation is committed, similar to the behavior seen on source.
This involves implementation of changes described in HLS section
B.2.c#.
FR4)
DDL_CTAS command execution would be allowed with GTID enabled.
And it would exhibit behavior described in FR1. This involves
implementing changes B.3.c#.
FR5)
FR1 would be applicable when replaying replay binlog on server
using mysqlbinlog tool. This behavior is achieved by implementing
above FR's.
FR6)
For storage engines which do not support atomic DDL's, the
binlogging will work the same way and there will be no change.
And replication of such statements with GTID mode enabled would
be still prohibited.
ABBREVIATION
============
DDL_CTAS -> CREATE TABLE ... AS SELECT
A. INTRODUCTION
===============
The primary aim of the worklog is to make DDL_CTAS execution as a
atomic operation on replication side. In other words this
involves making DDL_CTAS to fully rollback on failure or crash in
both cases when execution happens on source and replica.
The USER/DEV stories section listed in HLD highlights the
scenario of interest.
1 ROW BASED BINARY LOGS CAPTURE TWO TRANSACTIONS FOR DDL_CTAS.
MySQL server executes DDL_CTAS as single transaction, but the
binary log is written as two separate transaction. One for
the DDL part and another with the DML part. The DDL part is
logged as non-transaction Query log event. And the DML part
of is written as transactional Row log event.
2 BINARY LOG APPLIER EXTERNALIZES TABLE BEFORE INSERTING THE ROWS.
As a consequence of #1, the binary log applier externalizes
the table before actually inserting the rows part of the
DDL_CTAS. This also means that if there is a crash in-between
the two transactions an empty table is externalized after
recovery. I.e., the whole thing is not atomic.
3 BINARY LOG APPLIER APPLIES TWO TRANSACTIONS INSTEAD OF ONE
FOR DDL_CTAS.
As a consequence of #1 the applier applies two different
transactions, which means again that the whole thing is not
atomic and can lead to replication stream mismatches.
4 DUE TO #1, #2, #3 WITH GTIDS DDL_CTAS IS BLOCKED
When GTID's are enabled, the DDL_CTAS fails with an error
ER_GTID_UNSAFE_CREATE_SELECT. This is because a single
DDL_CTAS may be split by the secondary applier. This can lead
the secondary to effectively report having more transactions
than those on the primary (e.g., primary logs in SBR and
secondary logs in RBR). Thus it leads to inconsistent GTID
sets between the two servers for the same input workload.
This WL aim's to lift above limitations which would enable
execution of DDL_CTAS operation to be atomic on server,
replication replica, group replication node and even during
binlog replay.
B. DESIGN OF ATOMIC DDL_CTAS
============================
Following are the items which would enable recovery of
DDL_CTAS operation in replication side:
1. Atomic DDL_CTAS execution by SQL Server:
When binlogging is off, the current MySQL server invokes single
implicit commit after CREATE TABLE and INSERT operation during
DDL_CTAS execution. There will be no change to this. This
applies to SE's that support atomic DDLs.
When binlogging is on, the changes required are discussed in
section B.2.
Following change will be made during execution of DDL_CTAS at
server.
c1. Disallow creation of foreign keys along with DDL_CTAS.
Users should be able to add required foreign keys using ALTER
TABLE later. This is so that the row's inserted into the table
remains consistent one source and replica, avoiding constraint
check failures which can lead to inconsistency.
More specifically the foreign keys in DDL_CTAS will be
disallowed if,
- SE supports foreign keys.
- SE supports atomic DDL.
- The binlogging is enabled.
- The binlog format is 'row'.
Note:
The main reason for this is complications related to locking,
updating and invalidation (in case of rollback) of DD cache
for parent table.
We can think of disallowing foreign keys in DDL_CTAS when
binlog format is 'stmt' for more uniformity. However doing
this may break old compatibility and so we keep it as is.
Alternatively, we can think of adding code to remove this
limitation for binlog_format = ROW mode.
2. Binlog handling for atomic DDL_CTAS:
Binlog creation in row mode and apply operation needs following
changes:
c1. Enable the Query_log_event created for CREATE TABLE part of
DDL_CTAS to be written to transactional binlog cache and not to
statement binlog cache. This enable DDL_CTAS operation to be
logged in a single BEGIN and COMMIT block in the binlog.
Query_log_event(CREATE TABLE ... START TRANSACTION)
... row events (INSERT) ...
... row events (INSERT) ...
... row events (INSERT) ...
Query_log_event(COMMIT)
See LLD N1) for a sample binlog event.
The new clause 'START TRANSACTION' for CREATE TABLE is written
in the binlog.
...
CREATE TABLE [IF NOT EXISTS] tbl_name (create_definition)
[table_options] [partition_options]
In this terminology, we propose to add a new
table_option, like: table_option:
START TRANSACTION
| [...]
Example:
CREATE TABLE t (a INT) START TRANSACTION;
...
The behavior of the new clause is discussed below in c3 and c4.
NOTES ON A ALTERNATIVE:
We have also considered following alternative, to use a
SET @@ddl_commit variable e.g.,
SET @@ddl_commit=0;
BEGIN;
CREATE TABLE t (a INT); // Won't do commit with @@ddl_commit=0;
BINLOG ...;
...
BINLOG ...;
COMMIT;
SET @@ddl_commit=1;
+) This looks like a logical extension to existing syntax (in fact
no new syntax is introduced). In future we might support
transactional DDL and allow other statements than CREATE TABLE
inside of BEGIN/COMMIT.
+) It is possible to execute on earlier versions by surrounding
SET statements in versioned comments and preferrably turning
autocommit off.
-) It might confuse users and made them think that we support
transactional DDL. So extra care in documentation and possibly
emitting errors for other DDL will be necessary?
-) @@ddl_commit is yet another variable which needs to be binlogged.
As of now, we do not prefer this alternative because,
Transactional DDL is not in any medium term plans. This
later might happen rather far away in future.
c2. Replaying DDL_CTAS binlog event using mysqlbinlog tool.
This should work same as discussed in below in section c3 & c4.
c3 CREATE TABLE behavior changes in the following way when it used
with START TRANSACTION clause (be it executed by SQL thread as
result of processing Query_log_event(CREATE TABLE ... START
TRANSACTION) or in normal connection which executes output from
mysqlbinlog tool).
1 Do not invoke commit at the end of CREATE TABLE.
2 Do not release MDL lock on the table at the end of CREATE
TABLE.
3 Applying row events (for INSERT) should be able to
execute on already locked table name.
4 Release the table lock during COMMIT operation.
Note:
- The behavior that is observed for concurrent operations with a
regular CREATE TABLE is also observed for DDL_CTAS
- Like other atomic DDLs, the effects of a DDL_CTAS will only be
visible (table and table contents) once the DDL transaction
commits.
c4. To support processing of mysqlbinlog output we need to
allow use of START TRANSACTION by general user.
Note that:
- We will allow only applying BINLOG INSERT statement, COMMIT
and ROLLBACK statements after CREATE TABLE ... START
TRANSACTION command.
- START TRANSACTION clause won't require an additional privilege.
- Use of this syntax on table using engine that does not
support atomic DDL will result in a error.
c5. Binlogging CREATE TABLE ... START TRANSACTION.
When user executes this statement on a normal or
on a replica connection,
- A old transaction is committed and a new is started.
- If binlog is disabled, transaction is not committed and
user is allowed to execute BINLOG INSERT, COMMIT or
ROLLBACK command.
- If binlog is enabled, a Query_log_event will be written
with user provided query string in it. The transaction is
left uncommitted and user is allowed to execute BINLOG
INSERT, COMMIT or ROLLBACK command.
- If binlog format is 'row' or 'mixed', then the events
generated are similar to as described in B.1.c2.
- If binlog format is 'statement' then BINLOG statements
will fail. So applying CTAS + BINLOG + COMMIT combo
representing ANY CTAS is impossible. However, we will
allow CREATE TABLE ... START TRANSACTION, immediately
followed by COMMIT to be able apply transactions
representing "CREATE TABLE ... SELECT LIMIT 0".
This is good for compatibility and should not complicate
code.
c6. For storage engines which do not support atomic DDL's, the
binlogging will work the same way and there will be no change.
3. GTID's for DDL_CTAS:
When GTID's are enabled, the following action items are
required to make DDL_CTAS an atomic commit.
c1 Allow DDL_CTAS even when --enforce-gtid-consistency=on.
c2 A single Gtid_log_event should be logged before the events
written as mentioned in section B.2.c1). E.g.,
Gtid_log_event
Query_log_event(CREATE TABLE ... START TRANSACTION)
... row events (INSERT) ...
... row events (INSERT) ...
... row events (INSERT) ...
Query_log_event(COMMIT)
C. RECOVERY
===========
This WL primarily deals with make DDL_CTAS a rollbackable
operation.
- From section A. we see that the DDL_CTAS operation is atomic at
server level, with binlog is off. This enables operation to be
rolled back on a event of a failure. Test t1 and t2 in section E.
verifies the same.
- From section B.c1 and c3, we see that the DDL_CTAS is applied
at replica as a single transaction. This enables operation to be
rollbackable on a event of failure in replica applier thread. Test
t3 and t4 in section E. verifies the same. This same behavior
is applicable during binlog replay.
D. BINLOG FORMAT/PROTOCOL
=========================
This WL will write query log event (for DDL) followed by row
events (for INSERT's) within BEGIN and COMMIT block, as shown in
LLD. There are no changes to binlog format or protocol.
We need to change the transaction boundary parser to
accommodate the new combination of events: DDL + ROWS + COMMIT
E. TESTING
==========
Primary test cases to be executed with binlog row format:
t1 Induce crash after table creation during DDL_CTAS execution,
before the inserts into the table. Recover the server and
see,
- that the table is dropped.
- Check that nothing is binlogged. Atomicity has to go all
the way, engines and binary log.
t2 Induce crash after all the insert operation during DDL_CTAS
execution, before the commit operation. Recover the server and
see,
- that the table is dropped.
- Check that nothing is binlogged. Atomicity has to go all
the way, engines and binary log.
t3 Induce crash after replica thread creates table when applying the
binlog for DDL_CTAS. Recover the replica and see,
- that the table is dropped.
- Restart replication. Check that it resumes from the point
where it should. If the replica has binary log, check that no
"garbage" ended up in the binary log.
t4 Induce crash after replica thread executes inserts operation when
applying binlog for DDL_CTAS, before the commit operation.
Recover the server and see,
- that the table is dropped.
- Restart replication. Check that it resumes from the point
where it should. If the replica has binary log, check that no
"garbage" ended up in the binary log.
t5 Repeat t3 and t4 when group replication applies the DDL_CTAS
operation. Also check that servers are able to get back to
the group after recovery. No "garbage" in the binary log
either.
t6 Check if attempt to access the table gets blocked, while it
is being used to insert rows when applying DDL_CTAS binlog by
replica thread or GR node.
TODO - Develop more test cases.
F. PERFORMANCE
==============
- There would be no major performance impact (less than 2%) in
execution of DDL_CTAS either on the server, replica thread and GR
node.
G. UPGRADE
==========
OLD -> refers to MySQL server without this WL implemented.
New -> refers to MySQL server with this WL implemented.
1. Replication replica is upgraded first, so binlog from OLD
source would be still applied by NEW replica.
2. When GTID_MODE = OFF, we can make NEW->OLD work by logging
CREATE TABLE ... SELECT as:
CREATE TABLE ... /*! START TRANSACTION */
This behavior is not feasible to be implemented, mainly because
the CREATE TABLE would be executed as a separate transaction, keeping
row inserts as separate operation. Even if the binlog semantics is
made compatible by adding versioned comment, the replica wouldn't apply
it atomically.
3. When GTID_MODE = ON, we can never make NEW->OLD work, because
OLD will commit CREATE TABLE before it applies the row events,
so one transaction on new source becomes two on old replica, which
is bad. In practice it will generate an error and stop the replica
executor thread on the first row event. It is better to generate
this error already on CREATE, so when GTID_MODE = ON it's better
to *not* put START TRANSACTION inside version comments and make
the old replica stop on unknown syntax in the CREATE statement.
H. OBSERVABILITY
================
It would be helpful to enable users to observe the process of
DDL_CTAS operation. We can extend the performance_schema to show
two stages of DDL_CTAS operation 1) Creating table and 2)
Inserting rows.
1) DDL_CTAS execution run on source:
Following is output from performance_schema.events_stages_history_long,
378 48 ... stage/sql/Opening tables sql_base.cc:5594 ...
379 48 ... stage/sql/init sql_select.cc:171 ...
380 48 ... stage/sql/creating table sql_table.cc:8399 ...
381 48 ... stage/sql/After create sql_table.cc:8472 ...
382 48 ... stage/sql/System lock lock.cc:331 ...
383 48 ... stage/sql/optimizing sql_optimizer.cc:282 ...
384 48 ... stage/sql/statistics sql_optimizer.cc:502 ...
385 48 ... stage/sql/preparing sql_optimizer.cc:583 ...
386 48 ... stage/sql/executing sql_union.cc:1410 ...
387 48 ... stage/sql/System lock lock.cc:331 ...
388 48 ... stage/sql/waiting for handler commit handler.cc:1590 ...
389 48 ... stage/sql/waiting for handler commit handler.cc:1590 ...
390 48 ... stage/sql/end sql_select.cc:222 ...
1.1) Line 380 and 381 can be used to observer table being created.
37x 48 ... stage/sql/creating table as select sql_select.cc:171 ...
1.2) Line 386 actually means insertion of data into created table.
Would it help to modify it as following ?
386 48 ... stage/sql/inserting rows sql_union.cc:1410 ...
2) DDL_CTAS execution by applier thread:
As the applier thread executes CREATE TABLE and BINLOG INSERTs as several
statement the performance schema would show the stages involved in each
statement execution, and would be sufficient.
I. FAILURE MODEL
================
- Recovering after a crash after CREATE TABLE
- No table shall exist in the database.
- No change logged into the binary log due to DDL_CTAS.
- No GTID metadata changed. (see mysql.gtid_executed)
- Recovering after a crash in the middle of rows being inserted
- No table shall exist in the database.
- No rows shall exist either (obviously).
- No change logged into the binary log due to DDL_CTAS.
- No GTID metadata changed. (see mysql.gtid_executed)
- Recovering after a crash just before PREPARING in the engine
- No table shall exist in the database.
- No rows shall exist either (obviously).
- No change logged into the binary log due to DDL_CTAS.
- No GTID metadata changed. (see mysql.gtid_executed)
- Recovering after a crash during the FLUSH to the binary log.
(after process_flush_stage_queue() and before update_binlog_end_pos())
- Table shall not exist in the database.
- No rows shall exist either (obviously).
- No change logged into the binary log due to DDL_CTAS.
- No GTID metadata changed. (see mysql.gtid_executed)
- Recovering after a crash after WRITING to the binary log successfully.
(after update_binlog_end_pos())
- Table shall exist in the database.
- Rows shall exist in the table.
- Changes are logged into the binary log.
- Transaction is committed in the engine.
- GTID metadata updated. (see mysql.gtid_executed)
A) This is DDL_CTAS binlog sample proposed in design above.
This is result of mysqlbinlog tool on a prototype patch.
==============================================================
..................................................................
# at 1437
#200304 22:31:53 server id 1 end_log_pos 1516 CRC32 0x45e0ba69 Anonymous_GTID last_committed=5 sequence_number=6 rbr_only=no original_committed_timestamp=1583350313298585 immediate_commit_timestamp=1583350313298585 transaction_length=423
# original_commit_timestamp=1583350313298585 (2020-03-04 22:31:53.298585 GMT)
# immediate_commit_timestamp=1583350313298585 (2020-03-04 22:31:53.298585 GMT)
/*!80001 SET @@session.original_commit_timestamp=1583350313298585*//*!*/;
/*!80014 SET @@session.original_server_version=80021*//*!*/;
/*!80014 SET @@session.immediate_server_version=80021*//*!*/;
SET @@SESSION.GTID_NEXT= 'ANONYMOUS'/*!*/;
# at 1516
#200304 22:31:53 server id 1 end_log_pos 1591 CRC32 0xa7ee392c Query thread_id=9 exec_time=0 error_code=0
SET TIMESTAMP=1583350313/*!*/;
BEGIN
/*!*/;
# at 1591
#200304 22:31:53 server id 1 end_log_pos 1726 CRC32 0x36ebf6de Query thread_id=9 exec_time=0 error_code=0
SET TIMESTAMP=1583350313/*!*/;
/*!80013 SET @@session.sql_require_primary_key=0*//*!*/;
CREATE TABLE `t1` (
`f1` int DEFAULT NULL
) START TRANSACTION
/*!*/;
# at 1726
#200304 22:31:53 server id 1 end_log_pos 1774 CRC32 0xb360e710 Table_map: `test`.`t1` mapped to number 154
# at 1774
#200304 22:31:53 server id 1 end_log_pos 1829 CRC32 0xc83d4f94 Write_rows: table id 154 flags: STMT_END_F
BINLOG '
KQJgXhMBAAAAMAAAAO4GAAAAAJoAAAAAAAEABHRlc3QAAnQxAAEDAAEBAQAQ52Cz
KQJgXh4BAAAANwAAACUHAAAAAJoAAAAAAAEAAgAB/wABAAAAAAIAAAAAAwAAAAAEAAAAlE89yA==
'/*!*/;
# at 1829
#200304 22:31:53 server id 1 end_log_pos 1860 CRC32 0x63a11dbe Xid = 67
COMMIT/*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
..................................................................
B) LOW LEVEL DETAILS IN IMPLEMENTING B.2.C3.
============================================
1. Do not commit the transaction after CREATE TABLE .. START
TRANSACTION. (Lets say DDL_CTAS_ST)
2. Maintain context of DDL_CTAS_ST execution within the.
THD->Transactional_ddl_context
This context will hold following information,
a. Create THD->m_transactional_ddl_ctx if we are executing
DDL_CTAS on SE supporting atomic DDL.
b. Schema and table name of the table being created.
c. The handlerton pointer on which the table is being created.
d. Dictionary_client::Auto_releaser() which holds the
dictionary object that is built to to create new table and which
is not yet committed. The Table definition cache would
still hold the TABLE*.
3. Do not release MDL lock on schema.tablename at the end of DDL_CTAS_ST.
4. The rollback operation involves doing following,
a. removes the TABLE* using tdc_remove_table().
b. Frees the Auto_releaser from THD->Transactional_ddl_context.
c. Invoke the handlerton->post_ddl() hook. This also evicts
the table from the innodb internal dict cache.
d. Free the Transactional_ddl_context itself.
5. The commit operation involves doing following,
a. Free the Auto_releaser maintained in Transactional_ddl_context.
b. Invoke the handlerton->post_ddl() hook.
c. Free the Transactional_ddl_context itself.
...
class Transactional_ddl_context {
public:
~Transactional_ddl_context() { end(); }
void init(THD *thd, dd::String_type db, dd::String_type tablename,
handlerton *hton);
bool inited() { return m_thd != nullptr; }
void rollback();
void end();
private:
std::unique_ptr m_auto_releaser;
THD *m_thd{nullptr};
handlerton *m_hton{nullptr};
dd::String_type m_db;
dd::String_type m_tablename;
};
...
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.