WL#7592: GTIDs: generate Gtid_log_event and Previous_gtids_log_event always
When GTID_MODE = ON, the binary log contains two events that do not currently exist when GTID_MODE = OFF. In this worklog we will make it so that similar events are generated also when GTID_MODE = OFF:
- When GTID_MODE = ON, every transaction is preceded by an event with type code GTID_LOG_EVENT. No such event exists when GTID_MODE = OFF.
However, after GTID_LOG_EVENT was introduced, it has gradually turned into an event with a more generic purpose, as fields not related to GTIDs were added to the event. These fields include logical timestamps used for applying transactions in parallel on the slave, as well as physical timestamps used for monitoring. It is also likely that more per-transaction fields will be needed in the future.
Therefore, we need to generate a per-transaction event also when GTID_MODE = OFF; this is needed e.g. for WL#7083 and WL#7165.
The event type that we need exists in the code already, although it is not used: it has the type code ANONYMOUS_GTID_LOG_EVENT. This type code is different from GTID_LOG_EVENT, but internally, the two event types share the same class, called Gtid_log_event.
- When GTID_MODE = ON, every binary log has a Previous_gtids_log_event in the header (just after the Format_description_log_event). When GTID_MODE = OFF, there is no such per-binlog event.
This has the problem that if gtid_mode is changed from ON to OFF and then back to ON again, the values of GTID_EXECUTED and GTID_PURGED may get lost.
Therefore, this worklog ensures that Previous_gtids_log_event is written to every binary log, regardless of gtid_mode.
This has the additional benefit that it will allow the recovery of GTID_PURGED and GTID_EXECUTED to be optimized.
An additional benefit from generating both Gtid_log_event and Previous_gtids_log_event unconditionally is that binary logs will have similar structures when GTID_MODE = ON and GTID_MODE = OFF. This allows us to simplify both code and test cases, as otherwise we would need special code to handle the two different cases.
This work was previously part of WL#7083 but will be more practical to fix separately.
REFERENCES
- BUG#69097 - MYSQLD SCANS ALL BINARY LOGS ON CRASH RECOVERY
- The bug can later be fixed using the refactoring in this worklog.
Functional requirements:
- FR1.
- When GTID_MODE = OFF, every transaction shall be preceded by an Anonymous_gtid_log_event in the binary log.
- FR2.
- When GTID_MODE = OFF, every binary log shall start with a Previous_gtid_log_event.
- FR3.
- Cross-version replication OLD->NEW shall not harmed by this worklog.
- FR4.
- Cross-version replication NEW->OLD shall work as follows:
-
- FR4.1. NEW->5.6.21 shall work if gtid_mode=ON. It will not work if gtid_mode=off.
-
- FR4.2. NEW->5.6.22 shall work regardless of gtid_mode.
- Footnote: the reason that 5.6.21 and older cannot work is BUG#74683. Once the bug is fixed, cross-version replication shall work.
Non-functional requirements:
- NFR1.
- The code shall be structured such that it does not assume all Gtid_log_events and Anonymous_gtid_log_events have the same size.
- This requirement makes the code more maintainable.
- NFR2.
- This shall not cause more than 3% performance degradation.
- NFR3.
- The restrictions imposed by GTID_MODE=ON (e.g., enforce-gtid-consistency, and disallowing sql_slave_skip_counter) shall not apply when GTID_MODE=OFF, even when we generate events with type code ANONYMOUS_LOG_EVENT.
Note. Analysis of storage requirements
- A Gtid_log_event takes 57 bytes. In the case that Gtid_log_event is not include in the binary log, Query_log_event will grow by 8 bytes since a logical timestamp used for MTS is stored in the Query_log_event instead of in the Gtid_log_event. So the real overhead is 57-8=49 bytes.
- A Query_log_event takes 79 bytes or more, plus the length of the query, plus the length of the database name.
- A Xid_log_event takes 31 bytes.
- A table_map event takes 45 bytes or more.
- A row event takes 40 bytes or more.
So a *minimal* DML transaction without Gtid_log_event takes:
79 + 1 + 5 (query_log_event(use 'x'; BEGIN))) + 79 + 1 + 21 (query_log_event(use 'x'; INSERT INTO t SET a=1)) + 31 (Xid_log_event) = 217 bytes.
Thus, the overhead of the Gtid_log_event is at most 49/216=22%. But this is a theoretical worst case. A real transaction would have significantly large queries, and often a larger number of queries, which would make the relative overhead of the Gtid_log_event smaller.
Moreover, the query_log_event(BEGIN) is redundant and the XID of the Xid_log_event could be stored in the Gtid_log_event. This worklog will pave the way for the future work of removing both these events, and then it will unconditionally be an improvement even compared to a binary log without Gtid_log_event: it will save 79+1+5+31-49=67 bytes per transaction.
Contents |
Summary
- When GTID_MODE = OFF, an Anonymous_gtid_log_event shall be generated and every transaction or DDL statement in the binary log shall be preceded by it.
- When GTID_MODE = OFF, a Previous_gtids_log_event shall be generated at the beginning of every binary log and every relay log.
- When GTID_MODE = ON, there is no user-visible change.
Cross-version compatibility
OLD->NEW: This will always work. The server can handle old events and there is no problem.
NEW->OLD[5.5]: This will never work. Even if ANONYMOUS_GTID_LOG_EVENT has LOG_EVENT_IGNORABLE_F set, this flag was only introduced in 5.6, so the 5.5 slave does not know about the flag and will fail.
NEW->OLD[5.6 or 5.7]:
- If GTID_MODE=ON, there is no problem, since nothing changed.
- If GTID_MODE=OFF, there are two cases:
- If slave is 5.6.21 or older, or 5.7.5 or older, replication will stop with an error, due to BUG#74683.
- If we fix BUG#74683 in 5.6.22, and slave is 5.6.22 or later, then replication will work fine, since PREVIOUS_GTIDS_LOG_EVENT is skipped by the slave receiver thread and ANONYMOUS_GTIDS_LOG_EVENT has LOG_EVENT_IGNORABLE_F set.
DROP DATABASE
The behavior has changed for DROP DATABASE statements that fail after deleting some tables.
This error can happen e.g. because of extra files in the database directory. When this happens, the server generates an error, but it leaves the tables deleted. To log this, the server generates a DROP TABLE statement listing all the tables. If there are many tables, the DROP TABLE statement becomes very long. In this case, the server generates multiple DROP TABLE statements. This is very strange but it is how the server works before this patch.
If the error happens on a master, there is no problem from GTID perspective: it will generate one Anonymous_gtid_log_event per DROP if GTID_MODE=OFF, and one Gtid_log_event per DROP if GTID_MODE=ON.
If the error happens on a slave, and GTID_MODE=OFF, there is also no big problem: it generates an Anonymous_gtid_log_event per DROP.
If the error happens on a slave, and GTID_MODE=ON, we are in a worse situation. The GTID must be preserved, but the statement needs to be logged as multiple transactions, each having its own GTID. The error cannot be detected until after the statement has completed and it cannot be rolled back. Our solution is to generate an error and not log anything at all. We introduce a new error message for this:
- "DROP DATABASE failed on slave; some tables may have been dropped but the database directory remains. The GTID has not been added to GTID_EXECUTED and the statement was not written to the binary log. Fix this as follows: (1) remove all files from the database directory %-.192s; (2) SET GTID_NEXT='%-.192s'; (3) DROP DATABASE `%-.192s`."
CREATE TABLE...SELECT with BINLOG_FORMAT=ROW
CREATE...SELECT is allowed when GTID_MODE=OFF. Prior to this worklog, this was executed as one transaction. In statement format it was logged as a single statement. In row format, it was logged as:
BEGIN; CREATE TABLE without SELECT ...row events... COMMIT;
This worklog changes the logging of CREATE...SELECT in row format, so now it is:
Anonymous_Gtid; CREATE TABLE without SELECT Anonymous_Gtid; BEGIN; ...row events... COMMIT;
This also means that there is a storage engine commit between the CREATE and the insertion of rows.
Stricter checks for missing Gtid_log_event in slave applier thread
When slave has GTID_MODE=ON, it must not accept any transactions that are missing a Gtid_log_event. If the slave applier thread sees an event that is part of a transaction that does not have a Gtid_log_event, then the slave stops with an error.
Prior to this worklog, the check was omitted for BEGIN and COMMIT events. Now, the check is stricter and is done also for BEGIN and COMMIT. This should not make any different to users, but required modifying a few tests cases that did strange seeking in the binary log in order to simulate errors.
PERFORMANCE_SCHEMA
There was a bug in how GTIDs were displayed PERFORMANCE_SCHEMA. This is not directly related to the scope of this worklog, but had to be fixed to make existing tests pass.
Background: The GTID is changed during the lifetime of a transaction as follows:
- On a master, the transaction is "AUTOMATIC" while executing. Only when it commits is it assigned a GTID of the form UUID:NUMBER.
- On a slave, the transaction is assigned a GTID before it starts to execute.
- When GTID_MODE = OFF, transactions use the special GTID "ANONYMOUS" rather than "UUID:NUMBER".
This was not reflected correctly in the GTID column of these performance_schema tables. Among other things, the history table could contain 'AUTOMATIC'. This should never happen since it makes it impossible to identify the transaction.
mysqlbinlog
The output from mysqlbinlog changes a little bit:
- Previous_gtids_log_event and Anonymous_Gtid_log_event are present even if gtid_mode=off.
- For Query_log_event, WL#7165 introduced a commented-out text that shows the logical timestamps used for MTS. Now, these timestamps do not occur at all for Query_log_event, so they have been removed from the output of mysqlbinlog.
- For Gtid_log_event, there was a commented-out text saying 'commit=yes'. This once had a meaning in a development tree of the GTID feature, but the part of the GTID feature that this refers to was never pushed to the main trees. So the text is completely meaningless. In this worklog we remove the text.
Notes
- To implement FR1, we need to generate an Anonymous_gtid_log_event even whenGTID_MODE = OFF.- To implement NFR1, we need change the life cycle of Gtid_log_event
and Anonymous_gtid_log_event.
Background:
Currently, Gtid_log_event is generated the first time something is written to the cache. This means that the size has to be determined before the transaction executes, which violates NFR1.
Fix:
To conform to NFR1, we move the Gtid and anonymous events out of the transaction and statement caches. Instead of generating the events at the beginning of the transaction, we generate them just before flushing the transaction to the binary log. I.e., we generate them in (a function called from) binlog_cache_data::flush(). (Nothing prevents us from generating the GTID and/or Gtid_log_event in any other places, if that is needed in the future.)
In do_write_cache(), we keep the event outside the cache and update end_log_pos and compute the checksum separately, before processing the actual cache.
Plan
We will implement this worklog as multiple patches:
1. Small simplifications.
While debugging the feature, a few small things had to be fixed, e.g. more DBUG output, etc. This patch collects all such simplifications, so that they don't distract the rest of the worklog.
2. Remove dead code related to empty GTID transactions.
While developing the feature, we found parts of the GTID code was not used at all. Since this is code related to the feature, we fix it in this worklog by removing the dead code.
3. Clean up binlog.cc:gtid_empty_group_log_and_cleanup.
This feature affects binlog.cc:gtid_empty_group_log_and_cleanup. This function was unnecessarily complex, and a few things were done the same way by all callers of the functions. We refactor this function to simplify the code and have a more clear view of how GTIDs work in the server.
4. Refactor @@session.gtid_executed implementation.
Before this patch, the implementation of the session variable @@session.gtid_executed relies on Gtid_log_event being stored in the transaction cache or statement cache. Since we will move Gtid_log_event out of the caches, we change the implementation of @@session.gtid_executed so that it does not rely on the caches.
5. Refactor owned_gtid. In order to make MTS work correctly in the presence of Anonymous_gtid_log_events, we must make Relay_log_info::is_in_group understand that an anonymous transaction has started after an Anonymous_gtid_log_event, before the BEGIN is executed.
Currently, is_in_group looks at the value of THD::owned_gtid.sidno to determine if any GTID is owned. To make this work for Anonymous_gtid_log_events, we make SET GTID_NEXT='ANONYMOUS' set THD::owned_gtid.sidno = -2, and make any committing statement reset THD::owned_gtid even in this case. We introduce the symbolic constant OWNED_SIDNO_ANONYMOUS for -2.
6. Allow writing Log_event header to memory.
This is a small refactoring to allow writing Gtid_log_event to a memory buffer rather than to an IO_CACHE.
When we write Gtid_log_event always, the Gtid_log_event will live outside the transaction cache. Before the event is flushed to the binary log, it has to be stored in a memory buffer. However, there was no function for writing log events to a memory buffer.
This patch adds the function Log_event::write_header_to_memory that writes the 19 byte Log_event header to a memory buffer, and Gtid_log_event::write_to_memory that writes an entire Gtid_log_event to a memory buffer.
7. Refactor MYSQL_BINLOG::do_write_cache.
This patch rewrites do_write_cache so that it becomes more maintainable and so that we can write out-of-cache events to the binary log. There are two reasons for this.
Rationale 1:
Before this patch, do_write_cache mixes two tasks: 1. It reads from the statement or transaction cache and assembles pieces of events that are split over multiple pages in the IO_CACHE. 2. It processes the event and writes it to the binary log.
In this worklog, we need to store the Gtid_log_event outside the statement and transaction caches. The event needs the processing in (2), but not that in (1). Hence, we need de-couple these two tasks. Task (2) is implemented by a new class and task (1) remains in do_write_cache.
Rationale 2:
Before this patch, do_write_cache was very difficult to understand, as it kept lots of state in variables that were updated throughout the function. The reason was that the logic was organized in an unpractical manner: the outermost loop iterated over pages of the IO_CACHE and tried to keep various pieces of state of half-written events in local variables. This state information was updated all over the function, which made it difficult to understand what were the loop invariants.
This patch changes so that the outermost loop iterates over events. This makes the state information is much more short-lived. For instance, state related to the beginning of an event is only needed at the beginning of the iteration and can be stored in a very short-lived variable. So there is less state kept between iterations, and this makes the loop invariants much simpler.
8. Generate Gtid_log_event_always.
After all the previous preparation steps, we can now generate Gtid_log_event always. This does change server behavior. Moreover, it breaks many existing test cases that assume there is no Gtid_log_event. However, since we have another big change to make (generate Previous_gtids_log_event always), we do not touch tests in this patch. Later patches will fix the tests. This patch has three primary goals: 8.1. Remove Gtid_log_event from the statement/transaction cache. 8.2. Generate Gtid_log_event just before flushing the statement/transaction cache. Store the event in a separate buffer. 8.3. Generate Gtid_log_event even when GTID_MODE=OFF.
In order to make this work, we have to do the following additional tasks: 8.4. Previously, commit sequence number was generated when flushing the cache. Since Gtid_log_event is not part of the cache now, we generate it in Gtid_log_event. 8.5. Since Anonymous_log_events now exist, a case must be added to MTS code so that it reads the commit_seq_no from the ANONYMOUS_GTID_LOG_EVENT just like it reads from the GTID_LOG_EVENT. In addition we do the following cleanup tasks: 8.6. Remove binlog.cc:gtid_before_write_cache. This functionality is now in MYSQL_BIN_LOG::write_gtid. In addition, we move the part of this function that deals with generating the GTID into the new function Gtid_state::generate_automatic_gtid. 8.7. Avoid taking a lock in Gtid_log_event constructor. 8.8. Remove the '[commit=yes/no]' output when mysqlbinlog prints a Gtid_log_event. This was only relevant for a pre-GA version of the GTID feature.
9. Generate Previous_gtid_log_event always.
This patch generates Previous_gtids_log_event even if gtid_mode=off. This changes server behavior. It also breaks a lot of existing test cases that assume there is no Previous_gtids_log_event. We address the test cases in the following patch.
10. Fix failing tests.
This patch fixes all test cases that fails due to the previous two patches. In addition it fixes two code bugs that were exposed/introduced due to the previous two patches, and which caused some of the test failures. - Fix bug in START SLAVE UNTIL MASTER_LOG_POS logic.
The problem was: some Rotate_log_events in the relay log are generated on the slave, not on the master. Thus, their end_log_pos field is relative to the slave relay log. Since MASTER_LOS_POS is relative to the master binary log, we must not evaluate the MASTER_LOS_POS condition for such slave-generated Rotate_log_events. But the logic to skip the until check for slave-generated events was missing, and this caused tests to fail. The fix is to avoid evaluating the until condition for slave-generated events. This is easy: such events are easily distinguishable since their server_id is zero. So we check if the server_id==0, and in that case we don't evaluate the until condition. This did not cause any tests to fail before this worklog, because the events appeared so early in the relay log that their positions would be smaller than the position specified by MASTER_LOG_POS. However, after this patch, the events appear after Previous_gtids_log_event, which moves the position forward so much that it causes the slave thread to stop before the rotate event, which causes the test to fail. - Fix bug in sql_slave_skip_counter with GTIDs.
sql_slave_skip_counter did compute transaction boundaries correctly in the presence of Gtid_log_events. This did not cause any problems before this patch since sql_slave_skip_counter is not allowed when gtid_mode=on.
sql_slave_skip_counter is supposed to decrease for every event processed, except it should not decrease down to 0 in the middle of a group. This ensures that the applier thread does not stop in the middle of a transaction. However, the applier thread did not consider Gtid_log_event to be part of a group, and therefore it could stop after the Gtid_log_event. The problem was that Gtid_log_event implement a specialized do_shall_skip function. This caused it to decrease the counter down to zero. The fix is to implement Gtid_log_event::do_shall_skip and make it call continue_group.
- Fix simplified-binlog-recovery.
Writing Previous_gtids_log_event always broke the logic for simplified-binlog-recovery.
Background:
Beforeā this patch, simplified-binlog-recovery would avoid iterating over multiple binary logs only in the case that the binlog lacks a Previous_gtids_log_event. Problem:
Since we now generate Previous_gtids_log_event always, recovery would iterate over all binary logs even when simplified-binlog-recovery was enabled. Fix:
Make it so that simplified-binlog-recovery skips the rest of the binary logs also in the case that the first binary log contains a Previous_gtids_log_event and no Gtid_log_event.
11. Remove dead code.
This patch removes some code that became dead after this worklog:
- Remove code to store logical timestamps in Query_log_event. This is now done only in Gtid_log_event.
- Remove class Group_cache. This is not used now since the Gtid_log_event is not written to the transaction or statement cache.
- Remove the G_COMMIT_TS field. This was a single byte that had the constant value 1, and was stored in the binary log for Gtid_log_events. It did not server a purpose, so we remove it.
- Remove IO_CACHE::commit_seq_no and IO_CACHE::commit_seq_offset. These were clearly misplaced in the first place, and are now not needed any more since the timestamp is generated in the Gtid_log_event constructor.
- Remove enumeration value INVALID_GROUP from enum_group_type. This was used in two places. The first one was Group_cache, which is now removed. The second was Gtid_specification::get_type. But get_type was only used in Gtid_specification::is_valid. We can merge get_type into is_valid (this is a simplification), and the result is that we don't need INVALID_GROUP.
12. Fix GTIDs in P_S tables.
In this patch we correct the GTID shown in the GTID columns of performance_schema.events_transactions_current and performance_schema.events_transactions_history. Background:
The GTID is changed during the lifetime of a transaction as follows:
- On a master, the transaction is "AUTOMATIC" while executing. Only when it commits is it assigned a GTID of the form UUID:NUMBER.
- On a slave, the transaction is assigned a GTID before it starts to execute.
- When GTID_MODE = OFF, transactions use the special GTID "ANONYMOUS" rather than "UUID:NUMBER".
This was not reflected correctly in the GTID column of these performance_schema tables. Among other things, the history table could contain 'AUTOMATIC'. This should never happen since it makes it impossible to identify the transaction. This issue is not directly related to WL#7592. However, it showed up as a test failure in perfschema.transaction after a refactoring that was part of WL#7592. Therefore, we fix it in order to make tests pass for WL#7592.
13: Fix binlogging of strange SQL statements.
A few SQL statements have strange semantics that causes trouble for GTIDs. This includes DROP TABLE with multiple tables, CREATE TEMPORARY/DROP TEMPORARY, DROP DATABASE, OPTIMIZE/REPAIR/ANALYZE/CHECKSUM TABLE, CREATE TABLE ... SELECT, and DROP TEMPORARY generated by client disconnect.
1. Background: When DROP TABLE is used with multiple tables, and the tables are of different types (transactional/non-transactional or temporary/non-temporary), tables of the same type get grouped together and each group is logged as a separate statement. For example: DROP TABLE temporary, non_temporary gets logged as DROP TABLE temporary; DROP TABLE non_temporary.
When GTID_MODE = ON, each such statement is assigned its own GTID. In order to generate the GTID, mysql_rm_table_no_locks must call mysql_bin_log.commit for each group of tables.
Problem: mysql_bin_log.commit is only called when gtid_mode != OFF. When gtid_mode == OFF, all the statements are written to the binary log in one operation. So after WL#7592 there is only one Anonymous_gtids_log_event, instead of one for each DROP TABLE.
Fix: Call mysql_bin_log.commit unconditionally. Note: inside a transaction, and only temporary tables are dropped, we should not call mysql_bin_log.commit, since the transactional context must remain open in this case.
2. Background: Prior to this patch, when binlog_format=row, CREATE...SELECT gets written to the binary log as BEGIN CREATE row events COMMIT CREATE...SELECT is not allowed when gtid_mode=on (in fact, not when enforce_gtid_consistency=1).
Problem: Although CREATE without SELECT has an implicit commit, it appears in the middle of a transaction on the slave. Thus, after this worklog and prior to this patch, it gets logged as: Anonymous_gtids_log_event BEGIN CREATE row events COMMIT This causes problems on an MTS slave.
Fix: Call mysql_bin_log.commit after writing the CREATE statement.
3. Background: If DROP DATABASE fails after dropping some tables (e.g., if there are extra files in the database directory), then it writes a DROP TABLE statement that lists all the tables that it dropped. If there are many tables, this statement gets long. In this case, the server splits the statement into multiple DROP TABLE statements.
Problem: If this happens when GTID_NEXT='UUID:NUMBER', then there is no way to log this correctly. So we must generate an error and log nothing.
Fix: Introduce a new error code, ER_CANNOT_LOG_FAILED_DROP_DATABASE_WITH_MULTIPLE_STATEMENTS, and generate the error if GTID_NEXT='UUID:NUMBER' and DROP DATABASE needs to be logged as multiple DROP statements.
4. Background: OPTIMIZE/REPAIR/ANALYZE/CHECKSUM TABLE are written to the binary log even if they fail, after having called trans_rollback.
Problem: trans_rollback calls gtid_state::update_on_rollback, which normally releases GTID ownership. But we must not release ownership before writing to the binary log.
Fix: This was already fixed for the case gtid_mode=on; for that case we set a special flag in the THD object which tells gtid_state::update_on_rollback to not release ownership. Now we need to fix the case gtid_mode=off, so we set the flag in this case too.
5. Background: CREATE TEMPORARY and DROP TEMPORARY behave very strange. If executed outside transactional context, they behave as DDL: they get logged without BEGIN...COMMIT and cannot be rolled back. If executed in transactional context, they behave as non-transactional DML: they get logged inside BEGIN...COMMIT, leave the transactional context open, but cannot be rolled back. Before this patch, CREATE TEMPORARY and DROP TEMPORARY call gtid_end_transaction unconditionally.
Problem: gtid_end_transaction ends the transactional context and releases ownership. This was not a problem before WL#7592 since gtid_end_transaction could only be called when gtid_mode=on, and when gtid_mode=on we disallow CREATE TEMPORARY and DROP TEMPORARY inside transactional context. However, after WL#7592, we call gtid_end_transaction also when gtid_mode=off, and gtid_end_transaction releases anonymous ownership.
Fix: Do not call gtid_end_transaction for CREATE TEMPORARY and DROP TEMPORARY inside transaction context.
6. Background: When a client that has open temporary tables disconnects, the temporary tables are dropped and DROP TEMPORARY is written to the binary log.
Problem: After WL#7592 and before this patch, if a client disconnects when GTID_NEXT='ANONYMOUS', the client would not hold anonymous ownership when writing to the binary log, which would trigger an assertion in write_gtid.
There was no problem when GTID_NEXT='UUID:NUMBER', since this case was taken care of already before WL#7592. In this case, we set GTID_NEXT='AUTOMATIC' before dropping any tables.
Fix: Set GTID_NEXT='AUTOMATIC' regardless of GTID_MODE.