WL#11123: Group Replication: hold reads and writes when the new primary has replication backlog to apply

Affects: Server-8.0 — Status: Complete

Executive Summary

This worklog implements a fencing mechanism when a new primary is being promoted in Group Replication (GR). The fencing will hold connections from writing and reading from the new primary until it has applied all the pending backlog of changes that came from the old primary. Applications will not read stale data or do writes for a short period of time (during the new primary promotion).

Motivation

There is an expectation that if an application reads from the primary, it will always read its own writes. This expectation can even be subconsciously enforced when GR is deployed in single primary mode.

However, on primary fail-over, the new primary will not reject or hold reads after being promoted. Instead, it will turn on conflict detection while the backlog is being applied. This means that if the application wrote A to the previous primary and then reads A from the new primary, it may see a stale value (i.e., miss its previous write, because it is still in the backlog to be applied).

For example:

session1 @ server1 (primary) : W(a) C
server1 fails, server2 is promoted and is now the new primary
server2 is still applying the backlog (thence W(a) has not been applied yet to server2)
router fails over application to server2 (new primary)
session1 @ server2 (new primary) : R(a) C
Application reads an old version of "a", even though it was always connected to the primary (!)

The user of GR not expect this behavior, the primary member always shall have the most recent data from the group.

NOTES - In async master-slave replication there is no provision in the system to prevent anything like this -- meaning that MySQL users have always faced this issue and they choose how to deal with it. The same techniques that users use today in async master-slave promotion can be applied to GR single primary promotion; - And the current GR behavior is already better than the case in async master-slave promotions; - But, GR is raising the bar and expectations are higher when it comes to synchrony and consistency; - The user/middleware/router can hold incoming read/writes to the new primary until it finishes applying the backlog or we can do it in the server itself.

BIG FAT NOTE: This is not about implementing read your writes consistency, but rather making sure that the following simple and straightforward assumption holds: if one reads from the primary, at any point in time, one will always read one's own writes.

Potential Solutions

There are three approaches to fix this (though 2 of them are pretty much the same):

Fix it in the router, so that fences the new primary until the backlog is flushed;
Fix it in GR, so that the plugin rejects incoming traffic, until its backlog is flushed;
Fix it in GR, so that the plugin holds incoming traffic, until its backlog is flushed.

NOTES: - Something similar to #3 will have to be implemented for "WL#10379: Group Replication: consistent reads"; - #1 is valid only for InnoDB Cluster setups, i.e., when MySQL Router is deployed; - #2 is just a variation of #3. #1 vs #2,#3 is really what the decision boils down to.

Chosen Solution

Incoming transit when a member is being elected as Primary will be put on hold until it applies all backlog. After this it will resume and retrieve the most recent data from the group.

The option can be enabled on a GLOBAL or SESSION scope. The goal is to allow create a session that can be used to do operations without cause impact on others sessions that are on hold until backlog is applied.

Functional Requirements

FR1: In single-primary mode, if an application reads from the primary, it must always read the most up to date snapshot.
FR2: Any write statement, listed in the HLD, SHALL be hold on the primary when: (a) the primary is being promoted; (b) the option to wait for the backlog to be installed is set; and (c) there is replication backlog to apply.
FR3: Any read statement, listed in the HLD, SHALL be hold on the primary when: (a) the primary is being promoted; (b) the option to wait for the backlog to be installed is set; and (c) there is replication backlog to apply.
FR4: Any write statement that is on hold waiting for the backlog to be applied SHALL be released, and thus allowed to resume the execution, once the replication backlog has been applied.
FR5: Any read statement that is on hold waiting for the backlog to be applied SHALL be released, and thus allowed to resume the execution, once the replication backlog has been applied.
FR6: There will be a server option to specify consistency of primary failover, allowing be set on GLOBAL or SESSION scope
FR7: An group action executed to elect a new primary shall hold all reads and writes when backlog is being applied
FR8: A server that is being promoted, applying backlog and is stopped, either by STOP GROUP_REPLICATION or server shutdown, statements on hold will resume and throw error ER_GR_KILLED to the client session.
FR9: The system variable wait_timeout will be used to set a limit of time a statement can be on hold. When the period on hold is greater than wait_timeout the statement will abort and throw error ER_GR_HOLD_WAIT_TIMEOUT to the client session.
FR10: If a server that is being elected as the group primary member faces an issue and moves to ERROR state while applying its replication backlog then statements, listed on HLD, SHALL be aborted with ER_GR_HOLD_MEMBER_STATUS_ERROR. The hook will retrieve error for all connections until STOP group replication is executed.
FR11: There is no impact of setting group_replication_consistency as BEFORE_ON_PRIMARY_FAILOVER in Multi Primary Mode
FR12: There is no impact due to this WL on the SECONDARIES when group_replication_consistency=BEFORE_ON_PRIMARY_FAILOVER
FR13: A group of statements that belong to the same transaction shall run the hook only on the first statement, preventing multiple calls to plugin on same transaction
FR14: During the period of time a statement is on hold, the stage on process list for his connection shall be "Executing hook on transaction begin".

Note: Even when there is no backlog to apply, its check is still done and transactions are held until then. In practice this means that an empty backlog is still considered as a valid backlog for the above functional requirements

Non-Functional Requirements

NFR1: This WL must have no impact on transaction execution performance when no single primary election is being executed.

Summary of the approach

It was added a hook on server side to be triggered when a statement begins.

On server side we will validate if the command is eligible to be put on hold and if so the hook will be called and by consequence its observers.

If GR plugin observer is installed, the statement is intercepted and put on hold if a primary election is ongoing. When no observer is registered, no action is taken to minimize the performance impact.

The server won't fence the statements that:

don't modify data. So the command don't use stored routines and is one of:
- SHOW
- SET OPTION
- DO
- EMPTY
- USE
SELECT command to database performance_schema
SELECT command to table PROCESSLIST from database infoschema
SELECT command to database sys
SELECT command that don't use tables
SELECT command that don't execute user defined functions
STOP GROUP_REPLICATION command
SHUTDOWN command
RESET PERSIST

After the server validation, if the command need to be fenced we will call the hook.

The Group Replication observer will verify that:

group_replication_single_primary_mode is set
the server is the primary
group_replication_consistency is set with BEFORE_ON_PRIMARY_FAILOVER option
incoming transactions arriving through group_replication_applier or group_replication_applier channels are never hold, since there are transactions already under group control.
backlog is being applied

When the new primary applied all backlog it will send a signal for all statements on hold, unlocking the transactions.

If the primary being elected is stopped while applying backlog the hold statements are released by throwing an error to the client session.

A statement on hold for a period of time greater than the wait_timeout [1] value will abort and retrieve an error to the client. The timeout prevents client statement being on hold indefinitely.

[1] - https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_wait_timeout

Security context

From a point of view of malicious attack to the group, if someone is able to prevent a secondary to apply his backlog when being promoted it will cause statements being resumed with an error due to timeout until the attack is discovered.

Upgrade/downgrade and cross-version replication

The worklog don't cause any impact on upgrades or cross-version replication.

This mechanism depends solely on the new primary server having or not this feature, and for that reason no cross version issues exist.

User interface

On this worklog we will introduce a option to specify which data consistency the user wants when failover of primary happens.

A option will be added to server to specify the failover guarantees.

name: group_replication_consistency
values: { EVENTUAL, BEFORE_ON_PRIMARY_FAILOVER }
default: EVENTUAL
scope: global, session
dynamic: yes

BEFORE_ON_PRIMARY_FAILOVER means that new queries to a new elected primary that it is applying backlog from the old primary will be hold before execution until the backlog is applied

EVENTUAL allows executing reads from newer primary even when the backlog isn't applied, the client will return old values. Writes to new primary will fail due super read only mode being enabled.

Observability

The state "Executing hook on transaction begin" is displayed on INFORMATION_SCHEMA.PROCESSLIST to allow user see if transaction is on hold.

Deployment and installation

The worklog won't introduce any modification on plugin deployment or installation.

Protocol

This worklog don't introduce any change to protocols.

Failure model specification

When a group is electing a new primary, if the option BEFORE_ON_PRIMARY_FAILOVER is enabled, the group will wait for apply all backlog to resume the statements.

A statement on hold can be abort and throw an error to the client before applied all backlog in two cases:

group replication plugin is stopped (including on server restart and shutdown)
period of time that statement is on hold is bigger than wait_timeout system variable

If the member during the election change the state to ERROR all the commands will be aborted until STOP GROUP_REPLICATION is invoked.

Summary of Changes

Server core changes

Add a new hook call on server side whenever a statement starts, more precisely when its being parsed.
Create a new 'begin' hook function on rpl_handler.cc and make sql command and table validations

Group Replication changes

Add a new server variable, group_replication_consistency to select the behavior when the primary is being promoted.
Create a new observer transaction hook that will call group transaction observation manager registered observers
On plugin it was added functions that subscribe and unsubscribe the observation manager observer for when primary election occur or plugin is stopping
A class was created created to hold the statements when backlog isn't applied on a primary election.

Sequence Diagram

When a statement is called, it will be intercepted:

sql/sql_parse.cc mysql_execute_command() |- server code ... |- launch_hook_trans_begin() @ sql/rpl_handler.cc `- server code

On rpl_handler it makes validations to SQL command and the tables:

sql/rpl_handler.cc launch_hook_trans_begin() `- if (server validations) group_replication_trans_begin() @ plugin/group_replication/src/observer_trans.cc

The function group_replication_trans_begin validates if group replication is running and if there are group transaction observation manager observers registered.

plugin/group_replication/src/observer_trans.cc `- group_replication_trans_begin() if ( plugin is running && observers registered) before_transaction_begin() @ plugin/group_replication/src/plugin_handlers/primary_election_handler.cc

When a primary election is occurring the observer is registered and the hold of transactions is enabled.

plugin/group_replication/src/plugin_handlers/primary_election_invocation_handler.cc `- internal_primary_election() hold_transactions->enable(); register_transaction_observer();

When a primary election finishes we resume the hold transactions and unregister the transaction observer. It can be unregistered when STOP GROUP_REPLICATION is executed.

plugin/group_replication/src/plugin_handlers/primary_election_primary_process.cc `- primary_election_process_handler() hold_transactions->disable(); unregister_transaction_observer();

The hold transaction was implemented on a new class, Hold_transactions at plugin/group_replication/src/hold_transaction.cc,

The methods are:

enable : start hold transactions
disable : stop hold transactions and send signal to resume held transactions
wait_until_primary_failover_complete : if hold mechanism is enable, transaction will be held until backlog applied or a error condition occur.