WL#11123: Group Replication: hold reads and writes when the new primary has replication backlog to apply

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

Executive Summary
=================

This worklog implements a fencing mechanism when a new primary is being
promoted in Group Replication (GR). The fencing will hold connections from
writing *and* reading from the new primary until it has applied all the pending
backlog of changes that came from the old primary.
Applications will not read stale data or do writes for a short period of time
(during the new primary promotion).

Motivation
==========

There is an expectation that if an application reads from the primary, it
will always read its own writes. This expectation can even be subconsciously
enforced when GR is deployed in single primary mode.

However, on primary fail-over, the new primary will not reject or hold
reads after being promoted. Instead, it will turn on conflict detection while
the backlog is being applied. This means that if the application wrote
A to the previous primary and then reads A from the new primary, it may
see a stale value (i.e., miss its previous write, because it is still in
the backlog to be applied).

For example:

1. session1 @ server1 (primary) : W(a) C
2. server1 fails, server2 is promoted and is now the new primary
3. server2 is still applying the backlog (thence W(a) has not been applied
   yet to server2)
4. router fails over application to server2 (new primary)
5. session1 @ server2 (new primary) : R(a) C
6. Application reads an old version of "a", even though it was always
   connected to the primary (!)

The user of GR not expect this behavior, the primary member always shall have
the most recent data from the group.

NOTES
- In async master-slave replication there is no provision in the system
  to prevent anything like this -- meaning that MySQL users have always
  faced this issue and they choose how to deal with it. The same techniques
  that users use today in async master-slave promotion can be applied
  to GR single primary promotion;
- And the current GR behavior is already better than the case in async
  master-slave promotions;
- But, GR is raising the bar and expectations are higher when it comes
  to synchrony and consistency;
- The user/middleware/router can hold incoming read/writes to the new
  primary until it finishes applying the backlog or we can do it in the
  server itself.

BIG FAT NOTE: This is not about implementing read your writes consistency,
but rather making sure that the following simple and straightforward
assumption holds: if one reads from the primary, at any point in time, one
will always read one's own writes.

Potential Solutions
===================

There are three approaches to fix this (though 2 of them are pretty
much the same):

1. Fix it in the router, so that fences the new primary until the backlog
   is flushed;
2. Fix it in GR, so that the plugin **rejects** incoming traffic, until its
   backlog is flushed;
3. Fix it in GR, so that the plugin **holds** incoming traffic, until its
   backlog is flushed.

NOTES:
- Something similar to #3 will have to be implemented for "WL#10379: Group
  Replication: consistent reads";
- #1 is valid only for InnoDB Cluster setups, i.e., when MySQL Router is
  deployed;
- #2 is just a variation of #3. #1 vs #2,#3 is really what the decision boils
  down to.

Chosen Solution
===============

Incoming transit when a member is being elected as Primary will be put on hold
until it applies all backlog. After this it will resume and retrieve the most
recent data from the group.

The option can be enabled on a GLOBAL or SESSION scope. The goal is to allow
create a session that can be used to do operations without cause impact on
others sessions that are on hold until backlog is applied.


Functional Requirements
======================

- FR1: In single-primary mode, if an application reads from the primary, it must
  always read the most up to date snapshot.

- FR2: Any write statement, listed in the HLD, SHALL be hold on the primary
  when: (a) the primary is being promoted; (b) the option to wait for the
  backlog to be installed is set; and (c) there is replication backlog to
  apply.

- FR3: Any read statement, listed in the HLD, SHALL be hold on the primary
  when: (a) the primary is being promoted; (b) the option to wait for the
  backlog to be installed is set; and (c) there is replication backlog to
  apply.

- FR4: Any write statement that is on hold waiting for the backlog
  to be applied SHALL be released, and thus allowed to resume
  the execution, once the replication backlog has been applied.

- FR5: Any read statement that is on hold waiting for the backlog
  to be applied SHALL be released, and thus allowed to resume
  the execution, once the replication backlog has been applied.

- FR6: There will be a server option to specify consistency of primary
  failover, allowing be set on GLOBAL or SESSION scope

- FR7: An group action executed to elect a new primary shall hold all reads and
  writes when backlog is being applied

- FR8: A server that is being promoted, applying backlog and is stopped, either
  by STOP GROUP_REPLICATION or server shutdown, statements on hold will resume
  and throw error ER_GR_KILLED to the client session.

- FR9: The system variable wait_timeout will be used to set a limit of time a
  statement can be on hold. When the period on hold is greater than
  wait_timeout the statement will abort and throw error ER_GR_HOLD_WAIT_TIMEOUT
  to the client session.

- FR10: If a server that is being elected as the group primary member faces an
  issue and moves to ERROR state while applying its replication backlog then
  statements, listed on HLD, SHALL be aborted with
  ER_GR_HOLD_MEMBER_STATUS_ERROR. The hook will retrieve error for all
  connections until STOP group replication is executed.

- FR11: There is no impact of setting group_replication_consistency as
  BEFORE_ON_PRIMARY_FAILOVER in Multi Primary Mode

- FR12: There is no impact due to this WL on the SECONDARIES when
  group_replication_consistency=BEFORE_ON_PRIMARY_FAILOVER

- FR13: A group of statements that belong to the same transaction shall run the
  hook only on the first statement, preventing multiple calls to plugin on same
  transaction

- FR14: During the period of time a statement is on hold, the stage on process
  list for his connection shall be "Executing hook on transaction begin".

Note:
Even when there is no backlog to apply, its check is still done and
transactions are held until then.
In practice this means that an empty backlog is still considered as a valid
backlog for the above functional requirements

Non-Functional Requirements
==========================

- NFR1: This WL must have no impact on transaction execution performance when
  no single primary election is being executed.

Summary of the approach
=======================

It was added a hook on server side to be triggered when a statement begins.

On server side we will validate if the command is eligible to be put on hold
and if so the hook will be called and by consequence its observers.

If GR plugin observer  is installed, the statement is intercepted and put on
hold if a primary election is ongoing.
When no observer is registered, no action is taken to minimize the performance
impact.

The server won't fence the statements that:

* don't modify data. So the command don't use stored routines and is one of:
   - SHOW
   - SET OPTION
   - DO
   - EMPTY
   - USE
* SELECT command to database performance_schema
* SELECT command to table PROCESSLIST from database infoschema
* SELECT command to database sys
* SELECT command that don't use tables
* SELECT command that don't execute user defined functions
* STOP GROUP_REPLICATION command
* SHUTDOWN command
* RESET PERSIST

After the server validation, if the command need to be fenced we will call the
hook.

The Group Replication observer will verify that:

- group_replication_single_primary_mode is set
- the server is the primary
- group_replication_consistency is set with BEFORE_ON_PRIMARY_FAILOVER option
- incoming transactions arriving through group_replication_applier or
  group_replication_applier channels are never hold, since there are
  transactions already under group control.
- backlog is being applied

When the new primary applied all backlog it will send a signal for all
statements on hold, unlocking the transactions.

If the primary being elected is stopped while applying backlog the hold
statements are released by throwing an error to the client session.

A statement on hold for a period of time greater than the wait_timeout [1]
value will abort and retrieve an error to the client. The timeout prevents
client statement being on hold indefinitely.

[1] - https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_wait_timeout

Security context
================

From a point of view of malicious attack to the group, if someone is able to
prevent a secondary to apply his backlog when being promoted it will cause
statements being resumed with an error due to timeout until the attack is
discovered.

Upgrade/downgrade and cross-version replication
===============================================

The worklog don't cause any impact on upgrades or cross-version replication.

This mechanism depends solely on the new primary server having or not this
feature, and for that reason no cross version issues exist.

User interface
==============

On this worklog we will introduce a option to specify which data consistency
the user wants when failover of primary happens.

A option will be added to server to specify the failover guarantees.

 - name: group_replication_consistency
 - values: { EVENTUAL, BEFORE_ON_PRIMARY_FAILOVER }
 - default: EVENTUAL
 - scope: global, session
 - dynamic: yes

BEFORE_ON_PRIMARY_FAILOVER means that new queries to a new elected primary that
it is applying backlog from the old primary will be hold before execution until
the backlog is applied

EVENTUAL allows executing reads from newer primary even when the backlog isn't
applied, the client will return old values. Writes to new primary will fail due
super read only mode being enabled.

Observability
=============

The state "Executing hook on transaction begin" is displayed on
INFORMATION_SCHEMA.PROCESSLIST to allow user see if transaction is on hold.

Deployment and installation
===========================

The worklog won't introduce any modification on plugin deployment or installation.

Protocol
========

This worklog don't introduce any change to protocols.

Failure model specification
===========================

When a group is electing a new primary, if the option
BEFORE_ON_PRIMARY_FAILOVER is enabled, the group will wait for apply all
backlog to resume the statements.

A statement on hold can be abort and throw an error to the client  before
applied all backlog in two cases:

- group replication plugin is stopped (including on server restart and shutdown)
- period of time that statement is on hold is bigger than wait_timeout system
  variable

If the member during the election change the state to ERROR all the commands
will be aborted until STOP GROUP_REPLICATION is invoked.
Summary of Changes
==================

Server core changes
-------------------

1. Add a new hook call on server side whenever a statement starts, more
   precisely when its being parsed.

2. Create a new 'begin' hook function on rpl_handler.cc and make sql command and
   table validations

Group Replication changes
-------------------------

1. Add a new server variable, group_replication_consistency to select the
   behavior when the primary is being promoted.

2. Create a new observer transaction hook that will call group
   transaction observation manager registered observers

3. On plugin it was added functions that subscribe and unsubscribe the
   observation manager observer for when primary election occur or plugin is
   stopping

4. A class was created created to hold the statements when backlog isn't
   applied on a primary election.

Sequence Diagram
----------------

When a statement is called, it will be intercepted:

```
sql/sql_parse.cc
    mysql_execute_command()
      |- server code ...
      |- launch_hook_trans_begin() @ sql/rpl_handler.cc
      `- server code
```

On rpl_handler it makes validations to SQL command and the tables:

```
sql/rpl_handler.cc
    launch_hook_trans_begin()
      `- if (server validations)
            group_replication_trans_begin() @ plugin/group_replication/src/observer_trans.cc
```

The function group_replication_trans_begin validates if group replication is
running and if there are group transaction observation manager observers
registered.

```
plugin/group_replication/src/observer_trans.cc
    `- group_replication_trans_begin()
        if ( plugin is running && observers registered)
           before_transaction_begin() @ plugin/group_replication/src/plugin_handlers/primary_election_handler.cc
```

When a primary election is occurring the observer is registered and the hold of
transactions is enabled.

```
plugin/group_replication/src/plugin_handlers/primary_election_invocation_handler.cc
    `- internal_primary_election()
          hold_transactions->enable();
          register_transaction_observer();
```

When a primary election finishes we resume the hold transactions and unregister
the transaction observer. It can be unregistered when STOP GROUP_REPLICATION is
executed.

```
plugin/group_replication/src/plugin_handlers/primary_election_primary_process.cc
    `- primary_election_process_handler()
          hold_transactions->disable();
          unregister_transaction_observer();
```

The hold transaction was implemented on a new class, Hold_transactions at
`plugin/group_replication/src/hold_transaction.cc`,

The methods are:

- enable : start hold transactions

- disable : stop hold transactions and send signal to resume held transactions

- wait_until_primary_failover_complete : if hold mechanism is enable, transaction will be
  held until backlog applied or a error condition occur.