WL#12895: Group Replication: Add OFFLINE_MODE to group_replication_exit_state_action

Affects: Server-8.0   —   Status: Complete

EXECUTIVE SUMMARY
=================

The option group_replication_exit_state_action configures how Group
Replication behaves when a server leaves the group unintentionally:

 1. after encountering an applier error;
 2. after encountering an recovery error;
 3. in the case of a loss of majority;
 4. when another member of the group expels it due to a suspicion
    timing out;
 5. after an error on coordinated group changes[1];
 6. after a primary election error;
 7. after a clone provisioning error;
 8. when automatic rejoin is enable, after its attempts are
    exhausted unsuccessfully.

This behavior can be:

 * READ_ONLY    disable writes on the server, by enabling
                server super_read_only[2] option;
 * ABORT_SERVER shutdown the server.

After this worklog is implemented, a new behavior is added:

 * OFFLINE_MODE close all connections and disallow new ones from
                users who do not have the CONNECTION_ADMIN or SUPER
                privilege, by enabling server offline_mode[3] option.
                OFFLINE_MODE includes the behaviour of READ_ONLY.

This automatic connection closure will avoid user sessions to:
1) read stale data and 2) write data to orphans servers, without the
need to shutdown the server.

Proxy tools, like MySQL Router, will immediately detect when a
connection is closed and update its pool of connections avoiding
directing users request to failed servers.

[1]
https://dev.mysql.com/doc/refman/8.0/en/group-replication-configuring-online-group.html
[2]
https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_super_read_only
[3]
https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_offline_mode


USER STORIES
============

- As a system developer using MySQL I want to always read data from
  a MySQL server that is connected through replication so I minimize
  my chances of reading stale data.

- As a MySQL DBA I want my servers to automatically close users
  connections if they (the servers) involuntarily drop out of
  replication, so that other components in my system do not engage
  stale servers.

- As a system builder, I want my system to react (e.g., close
  connections, remove server from the pool of "good" servers, etc)
  whenever a server goes involuntarily offline w.r.t. replication, so
  that I avoid to pro-actively polling the system to figure that out.

- As a proxy tool routing connections to a MySQL server, I want to
  get my connections to stale servers automatically closed, so that
  these connections are evicted from my routing cache automatically.

Functional requirements
=======================
FR1: In the case of applier error, the member will change to ERROR
     state and leave the group, consequently if
     group_replication_exit_state_action= OFFLINE_MODE then the
     plugin will set server offline_mode= ON [1].

FR2: In the case of recovery error, the member will change to ERROR
     state and leave the group, consequently if
     group_replication_exit_state_action= OFFLINE_MODE then the
     plugin will set server offline_mode= ON [1].

FR3: In the case of member expel from the group, the member will
     change to ERROR state, consequently if
     group_replication_exit_state_action= OFFLINE_MODE then the
     plugin will set server offline_mode= ON [1].

FR4: In the case of member is unable to contact a majority of the
     group members, after the timeout specified on
     group_replication_unreachable_majority_timeout[2] option, if
     group_replication_exit_state_action= OFFLINE_MODE then the
     plugin will set server offline_mode= ON [1].

FR5: In the case of error during a coordinated group change, like
     switch from single-primary to multi-primary mode, the member
     will change to ERROR state and leave the group, consequently
     if group_replication_exit_state_action= OFFLINE_MODE then the
     plugin will set server offline_mode= ON [1].

FR6: In the case of error during a primary election, the member
     will change to ERROR state and leave the group, consequently
     if group_replication_exit_state_action= OFFLINE_MODE then the
     plugin will set server offline_mode= ON [1].

FR7: In the case of error during cloning provisioning, the member
     will change to ERROR state and leave the group, consequently
     if group_replication_exit_state_action= OFFLINE_MODE then the
     plugin will set server offline_mode= ON [1].

FR8: When automatic rejoin is enable[3], after its attempts are
     exhausted unsuccessfully, the member will change to ERROR
     state and leave the group, consequently if
     group_replication_exit_state_action= OFFLINE_MODE then the
     plugin will set server offline_mode= ON [1].

FR9: group_replication_exit_state_action= OFFLINE_MODE includes the
     behaviour of group_replication_exit_state_action= READ_ONLY.

[1] https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_offline_mode
[2] https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replication_unreachable_majority_timeout
[3] https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replication_autorejoin_tries


Non-functional requirements
===========================
  None.
DEFINITIONS
===========

super_read_only: server super_read_only option
https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_super_read_only

offline_mode: server offline_mode option
https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_offline_mode

READ_ONLY: value of group_replication_exit_state_action option
https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replication_exit_state_action

OFFLINE_MODE: value of group_replication_exit_state_action option
https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replication_exit_state_action


SUMMARY OF THE APPROACH
=======================
The DBA can configure how Group Replication behaves when a server
leaves the group unintentionally.
A server leaves the group unintentionally:

 1. after encountering an applier error;
 2. after encountering an recovery error;
 3. in the case of a loss of majority;
 4. when another member of the group expels it due to a suspicion
    timing out;
 5. after an error on coordinated group changes[1];
 6. after a primary election error;
 7. after a clone provisioning error;
 8. when automatic rejoin is enable, after its attempts are
    exhausted unsuccessfully.

The behavior taken depends on the value of the option
group_replication_exit_state_action:

- READ_ONLY
  The default value, switches the server to super read only mode.
  RW (read-write) transactions are disallowed, RO (read-only) are
  allowed.

- ABORT_SERVER
  (WL#11568) Shutdowns the server, thence closing all user
  connections.

- OFFLINE_MODE
  Behavior added on this worklog, closes all connections and
  disallows new ones from users who do not have the
  CONNECTION_ADMIN or SUPER privilege.
  This is done by setting server offline_mode option to ON.
  OFFLINE_MODE includes the behaviour of READ_ONLY, that is,
  it also disallows RW transactions, even from users with
  CONNECTION_ADMIN or SUPER privilege.

The full description of offline_mode can be read at
https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_offline_mode

[1]
https://dev.mysql.com/doc/refman/8.0/en/group-replication-configuring-online-group.html


USER INTERFACE
==============
The DBA can specify the behaviour by setting the option:

 - name: group_replication_exit_state_action
 - values: { READ_ONLY, ABORT_SERVER, OFFLINE_MODE }
 - default: READ_ONLY
 - scope: global
 - dynamic: yes
 - replicated: no
 - persistable: PERSIST, PERSIST_ONLY
 - credentials: SYSTEM_VARIABLES_ADMIN
 - description: The action that is taken when the server
                leaves the group unintentionally.
                Possible values are READ_ONLY,
                ABORT_SERVER and OFFLINE_MODE.


SECURITY CONTEXT
================
SYSTEM_VARIABLES_ADMIN privilege is required to change the option
group_replication_exit_state_action.

The mysql.session user used internally to access the SQL API
requires SUPER or CONNECTION_ADMIN privilege to avoid having its
connections closed when server offline_mode= ON.

Currently the mysql.session has the SUPER privilege, this worklog
will add the CONNECTION_ADMIN one.


UPGRADE/DOWNGRADE AND CROSS-VERSION REPLICATION
===============================================
There are no repercussions on upgrade scenarios.


OBSERVABILITY
=============
The member state after leaving is ERROR (no change on this worklog),
which can be check on performance_schema.replication_group_members
table:

  SELECT * FROM performance_schema.replication_group_members;

The offline_mode state can be check with:

  SELECT @@GLOBAL.offline_mode;

Existing connections which users do not have CONNECTION_ADMIN or
SUPER privilege will be close and unable to establish new
connections.

DEPLOYMENT AND INSTALLATION
===========================
There are no repercussions.

PROTOCOL
========
There are no repercussions.

FAILURE MODEL SPECIFICATION
===========================
The failure model remains unchanged, this worklog introduces a new
behaviour on how to handle the involuntary leaves.

SUMMARY OF CHANGES
==================

Server core changes
-------------------
- Add CONNECTION_ADMIN privilege to mysql.session user.

Group Replication changes
-------------------------
- Extend the option group_replication_exit_state_action with one
  more value: OFFLINE_MODE.

- When a server leaves involuntary and
  group_replication_exit_state_action=OFFLINE_MODE:
   - set server super_read_only[1] to ON through the SQL API;
   - set server offline_mode[2] to ON through the SQL API;

[1] https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_super_read_only
[2] https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_offline_mode