WL#11568: Group Replication: option to shutdown server when dropping out of the group

Affects: Server-8.0 — Status: Complete

EXECUTIVE SUMMARY
=================

This worklog implements an option allowing the user to define the
behavior of the server once it drops out of the cluster. The option
allows the user to specify if the server should voluntarily shut
itself down or if it switches itself to super read only mode instead
(current behavior).

USER STORIES
============

- As a developer using MySQL I want to always read data from a MySQL
  server that is connected to replication so I minimize my chances of
  reading stale data.

- As a MySQL DBA I want my servers to automatically shoot themselves
  in the head if they involuntarily drop out of replication, so that
  other components in my system do not engage stale servers, or remove
  connections automatically to stale servers, or both.

- As a system builder, I want my system to react (e.g., close
  connections, remove server from the pool of "good" servers, etc)
  whenever a server goes involuntarily offline w.r.t. replication, so
  that I avoid to pro-actively polling the system to figure that out.

- As a proxy tool routing connections to a MySQL server, I want to
  get my connections to stale servers automatically closed, so that
  these connections are evicted from my routing cache automatically.

PROBLEMS
========

Issues:

- When a server drops out of the group, it sets itself as
  super-read-only, thus still allowing stale reads.
- When a server is stuck on a minority partition, it is still
  readable, thus still allowing stale reads!
- There is no notification emitted to other parts of the
  infrastructure, not even automatic connection closing when a
  server drops outside of replication.

Users want:

- MySQL Router to kill all open connections to a server that leaves
  the group (setting instance to RO is not sufficient).
- MySQL Router to kill all open connections to a server that is
  stuck on a minority.
- Those who not use MySQL Router want a more autonomic system where
  the server automatically restrict access if it runs into an
  unrecoverable local error (has gone out of sync).

What automatic shutdown does (side-effects):

- Closes all open connections and prevents client apps from doing
  stale reads or failed writes.

- Allows for systemd or watchdog tool to restart the server (and
  thus rejoin automatically)

Functional requirements
=======================
FR1: In the case of applier error, the member will change to ERROR
     state and leave the group, consequently if
     group_replication_exit_state_action= READ_ONLY then the plugin
     will set super_read_only=ON and disallow write operations.

FR2: In the case of applier error, the member will change to ERROR
     state and leave the group, consequently if
     group_replication_exit_state_action= ABORT_SERVER then the
     server process must abort.

FR3: In the case of member expel from the group, the member will
     change to ERROR state, consequently if
     group_replication_exit_state_action= READ_ONLY then the plugin
     will set super_read_only=ON and disallow write operations.

FR4: In the case of member expel from the group, the member will
     change to ERROR state, consequently if
     group_replication_exit_state_action= ABORT_SERVER then the
     server process must abort.

FR5: In the case of member is unable to contact a majority of the
     group members, after the timeout specified on
     group_replication_unreachable_majority_timeout[1] option, if
     group_replication_exit_state_action= READ_ONLY then the plugin
     will set super_read_only=ON and disallow write operations.

FR6: In the case of member is unable to contact a majority of the
     group members, after the timeout specified on
     group_replication_unreachable_majority_timeout[1] option, if
     group_replication_exit_state_action= ABORT_SERVER then the
     server process must abort.

FR7: In the case of recovery error, the member will change to ERROR
     state and leave the group, consequently if
     group_replication_exit_state_action= READ_ONLY then the plugin
     will set super_read_only=ON and disallow write operations.

FR8: In the case of recovery error, the member will change to ERROR
     state and leave the group, consequently if
     group_replication_exit_state_action= ABORT_SERVER then the
     server process must abort.

[1]
https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replication_unreachable_majority_timeout


Non-functional requirements
===========================
  None.

While we do not have a complete solution, we can
provide a quick solution that would introduce the option:
 - name: group_replication_exit_state_action
 - values: { READ_ONLY, ABORT_SERVER }
 - default: ABORT_SERVER
 - scope: global
 - dynamic: yes

Similar to binlog_error_action[1]
In MySQL 5.7.7 and later, this variable defaults to ABORT_SERVER, which
makes the server halt logging and shut down whenever it encounters such
an error with the binary log.

Applying this approach to GR, we will pair
group_replication_exit_state_action with
group_replication_unreachable_majority_timeout[2]
-------------------------------------------------
This would provide the following behaviour:
 group_replication_exit_state_action= ABORT_SERVER
 group_replication_unreachable_majority_timeout= 10 (seconds)

 In the case of applier error, the member must change to ERROR state
 and that would trigger ABORT_SERVER.

 In the case of member expel, the member must change to ERROR state
 and that would trigger ABORT_SERVER.

 In the case of unreachable majority, after the 10 seconds, the member
 must change to ERROR state and that would trigger ABORT_SERVER.

We must not confuse this new fence ability with
group_replication_defer_expel option, which only kicks in when a
majority does exist.
Example timeline:

0) Setup group of 3 members: M1, M2, M3. All member are online at this
point.
  -----------------
  View from member1:
  SELECT MEMBER_ID, MEMBER_STATE FROM
    performance_schema.replication_group_members ORDER BY MEMBER_PORT ASC;
  MEMBER_ID       f05a00b8-1583-11e8-a2e7-0010e0734796
  MEMBER_STATE    ONLINE
  MEMBER_ID       f0539a9c-1583-11e8-aa19-0010e0734796
  MEMBER_STATE    ONLINE
  MEMBER_ID       f0614c0d-1583-11e8-aa49-0010e0734796
  MEMBER_STATE    ONLINE

  -----------------
  View from member2:
  SELECT MEMBER_ID, MEMBER_STATE FROM
    performance_schema.replication_group_members ORDER BY MEMBER_PORT ASC;
  MEMBER_ID       f05a00b8-1583-11e8-a2e7-0010e0734796
  MEMBER_STATE    ONLINE
  MEMBER_ID       f0539a9c-1583-11e8-aa19-0010e0734796
  MEMBER_STATE    ONLINE
  MEMBER_ID       f0614c0d-1583-11e8-aa49-0010e0734796
  MEMBER_STATE    ONLINE

  -----------------
  View from member3:
  SELECT MEMBER_ID, MEMBER_STATE FROM
    performance_schema.replication_group_members ORDER BY MEMBER_PORT ASC;
  MEMBER_ID       f05a00b8-1583-11e8-a2e7-0010e0734796
  MEMBER_STATE    ONLINE
  MEMBER_ID       f0539a9c-1583-11e8-aa19-0010e0734796
  MEMBER_STATE    ONLINE
  MEMBER_ID       f0614c0d-1583-11e8-aa49-0010e0734796
  MEMBER_STATE    ONLINE

1) due to network issues, M1 becomes unreachable to M2 and M3
  -----------------
  View from member1:
  SELECT MEMBER_ID, MEMBER_STATE FROM
    performance_schema.replication_group_members ORDER BY MEMBER_PORT ASC;
  MEMBER_ID       f05a00b8-1583-11e8-a2e7-0010e0734796
  MEMBER_STATE    ONLINE
  MEMBER_ID       f0539a9c-1583-11e8-aa19-0010e0734796
  MEMBER_STATE    UNREACHABLE
  MEMBER_ID       f0614c0d-1583-11e8-aa49-0010e0734796
  MEMBER_STATE    UNREACHABLE

  -----------------
  View from member2:
  SELECT MEMBER_ID, MEMBER_STATE FROM
    performance_schema.replication_group_members ORDER BY MEMBER_PORT ASC;
  MEMBER_ID       f05a00b8-1583-11e8-a2e7-0010e0734796
  MEMBER_STATE    UNREACHABLE
  MEMBER_ID       f0539a9c-1583-11e8-aa19-0010e0734796
  MEMBER_STATE    ONLINE
  MEMBER_ID       f0614c0d-1583-11e8-aa49-0010e0734796
  MEMBER_STATE    ONLINE

  -----------------
  View from member3:
  SELECT MEMBER_ID, MEMBER_STATE FROM
    performance_schema.replication_group_members ORDER BY MEMBER_PORT ASC;
  MEMBER_ID       f05a00b8-1583-11e8-a2e7-0010e0734796
  MEMBER_STATE    UNREACHABLE
  MEMBER_ID       f0539a9c-1583-11e8-aa19-0010e0734796
  MEMBER_STATE    ONLINE
  MEMBER_ID       f0614c0d-1583-11e8-aa49-0010e0734796
  MEMBER_STATE    ONLINE


2) M2 and M3 hold a majority. They will wait
   group_replication_defer_expel seconds for M1 to return. M1 does not
   return and as such M2 and M3 expel M1 from the group
  -----------------
  View from member1:
  None, since the process abort()'ed. If the server is automatically restarted
  by some watchdog tool (like systemd for example), it will not be able to
  automatically join the group, because its address is still seen as
  UNREACHABLE by the group (i.e. it belongs to the group still but the group
  thinks it is unreachable). Eventually its membership is dropped and the
  server will be able to joing the group again.

  -----------------
  View from member2:
  SELECT MEMBER_ID, MEMBER_STATE FROM
    performance_schema.replication_group_members ORDER BY MEMBER_PORT ASC;
  MEMBER_ID       f0539a9c-1583-11e8-aa19-0010e0734796
  MEMBER_STATE    ONLINE
  MEMBER_ID       f0614c0d-1583-11e8-aa49-0010e0734796
  MEMBER_STATE    ONLINE

  -----------------
  View from member3:
  SELECT MEMBER_ID, MEMBER_STATE FROM
    performance_schema.replication_group_members ORDER BY MEMBER_PORT ASC;
  MEMBER_ID       f0539a9c-1583-11e8-aa19-0010e0734796
  MEMBER_STATE    ONLINE
  MEMBER_ID       f0614c0d-1583-11e8-aa49-0010e0734796
  MEMBER_STATE    ONLINE

group_replication_exit_state_action does operate on M1 side, the member
that does not hold a majority. Such member will never propose or decide
to expel any member.

Since the group_replication_exit_state_action default is ABORT_SERVER, on
the unlikely scenario of a split-brain, with
group_replication_unreachable_majority_timeout > 0 all the servers will
abort, which will consequently cause a full group shutdown.
DBA/operator will need to set group_replication_bootstrap_group=ON on one
server and bootstrap the group, only then the others servers can join.

Later when WL#11648 appears we will change Group Replication plugin to
notify server that a critical failure did happen and that the server
configured failure mode must be engaged.

[1]
https://dev.mysql.com/doc/refman/8.0/en/replication-options-binary-log.html#sysvar_binlog_error_action
[2]
https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replication_unreachable_majority_timeout

SUMMARY OF CHANGES
==================

--------------------------------------------
1. New sysvar
--------------------------------------------
A new plugin sysvar will be added, that will be settable via command-line, the
SET statement or by configuration file. This sysvar shall be named
'group_replication_exit_state_action' and shall be a string whose value
shall be one of { READ_ONLY, ABORT_SERVER } (case-insensitive). The string
value itself of the variable shall be mapped to the following enum:
enum enum_exit_state_action {
    EXIT_STATE_ACTION_READ_ONLY = 0,
    EXIT_STATE_ACTION_ABORT
};
The default value of the sysvar will be EXIT_STATE_ACTION_ABORT. This changes
the default behaviour completly from previous versions so we must be careful
when configuring new instances of MySQL GR from now on. Further on, for
testing purposes on MTR, the variable must be set at my.cnf level to
EXIT_STATE_ACTION_READ_ONLY so we don't break the rest of the tests.

--------------------------------------------
2. Aborting the server when it involuntarily leaves the group
--------------------------------------------
This new variable shall be stored at the global namespace under the name
'exit_state_action_var' in order to be consumed by all the relevant components.
These components are Applier_module and Group_partition_handling.

    2.1 Aborting upon member expel or applier error
These scenarios can be handled by checking exit_state_action_var on
Applier_module::kill_pending_transactions(), right after unblocking all pending
transactions on the server that left the group. This is done in order to allow
these local transactions that are waiting certification to rollback.
If the state is EXIT_STATE_ACTION_ABORT we call abort() right there. Otherwise
we will set the read mode to super_read_only. 

    2.2 Aborting upon majority loss
This scenario is handled by checking exit_state_action_var on
Group_partition_handling::kill_transactions_and_leave(), again, right after
right after unblocking all pending transactions on the server that left the
group. We do this for the same reasons as described in scenario 2.1.
If the state is EXIT_STATE_ACTION_ABORT we call abort() right there. Otherwise
we will set the read mode to super_read_only.