WL#12827: Group Replication: Clone plugin integration on distributed recovery

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

EXECUTIVE SUMMARY

When setting up a new server in order to add it to a existing group, one of the main pain points in this process is provisioning the new server. In order to improve this experience this worklog aims to make use of the MySQL clone plugin, integrating it with Group Replication to turn this process less cumbersome.

The end goal of this process is that the user can start the group replication process in a new server and automatically it will clone the data from a donor and get up to speed without further intervention.

User stories

As a MySQL DBA I want that a server quickly joins a group without the need of a previous manual server provisioning.

As a MySQL DBA I want the recovery process to work even when I purged my binlogs in the other group members.

Functional requirements

FR1: If the difference in GTIDs between the donor and the joiner is bigger than the defined value for group_replication_clone_threshold, the server shall be cloned when joining.
FR2: Even if below the threshold stated on FR1, if there is no server that can provide the missing transactions for the new joiner, as some binlogs were purged, the server shall be cloned when joining.
FR3: If cloning is not possible, as the other members don't support or some error occurred, the server should fall back to distributed recovery.
FR4: If neither cloning nor distributed recovery is possible the member shall leave the group and go into error state executing the configured exit action.
FR5: To be picked as a donor a group member must be online and its version supports cloning (8.0.17 or higher).
FR6: According to clone plugin restrictions, when cloning, the donor shall have the same version and operative system as the the joining member.
FR7: The recovery user will be used for cloning, and as so it also now requires the BACKUP_ADMIN privilege on top of already needed REPLICATION SLAVE.
FR8: The authentication parameters for the group replication recovery channel present in the donor shall be preserved in the clone receiver so it can join the group automatically. As so authentication parameters must be valid in the joining member for this to work.
FR9: When the user stops the group replication process during clone, the clone process terminated with a KILL statement.
FR10: A joiner with more data than the group, will not be allowed to join (clone will not override joiner data).
FR11: If a clone donor leaves the group while clonning, the process will be interrupted and the clone process will restart with new donor if possible.
FR12: If the server is not managed by a supervisor process, the server will shutdown.
FR13: Group replication cannot be started while clone is running
FR14: Clone commands cannot be executed while group replication is running.
FR15: When all cloning attempts error out the member shall not fall back to distributed recovery if not possible (e.g. donors don't have the data); it shall leave the group and go into error state executing the configured exit action.

Non functional requirements:

1. Basic concepts

Group Replication: The group replication plugin, many times named GR in this document. When it is said, GR starts, it means we issued a START GROUP_REPLICATION command and the plugin tried to join the group and so on. Same for other commands.
Clone plugin: InnoDB plugin, that clones data from one server to the other. All previous data in the server that receives the clone is destroyed. All settings not stored in tables are however maintained. Under clone we destinguish:
- Donor: The member that provides the data to the cloned member.
- Receiver: The member that receives the cloned data
Distributed recovery: process where a joiner in a Group Replication group establishes a connection to a donor to get missing data and certification information. This certification information comes in a View Change Log Event (VCLE) associated to the moment the member joined the group
The session (SQL) API: The service that allows plugins to open client connections in the server. When it is said that GR is executing a command, in fact it means the plugin is using this service to open a connection to the server and execute a SQL query.

2. The base execution idea

The main idea on the clone integration plan is that if you start Group Replication in a new server and the data to transfer is too big, the member will be cloned from another group member. The user shall never execute any more commands besides the initial START GROUP_REPLICATION.

In terms of steps the plugin shall then:

Join the group
Collect GTID information from group members
If the gap in data is too big, or the binlogs for the needed GTIDs are not present, execute the clone process. If not move to 6.
- 3.1. The GR plugin picks a donor from the online valid members
- 3.2. The GR plugin executes a command to add the donor to the clone plugin valid donor list.
- 3.3. The GR plugin executes a clone command (SQL) using the recovery credentials.
- 3.4. The clone plugin clones data from the donor and restarts the server
The plugin starts on boot if configured to do so.
Go back to 1.
Execute the distributed recovery process.
Go to state ONLINE when the new member catches up to the group

3. When to trigger a clone

The first question when using the clone plugin is: when do we invoke it? The cloning process should be avoided when possible as it implies:

The loss of all previous data in the receiver member
A server restart
The comunication of a possible large volume of data over the network.

So, when joining the server will use the following evaluation criteria.

How much data is the server missing

When a member joins the group we have information about what are the transactions executed in the group members: the servers GTID executed sets. If we combine that with the information about what information the source member already contains we can then have a picture of the number of missing transactions we need to transfer.

This is a crude metric, as N transactions can represent 1 megabyte or 1 gigabyte of data, but still usually a good estimate of workload that the distributed recovery will be subjected to.

So, if we have the number of missing transactions, what is the threshold upon which we shall then invoke the clone process.

That threshold shall be defined by the new variable

name: group_replication_clone_threshold
minimum value: 1
maximum value: MAX_GNO (LLONG_MAX +9223372036854775807 // 2^63-1)
default: MAX_GNO
scope: global
dynamic: yes
replicated: no
persistable: PERSIST, PERSIST_ONLY
credentials: SYSTEM_VARIABLES_ADMIN
description: The transaction number gap between servers above which the
             the clone plugin is chosen for recovery.

The default value means that clone will be avoided unless the user configures it too. The main reason is as said above the cost of cloning. Given the max value virtually disables the clone process we restrict the use of the 0 value as it could lead to confusion (do not clone versus always clone).

The algorithm can then be defined as:

 On view:
  group_gtid_executed = empty_set
  joiner_gtid_executed = get_local_executed_gtids()

  for server in all_servers
    group_gtid_executed.add(server.get_executed_set())

  total_missing_gtids = group_gtid_executed.subtract(joiner_gtid_executed)

  if (total_missing_gtids.count() >= group_replication_clone_threshold)
    do_clone()

Purged binlogs

Another problematic point in the traditional recovery system is that even if the load to transmit is small, it requires for all the data to be present in the donors binlog files. When this is not true, manual intervention is need to join a new member.

So, the other point for clone invocation is: can the group really provide the transactions the server lacks?

For this to be calculated new data shall be added to the state exchange algorithm. So basically every server will broadcast its gtid_purged set.

So, the algorithm becomes:

 On view:
  valid_donors = 0
  group_gtid_executed = empty_set
  joiner_gtid_executed = get_local_executed_gtids()

  for server in all_servers
    if (server not ONLINE)
      break; // skip
    missing_gtids = server.get_executed_set().subtract(joiner_gtid_executed)
    if (!missing_gtids.is_intersection_nonempty(server.get_purged_set()))
      valid_donor++;

  if (total_missing_gtids.count() >= group_replication_clone_threshold ||
      valid_donors == 0)
    do_clone()

4. How to trigger a clone

The base command to trigger the clone plugin is a variation of something like

 CLONE INSTANCE FROM "user"@donor_host:9999 IDENTIFIED BY "password" REQUIRE NO SSL;

Variants with SSL also exist, we use this one for simplicity.

The questions that are raised here are:

What is the host and port of the donor

As we do on recovery, the donor port and host will be randomly selected among all the ONLINE members that support clone.

One detail here is that we need to add this donor to the allowed servers we are allowed to clone from. So we might need to execute:

SET GLOBAL clone_valid_donor_list = "donor_host:donor_port";

What are the credentials

The credentials to be used in the clone are the same for distributed recovery. This means the recovery user shall need the REPLICATION SLAVE and BACKUP_ADMIN privileges. To add then just run:

GRANT REPLICATION SLAVE ON *.* TO "recovery_user"
GRANT BACKUP_ADMIN ON *.* TO "recovery_user"

How do we use SSL

If the user defines recovery to use SSL then clone will also use it. It is then controlled by

SET GLOBAL group_replication_recovery_use_ssl=1;

The SSL clone options will then be set to the recovery counter parts.

clone_ssl_ca = group_replication_recovery_ssl_ca
clone_ssl_cert = group_replication_recovery_ssl_cert
clone_ssl_key = group_replication_recovery_ssl_key

The executed command then becomes

CLONE INSTANCE FROM "user"@donor_host:9999 IDENTIFIED BY "password" REQUIRE SSL;

What is the attempt/failure process

An attempt will be made for each online server that supports cloning. If non of the attempts are successful an warning is logged and we fallback to distributed recovery if possible. Some cases like a failure to restart by part of the clone will not cause a distributed recovery attempt.

It is assumed that any Clone failure will be persistent so no retries will be made to the same donor.

What if member M leaves

If the member joining the group leaves it will stop the clonning process, meaning that clone process will be killed.

If a possible donor leaves while the process is not yet started, that member will be removed from the list of suitable donors.

If a possible donor leaves while the process is ongoing, the clonning process will be aborted as external data to the group could be cloned as well.

How do we stop the clone process

If we are joining a new server and the clone process is ongoing, the process should be aborted if a user stops the Group Replication process.

As instructed by the plugin official process, what GR does is issue a kill query to the clone process using the KILL QUERY command.

5. Settings and their persistence

One major question in this worklog is what happens in terms of settings in what relates to the joining server and the donor one.

The first distinction to be made here is between settings persisted in tables and settings that are on configuration files or persisted system variables.

Table persisted variables

Data is data, and for the Clone plugin all data is equal so every setting you had on the donor will be cloned to the joiner. This applies to firewall and audit log settings and also channel settings. Many of these settings don't belong to this worklog scope, but in what relates to the channels that belongs to Group Replication they will be handled by this task.

So, the basis is that if the group_replication_applier relay logs are not cloned, then all info pertaining to the channel is useless and will be removed.

The group_replication_recovery channel is different however given that it contains authentication credentials and we preserver those. When cloned, if we want the joiner to rejoin the group, it will need these authentication credentials to do so.

A remark here: this means we do not preserve the original recovery authentication settings the user configured in the joiner, but make use of the donor ones instead. This means the user shall have equal recovery settings in all members for user and password. The settings that pertain to SSL are preserved so those can be unique to the joiner.

Variables in general

All settings outside tables are preserved in the joiner, so in what relates to local addresses, white lists and peer addresses the new joining member will remain configured.

Still there is a remark to be made here: all necessary options for a successful rejoin on server boot must survive the restart. Either they are on a configuration file or persisted using SET PERSIST, but such settings like the group name, the group contact settings and the start on boot parameters must be there for the member to join the group after the clone.

6. Observability

In terms of observability, no new structures will be added in this worklog. The only expection is the status of the group recovery clone thread that can be checked in the performance_schema.threads table. Minor stage information can also be checked but it only reports the clone attempt number.

The user can rely on the existing replication status tables to know the member is in recovery. Then either the process of distributed recovery or the clone process will begin and both have their own observation tools.

More about clone status can be see on: WL#9212: InnoDB: Monitor Clone status

7. Upgrade / Downgrade

To use this utility all newer versions must have the clone plugin installed. If no donors are present in the group with clone capabilities the member will fallback to the standard process of distributed recovery.

In practice this means no special task is needed when upgrading the group.

8. Security

All the donor data is transmitted in the clone process, but the clone plugin supports SSL connections so all transmissions can be made secure.

Also in terms of privileges, this worklog means the new configured recovery users will need BACKUP_ADMIN privileges.

The mysql.session user used internally by the plugin now needs both the BACKUP_ADMIN and CLONE_ADMIN privileges.

9. User interface

No new commands or changes to the plugin user interface are introduced in this worklog.

We do add a new variable for controlling the threshold for plugin invocation.

name: group_replication_clone_threshold
minimum value: 1
maximum value: MAX_GNO (LLONG_MAX +9223372036854775807 // 2^63-1)
default: MAX_GNO
scope: global
dynamic: yes
replicated: no
persistable: PERSIST, PERSIST_ONLY
credentials: SYSTEM_VARIABLES_ADMIN
description: The transaction number gap between servers above which the
             the clone plugin is chosen for recovery.

In terms of user process, the only change is that the DBA must now add a new privileges to the mysql user that is configured for recovery and cloning.

GRANT BACKUP_ADMIN ON *.* TO "recovery_user"

The cloning class

This class shall contain the methods to check if cloning is wanted or needed and also the utilities to execute it.

Class structure

check_clone_preconditions()

This method shall check if we want or need to clone the server. In the algorith, the threshold is group_replication_clone_threshold Sumarizing its code it does:

1: Get the local member GTID_executed Set
2: For server S in all servers, that not the local one, do:
2.1: Check if S is ONLINE, supports cloning (version >= 8.0.17) and is of the same version. If so increment a var valid_clone_donors
2.2: Check what are the missing gtids in relation to that S and if S has not purged some of them. If it can provide the data increment valid_recovery_donors
2.3: Calculate the total Group data by adding the server S GTID_executed data to group_gtid_set
3: Calculate The subraction of the local set from group_gtid_set and get the number of missing GTIDs: missing_gtids

If missing_gtids is above or equal to the threshold and valid_clone_donors is bigger than 0 return DO_CLONE.

If missing_gtids is above or equal to the threshold but valid_clone_donors is 0 and valid_recovery_donors is bigger than 0 return DO_RECOVERY.

If missing_gtids is below the threshold and valid_recovery_donors is bigger than 0 return DO_RECOVERY.

If missing_gtids is below the threshold and valid_recovery_donors is 0 but valid_clone_donors is bigger than 0 return DO_CLONE.

If valid_recovery_donors and valid_clone_donors are 0 return ERROR.

get_clone_donors(donor_list)

This method shall get all members that are ONLINE and support cloning to the donor_list.

do_clone()

This method shall execute all the steps needed for the clone process.

1: Get all available donors using the get_clone_donors
2: Open a connection session using the Session API
3: Reset the read mode in the server
4: Configure clone SSL settings using the open connection session
5: Pick one at random from the list
6: Add that member to the clone_valid_donor_list
7: Construct the clone command given the recovery options: user, use SSL, etc.
8: Execute the command, if an error occurs go back to 5 with one less member.

This must be done in a spawned thread to not block the GCS process.

kill_clone_process()

This method shall fetch the session API thread id if running and issue a kill query if needed.

Invocation

The code for cloning will be executed when a view is received.

In the file

+ plugin/group_replication/src/gcs_event_handlers.cc

When in the handle_joining_members we keep the code for suspending the applier module and logging the view change log event. Only when we before started recover we now choose

do_clone = clone_handler.check_clone_preconditions();
if (do_clone == DO_CLONE)
  error = clone_handler->do_clone()
if (do_clone == DO_RECOVERY || error)
  recovery_module->start_recovery(
if (do_clone == ERROR)
  leave on error

Invocation Threshold

In the plugin file

+ plugin/group_replication/src/plugin.cc

We add a new variable

name: group_replication_clone_threshold
minimum value: 1
maximum value: MAX_GNO (LLONG_MAX +9223372036854775807 // 2^63-1)
default: MAX_GNO
scope: global
dynamic: yes
replicated: no
persistable: PERSIST, PERSIST_ONLY
credentials: SYSTEM_VARIABLES_ADMIN
description: The transaction gap between servers above which the
             the clone plugin is chosen for recovery.

Channel credentials

During the cloning process we want to use the recovery user to contact the donor server. Some of the options, like the SSL ones, are in the plugin but the user and password belong the group_replication_recovery channel.

For this reason we need to add new methods to fetch these. A new method needs to be added to

+ /sql/rpl_channel_service_interface.cc

The method is then

int channel_get_credentials(channel, user, pass)

with a similar method in the plugin on

+ plugin/group_replication/src/replication_threads_api.cc

Channels on restart after clone

When we clone, in the receiver there will be data about donor channels. Clone does not copy relay logs however so there is some clean up to do.

First point is to know if we are on a server restart after a clone. For that a new method is added to

+ include/mysql/group_replication_priv.h

We call this method

is_server_restarting_after_clone()

and we just return the value of the server variable clone_startup

Then on

+ plugin/group_replication/src/plugin.cc

When we initialize group replication we do

bool is_restart_after_clone = is_server_restarting_after_clone();
if (is_restart_after_clone)
  applier_channel.purge_logs(true /*reset all*/);
  recovery_channel.purge_logs(false /*don't reset settings*/);
  recovery_channel.initialize(/*default values preserving user/pass*/)

GTID purged and state exchange

To check if distributed recovery is possible when there are no binlogs for the missing data we now need info for GTID_purged.

For this we need a new field on

+ plugin/group_replication/include/member_info.h (.cc)

This new field will contain the value of GTID_purged. This also means new code for encoding/decoding.

In the event handlers for the communication file

+ plugin/group_replication/src/gcs_event_handlers.cc

we now need to add the GTID_purged extraction and broadcast to

Plugin_gcs_events_handler::get_exchangeable_data()

Privileges to the mysql.session user

In order for the plugin to execute the clone command in an automatic way there is need to update the mysql.session user. This means an update to

+ sql/rpl_group_replication.h
+ scripts/mysql_system_users.sql

To add the privileges BACKUP_ADMIN, CLONE_ADMIN and SHUTDOWN.

Other details

The clone plugin needs to execute a RESET MASTER command while the plugin is running, something that is forbidden in the current code. For this reason the limitation to this method is change to

if (is_group_replication_running_not_cloning()) {

This new method now belong to the plugin interface in

+ include/mysql/group_replication_priv.h