WL#11284: Group Replication: auto-rejoin member to group after a expel

Affects: Server-8.0   —   Status: Complete

EXECUTIVE SUMMARY

When in a group, a node can be expelled if it is not detected for some time by other members. XCom states that a node is suspected dead by the whole group and an internal algorithm determines that one node should remove the failed node on behalf of the whole group.

In real-world scenarios, this might happen quite frequently. For instance, in network splits in which nodes that belong to a partition that does not hold majority are expelled from the partition that holds the majority. The system then proceeds to work with a newly reconfigured group.

In such hypothetical scenario, if the network reestablishment happens quickly, the nodes that were expelled will be able to reconnect to the old group, receive the view that expel them, and shoot themselves in the head (to prevent inconsistency issues of rejoining indirectly via Communication Layer, since the partition that proceeded might have, for instance, more transactions than the nodes that resided in the split partition).

The goal of this WL is to make them safely rejoin the group if they are out of it for a reasonably short time.

USER STORIES

  • As a MySQL DBA, I want that nodes previously expelled due to quick network glitches automatically rejoin the group so that I am able to avoid unnecessary maintenance tasks.

PROBLEMS

Issues:

  • Flaky network connections can result in a group being split fairly often and thus requiring too much manual operation/intervention.

User/Customers want:

  • To minimize unnecessary intervention, which implies the automatic rejoin of expelled members that were expelled due to unstable connections.

Side-effects:

  • Under really unstable connections you can enter a loop of expel->rejoin, possibly stressing other members with constant view updates.
  • Auto-rejoin is the antithesis of WL#11568[1] when using ABORT_SERVER as the fencing mode.

NOTES

  • N1. This behaviour can act as an annulment of WL#11568 if we use the ABORT_SERVER fencing mode.

REFERENCES

  • [1] WL#11568: Group Replication: option to shutdown server when dropping out of the group
  • F-1: The user MUST be able to enable and disable auto-rejoin of expelled members per individual member.
  • F-1.1: If auto-rejoin is enabled on a given member M, upon expel from its group, member M MUST attempt to rejoin the group.
  • F-1.2: If auto-rejoin is enabled on a given member M and if M is removed from the group due to the group_replication_unreachable_majority_timeout option value, M MUST attempt to rejoin the group.
  • F-2: The user MUST be able to configure the number of retries used for auto-rejoin.
  • F-3: The auto-rejoin procedure MUST be monitorable using the appropriate infrastructure (thread stage events).
  • F-4: The member MUST stay in ERROR state while it is rejoining. If the rejoin is sucessful it should go to RECOVERING and then ONLINE (if the distributed recovery is successful).
  • F-5: The member MUST be in super_read_only mode during the whole of the auto-rejoin procedure. If auto-rejoin fails (i.e. all retry attemps are exhausted), the member MUST then stay in super_read_only mode and proceed according to its group_replication_exit_state_action.
  • F-6: The DBA MUST be allowed to issue a STOP GROUP_REPLICATION command or restart/shutdown the server while the auto-rejoin procedure is running (even if a rejoining member is waiting for the next retry).
  • F-7: The auto-rejoin procedure will always use the plugin options latest values, e.g. the same values that were active when the rejoining member was still on the group (same group_seeds, etc...).
  • F-8: The maximum time to wait between each rejoin attempt SHALL be 5 minutes.

HIGH-LEVEL SPECIFICATION

In order to allow the user to enable and disable the auto-rejoin procedure at will, a new sysvar will be added to the Group Replication plugin. Furthermore, to achieve F-3, we will introduce a new thread stage event which will enable the user to monitor the current auto-rejoin procedure as well as all the auto-rejoin procedures that have been triggered since the server started.

INTERFACE SPECIFICATION

Configuration/sysvar changes:

  • group_replication_autorejoin_tries (NEW)

    New sysvar that allows the user to configure the number of retries that a member will go through in the auto-rejoin procedure. This accomplishes requirement F-2. It will be an unsigned integer with a valid range of 0 to 2016. Upon expel or upon being unable to contact a majority of the group, the member will attempt to join the group again up to group_replication_autorejoin_tries (the timeout for the join operation will be 5 minutes) until it gives up and proceeds as normal for an expel, i.e. the member will enter the ERROR state and will react according to the group_replication_exit_state_action option. During the whole of the auto-rejoin procedure, the member must stay in super_read_only mode. This accomplishes requirements F-1.1, F-1.2, F-4 and F-5. When a value of 0 is used, auto-rejoin shall be disabled for that member, thus accomplishing requirement F-1.

Performance schema changes:

Monitoring and observability is required so that users, DBAs and management tools/scripts can be informed of what's happening to a member between being expelled and joining or leaving the group. The MySQL server provides infrastructure for the concept of stage events. These represent steps in a procedure or workflow and their main goal is to inform the user of the progress of work (by providing the work done as well as the estimated work needed to complete the task). For more details on this matter, please refer to the reference manual.

We must be able to observe and capture the following metrics and status:

  • Whether or not an auto-rejoin procedure is undergoing, the number of retries that have already ellapsed and the time remaining to the next retry

  • Example (for knowing if the auto-rejoin procedure is running):


SELECT COUNT(*) FROM performance_schema.events_stages_current                                                                         
  WHERE EVENT_NAME LIKE '%auto-rejoin%';

COUNT(*)
1

  • Example (for viewing the number of retries in an undergoing auto-rejoin procedure):

SELECT WORK_COMPLETED FROM performance_schema.events_stages_current WHERE
  EVENT_NAME LIKE '%auto-rejoin%';                            

WORK_COMPLETED
1

  • Example (for estimating the remaining time until the next retry):

SELECT (300.0 - (TIMER_WAIT - 300.0 * num_retries)) AS time_remaining FROM
  (SELECT COUNT(*) - 1 AS num_retries FROM
    performance_schema.events_stages_current WHERE EVENT_NAME LIKE '%auto-rejoin%') AS T,
  performance_schema.events_stages_current WHERE EVENT_NAME LIKE '%auto-rejoin%';   

time_remaining
30.0

  • The number of times the auto-rejoin procedure was triggered since the server started

  • Example:


SELECT COUNT_STAR FROM
performance_schema.events_stages_summary_global_by_event_name
WHERE EVENT_NAME LIKE '%auto-rejoin%';           

COUNT_STAR
1

  • The timestamp of the last time the auto-rejoin procedure was triggered (since the server started)

  • Example:


SELECT (UNIX_TIMESTAMP() - (
  (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='UPTIME') - TIMER_END * 10e-13)) AS timestamp FROM
performance_schema.events_stages_history_long WHERE EVENT_NAME LIKE '%auto-rejoin%'
ORDER BY TIMER_END DESC
LIMIT 1;       

timestamp
1548344279.6238284

To fulfill these requirements we will use the performance_schema.events_stages_current table, which contains the stage event occurring on each thread.

In this regard, the EVENT_NAME shall be 'Undergoing auto-rejoin procedure', the WORK_COMPLETED shall be the number of retries attempted for a running auto-rejoin procedure and, between each retry, the time for the next retry will be 5 - TIMER_WAIT (keep in mind that TIMER_WAIT is in picoseconds).

Example: SELECT EVENT_NAME, WORK_COMPLETED, TIMER_START, TIMER_WAIT FROM performance_schema.events_stages_current WHERE EVENT_NAME LIKE '%auto-rejoin%';


  EVENT_NAME                                   WORK_COMPLETED   TIMER_START   TIMER_WAIT
  stage/group_rpl/Undergoing auto-rejoin procedure      3         1000000000  60000000000000
  

In this example, the auto-rejoin procedure was started and 3 retries have been attempted. The time remaining for the next retry can be calculated in minutes by 300 - (TIMER_WAIT * 10^-13) which is 4 minutes (we scale TIMER_WAIT to minutes: 60000000000000 ps is 1 min). Please note that this is an estimated time window, because a successful join() can occur before the 5 minute timeout.

As for keeping track of previous auto-rejoin instances, it is possible to use the performance_schema.events_stages_summary_global_by_event_name table (which aggregates summaries for all thread stage events). Since the event we are introducing will only occur in its own thread (the auto-rejoin thread) and since this thread is unique within the server process, the user can use the table with all the events because our proposed thread stage event will only appear once. The field COUNT_STAR will contain the number of times the event was triggered (therefore, the number of times the server initiated the auto-rejoin procedure).

Example: SELECT EVENT_NAME, COUNT_STAR FROM performance_schema.events_stages_summary_global_by_event_name WHERE EVENT_NAME LIKE '%auto-rejoin%';


  EVENT_NAME                                    COUNT_STAR
  stage/group_rpl/Undergoing auto-rejoin proecedure    10
  

In this example, the auto-rejoin procedure has been triggered 10 times. Note that this means the procedure itself was run 10 times, not that it has attempted 10 retries. For individual retry accounting, one has to look at performance_schema.events_stages_current when the auto-rejoin procedure is running.

For infering the last time the event was triggered, the performance_schema.events_stages_history_long table can be used, by filtering by the auto-rejoin event and sorting in descending order by the TIMER_END field. The TIMER_END field of the first row of this resulting query contains the timestamp of the last auto-rejoin triggered. Keep in note that there is a fixed size for this history table, for more information please refer to the reference manual.

Example: SELECT EVENT_NAME, TIMER_END FROM performance_schema.events_stages_history_long WHERE EVENT_NAME LIKE '%auto-rejoin%' ORDER BY TIMER_END DESC LIMIT 1;


  EVENT_NAME                                            TIMER_END
  stage/group_rpl/Undergoing auto-rejoin procedure  1530025140000000000000 
  

In this example, to obtain the time the last auto-rejoin was run, we need to subtract the current timestamp to that of TIMER_END, like so: (UNIX_TIMESTAMP() / 60) - (TIMER_END * 10^-13) This will yield the number of minutes that have passed since the last auto-rejoin procedure.

The following SQL functions can help the DBA extract this information more easily and serve as a foundation for monitoring tooling. DISCLAIMER: These are not UDFs nor are officially supported by MySQL - they were developed solely to ease the testing of the feature in MTR.


    -- IS_AUTOREJOIN_RUNNING()
    --  + Returns a flag indicating whether or not an auto-rejoin process is
    --  currently running.
    CREATE FUNCTION IS_AUTOREJOIN_RUNNING() RETURNS BOOLEAN
    BEGIN
      DECLARE autorejoin_running INTEGER;
      DECLARE ret BOOL;
      SET autorejoin_running = -1;
      SET ret = FALSE;

      -- There is only one auto-rejoin thread. If there is a thread stage event
      -- with the auto-rejoin name, then an auto-rejoin process is being run.
      SELECT COUNT(*) FROM performance_schema.events_stages_current
        WHERE EVENT_NAME LIKE '%auto-rejoin%'
        INTO autorejoin_running;

      IF autorejoin_running = 1 THEN
        SET ret = TRUE;
      END IF;

      RETURN ret;
    END
  


    --  GET_NUMBER_RETRIES()
    --    + Returns the number of retries that have been attempted in a running
    --      auto-rejoin process.
    CREATE FUNCTION GET_NUMBER_RETRIES()
      RETURNS INT
    BEGIN
      DECLARE ret INT;
      DECLARE autorejoin_running BOOL;
      SET ret = -1;
      SET autorejoin_running = FALSE;

      SELECT IS_AUTOREJOIN_RUNNING() INTO autorejoin_running;

      -- The WORK_COMPLETED field of the events_stages_current performance schema
      -- table contains the number of attempts within a running auto-rejoin process.
      IF autorejoin_running = TRUE THEN
        SELECT WORK_COMPLETED FROM performance_schema.events_stages_current
          WHERE EVENT_NAME LIKE '%auto-rejoin%'
          INTO ret;
      END IF;

      RETURN ret;
    END
  


    --  GET_TIME_UNTIL_NEXT_RETRY()
    --    + Returns the time remaining until the next retry in seconds, for a
    --      running auto-rejoin process. If no auto-rejoin process is running, it
    --      returns -1.
    CREATE FUNCTION GET_TIME_UNTIL_NEXT_RETRY(wait_interval INTEGER)
      RETURNS DOUBLE
    BEGIN
      DECLARE ret DOUBLE;
      DECLARE autorejoin_running BOOL;
      DECLARE time_in_secs DOUBLE;
      DECLARE num_retries INT;
      DECLARE sleep_interval INT;
      SET ret = -1;
      SET autorejoin_running = FALSE;
      SET time_in_secs = 0;
      SET num_retries = 0;
      SET sleep_interval = wait_interval;

      SELECT IS_AUTOREJOIN_RUNNING() INTO autorejoin_running;

      IF autorejoin_running = TRUE THEN
        -- If -1 is passed as interval, we default to 5 mins (300 secs)
        IF sleep_interval = -1 THEN
          SET sleep_interval = 300;
        END IF;

        -- We obtain the number of retries so far, so we can calculate the time
        -- that has been spent up until the last retry.
        SELECT GET_NUMBER_RETRIES() INTO num_retries;
        SET num_retries = num_retries - 1;

        -- We then obtain the time spent up until now and scale it to seconds.
        SELECT TIMER_WAIT FROM performance_schema.events_stages_current
          WHERE EVENT_NAME LIKE '%auto-rejoin%'
          INTO time_in_secs;
        SET time_in_secs = time_in_secs * 10e-13;

        -- We then estimate the remaining time in seconds. Note: 300sec = 5 mins.
        SET ret = sleep_interval - (time_in_secs - sleep_interval * num_retries);
      END IF;

      RETURN ret;
    END
  


    --  GET_LAST_AUTOREJOIN()
    --    + Returns the UNIX timestamp of the last auto-rejoin process that was
    --      run. If no auto-rejoin process has been attempted yet, returns -1.
    CREATE FUNCTION GET_LAST_AUTOREJOIN()
      RETURNS INT
    BEGIN
      DECLARE ret INT;
      DECLARE num_autorejoins INT;
      SET ret = -1;
      SET num_autorejoins = 0;

      -- We first verify if there have been any auto-rejoin up until now.
      SELECT GET_COUNT_AUTOREJOIN() INTO num_autorejoins;
      IF num_autorejoins > 0 THEN
        SELECT
          UNIX_TIMESTAMP() - ((SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='UPTIME') - TIMER_END * 10e-13)
          FROM performance_schema.events_stages_history_long
          WHERE EVENT_NAME LIKE '%auto-rejoin%'
          ORDER BY TIMER_END DESC
          LIMIT 1
          INTO ret;
      END IF;

      RETURN ret;
    END
  


    --  GET_COUNT_AUTOREJOIN()
    --    + Returns the number of times the auto-rejoin process has been run.
    CREATE FUNCTION GET_COUNT_AUTOREJOIN()
      RETURNS INT
    BEGIN
      DECLARE ret INT;
      SET ret = 0;

      SELECT COUNT_STAR
        FROM performance_schema.events_stages_summary_global_by_event_name
        WHERE EVENT_NAME LIKE '%auto-rejoin%'
        INTO ret;

      RETURN ret;
    END
  

The combined use of these three tables achieves F-3.

SUMMARY OF CHANGES

1. group_replication_autorejoin_tries (new sysvar)

A new sysvar shall be added that allows the user to configure the number of retries that a member will go through in the auto-rejoin procedure. Its type shall be ulong and its default 0. When set to 0, the auto-rejoin procedure is disabled. Between each retry, if the join is not successful, the member will sleep 5 minutes (this sleep must be interruptible in order to achieve F-6). If the join is successful, the auto-rejoin procedure stops. The acceptable range of values is [0, 2016] and the type of the sysvar is Enumerable.

2. New thread stage event

A new thread stage event shall be added that represents an ongoing auto-rejoin procedure. The summary and history tables of the thread stage event tables shall provide the functionality to monitor past auto-rejoin procedures. For this effect, the following thread stage event PSI key shall be registered:


PSI_stage_info info_GR_STAGE_autorejoin = {
    0, "Undergoing auto-rejoin procedure",
    PSI_FLAG_STAGE_PROGRESS, PSI_DOCUMENT_ME};

Furthermore, a new entity shall be added that will handle all the logic related to interacting with the thread stage events tables. This code will be adapted from the Plugin_stage_monitor_handler that is implemented in WL#10378.

3. General flow

The entry point for the auto-rejoin procedure is either at the on_view_changed or on_suspicions methods of Plugin_gcs_event_handlers, depending on whether or not the member was expelled or if it lost contact to a majority of the group. From then on the flow is similar in both instances. The auto-rejoin procedure will only begin if the number of attempts to try is at least one. So that we don't block the GCS event handler thread, and given that the GCS callbacks are all const, we delegate the actual rejoin procedure to another thread, called Autorejoin_thread. Since a subsequent call to Gcs_operations::join() doesn't teardown the GCS infrastructure, we must explicitly leave the group (by invoking Gcs_operations::leave()) and once the member has left, we join again. So, starting from the Plugin_gcs_events_handler::was_member_expelled() method and using the scenario where the member was expelled as an example (note that the only difference between that scenario and the one where the member looses contact to a majority of the group is the entry point), the flow is the following:


+-----+              +---------------------------+                                   +-------------------+                                         +---------------------------+ +-----------------+                                +-------------------------------+
| GCS |              | Plugin_gcs_event_handlers |                                   | Autorejoin_thread |                                         | Group_member_info_manager | | Gcs_operations  |                                | Plugin_stage_monitor_handler  |
+-----+              +---------------------------+                                   +-------------------+                                         +---------------------------+ +-----------------+                                +-------------------------------+
   |                               |                                                           |                                                                 |                        |                                                         |
   | on_view_changed               |                                                           |                                                                 |                        |                                                         |
   |------------------------------>|                                                           |                                                                 |                        |                                                         |
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               | was_member_expelled                                       |                                                                 |                        |                                                         |
   |                               |--------------------                                       |                                                                 |                        |                                                         |
   |                               |                   |                                       |                                                                 |                        |                                                         |
   |                               |<-------------------                                       |                                                                 |                        |                                                         |
   |                               | ------------------------------------------------\         |                                                                 |                        |                                                         |
   |                               |-| if group_replication_autorejoin_tries_var > 0 |         |                                                                 |                        |                                                         |
   |                               | |-----------------------------------------------|         |                                                                 |                        |                                                         |
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               | init                                                      |                                                                 |                        |                                                         |
   |                               |---------------------------------------------------------->|                                                                 |                        |                                                         |
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               |                                                           | set_stage('autorejoin')                                         |                        |                                                         |
   |                               |                                                           |--------------------------------------------------------------------------------------------------------------------------------------------------->|
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               | start_rejoin                                              |                                                                 |                        |                                                         |
   |                               |---------------------------------------------------------->|                                                                 |                        |                                                         |
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               |                                                           | rejoin                                                          |                        |                                                         |
   |                               |                                                           |----------------------------------------------------------------------------------------->|                                                         |
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               |                                                           |                                                                 |                        | leave                                                   |
   |                               |                                                           |                                                                 |                        |------                                                   |
   |                               |                                                           |                                                                 |                        |     |                                                   |
   |                               |                                                           |                                                                 |                        |<-----                                                   |
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               |                                                           |                                                                 |                        | wait_for_view_change                                    |
   |                               |                                                           |                                                                 |                        |---------------------                                    |
   |                               |                                                           |                                                                 |                        |                    |                                    |
   |                               |                                                           |                                                                 |                        |<--------------------                                    |
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               |                                                           |                                                                 |                        | teardown_membership_dependent_resources                 |
   |                               |                                                           |                                                                 |                        |----------------------------------------                 |
   |                               |                                                           |                                                                 |                        |                                       |                 |
   |                               |                                                           |                                                                 |                        |<---------------------------------------                 |
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               |                                                           |                                                                 |                        | join                                                    |
   |                               |                                                           |                                                                 |                        |-----                                                    |
   |                               |                                                           |                                                                 |                        |    |                                                    |
   |                               |                                                           |                                                                 |                        |<----                                                    |
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               |                                                           |                                                                 |                        | wait_for_view_change                                    |
   |                               |                                                           |                                                                 |                        |---------------------                                    |
   |                               |                                                           |                                                                 |                        |                    |                                    |
   |                               |                                                           |                                                                 |                        |<--------------------                                    |
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               |                                                           | set_estimated_work(retries++, time())                           |                        |                                                         |
   |                               |                                                           |--------------------------------------                           |                        |                                                         |
   |                               |                                                           |                                     |                           |                        |                                                         |
   |                               |                                                           |<-------------------------------------                           |                        |                                                         |
   |                               |                                                           |                                                                 |                        |                                                         |
   |                               |                                                           | my_sleep(5mins)                                                 |                        |                                                         |
   |                               |                                                           |----------------                                                 |                        |                                                         |
   |                               |                                                           |               |                                                 |                        |                                                         |
   |                               |                                                           |<---------------                                                 |                        |                                                         |
   |                               |                                                           | -------------------------------------------------\              |                        |                                                         |
   |                               |                                                           |-| the sequence will be the same until            |              |                        |                                                         |
   |                               |                                                           | | we successfully rejoin or exhaust all attempts |              |                        |                                                         |
   |                               |                                                           | --------------------------------------------------\             |                        |                                                         |
   |                               |                                                           |-| if all retries are depleted, then               |             |                        |                                                         |
   |                               |                                                           | | group_replication_exit_state_action is executed |             |                        |                                                         |
   |                               |                                                           | |-------------------------------------------------|             |                        |                                                         |

4. Rejoin operation on Gcs_operations

A new method shall be added to the GCS wrapper (Gcs_operations) called rejoin and it will use the previously configured listeners, instead of configuring them again.


enum enum_gcs_error Gcs_operations::rejoin() {
  DBUG_ENTER("Gcs_operations::rejoin");
  enum enum_gcs_error ret = GCS_NOK;
  std::string group_name(group_name_var);
  Gcs_group_identifier group_id(group_name);
  Gcs_control_interface *gcs_control = NULL;
  Gcs_communication_interface *gcs_comm = NULL;
  enum_leave_state state = ERROR_WHEN_LEAVING;
  gcs_operations_lock->wrlock();

  // Sanity check on the Gcs_interface
  if (gcs_interface == NULL || !gcs_interface->is_initialized()) goto end;

  gcs_control = gcs_interface->get_control_session(group_id);
  gcs_comm = gcs_interface->get_communication_session(group_id);
  DBUG_ASSERT(gcs_control != NULL);
  DBUG_ASSERT(gcs_comm != NULL);

  /*
    The first step is to issue a GCS leave() operation. This is done because
    the join() operation will assume that the GCS layer is not initiated and
    will try to reinitialize everything. Thus, we will simply teardown and setup
    both the GCS layer and the group membership dependent components on the GR
    side between each retry.
  */
  view_change_notifier->start_view_modification();
  state = gcs_module->leave();
  if (state == Gcs_operations::ERROR_WHEN_LEAVING)
    LogPluginErr(ERROR_LEVEL, ER_GRP_RPL_FAILED_TO_CONFIRM_IF_SERVER_LEFT_GRP);
  if (view_change_notifier->wait_for_view_modification()) {
    LogPluginErr(WARNING_LEVEL, ER_GRP_RPL_TIMEOUT_RECEIVED_VC_ON_REJOIN);
    ret = GCS_NOK;
    goto end;
  }
  gcs_module->finalize();

  /*
    We must explicitly destroy/terminate all group membership dependent
    resources because the join() operation will try to create and/or initialize
    them.

    Firstly we remove the event listeners from the communication and control
    interfaces, since they are readded on the join() operation.
  */
  gcs_control->remove_event_listener(m_ctrl_event_listener_handle);
  gcs_comm->remove_event_listener(m_comm_event_listener_handle);

  // Then we terminate the GR layer components
  if (terminate_plugin_modules(REJOIN_MODULES_BITMASK, NULL)) goto end;

  /*
    If the member was the boot node, we must temporarily disable it, because
    the join operation will try to boot the group if the joining member is
    the boot node.
  */
  bootstrap_group_var = !bootstrap_group_var;
  gcs_module_parameters.add_parameter("bootstrap_group",
                                      (bootstrap_group_var ? "true" : "false"));
  ret = gcs_module->configure(gcs_module_parameters);
  if (ret != GCS_OK) {
    LogPluginErr(ERROR_LEVEL, ER_GRP_RPL_UNABLE_TO_INIT_COMMUNICATION_ENGINE);
    goto end;
  }

  /*
    The final part of the GR component reset is to try to initialize everything
    again.
  */
  if (initialize_plugin_modules(REJOIN_MODULES_BITMASK)) goto end;

  // Inform GCS layer we want to join the group
  gcs_control = gcs_interface->get_control_session(group_id);
  gcs_module->initialize();
  ret = gcs_control->join();
  if (ret == GCS_OK) {
    // Reset the bootstrap mode
    bootstrap_group_var = !bootstrap_group_var;
    gcs_module_parameters.add_parameter(
        "bootstrap_group", (bootstrap_group_var ? "true" : "false"));
    ret = gcs_module->configure(gcs_module_parameters);
    if (ret != GCS_OK) {
      LogPluginErr(ERROR_LEVEL, ER_GRP_RPL_UNABLE_TO_INIT_COMMUNICATION_ENGINE);
      goto end;
    }

    if (view_change_notifier->wait_for_view_modification()) {
      LogPluginErr(WARNING_LEVEL, ER_GRP_RPL_TIMEOUT_RECEIVED_VC_ON_REJOIN);
      ret = GCS_NOK;
    }
  }

end:
  gcs_operations_lock->unlock();
  DBUG_RETURN(ret);
}

5. Autorejoin_thread

So that we don't block the GCS event handler thread, we shall delegate the actual busy-wait loop of the auto-rejoin procedure to its own thread. To accomplish this, we introduce a new thread to GR, realized by the Autorejoin_thread:


/**
  Represents and encapsulates the logic responsible for handling the auto-rejoin
  procedure within Group Replication. The auto-rejoin procedure kicks in when a
  member is expelled from the group and the user configured a minimum number of
  retries to join the group again.

  This thread will do a busy-wait loop for group_replication_autorejoin_tries
  number of attempts, sleeping 5 minutes between each attempt.

  Since the join operation of the GCS layer is asynchronous, we cannot actually
  block while waiting for a confirmation if the server managed to join the group
  or not. As such, we wait on a signal sent by from an entity that is registered
  as a GCS event listener.

  @sa Plugin_gcs_events_handler
*/
class Autorejoin_thread {
 public:
  /**
    Deleted ctor.
  */
  Autorejoin_thread() = delete;

  /**
    Deleted ctor.
  */
  Autorejoin_thread(const Autorejoin_thread &) = delete;

  /**
    Deleted ctor.
  */
  Autorejoin_thread(const Autorejoin_thread &&) = delete;

  /**
    Deleted operator.
  */
  Autorejoin_thread &operator=(const Autorejoin_thread &) = delete;

  /**
    The ctor for the thread will initialize the instrumented mutex and
    cond_var.

    @param gcs_module the GCS layer wrapper to use.
  */
  Autorejoin_thread(Gcs_operations *gcs_module);

  /**
    The dtor for the thread will destroy the mutex and cond_var.
  */
  ~Autorejoin_thread();

  /**
    Creates the actual thread and waits on the cond_var for work.

    @return An integer representing an error code (returned from
    mysql_thread_create).

    @sa mysql_thread_create
  */
  int init();

  /**
    Aborts the thread's main loop, effectively killing the thread.
  */
  void abort();

  /**
    Starts the procedure of auto-rejoin, i.e. wake up the auto-rejoin thread
    and executes a busy-wait loop for a given maximum of retries.

    @param attempts the number of attempts we will try to rejoin
  */
  void start_autorejoin(ulong attempts);

  /**
    Returns a flag indicating whether or not the auto-rejoin procedure is ongoing
    on this thread.

    @return the state of the rejoin procedure.
  */
  bool is_autorejoin_ongoing() const;

 private:
  /**
    The thread callback passed onto mysql_thread_create.

    @param arg a pointer to an Autorejoin_thread instance.

    @return NULL, since the return value is not used.
  */
  static void *launch_thread(void *arg);

  /**
    The main loop of the thread.
  */
  void work();

  /**
    Handles the busy-wait retry loop.
  */
  void execute_rejoin_process();

  static const int REJOIN_TIMEOUT = 600;  // 5 mins
  /** the THD handle. */
  THD *m_thd;
  /** the state of the thread. */
  thread_state m_autorejoin_thd_state;
  /** the thread handle. */
  my_thread_handle m_handle;
  /** the mutex for controlling access to the thread itself. */
  mysql_mutex_t m_run_lock;
  /** the cond_var used to signal the thread. */
  mysql_cond_t m_run_cond;
  /** the mutex to signal the thread to attempt a retry. */
  mysql_mutex_t m_work_lock;
  /** the cond_var to signal the thread to attempt a retry. */
  mysql_cond_t m_work_cond;
  /** flag to indicate whether or not the thread is to be aborted. */
  bool m_abort;
  /** the number of attempts for the rejoin. */
  ulong m_attempts;
  /** flag to indicate whether or not an auto-rejoin procedure is ongoing. */
  bool m_rejoin_ongoing;
  /** the interface to the GCS layer. */
  Gcs_operations *m_gcs_module;
};