WL#8446: Group Replication: Implement error detection and sleep routines on donor selection

Affects: Server-5.7   —   Status: Complete

Under the Group Replication project, automatic distributed recovery is
a key element to give the user a good user experience when using the
plugin. The ability to automatically select a donor and failover to a
proper server when a error occurs in a transparent and configurable
way is crucial for this purpose.

With the purpose of improving distributed recovery, this worklog
intends to enhance donor selection and create better support for
errors during event application with new sleep routines when
something fails.
FR1: Recovery should detect any failure in the process and recover
     from it or error out. 

FR2: Recovery should execute sleep routines after a donor is not found
     in the group and before a new round of connection attempts
     begins.

FR3: Donor selection on Recovery should be less static being improved
     by a random selection of the next donor.
This worklog aims to improve Group Replication Recovery, that from
here on, we shall simply refer as Recovery, as we don't mention
any other kind of server/slave/etc recovery in this worklog.

Lets start by saying what we don't intend to do here:
1) Improve the selection algorithm with some kind of given metric
   (latency, executed GTIDs, etc)
2) Improve the algorithm of selection to ignore a prior servers that
   we know that will fail as donors (that have needed purged GTIDs for
   example)
3) Ignore possible donors that are already transmitting data to
   another joiner

This worklog will focus on two tasks:

 1) Implement error detection when we are receiving data from a
    donor. As of now, the joiner will only detect connection errors
    (nonexistent users, wrong passwords, etc) but whenever there is an
    error after this point, recovery will get stuck waiting for the
    process to end not knowing that something bad happened.

 2) Implement proper sleep routines in the donor selection mechanism.
    As of now, only the number of retries is configurable, and
    whenever an connection error occurs we just switch to another
    donor never sleeping between tries.

There is, even so, room in the planned changes to recovery code to
also make the donor selection algorithm less static. 

 3) Since we are changing the selection structures we intend to make
    the selection random. This avoids the continuous selection
    of the first member in the group as the group donor.

We will now descend into each one of this tasks.

=== Error detection ===

As said above, as of now, whenever a connection to a Donor is
attempted, we wait for it to be established, detecting if an error
occurred in the process. The same does not apply however to error when
applying/receiving events past this phase.
An error can occur as:

 a) There are purged GTIDs in the Donor that the Joiner doesn't have.

 b) For some reason there is some data that the Joiner and the Donor
 both share but it is logged with different GTIDs. This can happen for
 example when there is some test data, for example, that still remains
 in both servers.

In any of these situations, since recovery is not informed of these
failures, the process goes on waiting forever for the final view
change event to come (See WL#6837 for a general view on how recovery
works).
Being stuck, recovery can't also failover to another donor that could
on a) have the needed GTIDs or on b) lack the extra conflicting data,
allowing a successful conclusion of the process.

To address this challenge, we propose the use of the server applier
hook "applier_stop" to detect this errors.
The algorithm shall be:

Algorithm
=========

*Note: Being named "thread stop" we name the hook that fires whenever
the receiver stops: "receiver_stop" for better readability.

Recovery running:
   Select donor
   Establish connection to donor (failover on connection errors)
   Register an observer for the "applier_stop" and "receiver_stop" hooks.
   While waiting for the last view change to come:
     If (applier_stop /receiver_stop) hook fires:
       Something bad happens => donor_connection_thread_error=true 
       recovery loops again and connects to another donor. 
     If view change event arrives:
        Un-register the "applier/receiver stop" hooks
        Terminate donor connection.
        Continue with the recovery process.

This way, whenever the recovery applier/receiver thread stops, we have
access to this information and the process can react accordingly.

Note: This hook mechanism can be used as well to report errors on the
applier module that are at this point ignored with the exception of
the channel thread being shown as Off on the performance schema.


=== Sleep routines ===

The other focus of this worklog is to develop proper sleep routines
whenever a failure happens and failover kicks in. Currently, the
process of failover lacks any kind of waiting process between
connection attempts, something similar to its asynchronous slave
connection counterpart. Here we also address the retry count as well.

The questions to answer on the design are:

1) How many time we try to connect to a donor?

As of now, the default number of attempts made by a joining member to
find a suitable donor is equal to the number of online members when it
joined the group. The logic behind this reasoning is that if we tried
all possible donors and none was suitable then it would be pointless
to progress.

With this worklog, the intention is set the default value to a more
familiar number, as the old default was too small when compared to
the similar CHANGE MASTER parameter: MASTER_RETRY_COUNT (86400 tries
by default).

After implementation this value will still come from the existing
group_replication_recovery_retry_count variable, being the default
will be updated.

Note: 
In recovery, the process of failing over to another server is
not comparable to a simple slave re-connection as there is a pool of
available servers for connection. As so, we can distinguish between
individual connections attempts and full group rotations. The decision
here, even so, is to count every donor connection attempt and not full
group rotations.
We also now count the first connection as an attempt, contrarily to
the past where we counted only the re-connections. This means that
retries=1 now implies that recovery only tries once to connect and
never changes donor. This is the current slave behavior. 


2) How much time do we sleep?

To this second question, we decide again to look into the server where
the slave counter part is a CHANGE MASTER option: MASTER_CONNECT_RETRY.

The difference for the recovery process tough is that we don't have an
explicit CHANGE MASTER command so we must fetch this value from a new
plugin variable:

  group_replication_recovery_connect_retry

This variable, with a default set to 60 seconds (same as on change
master) will then be used to determine how much time should recovery
sleep between attempts.


3) When do we sleep?

In here, two options were possible, to
 a) Sleep between all connections.
 b) Sleep when we attempted to connect to all group members and
 failed.

We went with b), because here, there is no reason to wait between
connections as the next donor may not have the problem that affected
the previous problematic one. This is inherently different from a
slave connection scenario where we connect over and over to the same
server, so we should wait for the underlying problem to be resolved.
In here, this reasoning of "the DBA will solve" only makes sense when
all group options are exhausted, so we only wait after it.


Algorithm
=========

A generic view of the recovery algorithm and the sleep routines can be
expressed as:

 recovery_retry_count = get_server_retry_count_value()
 Recovery running:
   connected= false;
   error= false
   suitable_donors= group.get_online_members();

 connection loop:

   while (!connected || !error)
     if (recovery_retry_count == 0)
       throw "No more tries!"
       return error
     if(!error && suitable_donors.is_empty)
       suitable_donors= group.get_online_members();
       if(suitable_donors)
         throw "No available donors"
         return error
       else
         sleep(recovery_connect_retry)

     possible_donor = suitable_donors.pop()
     recovery_retry_count--
     connected= attempt_to_connect()


=== Random donor selection ===

The only change needed for guaranteeing a random selection of the
donor server is to assure an aleatory order in the suitable donor
list.  
This can be done with a random shuffle when we pick the group members
and select the viable ones that are then copied into another
structure.

This change implicates however a change to every test on MTR that
makes any assumptions about the donor selection order. This is
addressed below on the low level design. 
# For error detection #
=======================

New File:
channel_state_observer.cc/h
===========================

:: New class ::

  Channel_observation_manager

A class where we can register observers for channel state events.

 Methods ::

/*
  Initialize the class with plugin info for hook registration
*/
initialize(MYSQL_PLUGIN plugin_info)

/*
  A method to register observers to the events that come from the
  server.
*/
register_channel_observer(Channel_observer observer)

/*
  A method to remove observers.
*/
unregister_channel_observer(Channel_observer observer)


:: New class ::

  Channel_state_observer

This is class with abstract methods that recovery and the applier for
example can implement to listen to events in the server channels.
It can listen to Binlog_relay_IO_observer events:

  thread_start
  thread_stop
  applier_stop
  before_request_transmit
  after_read_event
  after_queue_event
  after_reset_slave

 Methods ::

virtual int thread_start();

...


Updated File:
recovery.cc/h
===========================

:: New fields ::

donor_channel_thread_error - reports if the recovery applier/receiver thread is dead

:: New class ::

  Recovery_channel_state_observer: public Channel_state_observer

This class will implement the recovery reaction to applier stops

 Class fields ::

thread_id - the thread id of the established channel

 Methods ::

int applier_stop()
{
  declare applier_dead = true;
  terminate_recovery_slave_threads()
  awake the recovery_condition condition
}

:: Updated logic ::

On establish_donor_connection():

  start_recovery_donor_threads
  + Recovery_channel_state_observer observer= new
  +    Recovery_channel_state_observer()
  + Channel_state_observer.register_channel_observer(observer);

On terminate_recovery_slave_threads():

  + Channel_state_observer.unregister_channel_observer(observer);
  donor_connection_interface.stop_threads

on recovery_thread_handle()
   //wait until..
   while (!donor_transfer_finished && // it's over
          !recovery_aborted &&            // recovery aborted
          !on_failover  &&                // switch to another donor
         +!donor_channel_thread_error)    // connection applier of dead
   {
      //Wait for completion
      condition_wait(recovery_condition)
   }



# For sleep routines #
======================

Updated File:
plugin.cc/h
============

:: New Sys Var ::

  group_replication_recovery_reconnect_interval

Variable created to allow the user to define the sleeping time between
attempts


Updated File:
recovery.cc/h
===========================

:: New fields ::

recovery_connection_retry -  the time to wait after attempting all
                             members

suitable_donors           -  queue of suitable group members that is
                             emptied was we try to connect

:: Removed field ::

rejected_donors           - replaced with suitable donors.


:: Updated logic ::

On start_recovery and update_recovery:

+ suitable_donors= group.get_online_members();

On select_donor() implement the above presented algorithm:

   while (!connected || !error)
     if (recovery_retry_count == 0)
       throw "No more tries!"
       return error
     if(!error && suitable_donors.is_empty)
       suitable_donors= group.get_online_members();
       if(suitable_donors)
         throw "No available donors"
         return error
       else
         sleep(recovery_connect_retry)

     possible_donor = suitable_donors.pop()
     recovery_retry_count--
     connected= attempt_to_connect()


# Random donor selection #
==========================

Updated File:
recovery.cc/h
==============

:: Updated logic ::

On start_recovery and update_recovery:

Implement this in a way that guarantees a random member order.
+ suitable_donors= group.get_online_members();

std::random_shuffle is an option to consider when implementing this.

Updated File:
group_replication_recovery_donor_failover.test
==============================================

This test starts with a group with server 1 and server 2.
When server 3 joins we assume server 1 will be selected donor so we
kill it to prove that recovery still goes smoothly. Server 2 can now
be the donor also.

Decision: Use the recovery channel to determine the donor uuid:
SELECT source_uuid FROM performance_schema.replication_connection_status
  WHERE channel_name='group_replication_recovery' AND service_state='ON'

With this uuid we query server 1 uuid and server 2 uuid and see who
the donor is. 

Proto algorithm:

  --connect server2
  server2_uuid = get_uuid

  donor_uuid= SELECT source_uuid...

  #default case
  donor_id= 1

  if(server2_uuid == donor_uuid)
    donor_id= 2

  --connect server$donor_id
  stop group replication
  ....

Other test changes:
==============

Every test that is based on the joiner failing when connecting to the
donor and waiting for it to become offline will now need changes to
set the number of retries to a low number that makes this process
fast.

Code refactor:
recovery_state_transfer.cc/h
============================

Many of the code that relates to the state transfer process can be
moved to a separated file to isolate this complex component from the
global process.