WL#9050: Group Replication: MySQL GCS majority loss handling integration

Affects: Server-5.7   —   Status: Complete

MySQL Group Replication provides multi-master update everywhere
replication to MySQL.
Clients can connect to any group server, and after conflict
detection, write changes that will be propagated to all group
members.

The multi-master behaviour is enabled by a group communication
system, XCom, which requires a consensus between group members in
order to agree on which messages and on which order them are
delivered to all group members.

Group communication consensus requires a majority of group members
to agree on a given decision in order to fulfil its correctness and
liveness properties.
When that majority of group members is lost, the group is unable to
progress and blocks, to avoid break correctness property.

This group block does happen on the following scenarios:
 1) In a 5 members group on which 3 members do crash
 2) A split brain: a 6 members groups is split into two groups of 3
    members each due to network partitions.
Both scenarios are examples of the rule - the number of online
members (O) must be higher than half of the group members (M):
    O > M/2

When a situation like this does happen, the group is unable to heal
itself, requiring manual intervention to decide if the group should
be reconfigured to only consider the online members (1) or which
partition should be considered the primary one (2).
A DBA, through a new option group_replication_force_peer_addresses,
will be able to reconfigure a blocked group to a subset of its
members.
FR-1: One must be able to unblock a group that looses majority,
      through group_replication_force_peer_addresses option
      on one ONLINE member.
FR-2: One must be able to set group members in a way that a
      majority is not needed.
FR-3: The DBA will provide a complete list of the members that
      she/he wants to belong to the reconfigured group.
FR-4: The DBA will need to manually shutdown the members not
      included on the new group membership.
FR-5: The members included on the new group membership must
      be fully functional.
FR-6: The members included on the new membership that are on
      RECOVERING state will failover to another donor if the
      current donor is not present on the new membership.
FR-7: It must be possible to set a empty value on
      group_replication_force_peer_addresses option, to clear
      its value.
FR-8: Set group_replication_force_peer_addresses option to any
      value other than empty on a not ONLINE member will return
      a error.
FR-9: When group_replication_force_peer_addresses option is
      already set before start Group Replication, like when
      set on configuration file, start will error out.
1. Context

In the context of Group Communication, the technology under the hood
of Group Replication, a typical scenario that might happen is the
loss of majority in a group. This takes great relevance, since our
typical Group Communication System (GCS) will run a consensus
algorithm to make the decisions among the group members. Without a
majority, no decisions can be made and the group will block for any
operation.

This can have different causes:
 * Catastrophic failure of several members;
 * A split brain scenario in which none of the partition retains
   a majority to proceed.

As any other GCS toolkit, furthermore supported by heterogeneous
types of GCSs underneath, MySQL GCS must provide means to unblock
this situation if it is bound to happen. In the particular case of
the implementation that is used as default - XCom, the PAXOS nature
will make it block naturally if a majority is not reached in a
proposed message.

This action path must be enabled at Group Replication level, that
is, DBA can force a blocked group to be reconfigured to a subset of
its members, by setting option
group_replication_force_peer_addresses on one of the alive members
of the group with the new list of members.

Example:
  Group with 5 members:
   
192.168.0.1:10000,192.168.0.2:10000,192.168.0.3:10000,192.168.0.4:10000,192.168.0.5:10000

  Three members do crash:
    192.168.0.3:10000,192.168.0.4:10000,192.168.0.5:10000

  DBA can force the group to be shrink to:
    192.168.0.1:10000,192.168.0.2:10000
  and unblock it.

  SQL command:
    SET GLOBAL group_replication_force_peer_addresses=
"192.168.0.1:10000,192.168.0.2:10000";

2. Building blocks

Lets consider a system that is configured with 5 members and 3
members crash in a catastrophic way leaving the system:
 * Being able to detect the crash but;
 * Without means to propose a new configuration, since it does not
   hold a majority of configured members.

2.1 New functionality

Currently, any client from the MySQL GCS API, like Group
Replication, only has access to the dynamic part of a group -
Control interface, in which one can join, leave and receive
notification from the status of a Group. One lacks the notion of a
Configuration, meaning, what was planned for a certain Group. This
would allow a clear distinction between the Plan, e.g., "I want a
group with 6 members" from the State, e.g., "my group has 3 nodes
alive and 3 nodes down".

MySQL GCS did make available to its clients a new interface,
called Configuration Management. For now this would have a single
operation to allow group reconfiguration, to be used in these type
of scenarios.

This new functionality will be accessible on Group Replication
through option group_replication_force_peer_addresses.

2.2 Behaviour/Use Case

In order to use this functionality, the group is already in a stall
state, without being able to accomplish anything. One must go to one
of the members from the group section that we want it to be alive
and set option group_replication_force_peer_addresses, injecting to
the group the configuration that we want it to have, in the form of
a list of group members that it considers alive. This is a way of
creating an implicit Primary Partition, allowing us to unblock this
situation.

The list of members that should belong to the unblocked version of the
group is of full responsibility of the DBA. This means that
intervention is needed to determine which members fit best to proceed
with group operations. It is not an automated operation.

When this operation is made, one must make sure that we are not doing
any group related operation at the same time like joins, leaves, in:
 * The member from where the reconfiguration operation is triggered;
 * Member that we decide to include in this new configuration.

A restriction that must apply is that, when one reconfigures the
group, it shall not be possible to add new members. The operation
must be done only using nodes that were present in the configuration
that stalled.

Whoever triggers this must also have present that it needs to act
upon the members that were put out of the configuration, taking them
down or shutting down the service. Thinking on a scenario in which
the members were dropped out because of a split, they might come
back and try to make operations that they should accomplish, like
joining a group or sending old data through the network.

Members that are not present in the new configuration will remain
blocked and will not be able to proceed either sending messages or
regular configuration modifications.

2.3 Crash Scenarios

Several crash scenarios can be exploited by this reconfiguration
procedure. Lets describe some in the following chapters and how
should one proceed.

2.3.1 Crash-and-recover scenario

The simplest scenario is an actual crash. Consider members A to F.
A,B and C crash. One must go either to D,E or F and inject a new
configuration containing D,E and F.

2.3.2 Split Brain Scenario

Considering the same set of members A to F. In any moment in time, a
network partition occurs: {A,B,C} and {D,E,F}. The DBA decides that
{D,E,F} are the eligible members to continue. She/He proceeds the
same way that he did in the previous scenario and injects {D,E,F} in
any of those members.

Above all, she/he must make sure that all members that belonged to
the configuration are indeed offline and cannot reach their old
counterparts. But she/he can't just call STOP GROUP_REPLICATION since
the member is blocked. Instead she/he needs to stop the process as a
whole, performing one of the following alternatives:
 A1: 1. Kill all client connections to that member;
     2. Remove clients from that member;
     3. Shutdown the server.
 A2: 1. Crash the server;

2.3.3 Partial Group Definition

Considering the same set of members A to F. In any moment in time, a
network partition occurs or a crash happens: {A,B,C} are down somehow
and {D,E,F} are deemed the best candidates to proceed. But the DBA,
for some reason wants to exclude D and proceed with E and F.

One must inject the new configuration either in E or F, and, like in
the previous scenario, make sure that A,B,C and D are not
reachable.

3. Future Work

3.1 Future work in MySQL GCS and Group Replication

In a near future, one must consider a full-fledged solution for
Group Management, in which a client, like Group Replication,
provides a group, representing a static plan for what we envision
that the members should be, and the View would be the current status
of that planned group.

In case of failures or split brain and working together with a
Failure Detector provided by the MySQL GCS implementation, we could
then augment this unblocking feature, just going to the partition
that we want to promote and state "unblock the members that are
online in this partition", instead of enumerating all members that
we want to be active.

4. Side effects

Since the new group membership is injected through
group_replication_force_peer_addresses option, this action has side
effects both on blocked and full operating systems.
On both situations, setting group_replication_force_peer_addresses
will change the group membership. The members not included on the
the new group membership will not receive a new view and will be
blocked, that is, them will not receive or send any data from/to
the new group.
The DBA will need to kill the excluded servers.

5. Documentation heads up

Manual must be clear that the excluded members of a new group
membership injected through group_replication_force_peer_addresses
option must be killed manually be the DBA.
The excluded members will be blocked, though on a future group
membership reconfiguration they may be allowed on the group again
and bad things may happen, since their data is outdated.
SUMMARY OF CHANGES
==================

 1. New group_replication_force_peer_addresses option added.
    When it is set on a ONLINE member, its value is injected
    into group communication layer as its new membership.
    When it is set on a OFFLINE or RECOVERING member, a error
    is returned.
    Empty is always accepted and only clears the option value.


SUMMARY OF CODE CHANGES
=======================
--- a/rapid/plugin/group_replication/include/gcs_event_handlers.h
+++ b/rapid/plugin/group_replication/include/gcs_event_handlers.h
@@ -71,6 +71,21 @@ public:
   void start_view_modification();
 
   /**
+    Signals that a injected view modification, like unblock a
+    group that did lost majority, is about to start.
+   */
+  void start_injected_view_modification();
+
+  /**
+    Checks if the view modification is a injected one.
+
+    @return
+      @retval true  if the current view modification is a injected one
+      @retval false otherwise
+   */
+  bool is_injected_view_modification();
+
+  /**
     Signals that a view modification has ended
   */
   void end_view_modification();
@@ -96,6 +111,7 @@ public:
 private:
   bool view_changing;
   bool cancelled_view_change;
+  bool injected_view_modification;
 
   mysql_cond_t wait_for_view_cond;
   mysql_mutex_t wait_for_view_mutex;


--- a/rapid/plugin/group_replication/src/gcs_event_handlers.cc
+++ b/rapid/plugin/group_replication/src/gcs_event_handlers.cc
@@ -190,6 +190,10 @@ Plugin_gcs_events_handler::on_view_changed(const Gcs_view&
new_view,
 
   //Handle joining members
   this->handle_joining_members(new_view, is_joining, is_leaving);
+
+  //Signal that the injected view was delivered
+  if (view_change_notifier->is_injected_view_modification())
+    view_change_notifier->end_view_modification();
 }
 
 void Plugin_gcs_events_handler::update_group_info_manager(const Gcs_view& new_view,
@@ -618,7 +622,8 @@ Plugin_gcs_events_handler::check_compatibility_with_group()
const
 }
 
 Plugin_gcs_view_modification_notifier::Plugin_gcs_view_modification_notifier()
-  :view_changing(false), cancelled_view_change(false)
+  :view_changing(false), cancelled_view_change(false),
+   injected_view_modification(false)
 {
 
 #ifdef HAVE_PSI_INTERFACE
@@ -652,10 +657,30 @@
Plugin_gcs_view_modification_notifier::start_view_modification()
   mysql_mutex_lock(&wait_for_view_mutex);
   view_changing= true;
   cancelled_view_change= false;
+  injected_view_modification= false;
   mysql_mutex_unlock(&wait_for_view_mutex);
 }
 
 void
+Plugin_gcs_view_modification_notifier::start_injected_view_modification()
+{
+  mysql_mutex_lock(&wait_for_view_mutex);
+  view_changing= true;
+  cancelled_view_change= false;
+  injected_view_modification= true;
+  mysql_mutex_unlock(&wait_for_view_mutex);
+}
+
+bool
+Plugin_gcs_view_modification_notifier::is_injected_view_modification()
+{
+  mysql_mutex_lock(&wait_for_view_mutex);
+  bool result= injected_view_modification;
+  mysql_mutex_unlock(&wait_for_view_mutex);
+  return result;
+}
+
+void
 Plugin_gcs_view_modification_notifier::end_view_modification()
 {
   mysql_mutex_lock(&wait_for_view_mutex);


--- a/rapid/plugin/group_replication/src/plugin.cc
+++ b/rapid/plugin/group_replication/src/plugin.cc
@@ -54,6 +54,7 @@ Read_mode_handler *read_mode_handler= NULL;
 char *gcs_engine_var;
 char *local_address_var;
 char *peer_addresses_var;
+char *force_peer_addresses_var;
 my_bool bootstrap_group_var= false;
 
 //The plugin auto increment handler
@@ -1434,6 +1435,92 @@ static void update_auto_increment_increment(MYSQL_THD
thd, SYS_VAR *var,
   DBUG_VOID_RETURN;
 }
 
+//Communication layer options.
+
+static int check_force_peer_addresses(MYSQL_THD thd, SYS_VAR *var,
+                                      void* save,
+                                      struct st_mysql_value *value)
+{
+  DBUG_ENTER("check_force_peer_addresses");
+
+  char buff[STRING_BUFFER_USUAL_SIZE];
+  const char *str= NULL;
+  (*(const char **) save)= NULL;
+
+  int length= sizeof(buff);
+  if ((str= value->val_str(value, buff, &length)))
+    str= thd->strmake(str, length);
+  else
+    DBUG_RETURN(1);
+
+  // If option value is empty string, just update its value.
+  if (length == 0)
+    goto update_value;
+
+  if (gcs_module == NULL || !gcs_module->is_initialized())
+  {
+    log_message(MY_ERROR_LEVEL,
+                "Member is OFFLINE, it is not possible to force a "
+                "new group membership");
+    DBUG_RETURN(1);
+  }
+
+  if (local_member_info->get_recovery_status() == Group_member_info::MEMBER_ONLINE)
+  {
+    string group_id_str(group_name_var);
+    Gcs_group_identifier group_id(group_id_str);
+    Gcs_group_management_interface* gcs_management=
+        gcs_module->get_management_session(group_id);
+
+    if (gcs_management == NULL)
+    {
+      log_message(MY_ERROR_LEVEL,
+                  "Error calling group communication interfaces");
+      DBUG_RETURN(1);
+    }
+
+    view_change_notifier->start_injected_view_modification();
+
+    Gcs_interface_parameters gcs_module_parameters;
+    gcs_module_parameters.add_parameter("peer_nodes",
+                                        std::string(str));
+    enum_gcs_error result=
+        gcs_management->modify_configuration(gcs_module_parameters);
+    if (result != GCS_OK)
+    {
+      log_message(MY_ERROR_LEVEL,
+                  "Error setting group_replication_force_peer_addresses "
+                  "value '%s' on group communication interfaces", str);
+      DBUG_RETURN(1);
+    }
+    log_message(MY_INFORMATION_LEVEL,
+                "Set group_replication_force_peer_addresses value '%s' "
+                "into group communication interfaces", str);
+     Wait_for_view_modification_result wait_for_view_modification_result=
+       
view_change_notifier->wait_for_view_modification(VIEW_MODIFICATION_TIMEOUT);
+    if (wait_for_view_modification_result.first)
+    {
+      log_message(MY_ERROR_LEVEL,
+                  "Timeout on wait for view after setting "
+                  "group_replication_force_peer_addresses value '%s' "
+                  "into group communication interfaces", str);
+      DBUG_RETURN(1);
+    }
+  }
+  else
+  {
+    log_message(MY_ERROR_LEVEL,
+                "Member is not ONLINE, it is not possible to force a "
+                "new group membership");
+    DBUG_RETURN(1);
+  }
+
+update_value:
+  *(const char**)save= str;
+
+  DBUG_RETURN(0);
+}
+
 //Base plugin variables
 
 static MYSQL_SYSVAR_STR(
@@ -1483,6 +1570,18 @@ static MYSQL_SYSVAR_STR(
   NULL,                                       /* update func*/
   "");                                        /* default*/
 
+static MYSQL_SYSVAR_STR(
+  force_peer_addresses,                       /* name */
+  force_peer_addresses_var,                   /* var */
+  PLUGIN_VAR_OPCMDARG | PLUGIN_VAR_MEMALLOC,  /* optional var | malloc string*/
+  "The list of peer addresses, comma separated. E.g., host1:port1,host2:port2. "
+  "This option is used to inject a new group membership, being the excluded "
+  "members from the new membership expelled from the group. The expelled "
+  "members will be blocked and do need to be shutdown by the DBA.",
+  check_force_peer_addresses,                 /* check func*/
+  NULL,                                       /* update func*/
+  "");                                        /* default*/
+
 static MYSQL_SYSVAR_BOOL(
   bootstrap_group,                            /* name */
   bootstrap_group_var,                        /* var */
@@ -1694,6 +1793,7 @@ static SYS_VAR* group_replication_system_vars[]= {
   MYSQL_SYSVAR(gcs_engine),
   MYSQL_SYSVAR(local_address),
   MYSQL_SYSVAR(peer_addresses),
+  MYSQL_SYSVAR(force_peer_addresses),
   MYSQL_SYSVAR(bootstrap_group),
   MYSQL_SYSVAR(recovery_retry_count),
   MYSQL_SYSVAR(recovery_use_ssl),