WL#9426: Group Replication: Single Primary Mode

Affects: Server-5.7   —   Status: Complete

Executive Summary
=================

This worklog implements a configuration mode for group replication
that makes a single server act as a writable master (PRIAMRY) and
the rest of the servers act as hot-standbys (SECONDARY). The group
itself coordinates automatically to figure out which is the server
that will act as the PRIMARY, through a simple leader election
mechanism (e.g., grabbing the alive server with the lowest ID, in
lexicographycal order, whenever there is a view change).

NOTE:

  This worklog does NOT cover:
  - facility for the user to trigger a new primary election
  - facility for the user to specify that a server is now the new
    primary

  This should be covered in another worklog, since operations-wise, it
  would be good if the user is able to override the system and state
  which is the primary node.

User Stories
============

- As a MySQL DBA I want to set the Replication group in a mode where only
  one server can accept updates so that I prevent multi-site concurrency
  issues and thus on-commit aborts.

- As a MySQL DBA I want to set the Replication group in a mode where only
  one server can accept updates so that I am able to make use cascading
  foreign keys with more than one level and serializable isolation mode.

- As a MySQL DBA I want to the Replication group in a mode where only
  one server can accept updates so that I am able to run DDL through the
  group in a more controlled way, i.e., without having to orchestrate
Functional Requirements
=======================

FR1. The user SHALL be able to turn ON/OFF update everywhere extra
     checks, but only when the Group Replication plugin is stopped.

FR2. When the update_everywhere_extra_checks is disabled, the user
     SHALL be able to run transactions with isolation level
     SERIALIZABLE.

FR3. The user SHALL be able to turn ON/OFF single_primary_mode, but
     only when the Group Repication plugin is stopped.

FR4. When the single_primary_mode is set, the first server to join
     the group SHALL be the PRIMARY server. All other servers that
     join afterwards SHALL be SECONDARIES.

FR5. All SECONDARIES SHALL have their super_read_only variable set
     to ON after joining the group and being set as SECONDARY.

FR6. SHOULD the user set super_read_only to OFF when the plugin sets
     it to ON (i.e., after classifying it as SECONDARY) then the
     user is overriding the plugin and SHALL not be guaranteed that
     read only properties will hold. Moreover, at a later stage,
     the plugin may reset the super_read_only back to ON.

FR7. SHOULD a network partition happen in which the primary is put
     on the partition without majority, then servers cannot agree
     on whether the server is still primary or not. However, since
     there is no quorum the system will stop for that partition.

     The partition with majority SHALL elect a new primary.

FR8. Whenever a server is promoted to PRIMARY, its super_read_only
     variable SHALL be set to OFF automatically and unconditinally.

FR9. Whenever a server is set as SECONDARY, its super_read_only
     variable SHALL be set to ON automatically and unconditinally.

Non-functional Requirements
===========================

NFR1. The fact that with many servers serving requests shows us that
      we have some scalability improvement, suggests that a single
      server SHALL be a scalability bottleneck, even if just RW
      transactions are routed to it.

NFR2. There SHALL not be any performance impact on replication
      throughput by enabling/disabling
High Level Specification
========================

Motivation
----------

The MySQL Group Replication plugin brings multi-master update
everywhere replication support for MySQL. This means that a
transaction first executes optimistically in a member of the group and
on commit it is synchronized with the group to verify that it does not
conflict with other transactions that had been executing concurrently
on other members.  If it conflicts, then the transaction is rolled
back.

The problem with the MySQL Server is that it has a lot of shortcomings:

- DDL is not transactional, thence it cannot be certified or rolled
  back;
- Architecture of the server puts the storage engine too far away from
  the execution and replication layers, meaning that the SE does
  things that upper layers may be oblivious to (e.g., foreign keys
  cascading, or different lock sets, ...);
- Hierarchical locking (MDL and storage engine locking).

These shortcomings undermine the ability to do completely safe
multi-master update everywhere, unless the user is told a set of
guidelines to cope with some restrictions. This works fine for a lot
of users. However, some users don't really care about the update
everywhere feature, but are interested in the split brain protection,
commit quorum and membership framework.

Running the write workload through a single server helps mitigating
the above issues, since execution and coordination is performed on a
single server. It will not help with the fact that DDL is not
transactional, but helps with the fact that one confines to a single
server the inconsistencies and hidden interactions between storage
engine and server layers. This makes things slightly safer, since
transactions are synchronized on a single server and thus there will
not be cross-server execution conflicts to detect and deal with.

In addition to what was outlined above, other internal project wants
to use Group Replication in this single master mode. And this is our
main motivation driver.

Approach
--------

The design decision is that we implement this at the Group Replication
layer only in this worklog. This is the minimum one can do to
implement this functionality. In the optimal case, in a single leader
mode, we could bypass parts of the replication pipeline and even
optimize XCom to have a single proposer instead of multiple
proposers. However, that is huge surgery and needs to be done in a
step-wise approach. Again, this worklog is just to have a single
leader mode in group replication from the get go.

Moreover, we are not implementing any new option for the user to
trigger an election in this worklog or specifying which is the primary
directly since this would bloat the work. Also, if we are thinking
big, we should have election policies, that could be modified and then
a way to trigger an election. This means that we need a proper
metadata model in the group -- perhaps with persistence. This is
already quite a bit of work to do and that would present high
risk. There is nothing preventing us from doing it at a later stage
though.

Now, the details! To implement a single master mode, the approach is
the following:

SERVER ROLES. We will start by defining a role for servers in the group.
We start by classifying servers as PRIMARY, SECONDARY. At any logical
point in time, there is only one PRIMARY and zero or more SECONDARIES.

CONFIGURATION. We will add a couple of options to the group replication
plugin. One that enables single master mode and another that disables
multi-master update everywhere checks. These options need to be set on
all servers in the group. It enables/disables automatic functionality
such as leader election.

LEADER ELECTION. When there is a view change, if the PRIMARY goes
away, a new PRIMARY needs to be elected from the SECONDARIES pool. Only
SECONDARIES that are in the ONLINE recovery state have a chance to be
elected. If there are no valid SECONDARY candidates no candidate is
elected.

READONLY STATE. When the single master option is turned on, then a
single server, the PRIMARY, will have super_read_only disabled. All
other servers the SECONDARIES will have super_read_only enabled. Once
a SECONDARY is promoted to PRIMARY, then it disables super_read_only.

User Interface
--------------

There will be a couple of new system variable:
- group_replication_single_primary_mode
- group_replication_enforce_update_everywhere_checks
and a new status variable** (group_replication_member_role).

SYS_VAR: group_replication_single_primary_mode
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These are the proposed system variables:

  NAME: single_primary_mode
  TYPE: BOOL
  DEFAULT: TRUE
  DYNAMIC: YES, but can only be changed while GR is stopped.
  SCOPE: GLOBAL
  DESCRIPTION: Activates the automatic election. This means that there
               will only be a primary server that takes incoming updates
               and all others in the group will be secondaries. On
               primary failures, the group will elect the new primary
               from the pool of secondaries. Default: TRUE.

SYS_VAR: group_replication_enforce_update_everywhere_checks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  NAME: enforce_update_everywhere_checks
  TYPE: BOOL
  DEFAULT: FALSE
  DYNAMIC: YES, but can only be changed while GR is stopped.
  SCOPE: GLOBAL
  DESCRIPTION: Enable/Disable strict consistency checks for multi-master
  update everywhere. Default: FALSE.

System Variables Behavior
~~~~~~~~~~~~~~~~~~~~~~~~~

These variable options can be combined in the following way:

E - enforce_update_everywhere_checks
S - single_primary_mode

 |--------+--------------+--------------+--------------+--------------|
 |        | S: OFF -> ON | S: ON -> OFF | E: OFF -> ON | E: ON -> OFF |
 |--------+--------------+--------------+--------------+--------------|
 | E: ON  | ERROR        | IMPOSSIBLE   | ----         | ----         |
 | E: OFF | SUCCESS      | SUCCESS      | ----         | ----         |
 |--------+--------------+--------------+--------------+--------------|
 | S: ON  | ----         | ----         | ERROR        | IMPOSSIBLE   |
 | S: OFF | ----         | ----         | SUCCESS      | SUCCESS      |
 |--------+--------------+--------------+--------------+--------------|

In other words:
- single primary mode cannot be set while enforcing update every where
  checks.
- update everywhere checks cannot be turned on while in single primary
  mode.

In addition, these options cannot be set while GR is running. If the
user wants to reset them, he needs to stop GR, reset them and the
restart GR.

Finally, all servers in the group need to have been configured the
same way, otherwise, a server that has a different configuration will
not be allowed to join the group.
 
The motivation for having a separate enforce_update_everywhere_checks
option is to allow DevOps the flexibility to choose whether he wants
to implement automatic primary promotion, or rely on existing
infrastructure that is maintained by himself to do it. He may want to
have all servers as PRIMARY but always ensure that he writes to a
single PRIMARY at the application level. In other words, he may wish
that some other layer takes the responsibility for picking the PRIMARY
server, while at the same time not being held back by the strictness
of an update everywhere mode and its checks.

STATUS_VAR: group_replication_primary_member
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is the proposed status variable:

  NAME: group_replication_primary_member
  TYPE: String
  DEFAULT: ""
  SCOPE: GLOBAL
  DESCRIPTION: Shows the UUID of the primary member in the current
               view.

  Special cases:
    If member is on OFFLINE state or
    group_replication_single_primary_mode= FALSE, its value will be
    "" (empty).
    If member is on ERROR state, its value will be UNDEFINED.
  

Primary Election
================

On view changes, a server is selected as primary for a given view when:
1. It was already the primary on the previous view,
2. Or it is the first ONLINE member on the new view after ordering
   all servers in lexicographycal order by using their UUIDs. This is
   performed using the std::sort function.

Considerations
==============

Security
--------

- There are no security concerns we can think of.

Upgrade/Downgrade
-----------------

There are no update/downgrade concerns other than the fact that on
downgrades, if the version one is downgrading to does not have this
functionality, then the user cannot expect that this functionality
works.

Interfaces
----------

Interfacing with this new functionality is done through the regular
system variables and/or status variable frameworks that we have in
place. Logs go through the error log.

** Regarding the status variable, we can of course add a new column
to the P_S table replication_group_members. In that column we can add
text in json format with meta information. This information would
initially contain the role of the server but in the future could
contain more information. For example:

--- a/scripts/mysql_system_tables.sql
+++ b/scripts/mysql_system_tables.sql
@@ -2251,7 +2251,8 @@ SET @cmd="CREATE TABLE
performance_schema.replication_group_members("
   "MEMBER_ID CHAR(36) collate utf8_bin not null,"
   "MEMBER_HOST CHAR(60) collate utf8_bin not null,"
   "MEMBER_PORT INTEGER,"
-  "MEMBER_STATE CHAR(64) collate utf8_bin not null"
+  "MEMBER_STATE CHAR(64) collate utf8_bin not null,"
+  "MEMBER_META TEXT not null"
   ") ENGINE=PERFORMANCE_SCHEMA;";

Then exporting the role would be a simple SELECT on this P_S table
instead of selecting the status variable from the global_status P_S
table. Moreover, this would display the roles of the other servers
in the group as well instead of just the one we are querying.

If not implemented as part of this worklog, this should be a follow-up
work.

DevOps Friendly and Non-intrusiveness
-------------------------------------

This makes the system more autonomic, which can be a good thing or
a bad thing. Elections and super_read_only modes are set automatically
and some users prefer that this is manual or controlled by their
infrastructure, such as their monitoring tools, etc.

See the user interface section for an explanation on how the update
everywhere checks can be lifted with
group_replication_update_everywhere_extra_checks system variable.

Integration with other Tools/Components
---------------------------------------

Other tools or components can just check the system variable
super_read_only and group_replication_member_current_role to figure
out what is the role that a server is playing at any given time.
Low Level Design
================

On the low level design we have the following components that need to
be changed:

1. Introduce two new plugin variables

@@ -2338,6 +2343,25 @@ static MYSQL_SYSVAR_STR(
   NULL,                                       /* update func*/
   IP_WHITELIST_DEFAULT);                      /* default*/

+static MYSQL_SYSVAR_BOOL(
+  single_primary_mode,                        /* name */
+  single_primary_mode_var,                    /* var */
+  PLUGIN_VAR_OPCMDARG,                        /* optional var */
+  "Whether the group should allow only a single leader or not.",
+  NULL,                                       /* check func*/
+  NULL,                                       /* update func*/
+  0);                                         /* default*/
+
+static MYSQL_SYSVAR_BOOL(
+  update_everywhere_extra_checks,             /* name */
+  update_everywhere_extra_checks_var,         /* var */
+  PLUGIN_VAR_OPCMDARG,                        /* optional var */
+  "Enable/Disable strict consistency checks for multi-master "
+  "update everywhere. Default: ON",
+  NULL,                                       /* check func*/
+  NULL,                                       /* update func*/
+  1);                                         /* default*/
+
 static SYS_VAR* group_replication_system_vars[]= {
   MYSQL_SYSVAR(group_name),
   MYSQL_SYSVAR(start_on_boot),
@@ -2366,9 +2390,46 @@ static SYS_VAR* group_replication_system_vars[]= {
   MYSQL_SYSVAR(gtid_assignment_block_size),
   MYSQL_SYSVAR(ssl_mode),
   MYSQL_SYSVAR(ip_whitelist),
+  MYSQL_SYSVAR(single_primary_mode),
+  MYSQL_SYSVAR(update_everywhere_extra_checks),
   NULL,
 };


2. Introduce a plugin status variable

+
+static int
+show_member_role(MYSQL_THD thd, SHOW_VAR *var, char *buff)
+{
+  var->type= SHOW_CHAR;
+  var->value= NULL;
+
+  if (group_member_mgr && local_member_info)
+  {
+    std::string uuid= local_member_info->get_uuid();
+    Group_member_info *info=
+      group_member_mgr->get_group_member_info(uuid);
+    Group_member_info::Group_member_role role=
+      info ? info->get_role() :
+      Group_member_info::MEMBER_ROLE_END;
+    const char* str=
+      Group_member_info::get_member_role_string(role);
+    memcpy(buff, str, strlen(str));
+    buff[strlen(str)]= '\0';
+    var->value= buff;
+    delete info;
+  }
+
+  return 0;
+}
+
+static SHOW_VAR group_replication_status_vars[]=
+{
+  {"group_replication_member_current_role",
+   (char*) &show_member_role,
+   SHOW_FUNC, SHOW_SCOPE_GLOBAL},
+  {NULL, NULL, SHOW_LONG, SHOW_SCOPE_GLOBAL},
+};
+
+
 mysql_declare_plugin(group_replication_plugin)
 {
   MYSQL_GROUP_REPLICATION_PLUGIN,
@@ -2380,7 +2441,7 @@ mysql_declare_plugin(group_replication_plugin)
   plugin_group_replication_init,    /* Plugin Init */
   plugin_group_replication_deinit,  /* Plugin Deinit */
   0x0008,                           /* Plugin Version: major.minor */
-  NULL,                             /* status variables */
+  group_replication_status_vars,    /* status variables */
   group_replication_system_vars,    /* system variables */
   NULL,                             /* config options */
   0,                                /* flags */

3. Extend member information that is exchanged with this additional
   metadata.

   3.1 Deploy an enumeration capturing the member roles to the
       Group_member_info class

+  /*
+    @enum Group_member_role
+
+    This enumeration describes all the roles a server can have.
+  */
+  typedef enum
+  {
+    MEMBER_ROLE_PRIMARY= 1,
+    MEMBER_ROLE_SECONDARY,
+    MEMBER_ROLE_END
+  } Group_member_role;
+

   3.2 Add a new field to the Group_member_info class that holds the
       value of the current member role. Add getters and setters.

@@ -165,6 +182,11 @@ public:
   Group_member_status get_recovery_status();

   /**
+    @return the member role type code.
+   */
+  Group_member_role get_role();
+
+  /**
     @return the member plugin version
    */
   const Member_version& get_member_version();
@@ -206,6 +228,13 @@ public:
                         std::string& retrieve_gtids);

   /**
+    Updates this object member role.
+
+    @param[in] new_role the role to set.
+   */
+  void set_role(Group_member_role new_role);
+
+  /**
     Returns a textual representation of this object

     @return an std::string with the representation
@@ -217,6 +246,12 @@ public:
    */
   static const char* get_member_status_string(Group_member_status status);

+
+  /**
+    @return the member role as string
+   */
+  static const char* get_member_role_string(Group_member_role role);
+
   /**
    Redefinition of operate == and <. They operate upon the uuid
    */
@@ -239,6 +274,7 @@ private:
   std::string retrieved_gtid_set;
   uint write_set_extraction_algorithm;
   ulonglong gtid_assignment_block_size;
+  Group_member_role m_role;
 };

   3.3 Add a new field to the enum_payload_item_type Group_member_info
       class

--- a/rapid/plugin/group_replication/include/member_info.h
+++ b/rapid/plugin/group_replication/include/member_info.h
@@ -79,8 +79,11 @@ public:
     // Length of the payload item: 8 bytes
     PIT_GTID_ASSIGNMENT_BLOCK_SIZE= 10,

+    // length of the role item: 1 byte
+    PIT_MEMBER_ROLE= 11,
+
     // No valid type codes can appear after this one.
-    PIT_MAX= 11
+    PIT_MAX= 12
   };


   3.4 Add an update member role the the Group_member_info manager class.

@@ -327,6 +363,15 @@ public:
   virtual void update_gtid_sets(const std::string& uuid,
                                 std::string& gtid_executed,
                                 std::string& gtid_retrieved)= 0;
+  /**
+    Updates the role of a single member
+
+    @param[in] uuid        member uuid
+    @param[in] new_role  role to change to
+   */
+  virtual void
+  update_member_role(std::string& uuid,
+                     Group_member_info::Group_member_role new_role)= 0;

   3.5 Add encoding/decoding support for this new field in the
       Group_member_info message:

@@ -113,6 +116,9 @@ Group_member_info::encode_payload(std::vector* buffer)
   encode_payload_item_int8(buffer, PIT_GTID_ASSIGNMENT_BLOCK_SIZE,
                            gtid_assignment_block_size);

+  char role_aux= (uchar)m_role;
+  encode_payload_item_char(buffer, PIT_MEMBER_ROLE, role_aux);
+
   DBUG_VOID_RETURN;
 }

@@ -182,6 +188,12 @@ Group_member_info::decode_payload(const unsigned char* buffer,
                            &payload_item_type,
                            >id_assignment_block_size);

+  unsigned char role_aux= 0;
+  decode_payload_item_char(&slider,
+                           &payload_item_type,
+                           &role_aux);
+  m_role= (Group_member_role)role_aux;
+
   DBUG_VOID_RETURN;
 }

4. Create a deterministic leader election function that promotes a
   server to PRIMARY on view changes (or when recovery finishes).

   The reason to run the leader election when a member states that it
   has finished recovery successfully, is because a leader election round
   may have been waiting for a server to become online to elect it as
   a PRIMARY (for example, there are no other serves with recovery state
   set to ONLINE).

   The function shall be put in Plugin_gcs_event_handler class, since
   then it can be called from the recovery or view change handlers.

void Plugin_gcs_events_handler::handle_leader_election_if_needed() const
{
...
}

   This function needs to be deterministic, i.e., its output be the same
   regardless of the member. All members will run this function at some
   point in time to pick the next leader. An alternative is to rely on
   the members uuids and always sort it in lexicographical order before
   picking the first in the list.

+static bool comparator_group_member_info(Group_member_info *m1,
+                                         Group_member_info *m2)
+{
+  std::string uuid1= m1->get_uuid();
+  std::string uuid2= m2->get_uuid();
+
+  return uuid1.compare(uuid2) < 0;
+}

   Then on the handle_leader_election_if_needed:

+      group_member_mgr->get_all_members();
+
+    std::vector::iterator it;
+    std::sort(all_members_info->begin(), all_members_info->end(),
+              comparator_group_member_info);
+

   Also, this function shall not be complex, all it needs to do is:
   1. check if the current member is leaving
      In this case do nothing.

   2. iterate over the set of active members in the group and check
      if there is a primary server already and it is ONLINE

   3. If there was no primary server
      - pick one from the list that is ONLINE

        if the next primary is ourselves, update role to PRIMARY and
        set super_read_only to OFF.

   4. If there was a primary server
      - set super_read_only to ON.
      - Update role for that server in the group_member_info to PRIMARY

5. Deploy the call to the handle_leader_election_if_needed calls in the
   recovery and view change handlers:

@@ -196,6 +198,10 @@ Plugin_gcs_events_handler::handle_recovery_message(const
Gcs_message& message) c
       recovery_module->update_recovery_process(false, false);
     }
   }
+
+  // check if we were waiting for some server to recover to
+  // elect a new leader
+  this->handle_leader_election_if_needed();
 }

 void
@@ -225,11 +231,159 @@ Plugin_gcs_events_handler::on_view_changed(const
Gcs_view& new_view,
   //Handle joining members
   this->handle_joining_members(new_view, is_joining, is_leaving);

+  // Handle leader election if needed
+  this->handle_leader_election_if_needed();
+
   //Signal that the injected view was delivered
   if (view_change_notifier->is_injected_view_modification())
     view_change_notifier->end_view_modification();
 }

6. Add functions to force setting and unsetting of super_read_only mode
   in the read_mode_handler. These should be called from the leader
   election function. Note that that function runs at the end of recovery,
   thence, when a member joins the group and recovery ends, it shall set
   itself to read_only if it was not selected as the PRIMARY.

--- a/rapid/plugin/group_replication/src/read_mode_handler.cc
+++ b/rapid/plugin/group_replication/src/read_mode_handler.cc
@@ -99,3 +99,37 @@ reset_super_read_only_mode(Sql_service_command
*command_interface)

   DBUG_RETURN(error);
 }
+
+long Read_mode_handler::force_set_super_read_only_mode(Sql_service_command
*command_interface)
+{
+  DBUG_ENTER("force_set_super_read_only_mode");
+
+  long error =0;
+  long server_super_read_only_query=
+    command_interface->get_server_super_read_only();
+
+  if (server_super_read_only_query == 0)
+    error = command_interface->set_super_read_only();
+
+  DBUG_RETURN(error);
+}
+
+long Read_mode_handler::force_unset_super_read_only_mode(Sql_service_command
*command_interface)
+{
+  DBUG_ENTER("force_unset_super_read_only_mode");
+
+  long error =0;
+  long server_super_read_only_query=
+    command_interface->get_server_super_read_only();
+
+  /*
+    If the server had no read mode active we set the read_only to 0,
+    this resets both read_only and super_read_only.
+    If the server had the read mode active we reset the super_read_only to 0.
+    If super_read_only was active, nothing happens.
+  */
+  if (server_super_read_only_query > 0)
+    error = command_interface->reset_super_read_only();
+
+  DBUG_RETURN(error);
+}

7. Add MTR test cases for leader election and for the new options.