WL#10378: Group Replication: group single/multi primary mode change and primary election

Affects: Server-8.0 — Status: Complete

Executive Summary

This worklog implements a framework to do group-wide configuration changes. After this worklog is done, the user will be able to change single_primary_mode without having to stop group replication, and he will also be able to trigger the election of a specific member as the new primary.

Background

Group Replication can be configured to run on multi primary (multi-primary) or single primary mode. Both these modes have their use cases but situations may occur where the user may want to go from one to the other with no downtime and currently such a change would require a rolling shutdown of the group members.

This worklog aims to make the change from multi-primary to single primary possible with a simple invocation of a function. The group should coordinate to elect a primary, enable or disable the read only modes on the correct members and execute any other necessary step that is deemed necessary.

This worklog should also provide to the end user the much requested option to force the election of a new primary member of his/her choice. Until now the primary would be the first member of the group, or when it died/stopped, the one with the lowest UUID or greater member weight.

This last feature is included in this worklog as it is a derivation of the coordination process that must be in place for the above live changes from multi-primary to single-primary mode and vice-versa.

User Stories

As a MySQL DBA I want to change the primary from member A to member B without having to remove member A from the group, so I can demote member A to secondary.
As a MySQL DBA I want to change from single primary mode to multi primary mode without stopping group replication, so I can from now on write to multiple servers.
As a MySQL DBA I want to change from multi primary mode to single primary, so I adjust the deployment mode now that I have figured that multi primary is not really fitting my use case.

Functional requirements

FR1: To execute a coordinated group configuration change the member must be on ONLINE state and belong to a reachable group majority.
FR2: To execute a coordinated group configuration change, the user must have GROUP_REPLICATION_ADMIN privileges.
FR3: If the coordinated group configuration change is invoked with a server UUID, that value shall be valid and must belong to a member of the group. An error shall be outputed otherwise.
FR4: When the group is in multi-primary mode and the user causes a change to single-primary mode the group must:
- FR 4.1: Elect a primary. If one is selected by the user, that member must be chosen.
- FR 4.2: The primary shall be writable after processing the local backlog
- FR 4.3: Secondaries shall enable the server super_read_only mode.
- FR 4.4: Update everywhere checks is set to False but only after all transactions from the old primary are applied.
FR5: When the group is in single-primary mode and the user causes a change to multi-primary mode the group must:
- FR 5.1: Update everywhere checks must be set to True on all members.
- FR 5.2: All members must be writable, so read_only mode should be False
- FR 5.3: When all members are writable, any transactional conflict must abort
FR6: When primary member is proposed
- FR 6.1: A new election shall happen in all members appointing the proposed member as the new primary
- FR 6.2: While all updates from the old primary previous to the election are not applied, the new primary must stay in read mode.
- FR 6.3: While updates from the old and new primaries are in the group, any transactional conflict between them must abort.
FR7: When changing to multi primary mode the auto increment values of the server shall change to the plugin automatic values according to the group_replication_auto_increment_increment variable if no user set value is present.
FR8: When changing to single primary mode the auto increment values of the server shall return to the base values if no user set value is present.
FR9: No members can join the group while a coordinated group configuration change is occurring
FR10: No coordinated group configuration change can happen if one of the members is in recovery mode.
FR11: No more than one coordinated group configuration change can happen at the same time.
FR12: No coordinated group configuration changes are allowed if the group contains a member of a previous version that does not support it.
FR13: When electing a primary server, P, if any other member than P contains running slave channels, the configuration change shall abort.
FR14: When changing to single primary mode, if more than a member contains running slave channels, the configuration change shall abort.
FR15: When changing to single primary mode with no appointed primary, if a solo member exists with running slave channels, that member shall be the elected primary.
FR16: When a coordinated group configuration change involving primary election is running no slave channels can be start in the group members.
FR17: Any change to multi-primary when already in multi-primary is a no-op.
FR18: Any change to single-primary when already in single-primary is a no-op.
FR19: An attempt to elect a primary member when in multi primary is not a valid operation.
An error saying to use the primary switch command is issued.
FR20: An attempt to elect a member as primary that is already the group primary member is a no-op.
FR21: Coordinated group configuration changes can be invoked in any member despite its primary or secondary role.
FR22: All changes to the primary mode shall be recorded with SET PERSIST meaning they will have effect even after a member restart.
FR23: When changing to single primary mode with no appointed primary, and no restrictions with slave channels exist, the new primary member shall be elected using weights or lexicographic order when all weights are equal.
FR24: When a coordinated group configuration change is accepted, even if the invoking member leaves or fails under a majority, the action will be executed in all online members.
FR25: Primary elections or change to multi-primary will be delayed until all transactions forbidden by enforce_update_everywhere_checks terminate.
FR26: When switching to a primary server or changing mode to single primary with an appointed primary, P, if P leaves or fails under a majority, before the election starts, the configuration change must abort.
FR27: When changing mode to single primary with an appointed primary, P, if P leaves or fails under a majority, when the primary election began but is not yet over, the change will not abort and adapt to the new elected primary throwing a warning.
FR28: When switching to a primary server, P, if P leaves or fails under a majority, when the primary election began but is not yet over, the configuration change will abort and the old primary will be elected if available. If not another member will be elected.
FR29: When switching to a primary server or changing mode to single primary with an appointed primary, P, if P leaves or fails under a majority, after the election finalizes, change terminates and the group elects a new primary. A warning is thrown to the user.
FR30: When electing a primary server, P, if any server S leaves or fails under a majority, the procedure shall not be affected and will resume.
FR31: Any member exit or failure under a majority shall not affect the process of changing to multi master mode.
FR32: After a coordinated group configuration change returns successfully to the user in the invoking member, its effects should be visible in all members.
FR33: If the user kills the query thread then the action and query threads shall be terminated.
FR34: If the group change coordination thread is killed but the distributed execution has already gone beyond a point where all servers agreed (cannot be canceled) then the action will complete.
A warning shall be returned by the executing query stating the kill had no effect.
FR35: If the group change coordination thread is killed but the configuration process still has major tasks to complete the member shall leave the group and go into ERROR mode or abort.
FR36: When the plugin is stopped or leaves in error, while changing from single primary mode to multi primary mode, if the member did not set the single primary mode flag to false, then update everywhere checks shall remain false.
FR37: When the plugin is stopped or leaves in error, while changing from single primary mode to multi primary mode, if the member did already set the single primary mode flag to false, then update everywhere checks shall be true afer stop.
FR38: When the plugin is stopped or leaves in error, while changing from multi primary mode to single primary mode, if the member did not set the single primary mode flag to true, then update everywhere checks shall remain true.
FR39: When the plugin is stopped or leaves in error, while changing from multi primary mode to single primary mode, if the member did already set the single primary mode flag to true, then update everywhere checks shall be false afer stop.
FR40: When the plugin is stopped or leaves in error, plugin configurations when the configuration change terminates must be valid, even if not persisted with SET PERSIST.
FR41: All coordinated group configuration changes shall allow the DBA to check its progress.
FR42: Functions to execute coordinated group configuration changes are only present when the plugin is installed.
FR43: Any local failure in a coordinated group configuration change that prevents its progress shall make the server leave the group as its configuration may have deviated from the group.
FR44: Error in the election process that prevent its progress shall make the server leave the group or abort as its configuration may have deviated from the group.
FR45: Any failure to enable the read mode in the server for data protection shall result in a server abort.
FR46: Outside the scope of coordinated group configurations changes, if a primary member fails the new primary wont be writable until it executes all the transactions from the old primary.
FR47: Member weights for primary election cannot be changed when a coordinated group configuration change is occurring.
FR48: When a primary election is running, no coordinated group configuration change can be executed in the group.
FR49: The coordinated group configuration changes proposed on this worklog cannot be executed when there is an active table lock in the session.

Non functional feature requests:

NFR1: This WL must have no impact on transaction execution performance when no coordinated group configuration changes are being executed.
NFR2: This WL must only have a minor overhead during transaction commit when executing coordinated group configuration changes that depend on transaction monitorization.

1. Some definitions and considerations

Single primary mode: When only one member in the group accepts writes and all other members are in read mode.
As the primary is the source of truth in the group, certification information is updated but not used for commit decisions.
Restrictions around foreign keys and other multi-primary limitations described below do not apply to single primary mode.
Primary: The writable member in the single primary mode. If it dies or exits the group a new primary is elected.
Secondary(ies): The non writable members in the group. These members receive transactions executed in the primary from the group.
Multi primary mode: Also called multi-primary in the text, is when all members are writable.
All transactions are certified and can rollback if they are concurrent and update the same data as other transaction committed in the group.
Multi primary mode is subjected to some restrictions described below.
Update everywhere checks: Controlled by a plugin variable: enforce_update_everywhere_checks.
With this var group replication can prevent the execution of transaction that cause updates to cascading foreign keys or use the serializable commit mode.
Election: The process where a member is appointed as the new group primary.
Note that under this worklog we call it election even if there is no algorithmic selection of a member and the member is directly chosen externally.
In either cases the distributed appointment of a member, the changes to the read modes and certification related tasks make what we call an election.
Certification: Certification is the process where the group decides which concurrently executed transactions, at different servers, are conflicting.
If the output is negative, the transaction will rollback in all members.
Throughout the worklog we mention points where we say certification is enabled or disabled, so a clarification here:
Certification keeps collecting information about transactions when enabled or disabled.
Being ON or OFF refers only to the certification output that is considered or not during on all transaction's commit process.
Coordinated Group Configuration Change: the group of coordinated steps needed to execute a change to the group. These are many times addressed simply as group actions or coordinated group actions throughout the worklog.
The action coordinator: The central blocks that coordinates the execution of actions in the group. It guarantees that only one action can execute at a time.
UDF: These are User Defined Functions, a mechanism that allows us to add functionality to the plugin without coding new parser commands.
Installed by us at plugin install, they allow the user to invoke a new action code by us in the plugin.
Group Replication and slave channels:
As described, when in primary mode it is assumed that only one member in the group is the source of all updates to the group's data.
This has some implications in the election process pointed in this worklog.
If one member has an active slave channel receiving data from an external source, this member must be the primary in the group, and no primary switches are allowed.
Such switches would mean that two different sources of updates would now exist in the group.

Note: All plugin variables in this worklog are often referred in the text without the prefix group_replication_ to decrease verbosity.
Example: enforce_update_everywhere_checks

2. Coordinated Group Configuration Changes - the basics

Such changes as the ones proposed here, where one dynamically alters the single/multi primary mode are operations that require the coordination of all group members in the execution of a set of steps to achieve the desired result.

So what this worklog intends to implement is a configuration module for task coordination but also a set of well defined operations that are used to achieve the wanted configuration changes.

The idea is that this coordinator and operations could be used on the implementation of new requests in the long run.

Coordinated Group Configuration Changes: The Coordinator

The coordinator shall work based on 3 phases

Coordinate the start of the action
Execute the action
Return the status of the execution to the user and declare the end of the action.

1) and 3) are coordinator steps, common to all actions. 2) is specific to each action.

On the first phase the coordinator shall send a message to the group stating the action to execute.
If an concurrent action exists it should abort the latest one before execution.

On the first phase, the coordinator shall send a message to the group stating that action A is to be executed.
If a concurrent action B is already taking place, then A is aborted.
The order of execution is established by the total order delivery guarantee of the GCS (Paxos).

This first phase also makes use of this total order delivery guarantees to check some general validations like: there is no member of an older version, there is no member on recovery, etc.

Much like the first phase, the third phase also needs to be coordinated between all members by means of sending a message to the group.

Otherwise, since the operations are asynchronous, members that are still executing action A would refuse a new action B while others that had finished A already would accept and start B.

Validations and execution vary with each action though so each action has their own implementations.
We shall call these group actions, described below.

In summary the coordinator shall coordinate the start and finish of a action invoking the corresponding action block, only returning when that action is finished.

Coordinated Group Configuration Changes: Group Actions

An group action shall then contain the basic method for execution.
This method shall be implemented for all actions.

execute_action()

Group Actions will also contain two methods

get_action_message(Message)

and

process_action_message(Message)

The idea here is that each action will encode their own parameters and decode them.
It is up to the coordinator to get this message when the action is invoked and give it to all members when it is accepted.

Stopping an action is also a key operation so failure and plugin stop situations.

stop_action_execution

Also, for debug and identification purposes these classes should expose their names.

get_action_name()

For now the coordinator will handle 2 actions

Multi primary mode migration
Single Primary election

The first shall handle changes from single-primary mode to multi-primary setups.

The second should handle the inverse conversion but also handle the primary election of a specific member.

Coordinated Group Configuration Changes: Actions invocation

First a note on how the option group_replication_single_primary_mode is still in effect and the DBA can still configure the member to start in a mode or another.

The changes from one mode to the other in a live group do not depend on vars though but on new introduced user defined functions.
This way the change is made trough a function that denotes an implicit action and not trough a variable change.

These functions are:

Changes from multi-primary to single primary

Base command:

 SELECT group_replication_switch_to_single_primary_mode()

The above command shall be invoked by the user to change to single primary mode, being the election controlled by the configured election weights.
If the user wants to appoint a primary in the process it executes:

 SELECT group_replication_switch_to_single_primary_mode(server_uuid);

Any invocation of these functions in a group already in this mode will cause no visible changes.

Changes from single-primary to multi-primary

Base command:

 SELECT group_replication_switch_to_multi_primary_mode();

This function has no parameters.
Any invocation in a group already in multi-primary mode will cause no changes.

Election of a new primary

Base command:

 SELECT group_replication_set_as_primary(server_uuid);

This function will not cause changes to single primary mode if the group is running on multi-primary mode.

Configuration changes: Algorithm components

To switch from the single primary to multi-primary mode and vice-versa the following steps/code units are necessary.

A. primary election: invoke primary election in a member

B. Disable/Enable certification

C. primary validation: check if the selected member is valid. This may include - the old primary has running slave channels - the user is selecting a member with version N+1 in a group with member of version N.

D. Wait for execution of the current set of local transactions

E. Set/Get plugin vars. This includes: - single_primary_mode - enforce_update_everywhere_checks

F. Wait for the execution of current relay log transactions

G. Message sending / reception

H. Enable/Disable the super read only mode

From this list:

A) needs to be refactored in terms of code and message flow for safety reasons.
B) need minor refactors in order to be reusable
C) and D) are new utilities that we need to build from scratch
E) requires a new code module as we want to use SET PERSIST for these variables.
F) We enhance this code with the hability to wait for the consuption of the group replication applier module queue before waiting for the execution of the transactions.
The rest can be used out of the box or by using current plugin methods. We do add the option to kill read mode queries in some situations though.

Configuration changes: How it works - a summary

To sum this section, here is a summary of how it all comes together.

The user triggers an action in the plugin (Using UDFs in this worklog)
The plugin parses the parameters and creates the correspondent Group Action instance.
The group action is submitted to the coordinator.
The coordinator gets the action message from the group action class and sends it to all members
If accepted, all members (except the invoking one) instantiate the same group action class.
The action message is given to the group action object for parsing.
All members execute the action
All members send a termination message when over.
All members declare the action as finished when everyone terminates.
The invoking member returns the result to the client.

Configuration changes: How it works - message Diagram

+-------------------+                  +-------------------------+                  +--------------------+
| ..member 1 (m1).. |                  | .....member 2 (m2)..... |                  | ...member3 (m3)... |
|                   |                  |                         |                  |                    |
|                   |                  | UDF function execution  |                  |                    |
|                   |   Group action   |     new Group_Action    |   Group action   |                    |
|                   |  start message   |           .             |  start message   |                    |
| new Group_action  | <--------------- |    send start message   | ---------------> |  new Group_action  |
| execution         |             \--> |       execution         |                  |     execution      |
|        +          |                  |           *             |                  |          +         |
|        +          |                  |           +             |                  |          +         |
|        +          |                  |           +             |                  |          +         |
|        +          |  Group action    |           +             |   Group action   |          +         |
| send end message  | end message (m1) |           +             | end message (m1) |          +         |
|        .          | ---------------> |           +             | ---------------> |          +         |
|        .          | <--/             |           +             |                  |          +         |
|        .          |                  |           +             |                  |          +         |
|        .          |                  |           +             |                  |          +         |
|        .          |   Group action   |           +             |   Group action   |          +         |
|        .          | end message (m3) |           +             |  end message(m3) |          +         |
|        .          | <--------------  |           +             | <--------------- |  send end message  |
|        .          |                  |           +             |           \----> |          .         |
|        .          |                  |           +             |                  |          .         |
|        .          |   Group action   |           +             |   Group action   |          .         |
|        .          | end message(m2)  |           +             |  end message(m2) |          .         |
|   declare action  | <--------------  |    send end message     | -------------->  |   declare action   |
|     finished      |             \--> | declare action finished |                  |     finished       |
|                   |                  |      UDF returns        |                  |                    |
|                   |                  |                         |                  |                    |

3. From single to multi primary

The first on the list is the change from when the member is on single primary mode to multi-primary.
We start to design this one as the inverse change is more complex.

In this change, all members become writable, but there is also restrictions to the allowed transactions in the group that must be enforced.

The HLD for this operation is then:

A message is sent to all members starting the configuration change in all members, same for the invocation member.
All members set enforce_update_everywhere_checks to true.
The primary waits for all transactions currently running to be processed by GR.
These transactions can have updates to tables with cascading FK for example, something that can cause issues in a multi-primary environment.
A message is sent to all members meaning: "I executed all running transactions, from now on, all transaction are safe."
When members receive this message they queue a packet in the plugin pipeline that will activate certification.
In the secondaries we extract the current GTIDs queued in the applier relay log and wait for its application.
Every member can set the single_primary_mode to false.
Members invoke a SET PERSIST instruction to make the option persistent.
The enforce_update_everywhere_checks is also made persistent here.
All members change the auto increment settings to the automatic values to avoid transaction collision.
Previous values are cached.
All secondaries can disable the read only mode when they complete step 6.
All members send a message when the action terminates.
When N messages are received, the action terminates.

4. From multi-primary to primary / primary election

In this change, there is an election of a new primary, either selected by the internal election algorithm or appointed by the user.
So, either when the user changes the primary mode to true or when it sets this variable the same set of tasks will be executed in the group.

In this change, only one member becomes writable and the transaction limitations that are enforced on multi-primary are no longer needed.

The HLD for this operation is then:

A message is sent to all members starting the configuration change in all members, same for the invocation member.
A validation phase is executed:
If the candidate must be of the lowest version present in the group.
Same thing for invalid uuids passed as an argument or the member is no longer in the group.
Everyone sends a message stating the existence of slave channels.
If more than a member has slave channels: error out
If slave channels exist in a member that is not the selected primary: error out
If no primary is appointed and a sole member exists with slave channels, force that member to be the new primary.
[Extra step] If a primary already exists:
All members set enforce_update_everywhere_checks to true. The primary waits for all transactions currently running to be processed by GR.
This means the old primary, if present, is the one that sends the primary election request message.
Run primary election on all members, using either the present election algorithm or choosing the user appointed member.
Member roles change as a result of the selection process. Under the new election algorithm (section 5) the new appointed primary will wait for messages from the old primary (if existant) up to this point.
The new primary will send a message to all members that election can continue and members also update the read mode status at this point.
Note that setting read mode to true will wait for executing transactions meaning this must be done in a spawned thread.
Everyone queues a message in the applier pipeline to re-enable the certifier (when migrating from multi-primary it is already enabled).
The new primary waits for a message from all members stating when they are on read mode. The primary also states it set his read mode to false.
When all member receive N messages they can set single_primary_mode to true. SET PERSIST is used here to make the mode persistent.
Secondaries can set enforce_update_everywhere_checks to false.
This step can be skipped if already on single primary mode.
When the primary receives N messages move to step 10
The primary shall wait for all the applier relay log to be consumed.
It sends a packet informing about this change.
Only then it can set enforce_update_everywhere_checks to false to avoid concurrency between local and remote transactions.
The primary should return values for auto increment to the user cached values if changing the mode.
These were stored before the multi primary values were set to avoid collisions.
All members disable the certifier when they receive the packet from the primary. Secondaries also change the auto increment settings for easy future primary failovers.
All members send a message when the action terminates. When N messages are received, the action terminates.

Note: Steps like 6), the first part of step 10) and 12) are already part of the current primary election process. They are only placed here for clarity.

5. Changes to Primary election

In order to solve an old safety issue surrounding primary election and the algorithms presented here we propose a change to the election mechanism.

Dwelling into it, the base of the issue is that single primary mode allows the execution of transactions that could lead to data divergence in multi primary setups.
One of these examples are transactions that have foreign keys and cascading side effects that could lead to different execution results in different members.
In theory such transactions are safe because the primary is the only source of truth in the group at all times.

Until now, when changing from one primary to the other, the new primary would accept new transactions the moment it was elected.
This meant that, for a window of time, it was possible for such transactions coming from both the old and the new primary to be executed in concurrency.
This breaks the assumption that a sole source of truth exists at each moment in time.

Other point of concern is that we must wait for old primaries to be in read mode (not applicable to crashes situations).
This is something that can take time, and certification must be ON during this period.

For these reasons this algorithm will now change and elections will have 5 stages whose invocation depends on the context of the election.

When the old primary dies or there is a change of the primary member, every member does an election and chooses the new primary.
The algorithm, or appointed server parameter make the election have the same outcome on all members.
The new appointed primary waits for its relay log queue to be consumed totally or in part.
If the old primary failed we wait for the relay log to be fully consumed as no more messages will arrive.
If this is a switch from the old to the new primary then we should wait for the transactions of all messages up to this point.
This prevents the local vs remote conflicts that would lead to data divergence.
When these transactions are consumed the member elected will send a message to all members and when received they all change their read modes according to their roles.
Enabling the read mode on members must be done in a spawned thread is it would deadlock with ongoing transaction messages in the GCS layer.
When receiving this message the members might or might not enable certification.
If an old primary(ies) is still present, then the new primary must wait for all members to send a message stating they are on read mode.
When it receives this message it will wait for its current relay log backlog to be executed, instructing the members to disable certification afterwards.

A note here about how this is a choice of safety over availability on failure scenarios so it may result on write downtime for the end user.
On the other hand, during live primary switches we have the option to restrict and monitor user transactions to preserve availability.

In terms of version coexistence, if a member of a version 5.X or previous to this WL release is present in the group, the old primary election algorithm will be the one executed.

Primary election: Brief look into the primary election scenarios.

If the old primary dies:

On failure cases the old primary dies and when the new one is elected there could still be some old transactions being applied on this member.
For this reason we need to wait for the execution of the transactions from the old primary before declaring the new primary writable.

So in this situation steps 1, 2 and 3 are executed to ensure safety.

One particularity of this case is that there is only one source of truth at a time, i.e., we ensure a member does not accept writes when applying updates from another member.
In practice this means there is no wait for the old primary to be on read mode and no certification activation is needed.

| ....member 1 (m1).... |             | .....member 2 (m2)..... |                          | .....member3 (m3)..... |
|     (Old Primary)     |             |       (Secondary)       |                          |     (New Primary)      |
|  (read mode is OFF)   |             |     (read mode is ON)   |                          |   (read mode is ON)    |
|                       |             |                         |                          |                        |
|       Failure         | View change |                         |       View change        |                        |
|                       | ########### |       View change       | ######################## |      View change       |
|                       |             |     elect a primary     |                          |    elect a primary     |
|                       |             |       m3 elected        |                          |      m3 elected        |
|                       |             |                         |                          |   wait for queue = 0   |
|                       |             |                         |                          |  wait for transaction  |
|                       |             |                         |                          |       execution        |
|                       |             |                         |                          |          +             |
|                       |             |                         | Primary election message |          +             |
|                       |             |                         | primary is ready         |          +             |
|                       |             |                         | <----------------------- |    backlog executed    |
|                       |             |     Read mode = ON      |                  \       |          .             |
|                       |             |                         |                   \----> |    Read mode = OFF     |
|                       |             |                         |                          |                        |
|                       |             |                         |                          |                        |

If we switch from one primary to another:

This scenario is the most complex one in order to preserve safety but also availability, something we cannot in the above case.

Since we are focusing on the primary election part, lets recall that under the full primary change algorithm there is a first phase where the old primary enables enforce_update_everywhere_checks.
So when election is invoked step 2 is executed to ensure all transactions executed before changing this variable are processed in the new primary.

Note that the old primary is still accepting requests until step 3 is invoked.
Hence, when the read mode is set, it must be done outside the GCS framework or else it could deadlock against running transactions waiting for certification messages.

It is also for this reason, that when the switch happens there are updates being executed from both the new and the old primary so we need to execute step 4.

So, now the new primary will wait for all members to be in read mode.
When that happens then it will wait on it back log, instructing the all members, when finished, that certification can be turned off.

| ....member 1 (m1).... | ........................ | .....member 2 (m2)..... | ........................ | .....member3 (m3)..... |
|     (Old Primary)     |                          |       (Secondary)       |                          |     (New Primary)      |
|  (read mode is OFF)   |                          |     (read mode is ON)   |                          |   (read mode is ON)    |
|                       |                          |                         |                          |                        |
|   Action Invocation   |                          |                         |                          |                        |
|      validations      |                          |                         |                          |                        |
| update checks = true  |                          |                         |                          |                        |
|   wait for ongoing    |                          |                         |                          |                        |
|    transactions       |                          |                         |                          |                        |
|         +             |                          |                         |                          |                        |
|         +             |                          |                         |                          |                        |
|   primary election    |                          |                         |                          |                        |
|                       |                          |                         |                          |                        |
/////////////////////////////////////////////////////// Primary Election /////////////////////////////////////////////////////////
|                       |                          |                         |                          |                        |
| Invoke an election    | Primary election message |                         | Primary election message |                        |
|    Send message       |  elect a new member (m3) |                         |  elect a new member (m3) |                        |
|                       | -----------------------> |                         | -----------------------> |                        |
|   elect a primary     | <----/                   |     elect a primary     |                          |     elect a primary    |
|     m3 elected        |                          |       m3 elected        |                          |       m3 elected       |
|                       |                          |                         |                          |  [wait for queue = 0]  |
|                       |                          |                         |                          | [wait for transaction  |
|                       |                          |                         |                          |      execution]        |
|                       |                          |                         |                          |          +             |
|                       |                          |                         |                          |          +             |
|                       | Primary election message |                         | Primary election message |          +             |
|                       |     primary is ready     |                         |     primary is ready     |          +             |
|                       | <----------------------- |                         | <----------------------- |   backlog executed     |
|  enable certification |                          |   enable certification  |                  \       |                        |
|  [Set Read mode = ON] |                          |   [Set Read mode = ON]  |                   \----> |  enable certification  |
|          +            |                          |            +            |                          | [Set Read mode = OFF]  |
|          +            | Primary election message |    Read mode = true     | Primary election message |                        |
|          +            | member in read mode (m2) |                         | member in read mode (m2) |                        |
|          +            | <----------------------- |                         | -----------------------> |                        |
|          +            |                          |                         | <----/                   |                        |
|          +            |                          |                         |                          |                        |
|          +            |                          |                         |                          |                        |
|          +            |                          |                         |                          |                        |
|   Read mode = ON      | Primary election message |                         | Primary election message |                        |
|                       | member in read mode (m1) |                         | member in read mode (m1) |                        |
|                       | -----------------------> |                         | -----------------------> |   [Wait for backlog]   |
|                       | <----/                   |                         |                          |           +            |
|                       |                          |                         |                          |           +            |
|                       | Single primary message   |                         | Single primary message   |           +            |
| disable certification | <----------------------- | disable certification   | <----------------------- |    backlog executed    |
|                       |                          |                         |                   \----> |  disable certification |
|                       |                          |                         |                          |                        |
|                       |                          |                         |                          |                        |

Operations surround by "[]" mean they are executed in spawned process and not on the GCS stack.

If we switch from multi primary to a single primary:

When electing a primary coming from a multi primary group one thing to have in mind is that enforce_update_everywhere_checks was already true before the election.
In practice this means that there is no need to wait for the execution of transactions from the old primary.
So the above described step 2 is skipped here.
Apart from that, the algorithm remains the same.

| ....member 1 (m1).... | ........................ | .....member 2 (m2)..... | ........................ | .....member3 (m3)..... |
|   (Multi Primary)     |                          |     (Multi Primary)     |                          |   (Appointed primary)  |
|  (read mode is ON     |                          |    (read mode is ON)    |                          |    (read mode is ON)   |
|                       |                          |                         |                          |                        |
|   Action Invocation   |                          |                         |                          |                        |
|      validations      |                          |                         |                          |                        |
|   primary election    |                          |                         |                          |                        |
|                       |                          |                         |                          |                        |
/////////////////////////////////////////////////////// Primary Election /////////////////////////////////////////////////////////
|                       |                          |                         |                          |                        |
| Invoke an election    | Primary election message |                         | Primary election message |                        |
|    Send message       |  elect a new member (m3) |                         |  elect a new member (m3) |                        |
|                       | -----------------------> |                         | -----------------------> |                        |
|   elect a primary     | <----/                   |     elect a primary     |                          |     elect a primary    |
|     m3 elected        |                          |       m3 elected        |                          |       m3 elected       |
|                       |                          |                         |                          |         /              |
|                       | Primary election message |                         | Primary election message |        /               |
|                       |     primary is ready     |                         |     primary is ready     |       /                |
|                       | <----------------------- |                         | <----------------------- | ------                 |
|  enable certification |                          |   enable certification  |                  \       |                        |
|  [Set Read mode = ON] |                          |   [Set Read mode = ON]  |                   \----> |  enable certification  |
|          +            |                          |            +            |                          | [Set Read mode = OFF]  |
|          +            | Primary election message |    Read mode = true     | Primary election message |                        |
|          +            | member in read mode (m2) |                         | member in read mode (m2) |                        |
|          +            | <----------------------- |                         | -----------------------> |                        |
|          +            |                          |                         | <----/                   |                        |
|          +            |                          |                         |                          |                        |
|          +            |                          |                         |                          |                        |
|          +            |                          |                         |                          |                        |
|   Read mode = ON      | Primary election message |                         | Primary election message |                        |
|                       | member in read mode (m1) |                         | member in read mode (m1) |                        |
|                       | -----------------------> |                         | -----------------------> |   [Wait for backlog]   |
|                       | <----/                   |                         |                          |           +            |
|                       |                          |                         |                          |           +            |
|                       | Single primary message   |                         | Single primary message   |           +            |
| disable certification | <----------------------- | disable certification   | <----------------------- |    backlog executed    |
|                       |                          |                         |                   \----> |  disable certification |
|                       |                          |                         |                          |                        |
|                       |                          |                         |                          |                        |

6. Facing member failures or stops

All is nice if there are no problems while the process is running.
One issue that might happen is that some member can leave during the process.
This can be an intentional leave as the DBA stopped the member or a server/machine/network failure that made the group expel the member.
In this section we handle exits under a majority, for partitions check section 7.

How is it handled: Failures at the coordination level

The invoking member fails: The way this WL is structured, the invoking member only plays a key role on the start and end part of the action when the result is returned to the user.

In practice what this means is that once an action is accepted by the group any failure on the invoking member will not stop the action progress on the group.
The group action will continue its work on other members until all declare its end.

If the coordinator dies before sending the action then nothing happens.

Any member fails:

When a member leaves or fails, all the running action coordinators in the other members needs to wait for 1 less member to declare the action as terminated.

How it is handled: Single primary -> Multi-primary

If primary fails:
We have to break the wait on the secondaries if they are waiting.
Note that this means no more transactions will come from the old primary so all transactions from this point on are safe.

There is however a question here of what to do with concurrent primary elections.
We have 2 options:
- A We elect a new primary, causing a secondary to be writable and activating certification before applying all pending transactions from the old primary.
The upside here is that the group write downtime is smaller, the downside is that there is a window for transaction divergence.

- **B** We don't elect a new primary and the process will wait for the

secondaries to be up to date.
This option while safer may mean the group wont be writable for a period of time.

For now we go with B for safety as explained on section 5.

If secondary fails:
If a secondary member fails the algorithm will not be affected and no action is needed in the remaining members

How it is handled: Multi-primary -> Single primary / Primary election

If the old primary fails: the process must break any waits for the old primary.
This means that if we are waiting for the old primary to be safe we can invoke a new primary election at this point.
If we are still validating the parameters and the action execution then it must select a new member to invoke the primary election. On the other hand, if the process is already waiting for this member to be in read mode, then we can skip waiting for it.
If the new primary fails: the process must abort if the member is not yet elected.
If the primary was elected already, then this failure does not affect the group action and is handled as a traditional primary failure.
If the election is still ongoing, it depends on the action. Primary member changes will abort and try to elect the old primary again.
If the election is still ongoing and we are changing to single primary mode, the action will output a warning but wont fail.
If secondary fails: the waiting members should be adjusted if waiting for messages.

How we register leaves and messages

For handling such events as member exits and primary elections mentioned above two options existed:

Methods placed at each point that are directed to the coordinator class and then direct at each executing group action
Observers at each point that can be used by actions if needed.

We went with 2, while it is a bit more complex than 1) it allows for more versatility.

Other reason we went for 2 is that we used the same pattern for messages as we needed to add new behaviors to old messages.

This way new plugin components can emerge reading events from the group and messages without changes to the gcs_event_handler code.

7. Other scenarios - partitions and joins

Partitions

Partitions are an orthogonal issue in this WL.
Being a distributed algorithm, it is normal that actions will block as any other transaction message will on a minority partition.
Question remains on how the DBA can handle it, and how will the system react.

Lets reason about the two different types of partitions:

Asymmetric partitions

In this case the group still has a majority and some members will be expelled.
To the majority, the way it handles these exits falls under section 6.

On the minority, on the absence of a network connection, the executing group actions will block.
On these members the DBA may stop group replication in the member and the group action process will terminate on that member.

If a value for group_replication_unreachable_majority_timeout is defined then eventually these members in a minority partition will error out.
The members can also be eventually expelled by the group and error out when they are again enable to contact it.

When this happens the action is stopped by the plugin error handling code.

Symmetric partitions

On symmetric partitions or multi group partitions where no sub-group holds a majority the DBA options are:

The DBA manages to restore the network, and if so the process should continue normally.
The DBA forces a new group membership. You can check:
https://dev.mysql.com/doc/refman/8.0/en/group-replication-network-partitioning.html

If the DBA goes with option 2 then a new majority will be formed where the group action will unblock.
Again, how this new majority handles the leaving members is described on section 6.

Example:
Is the old primary in the unreachable member and you were waiting for a message? When a majority is formed and the old primary is expelled, the wait will unblock,

On expelled member, the view will mark them as leaving members making them terminate their actions locally.

A new member joins

The way this WL was designed, no concurrent joins are allowed during a group action as it would make them even more complex.

The frontier is the reception of the action starting message.
At this point if there are members in recovery the action will abort.
If there are no joiners, then after this point all joins will fail, meaning the joiner will leave the group when it sees a running action.

Note that joins, recovery state changes, and action starts all rely in GCS events and are for that reason not concurrent.

8. Facing process failures

When an unexpected failure happens when executing a step the process also must behave accordingly.
Many of the execution steps are simple tough, like setting a plugin var, so they should not fail.
Others are critical and threaten the consistency of the group like falling to disallow writes on a secondary member.

So the strategy may vary and the process should accept that:

Message sending failures: Message retry is already something inside GCS.
This means that sending errors are serious and not easy to handle.
When a group action fails to send messages how can it say to abort?
If the action already is running, the only option is to leave the group (and enable read mode).
Enabling the read mode: all failures enabling the read mode will leave the member in a undesired state, so the server process shall abort. For this, a service shall be implemented, as described in the low level design.
Disabling the read mode: Failures when disabling the read mode can be handled by the DBA, so a logged error should be sufficient.
Failures in assessing the number of transactions running: When a operation needs to know the number of transactions running to ensure safety, any failure should make the process abort.
Failures on SET PERSIST or other critical failures: Whenever there is a failure that prevents the algorithm from progressing, the member should exclude himself from the group (and enable read mode).

9. Monitoring and error reporting

One point not mentioned until now is how the DBA can check the progress and status of coordinated configuration changes.

First of all, we must answer the question: What does the user wants to know?

A. What is the action running. B. What is the progress on that action. C. Is there an error, and if so, what was it.

Lets start with A. and B. To not overload our current performance schema tables with fields that have an unknown lifetime and evolution we went for an alternative.

Thus, monitoring will be based on the current stage event table:

performance_schema.events_stages_current

How this works is that under

performance_schema.setup_instruments

We have new instruments for monitoring

Stages

So, the planned stages consist of several steps where the algorithm will probably wait on and we can give some progress information.
Thence we skip here singular steps like action acceptance or primary election invocation.

Multi primary switch stages

stage/group_rpl/Multi-primary Switch: waiting for pending transactions to finish.

The old primary updated the enforce_update_everywhere_checks variable and collected the set of currently ongoing transactions.
In this stage we show the progress of how many transactions are left for execution.

stage/group_rpl/Multi-primary Switch: waiting on another member step completion

While the above step runs, the old secondaries are in wait state.
This stage shows we are waiting on a message from the old primary member.

stage/group_rpl/Multi-primary Switch: applying buffered transactions.

The old primary executed all the above transactions.
The secondaries must also wait for them to be executed locally.
This stage reports when the process is over.

stage/group_rpl/Multi-primary Switch: waiting for operation to complete on all members.

Due to the asynchronous nature of the algorithm, completion time in different members can differ.
This stage reports how much members finished vs the ones that are still missing.

Single primary switch stages

stage/group_rpl/Single-primary Switch: checking group pre-conditions.

Check if the group has running channel slaves or members of an invalid version.
This stage reports the completion of the several verification steps.

stage/group_rpl/Single-primary Switch: executing Primary election

Primary election is invoked and runs in all members. This stage means the algorithm is waiting on the election.

stage/group_rpl/Single-primary Switch: waiting for operation to complete on all members.

Due to the asynchronous nature of the algorithm, completion time in different members can differ.
This stage reports how much members finished vs the ones that are still missing.

Primary switch stages

stage/group_rpl/Primary switch: checking current primary pre-conditions.

Check if the old primary has running channel slaves. This stage reports the completion of the several verification steps.

stage/group_rpl/Primary Switch: waiting for pending transactions to finish.

stage/group_rpl/Primary Switch: waiting on another member step completion

While the above step runs, the old secondaries are in wait state.
This stage shows we are waiting the execution of the above transactions that will lead to the primary election phase.

stage/group_rpl/Primary Switch: executing Primary election

Primary election is invoked and runs in all members. This stage means the algorithm is waiting on the election outcome.

stage/group_rpl/Primary Switch: waiting for operation to complete on all members.

Due to the asynchronous nature of the algorithm, completion time in different members can differ.
This stage reports how much members finished vs the ones that are still missing.

Primary election stages

stage/group_rpl/Primary Election: applying buffered transactions.

The old primary executed all the above transactions.
The secondaries must also wait for them to be executed locally.
This stage reports the progress of this stage

stage/group_rpl/Primary Election: waiting on current primary transaction execution

When the secondaries are waiting on the primary to end the above stage.

stage/group_rpl/Primary Election: waiting for members to turn on super_read_only

When primary election is invoked all members shall wait for the other members to be in read mode. This stage reports how many servers are not yet in read mode.

stage/group_rpl/Primary Election: stabilizing transactions from former primaries.

Once all servers are in read mode the new primary shall consume its backlog and then declare that certification can be disable on all members.
This stage shows how many transactions are left to apply.

How to use it

So when a action is running, the DBA can check stages table for something like:

 SELECT event_name, source, work_completed, work_estimated FROM performance_schema.events_stages_current WHERE event_name LIKE "%stage/group_rpl%";
 EVENT_NAME                                                                         SOURCE              WORK_COMPLETED  WORK_ESTIMATED
 stage/group_rpl/Multi-primary Switch: waiting for pending transactions to finish.  stage_monitor.h:73      3               10

When no action is running the query returns empty.
So this solves A and B.

This information can be checked on all members running an action, being the progress reported the local one.

For C. we rely on the UDF function mechanism server error primitives.
If a function is executed with invalid parameters the UDF API will return an error with our custom error message like:

 ERROR 1123 (HY000): Can't initialize function 'group_replication_set_as_primary'; Member is in multi-primary mode.

If the function is validated and there is an error during execution then we return a custom error ourselves.
For that we will add the error:

 ER_GRP_RPL_UDF_ERROR

As for now there is no number associated to this error we will use NNNN

 ERROR NNNN (HY000): The function 'group_replication_set_as_primary' failed. There is a member joining the group.

10. User interface - a summary

To DBAs in general, here is the summary part.

Scenario A:

If you are multi-primary and want to change to single primary.
Just execute

 SELECT group_replication_switch_to_single_primary_mode()

And if you have a primary in mind you can opt to

 SELECT group_replication_switch_to_single_primary_mode(primary_uuid);
 Mode switched to single-primary successfully

While the action runs, you can check its progress with

 SELECT event_name, work_completed, work_estimated FROM performance_schema.events_stages_current WHERE event_name LIKE "%stage/group_rpl%";
 EVENT_NAME                                                           WORK_COMPLETED  WORK_ESTIMATED
 stage/group_rpl/Single-primary Switch: checking group pre-conditions       0               1

Scenario B:

If you are single-primary and want to change to multi-primary.
Just execute

 SELECT group_replication_switch_to_multi_primary_mode();
 Mode switched to multi-primary successfully

While the action runs, you can check its progress with

 SELECT event_name, work_completed, work_estimated FROM performance_schema.events_stages_current WHERE event_name LIKE "%stage/group_rpl%";
 EVENT_NAME                                                                        WORK_COMPLETED  WORK_ESTIMATED
 stage/group_rpl/Multi-primary Switch: waiting for pending transactions to finish.      2               10

Scenario C:

If you are single-primary and want to change the primary.
Just execute

 SELECT group_replication_set_as_primary(server_uuid);
 Primary server switched to: UUID

While the action run, you can check its progress with

 SELECT event_name, work_completed, work_estimated FROM performance_schema.events_stages_current WHERE event_name LIKE "%stage/group_rpl%";
 EVENT_NAME                                                                  WORK_COMPLETED  WORK_ESTIMATED
 stage/group_rpl/Primary Switch: waiting on another member step completion        0               1

Note that when the primary election algorithm kicks in, you can also monitor that in another stage:

 SELECT event_name, work_completed, work_estimated FROM performance_schema.events_stages_current WHERE event_name LIKE "%stage/group_rpl%";
 EVENT_NAME                                                                         WORK_COMPLETED  WORK_ESTIMATED
 stage/group_rpl/Primary Election: Waiting for members to turn on super_read_only        3               6

Error cases 1: State changes under the same mode

You are in single-primary / multi primary and you execute a migration to the mode the system is already in.

 SELECT group_replication_change_to_multi_primary_mode();
 The system is already on multi-primary mode

The function just returns a string stating that. No error.

Error cases 2: Primary switches in multi-primary mode

You are in multi primary and you execute a primary election

 SELECT group_replication_set_as_primary(server_uuid);
 ERROR HY000: Can't initialize function 'group_replication_set_as_primary'; In multi-primary mode. Use group_replication_switch_to_single_primary_mode.

Error cases 3: Generic Validation errors

You want to execute a function and you give an improper argument, none at all or some other case, some of the errors are:

SELECT group_replication_switch_to_single_primary_mode(____)
ERROR HY000: Can't initialize function 'group_replication_switch_to_single_primary_mode'; Wrong arguments: This function either takes no arguments or a single server uuid
ERROR HY000: Can't initialize function 'group_replication_switch_to_single_primary_mode'; Wrong arguments: The server uuid is not valid.
ERROR HY000: Can't initialize function 'group_replication_switch_to_single_primary_mode'; The requested uuid is not a member of the group.
ERROR HY000: Can't initialize function 'group_replication_set_as_primary'; Wrong arguments: You need to specify a server uuid.


SELECT group_replication_set_as_primary(____);
ERROR HY000: Can't initialize function 'group_replication_set_as_primary'; Wrong arguments: You need to specify a server uuid.
ERROR HY000: Can't initialize function 'group_replication_set_as_primary'; Wrong arguments: The server uuid is not valid.
ERROR HY000: Can't initialize function 'group_replication_set_as_primary'; The requested uuid is not a member of the group.

SELECT group_replication_switch_to_multi_primary_mode(____);
ERROR HY000: Can't initialize function 'group_replication_switch_to_multi_primary_mode'; Wrong arguments: This function takes no arguments.

Error cases 4: The DBA doesn't have privileges

You try to execute an action with no privileges

 SELECT group_replication_set_as_primary(server_uuid);
 ERROR 1123 (HY000): Can't initialize function 'group_replication_set_as_primary';  User 'group_rpl_user'@'%'. needs SUPER or GROUP_REPLICATION_ADMIN privileges.

Error cases 5: Member is not in a valid state

The member is in error state or unreachable.

 SELECT group_replication_set_as_primary(server_uuid);
 ERROR 1123 (HY000): Can't initialize function 'group_replication_set_as_primary'; The member needs to be ONLINE and in a reachable partition.

Error cases 6: An action is already running

There is already a group action running.
This is a runtime error, not validation error:

 SELECT group_replication_set_as_primary(server_uuid);
 ERROR NNNN (HY000): The function 'group_replication_set_as_primary' failed; There is already a configuration action being executed. Wait for it to finish.

Error cases 7: A member is joining

There is a member joining.
This is a runtime error, not validation error:

 SELECT group_replication_set_as_primary(server_uuid);
 ERROR NNNN (HY000): The function 'group_replication_set_as_primary' failed; A member is joining the group, wait for it to be ONLINE.

Error cases 8: A member of a lower version is present

A member that has a lower version and cannot execute this actions is present in the group.
This is a runtime error, not validation error:

 SELECT group_replication_set_as_primary(server_uuid);
 ERROR NNNN (HY000): The function 'group_replication_set_as_primary' failed; The group has a member with a version that does not support group coordinated operations.

Error cases 9: The primary fails before election

We are electing a primary or changing to single primary mode with an appointed primary and it fails.
This is a runtime error, not validation error:

 SELECT group_replication_set_as_primary(server_uuid);
 ERROR NNNN (HY000): The function 'group_replication_set_as_primary' failed; The appointed primary for election left the group, this operation will be aborted. No primary election was invoked under this operation.

Error cases 10: The primary fails during election - Change to Single Primary mode

We are changing to single primary mode with an appointed primary and it fails during election.
The operation completes but there is an warning.

 SELECT
 group_replication_switch_to_single_primary_mode(server_uuid)
 Mode switched to single-primary with reported warnings: The appointed primary being elected exited the group. Check the group member list to see who is the primary
 Warnings:
 Warning    NNNN    The appointed primary being elected exited the group. Check the group member list to see who is the primary. There were warnings detected also on other members, check their logs.

Error cases 11: The primary fails during election - Change of Primary Member

We are changing to single primary mode with an appointed primary and it fails during election.
This is a runtime error, not validation error:

 SELECT group_replication_set_as_primary(server_uuid);
 ERROR NNNN (HY000): The function 'group_replication_set_as_primary' failed; Primary assigned for election left the group, this operation will be aborted and if present the old primary member will be re-elected. Check the group member list to see who is the primary.

Error cases 12: The primary fails after election

We are electing a primary or changing to single primary mode with an appointed primary and it fails when the member was already elected.
Only a warning is thrown.

 SELECT group_replication_switch_to_single_primary_mode("MEMBER1_UUID")
 Mode switched to single-primary with reported warnings: The appointed primary left the group as the operation is terminating. Check the group member list to see who is the primary
 Warnings:
 Warning NNNN   The appointed primary left the group as the operation is terminating. Check the group member list to see who is the primary

Error cases 13: A slave channel prevents the operation

When you execute a function but a channel presence prevents it

 SELECT group_replication_switch_to_single_primary_mode("MEMBER2_UUID");
 ERROR HY000: The function 'group_replication_switch_to_single_primary_mode' failed. The requested primary is not valid as a slave channel is running on member MEMBER1_UUID

 SELECT group_replication_switch_to_single_primary_mode();
 ERROR HY000: The function 'group_replication_switch_to_single_primary_mode' failed. There is more than a member in the group with running slave channels so no primary can be elected.

 SELECT group_replication_set_as_primary("MEMBER2_UUID");
 ERROR HY000: The function 'group_replication_set_as_primary' failed. There is a slave channel running in the group's current primary member.

Error cases 14: There is a member in group from an older version

When you execute a function but there is a member that does not have this feature.

 SELECT group_replication_switch_to_single_primary_mode();
 ERROR HY000: The function 'group_replication_switch_to_single_primary_mode' failed. The group has a member with a version that does not support group coordinated operations.

Error cases 15: When you kill a coordinated change

When you execute a coordinated change and you kill it.
Note that depending on progress messages can be different .

 SELECT group_replication_switch_to_single_primary_mode();
 ERROR HY000: The function 'group_replication_switch_to_single_primary_mode' failed. This operation was locally killed and for that reason terminated. The member will now leave the group.

 SELECT group_replication_switch_to_single_primary_mode();
 ERROR HY000: The function 'group_replication_switch_to_single_primary_mode' failed. This operation was locally killed and for that reason terminated. However the member is already configured to run in single primary mode, but the configuration was not persisted. The member will now leave the group.

Error cases 16: Critical failures

When you execute a function and some critical error occurs

 SELECT group_replication_set_as_primary(server_uuid);
 ERROR HY000: The function 'group_replication_set_as_primary' failed. A critical error occurred during the local execution of this action. The member will now leave the group.

Error cases 17: Other failures

We are electing a primary or a migration and something fails

 SELECT group_replication_set_as_primary(server_uuid);
 ERROR NNNN (HY000): The function 'group_replication_set_as_primary' failed; Error message here

11. Other points

Upgrade/Downgrade

This WL does not depend on any upgrade or does it brings restrictions to downgrades.

There are however some behaviors that are associated to upgrade/downgrade processes where the group is formed by mixed versions.

A mix of 5.7 and 8.0 members

In such a group:

If an action is invoked on 8.0 members the action shall error out when it checks that 5.7 version members are present.

About 5.7 members there is no issue as actions cannot be invoked from such members.
One point about 5.7 members is that the old election algorithm shall be used whenever a member of this version is present.

A mix of 8.0 and 9.0 members

This applies for this algorithm for any version difference between members above 8.0.

If in a mixed group where actions are supported by all members, the primary election shall only allow the selection of a primary of the group lowest version.
Selecting a higher version as primary could lead to the dissemination of unknown messages to the lower version members.

Security

Users must have GROUP_REPLICATION_ADMIN privileges as for START and STOP commands.

Users that have this privilege can also stop group replication in all members interrupting service so no new angle of attack is introduced.

In terms of operation abuse, no two commands can run at the same time so malicious DBAs could only cause sequential elections or changes of mode.
These can cause some delay in the cluster, but overall there is no effect in terms of availability.

Query life cycle and killing behaviors

The way query execution is envisioned in this WL is that a DBA executes a query in the form of a UDF, and that will block until a instantiated group action is concluded.
The query process will remain waiting for a result while a new spawned thread executes the action process.

Question here is how this relates to query termination from the DBA part.

In terms of kill semantics, we should start with the basics on how query kills works on MySQL.
The basics is that the process in execution should check in different stages if the thread was given a kill signal.

Same principle applies here, actions should regularly check if its thread was killed and if so, they should abort.
The notion of not committing or roll-backing locally don't apply here though.
This has 2 major consequences:

Killing an accepted group coordinated change means this member is diverging in configuration from the group.
Hence, the member shall leave the group and move to error mode.
There is a point in time where the kill signal may not mean nothing as only trivial operation remains in the process.
So, the DBA may kill a group action only to find out that it completed successfully.

In practical terms this can be tracked trough stages.
When the final stage kicks in, it means the action is now completed locally and is only waiting for other members to finish. This means that any kill request after this will not cancel the operation.

Another point here is that the DBA may kill the stuck query but it is indeed the underlying action execution thread that is taking too much time.
This design also takes that into account, guaranteeing that upon detecting it was killed the query process shall kill the local action process.

Deployment / Install

This new plugin brings no new changes in terms of install and deploy.
Just some notes:

Performance schema needs to be enabled for monitoring.
Also, some of the above setup instruments may have to be enabled if needed.
UDF functions are auto installed so no user actions is needed here.
The use of SET PERSIST means some user defined settings in the configuration files may be ignored as newer settings take precedence over them.

12. Points not considered in the Worklog

Cancellation:
We do not handle cancellation of requests in this worklog due to its complexity.

1. Coordinator

Coordinator - Code Skeleton

//The coordinator class where actions are submitted
class Group_action_coordinator

public:
  //Proposes a group action
  int coordinate_action_execution(Group_action* action);

  /*
    Asks the coordinator to stop any ongoing action
    @param coordinator_stop is the coordinator terminating
  */ 
  int stop_coordinator_process(bool coordinator_stop);

  //Handle incoming  action message (start or stop)
  int handle_action_message(Group_action_message *msg);

  //Queue notification (primary change, message received,.. )
  int queue_notification(Action_notification notification);

  //Returns if there is a group action running
  bool is_group_action_running();

  /*
    Adapts the coordinator
    @param number_members   the current number of members
    @param is_leaving       is this member leaving?
  */ 
  void handle_leaving_members(int number_members, bool is_leaving);

private:

  //Handle incoming start action message
  int handle_action_start_message(Group_action_message *msg);

  //Handle incoming stop action message
  int handle_action_stop_message(Group_action_message* msg);

  //Declare this action as terminated to other members
  // @param message_type for the sent message
  int signal_action_terminated(enum_action_message_type);

  // Leave the group and change state to error 
  int leave_on_action_error(); 

  //Handle the termination of current action
  void terminate_action();

   //The id defined for each action that is currently running
  enum_group_action_type current_action_id;

  //The id defined for each action that is currently running
  Group_action executing_action;

  //Declare this action as terminated to other members
  Queue<Action_notifications> notifications;

  //The number of members known for the current action
  list<uuid> known_messages_uuids;

  //The lock too coordinate start and stop requests
  lock coordinator_process_lock;

  //The flag to avoid concurrent action start requests
  bool action_ongoing;

  //The flag to avoid action starts post stop
  bool coordinator_terminating;

  //Is the action terminating
  bool action_terminating

  //The handler where actions can report progress through stages
  Plugin_stage_monitor_handler* monitoring_stage_handler;

Coordinator - Method logic

General idea

->user action

Group_action action_X = new Custom_action(parameters_from_user);
error= group_action_coordinator.coordinate_action_execution(action_X);
return error;

coordinate_action_execution(Group_action action)

Lock coordinator_process_lock
[action_ongoing == true || coordinator_terminating == true]
Then abort (fail early)
Set action_ongoing to true
Get action message with Group_action::get_action_message
Send the message to all members
If it fails to send, return error to the user
Unlock coordinator_process_lock
Create a Plugin_stage_monitor_handler instance and set a stage;
Wait for response from action execution (execution is on another thread).
Set action_ongoing to false.
[If return value is either GROUP_ACTION_KILLED or GROUP_ACTION_ERROR]
Execute leave_on_action_error()
End the stage on the Plugin_stage_monitor_handler instance.
Check the response and return either success or error to the client

Check the below Killing queries section for that code path.

stop_coordinator_process(bool coordinator_stop)

[Is an action running]

Lock coordinator_process_lock
Set coordinator_terminating= coordinator_stop
Invoke Group_action::stop_action_execution(false).
Wait for the thread executing the action to finish.
Invoke Group_action_coordinator::terminate_action()
Unlock coordinator_process_lock

handle_action_message(Group_action_message *msg)

[coordinator_terminating == true]
Return
[Is it is a start message]
Invoke: Group_action_coordinator::handle_start_action_message
If there is an error on handling, return
If local, awake the action coordination so it aborts.
[Is it is a stop message]
Invoke: Group_action_coordinator::handle_stop_action_message
[Is it is a abort message] Invoke Group_action_coordinator::stop_coordinator_process(false)

is_group_action_running()

Return true if there is a running action.
Current ideas it to test if there is a defined action id.
Using action_ongoing can also be an option but this is set before the action is accepted.
So using action_ongoing can cause unnecessary join failures from new members.

handle_leaving_members(int number_members, bool is_leaving)

[is_leaving == true]
Invoke
Group_action_coordinator::stop_coordinator_process(true) Return
Update known_messages_uuids.
[Are the termination messages == known_messages_uuids]
Invoke Group_action_coordinator::terminate_action();

handle_start_action_message(Group_action_message *msg)

[Is an action is already/still running]
Then abort
If local, awake the action coordination so it aborts.
[No action running]
Go to 2)
[Are there any members of the group in recovery]
Then abort
If local, awake the action coordination so it aborts.
[No action running]
Set known_messages_uuids (this is a logical consistent moment)
Go to 3)
Get the local action coordinator.
Set the current action id.
[If it is the sender]
Get the Group_action object for this message
[If remote]
Instantiate a new Group_action object.
Set action_terminating to false;
Give the message to the Group_action object for processing.
Group_action::process_action_message(Group_action_message msg)
Instantiate a new Plugin_stage_monitor_handler and set monitoring_stage_handler
Invoke the execution method of the Group_action class in a spawned new thread
Group_action::execute_action
Return

handle_start_action_message - Execution thread

Execute Group_action::execute_action()
If the method returns a GROUP_ACTION_RESTART signal, re-execute the action.
When the thread job finishes, execute Group_action_coordinator::signal_action_terminated()

handle_stop_action_message(Group_action_message *msg)

Update the completed work on monitoring_stage_handler
[are the termination messages uuids == known_messages_uuids]
Then declare the action as terminated, go to 3)
Update known_messages_uuids to remove the received member uuid.
Use the end stage method on monitoring_stage_handler
Invoke Group_action_coordinator::terminate_action()

signal_action_terminated(enum_action_message_type)

Set action_terminating to false;
Use get_termination_key() from the Group_Action class.
Set the stage on the monitoring_stage_handler
Set the estimated work to the number of known members, and the completed to the number of received messages.
Instantiate message of type Group_action_message.
Use the given message type and use ACTION_END_PHASE.
Send the message.

terminate_action()

Delete any notification not used by the current action.
Awake coordinate_action_execution method.
Unset the current_action_id.

leave_on_action_error()

Change member state to error
Leave the group
Cancel pending transactions
Set read mode to true.

Coordinator - Code related changes

Plugin_gcs_events_handler::check_group_compatibility(

As it is stipulated in the requirements, new members cannot join during a group action.

To accomplish this we need to change this method and add a check for:
Group_action_coordinator::is_group_action_running()

Plugin_gcs_events_handler::handle_joining_members(

In a continuation of the above requirement, but with the intent of increasing user experience we propose to add the same check here.

The idea is that if joining the member will error out, but other members should also print the cause of the member being expelled.

So, on the code branch

else if (number_of_joining_members > 0 ||
        (number_of_joining_members == 0 && number_of_leaving_members == 0))
{

We shall add a check and a print to the error log if the member joined while an action was ongoing.

Plugin_gcs_events_handler::handle_leaving_members(

Similar to the recovery call for update, we also invoke here:
Group_action_coordinator::handle_leaving_members(int number_members, is_leaving)

Coordinator - Concurrency notes

Start vs Start scenarios

The coordinator will stop several start messages from being sent at the same time.
Only when an action returns you can send another due to the coordinator_process_lock
About concurrency between requests from other members, we rely on the sequential nature of GCS.
The first received start action message is the one executed, the other fails.

Start vs Stop scenarios

Stop happens on plugin.cc method terminate_plugin_modules().
This means that it happens when the member left the group so in theory there are no more messages starting actions.

There are however possible requests for actions in parallel.
Due to the coordinator locks, either:
The stop goes first and sets coordinator_terminating, so all requests will fail.
The start goes first, but stop ends it.

Coordinator - Life cycle

Initialization

This class is initialized on plugin.cc on plugin_group_replication_start.
Since it does not depend on any server service it does not rely on the delayed thread class.

Termination

This class is terminated and deleted on terminate_plugin_modules().
The method Group_action_coordinator::stop_coordinator_process(true) is invoked

Invocation

See section 8, UDF functions

Coordinator - Killing queries

As described on the functional requirements and high level design this WL shall be implemented with a responsive behavior to DBA kill requests.

On key point is that query kill signals shall also kill the underlying action process.

So on coordinate_action_execution(Group_action action) it is assumed that step 7 is not a hard wait but a timed one that will periodically check if the requested was killed.

If killed, the process shall:

Invoke Group_action::stop_action_execution(true).
Wait for the thread executing the action to finish.
Invoke Group_action_coordinator::terminate_action()
[Group Action result is GROUP_ACTION_NOT_KILLED]
Send a warning stating the action finished in spite of the kill
[Group Action result is GROUP_ACTION_KILLED]
Send an error stating that the action was killed and the member will leave the group.

2. Group Actions : Parent class

The parent class for all actions