MySQL :: No Ping Will Tear Us Apart - Enabling member auto-rejoin in Group Replication

With the release of version 8.0.16 of MySQL, we’ve added some features to the Group Replication (GR) plugin, in order to enhance its high-availability capacities. One of these features was the ability to enable a member that has left the group, under certain circumstances, to automatically rejoin it, without need of intervention by the user.

In order to understand the benefits of this feature and how to use it, we’ll quickly look at the concepts behind it and what motivates its existence in the first place.

Introduction

As you are probably aware by now, the GR plugin allows a MySQL user to effortlessly manage highly available groups, complete with all the trappings needed to guarantee high-availability in a system, like for instance fault-tolerance or failure detection.

One of the base guarantees that is provided in GR is that the group is presented to the user as one indivisible entity, meaning that once a member joins or leaves the group, that change is immediately seen by the other members. The data itself is eventually consistent (within the group) by default, though this can be modified. To implement this guarantee, GR makes use of a group membership service, as well as detecting conflicting transactions through a consensus algorithm and aborting them. This latter aspect of GR is beyond the scope of this blog post and not entirely related to the new auto-rejoin feature.

With that in mind, what does a group membership service entail? Well, exactly what it says in the name – it allows members to join and manages the group. Adding a member to an existing group is simple enough, but still the joining member must conform to a few criteria. Once the member is accepted in the group, it will undergo a process called “distributed recovery”, where the joining member will catch up to the group in terms of transactions processed so far (and it does this by choosing one of the members in the group to stream already processed transactions to him, which is referred in GR as “donating”). Finally, the new member will be declared as online, as long as no errors are encountered during this distributed recovery process.

As for managing the group, GR relies on a group communication layer known as Group Communication System (GCS).
This layer implements the consensus algorithm used to resolve conflicting transactions and enforces some communication properties which are crucial for achieving the indivisible view of the group mentioned earlier, like total order of messages, safe delivery or view syncrony (i.e. exchange views of the group, i.e. who is in the group at a given point in time, with all the members so they all agree on the same group structure).

In order to achieve these properties, GCS needs to be able to detect failures. What does this mean in practice? It means it needs to detect which members of the group are failing, or appear to be failing. The action taken once these members are detected as failing depends on a number of circumstances but in any case, once GCS decides that the member has failed, it is removed from the group, so that the group can stay healthy. For this, GCS introduces a failure-detector in each member that analyzes the message exchanged within the group. If it doesn’t receive a message from a given member for some amount of time, the fault-detector will generate a “suspicion” for that member, meaning that GCS believes the member may fail in the meanwhile. The time a member takes between being “suspected” to something actually happening (like for instance, being kicked out of the group) is configurable.

The need to rejoin a member

Now that you know the guarantees that a group must provide, in order to be highly available, and how it accomplishes those guarantees, consider the following example scenario:

You have a group of three members and one of those members has a flaky network connection, i.e. it sporadically experiences glitches that cause it to loose packets, connections to be severed or it can be affected by some other error situation that results in not being able to communicate with the other peers.

Consider also that these errors last sufficiently long as to overcome the value of group_replication_member_expel_timeout (the time interval between a member being suspected to actually be kicked out of the group). What will happen is that during one of these glitches the rest of the group will decide to kick out the member. The problem is that once this member reestablishes the severed connection, he will receive a view of the group in which he is expelled and will be out of the group. What happens next is that there will be a need for manual intervention by, for example, a DBA (you can actually circumvent this manual intervention by using the exit state action feature to abort the server when it is expelled from the group and having a monitor daemon that relaunches the server).

A similar scenario arises if the member somehow looses the contact to a majority of the group, due to similar network glitches. If that member has an unreachable majority timeout different than 0, it will wait that amount of time before leaving the group (setting the timeout to 0 means he will wait forever). When that timeout elapses, the member will reestablish the connection and won’t be able to rejoin the old group, requiring once again manual intervention.

Given this, there is a clear need to automate the need of a manual intervention when there are network glitches. Enter auto-rejoin.

With 8.0.16 we introduce the concept of an auto-rejoin procedure which is nothing more than, once the member is expelled from a group or loses contact with a majority of the group, it automatically tries to rejoin the group up to a given amount of times, waiting at least 5 minutes between each retry.

How do I enable auto-rejoin?

You can use auto-rejoin by setting the group_replication_autorejoin_tries system variable to the number of retries you want.

 SET GLOBAL group_replication_autorejoin_tries = 3

1	SET GLOBAL group_replication_autorejoin_tries = 3

The default value is 0, which means that the server won’t run the auto-rejoin procedure. We chose this default so we don’t change the previous behavior.

And how can I know that an auto-rejoin is happening?

As with many features in MySQL, monitoring and observability is provided for the auto-rejoin procedure, capturing important details that can be analyzed by a DBA, a monitoring system, a devops, etc…

The observability aspect itself of auto-rejoin relies on the
performance schema infrastructure to collect data on the procedure, in the form of stage events. These are a performance schema concept that are used to represent work or steps
done in a given procedure.

They maintain information about:

Which thread the event occurred in (THREAD_ID)
The event name (EVENT_NAME)
Timestamps for the point in time where it began, where it ended (if
applicable) and the total duration or the duration of the event so far
(TIMER_START, TIMER_END and TIMER_WAIT)
The work units that have been done in this event and the estimated work units
needed until the event stops (WORK_COMPLETED, WORK_ESTIMATED)

So when an auto-rejoin procedure begins, it will register a stage event in
the performance schema with the name “stage/group_rpl/Undergoing auto-rejoin
procedure“. Using the tables performance_schema.events_stage_current,
performance_schema.events_stages_summary_global_by_event_name and
performance_schema.events_stages_history_long we can observe the following:

Whether or not an auto-rejoin proceduring is ongoing
The number of retries ellapsed so far
The estimated time remaining until the next retry

Auto-rejoin procedure status

The auto-rejoin procedure status, i.e. if it’s undergoing or not, can be looked up simply by filtering any active stage events that contain the “auto-rejoin” string:

 SELECT COUNT(*) FROM performance_schema.events_stages_current
WHERE EVENT_NAME LIKE '%auto-rejoin%';

COUNT(*)
1

SELECT COUNT(*) FROM performance_schema.events_stages_current

WHERE EVENT_NAME LIKE '%auto-rejoin%';

COUNT(*)

In this case there was an auto-rejoin procedure running on the server.

Number of retries so far

If there is an auto-rejoin procedure undergoing, we can check the number of retries attempted so far by selecting the number of work units on the stage event:

 SELECT WORK_COMPLETED FROM performance_schema.events_stages_current WHERE
EVENT_NAME LIKE '%auto-rejoin%';

WORK_COMPLETED
1

SELECT WORK_COMPLETED FROM performance_schema.events_stages_current WHERE

EVENT_NAME LIKE '%auto-rejoin%';

WORK_COMPLETED

In this example there was only one attempt so far.

Estimated time remaining until next retry

Between each rejoin attempt the server will be in an interruptible sleep of 5 minutes. The time between a rejoin attempt starting up until it either succeeds or fails, is variable and cannot be estimated. So in order to get a rough estimate of the time remaining, we can multiply the number of retries attempted so far by 5 minutes and subtract the time taken by the stage event so far, to get an estimate of how long we still need:

SELECT (300.0 - ((TIMER_WAIT*10e-12) - 300.0 * num_retries)) AS time_remaining FROM
(SELECT COUNT(*) - 1 AS num_retries FROM
performance_schema.events_stages_current WHERE EVENT_NAME LIKE '%auto-rejoin%') AS T,
performance_schema.events_stages_current WHERE EVENT_NAME LIKE '%auto-rejoin%';

time_remaining
30.0

SELECT (300.0 - ((TIMER_WAIT*10e-12) - 300.0 * num_retries)) AS time_remaining FROM

(SELECT COUNT(*) - 1 AS num_retries FROM

performance_schema.events_stages_current WHERE EVENT_NAME LIKE '%auto-rejoin%') AS T,

performance_schema.events_stages_current WHERE EVENT_NAME LIKE '%auto-rejoin%';

time_remaining

30.0

So in this example there are still 30 seconds until the next rejoin. Keep in
mind that all time accounting in the performance schema table is kept with
pico precision, so we scaled TIMER_WAIT to seconds.

Trade-offs of using auto-rejoin vs. the expel timeout

So far in this post we’ve only looked at auto-rejoin. In truth, there are two different ways to achieve an automatic rejoin of a member that left the group: by auto-rejoin or by setting the expel timeout of that member. What this last mechanism does is configure a timeout between the member being suspected (and by suspected we mean some member in the group will see that another member is not responding in a timely fashion and will suspect that it is dead) and between actually being expelled from the group. This effectively delays the removal of the member from the group and if configured to a high enough interval, it increases the chances that the connection is reestablished and it is able to interact with the group again.

Though these two features achieve the same goal, the way they work is different and there are trade-offs to using them. By using the expel timeout you maintain the member that is suspected in the group (it appears to the other members as UNREACHABLE) with the downside that you can’t add or remove members or elect a new primary. On the other hand, by using auto-rejoin the member won’t be part of the group any longer and will stay in super_read_only mode until it manages to rejoin the group. But during this time, the likelihood of stale reads on the rejoining member will increase. You can also monitor the auto-rejoin procedure, while the expel timeout is not really monitorable.

So, to summarize:

Pros of expel timeout
- The member is inside the group the whole time
- Might be more suitable for small enough network glitches
Cons of expel timeout
- You can’t add/remove members on the group while a member is suspected
- You can’t elect a new primary while a member is suspected
- You can’t monitor this process
Pros of auto-rejoin
- The group will function without the rejoining member and you can add/remove members and elect new primaries
- You can monitor the process
Cons of auto-rejoin
- You increase the likelihood of stale reads on the rejoining member
- Possibly not as suitable to small enough network glitches

All in all, what do I get from enabling auto-rejoin?

By enabling auto-rejoin you can diminish the need for manual intervention on
the MySQL instance, which in of itself is a win in many systems. Your system
will become more resilient to transient network failures, all while keeping
your fault-tolerance and high-availability guarantees.

Summary

We introduced a new system variable called group_replication_autorejoin_tries
that allows the user to set the number of times a GR member tries to rejoin the
group after it has been expelled or lost contact with a majority of the group.
This auto-rejoin procedure is off by default. It essentially helps the user to
avoid manual intervention on a GR member when facing transient network issues.