You can use the
group_replication_member_expel_timeout
system variable to allow additional time between the creation of
a suspicion and the expulsion of the suspect member. A suspicion
is created when one server does not receive messages from
another server, as explained in
Section 20.1.4.2, “Failure Detection”.
There is an initial 5-second detection period before a Group
Replication group member creates a suspicion of another member
(or of itself). A group member is then expelled when another
member's suspicion of it (or its own suspicion of itself) times
out. A further short period of time might elapse after that
before the expelling mechanism detects and implements the
expulsion.
group_replication_member_expel_timeout
specifies the period of time in seconds, called the expel
timeout, that a group member waits between creating a suspicion,
and expelling the suspected member. Suspect members are listed
as UNREACHABLE
during this waiting period,
but are not removed from the group's membership list.
If a suspect member becomes active again before the suspicion times out at the end of the waiting period, the member applies all the messages that were buffered by the remaining group members in XCom's message cache and enters
ONLINE
state, without operator intervention. In this situation, the member is considered by the group as the same incarnation.If a suspect member becomes active only after the suspicion times out and is able to resume communications, it receives a view where it is expelled and at that point realises it was expelled. You can use
group_replication_autorejoin_tries
to make the member try to rejoin the group automatically at this point. This feature is active by default in MySQL 8.4, and the member makes three auto-rejoin attempts. If the auto-rejoin procedure does not succeed or is not attempted, the expelled member then follows the exit action specified bygroup_replication_exit_state_action
.
The waiting period before expelling a member only applies to members that have previously been active in the group. Non-members that were never active in the group do not get this waiting period and are removed after the initial detection period because they took too long to join.
If
group_replication_member_expel_timeout
is set to 0, there is no waiting period, and a suspected member
is liable for expulsion immediately after the initial 5-second
detection period ends. The default is 5, meaning that a
suspected member is liable for expulsion 5 seconds after the
5-second detection period has expired. It is not mandatory for
all members of a group to have the same setting for
group_replication_member_expel_timeout
,
but it is recommended in order to avoid unexpected expulsions.
Any member can create a suspicion of any other member, including
itself, so the effective expel timeout is that of the member
with the lowest setting.
Consider increasing the value of
group_replication_member_expel_timeout
from the default in the following scenarios:
The network is slow and the default 5 or 10 seconds before expulsion is not long enough for group members to always exchange at least one message.
The network sometimes has transient outages and you want to avoid unnecessary expulsions and primary member changes at these times.
The network is not under your direct control and you want to minimize the need for operator intervention.
A temporary network outage is expected and you do not want some or all of the members to be expelled due to this.
An individual machine is experiencing a slowdown and you do not want it to be expelled from the group.
You can specify an expel timeout up to a maximum of 3600 seconds
(1 hour). It is important to ensure that XCom's message cache is
sufficiently large to contain the expected volume of messages in
your specified time period, plus the initial 5-second detection
period, otherwise members cannot reconnect. You can adjust the
cache size limit using the
group_replication_message_cache_size
system variable. For more information, see
Section 20.7.6, “XCom Cache Management”.
If any members in a group are currently under suspicion, the
group membership cannot be reconfigured (by adding or removing
members or electing a new leader). If group membership changes
need to be implemented while one or more members are under
suspicion, and you want the suspect members to remain in the
group, take any actions required to make the members active
again, if that is possible. If you cannot make the members
active again and you want them to be expelled from the group,
you can force the suspicions to time out immediately. Do this by
changing the value of
group_replication_member_expel_timeout
on any active members to a value lower than the time that has
already elapsed since the suspicions were created. The suspect
members then become liable for expulsion immediately.
If a replication group member stops unexpectedly and is
immediately restarted (for example, because it was started with
mysqld_safe
), it automatically attempts to
rejoin the group if
group_replication_start_on_boot=on
is set. In this situation, it is possible for the restart and
rejoin attempt to take place before the member's previous
incarnation has been expelled from the group, in which case the
member cannot rejoin. Group Replication automatically uses a
Group Communication System (GCS) feature to retry the rejoin
attempt for the member 10 times, with a 5-second interval
between each retry. This should cover most cases and allow
enough time for the previous incarnation to be expelled from the
group, letting the member rejoin. Note that if the
group_replication_member_expel_timeout
system variable is set to specify a longer waiting period before
the member is expelled, the automatic rejoin attempts might
still not succeed.
For alternative mitigation strategies to avoid unnecessary
expulsions where the
group_replication_member_expel_timeout
system variable is not available, see
Section 20.3.2, “Group Replication Limitations”.