Group Replication’s failure detection mechanism is a distributed service which is able to identify that a server in the group is not communicating with the others, and is therefore suspected of being out of service. If the group’s consensus is that the suspicion is probably true, the group takes a coordinated decision to expel the member. Expelling a member that is not communicating is necessary because the group needs a majority of its members to agree on a transaction or view change. If a member is not participating in these decisions, the group must remove it to increase the chance that the group contains a majority of correctly working members, and can therefore continue to process transactions.
In a replication group, each member has a point-to-point
communication channel to each other member, creating a fully
connected graph. These connections are managed by the group
communication engine (XCom, a Paxos variant) and use TCP/IP
sockets. One channel is used to send messages to the member and
the other channel is used to receive messages from the member.
If a member does not receive messages from another member for 5
seconds, it suspects that the member has failed, and lists the
status of that member as UNREACHABLE
in its
own Performance Schema table
replication_group_members
. Usually,
two members will suspect each other of having failed because
they are each not communicating with the other. It is possible,
though less likely, that member A suspects member B of having
failed but member B does not suspect member A of having failed -
perhaps due to a routing or firewall issue. A member can also
create a suspicion of itself. A member that is isolated from the
rest of the group suspects that all the others have failed.
If a suspicion lasts for more than 10 seconds, the suspecting
member tries to propagate its view that the suspect member is
faulty to the other members of the group. A suspecting member
only does this if it is a notifier, as calculated from its
internal XCom node number. If a member is actually isolated from
the rest of the group, it might attempt to propagate its view,
but that will have no consequences as it cannot secure a quorum
of the other members to agree on it. A suspicion only has
consequences if a member is a notifier, and its suspicion lasts
long enough to be propagated to the other members of the group,
and the other members agree on it. In that case, the suspect
member is marked for expulsion from the group in a coordinated
decision, and is expelled after the waiting period set by the
group_replication_member_expel_timeout
system variable expires and the expelling mechanism detects and
implements the expulsion.
Where the network is unstable and members frequently lose and regain connection to each other in different combinations, it is theoretically possible for a group to end up marking all its members for expulsion, after which the group would cease to exist and have to be set up again. To counter this possibility, the Group Replication Group Communication System (GCS) tracks the group members that have been marked for expulsion, and treats them as if they were in the group of suspected members when deciding if there is a majority. This ensures at least one member remains in the group and the group can continue to exist. When an expelled member has actually been removed from the group, GCS removes its record of having marked the member for expulsion, so that the member can rejoin the group if it is able to do so.
For information on the Group Replication system variables that you can configure to specify the responses of working group members to failure situations, and the actions taken by group members that are suspected of having failed, see Section 20.7.7, “Responses to Failure Detection and Network Partitioning”.