Group Replication's failure detection mechanism is designed to identify group members that are no longer communicating with the group, and expel them as and when it seems likely that they have failed. Having a failure detection mechanism increases the chance that the group contains a majority of correctly working members, and that requests from clients are therefore processed correctly.
Normally, all group members regularly exchange messages with all other group members. If a group member does not receive any messages from a particular fellow member for 5 seconds, when this detection period ends, it creates a suspicion of the fellow member. When a suspicion times out, the suspected member is assumed to have failed, and is expelled from the group. An expelled member is removed from the membership list seen by the other members, but it does not know that it has been expelled from the group, so it sees itself as online and the other members as unreachable. If the member has not in fact failed (for example, because it was just disconnected due to a temporary network issue) and it is able to resume communication with the other members, it receives a view containing the information that it has been expelled from the group.
The responses of group members, including the failed member itself, to these situations can be configured at a number of points in the process. By default, the following behaviors happen if a member is suspected of having failed:
In MySQL 9.1, when a suspicion is created, a waiting period of 5 seconds is added before the suspicion times out and the suspected member is liable for expulsion.
If an expelled member resumes communication and realises that it was expelled, it automatically makes three attempts to rejoin the group (with 5 minutes between each attempt); if this auto-rejoin procedure does not work, it then stops trying to rejoin the group.
When an expelled member is not trying to rejoin the group, it switches to super-read-only mode and awaits operator attention.
You can use the Group Replication configuration options described in this section to change these behaviors either permanently or temporarily, to suit your system's requirements and your priorities. If you are experiencing unnecessary expulsions caused by slower networks or machines, networks with a high rate of unexpected transient outages, or planned network outages, consider increasing the expel timeout and auto-rejoin attempts. While a member is undergoing any of the default behaviors described previously, although it does not accept writes, reads can still be performed if the member is still communicating with clients, with an increasing likelihood of stale reads over time. If avoiding stale reads is a higher priority for you than avoiding operator intervention, consider reducing the expel timeout and auto-rejoin attempts or setting them to zero.
Members that have not failed might lose contact with part, but not all, of the replication group due to a network partition. For example, in a group of 5 servers (S1, S2, S3, S4, S5), if there is a disconnection between (S1, S2) and (S3, S4, S5) there is a network partition. The first group (S1, S2) is now in a minority because it cannot contact more than half of the group. Any transactions that are processed by the members in the minority group are blocked, because the majority of the group is unreachable, therefore the group cannot achieve quorum. For a detailed description of this scenario, see Section 20.7.8, “Handling a Network Partition and Loss of Quorum”. In this situation, the default behavior is for the members in both the minority and the majority to remain in the group, continue to accept transactions (although they are blocked on the members in the minority), and wait for operator intervention. This behavior is also configurable.
Note that where group members are at an older MySQL Server release
that does not support a relevant setting, or at a release with a
different default, they act towards themselves and other group
members according to the default behaviors stated above. For
example, a member that does not support the
group_replication_member_expel_timeout
system variable expels other members as soon as an expired
suspicion is detected, and this expulsion is accepted by other
members even if they support the system variable and have a longer
timeout set.