Documentation Home
MySQL 9.1 Reference Manual
Related Documentation Download this Manual
PDF (US Ltr) - 40.3Mb
PDF (A4) - 40.4Mb
Man Pages (TGZ) - 259.3Kb
Man Pages (Zip) - 366.4Kb
Info (Gzip) - 4.0Mb
Info (Zip) - 4.0Mb


MySQL 9.1 Reference Manual  /  ...  /  Responses to Failure Detection and Network Partitioning

20.7.7 Responses to Failure Detection and Network Partitioning

Group Replication's failure detection mechanism is designed to identify group members that are no longer communicating with the group, and expel them as and when it seems likely that they have failed. Having a failure detection mechanism increases the chance that the group contains a majority of correctly working members, and that requests from clients are therefore processed correctly.

Normally, all group members regularly exchange messages with all other group members. If a group member does not receive any messages from a particular fellow member for 5 seconds, when this detection period ends, it creates a suspicion of the fellow member. When a suspicion times out, the suspected member is assumed to have failed, and is expelled from the group. An expelled member is removed from the membership list seen by the other members, but it does not know that it has been expelled from the group, so it sees itself as online and the other members as unreachable. If the member has not in fact failed (for example, because it was just disconnected due to a temporary network issue) and it is able to resume communication with the other members, it receives a view containing the information that it has been expelled from the group.

The responses of group members, including the failed member itself, to these situations can be configured at a number of points in the process. By default, the following behaviors happen if a member is suspected of having failed:

  1. In MySQL 9.1, when a suspicion is created, a waiting period of 5 seconds is added before the suspicion times out and the suspected member is liable for expulsion.

  2. If an expelled member resumes communication and realises that it was expelled, it automatically makes three attempts to rejoin the group (with 5 minutes between each attempt); if this auto-rejoin procedure does not work, it then stops trying to rejoin the group.

  3. When an expelled member is not trying to rejoin the group, it switches to super-read-only mode and awaits operator attention.

You can use the Group Replication configuration options described in this section to change these behaviors either permanently or temporarily, to suit your system's requirements and your priorities. If you are experiencing unnecessary expulsions caused by slower networks or machines, networks with a high rate of unexpected transient outages, or planned network outages, consider increasing the expel timeout and auto-rejoin attempts. While a member is undergoing any of the default behaviors described previously, although it does not accept writes, reads can still be performed if the member is still communicating with clients, with an increasing likelihood of stale reads over time. If avoiding stale reads is a higher priority for you than avoiding operator intervention, consider reducing the expel timeout and auto-rejoin attempts or setting them to zero.

Members that have not failed might lose contact with part, but not all, of the replication group due to a network partition. For example, in a group of 5 servers (S1, S2, S3, S4, S5), if there is a disconnection between (S1, S2) and (S3, S4, S5) there is a network partition. The first group (S1, S2) is now in a minority because it cannot contact more than half of the group. Any transactions that are processed by the members in the minority group are blocked, because the majority of the group is unreachable, therefore the group cannot achieve quorum. For a detailed description of this scenario, see Section 20.7.8, “Handling a Network Partition and Loss of Quorum”. In this situation, the default behavior is for the members in both the minority and the majority to remain in the group, continue to accept transactions (although they are blocked on the members in the minority), and wait for operator intervention. This behavior is also configurable.

Note that where group members are at an older MySQL Server release that does not support a relevant setting, or at a release with a different default, they act towards themselves and other group members according to the default behaviors stated above. For example, a member that does not support the group_replication_member_expel_timeout system variable expels other members as soon as an expired suspicion is detected, and this expulsion is accepted by other members even if they support the system variable and have a longer timeout set.