MySQL Group Replication allows you to create an highly-available replication group of servers with minimum effort. It provides automated mechanisms to detect and respond to failures in the members of the group. The response depends on the characteristics of each failure and it is configurable. In order to decrease the need for manual user intervention whenever there is a temporary network partition or a server slowdown in MySQL 8.0.21 we have changed the default values of two system variables:
-
group_replication_member_expel_timeout
group_replication_autorejoin_tries
In this blog post we will review MySQL Group Replication’s mechanisms to detect and respond to failures, how they can be configured and why we changed the default configuration in MySQL 8.0.21.
Introduction
MySQL Group Replication provides consistent database replication across a group of servers. It is continuously available and can tolerate failures as long as the majority of the members are working properly and are able to communicate between themselves. It has an automated failure detection mechanism to identify group members that are no longer communicating and expel them when it is likely that they have failed.
Here is a brief overview of the failure detection mechanism. Each member analyzes the exchanged messages with other members. If it doesn’t receive any message from another member for some time it creates a suspicion. If the majority of the group members (more than half) agree with this suspicion the member is expelled and, as a result, excluded from the group.
In an asynchronous distributed system it is impossible to build a perfect failure detector [Chandra&Toueg(1996)] since there is no bound on the transmission delay of messages and on the processing time of code. It is thus possible that the server of the member which is suspect of failure is still alive and working correctly but due to a transient machine slowdown or temporary network failure it takes longer than expected to respond to the other members.
Group replication provides two features to increase the likelihood of keeping the member in the group during a transient communication failure
- defer the expulsion of the member
- enable member auto-rejoin.
In MySQL 8.0.21 we have changed the default configuration of these features so that the member will stay longer in the group and in case it leaves it will try to rejoin automatically.
The group waits longer before expelling unreachable members
In MySQL 8.0.13 we have introduced the system variable group_replication_member_expel_timeout. It configures the additional time interval in seconds between the creation of the suspicion of member failure and its expulsion from the group. It was described in detail in the Coping with unreliable failure detection blog post.
Up to and including MySQL 8.0.20, the value of this system variable defaults to 0. Since there is a 5 seconds waiting period before the suspicion is created, the member is actually expelled 5 seconds after the communication with the group is interrupted. In MySQL 8.0.21 we have increased the default value to 5 seconds,which means that the group will wait 10 seconds before expelling the unreachable member. The member could potentially survive for a further few seconds after this timeout because the check for expired suspicions is carried out periodically.
Let’s observe what happens in a replication group with 3 members after the network connection between the third node and the rest of the group is dropped.
1
2
|
16:45:29 $docker network disconnect groupnet node3 |
The network partition is kept for 10 seconds. Since the duration is not longer than 10 seconds the third member is not expelled from the group. As expected each partition sees the rest of the group as UNREACHABLE
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
16:45:39 $for N in 1 2 3 do docker exec -it node$N bin/mysql -uroot --socket=/tmp/mysql.0.sock \ -e "SHOW VARIABLES WHERE Variable_name = 'hostname';" \ -e "SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE, MEMBER_VERSION " \ -e "FROM performance_schema.replication_group_members;"; done +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node1 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | | node3 | UNREACHABLE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node2 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | | node3 | UNREACHABLE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node3 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | UNREACHABLE | PRIMARY | 8.0.21 | | node2 | UNREACHABLE | PRIMARY | 8.0.21 | | node3 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ |
Then the network connectivity is restored:
1
2
|
16:45:39 $docker network connect groupnet node3 |
Shortly after the third member becomes ONLINE
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
16:45:40 $for N in 1 2 3 do docker exec -it node$N bin/mysql -uroot --socket=/tmp/mysql.0.sock \ -e "SHOW VARIABLES WHERE Variable_name = 'hostname';" \ -e "SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE, MEMBER_VERSION " \ -e "FROM performance_schema.replication_group_members;"; done +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node1 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | | node3 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node2 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | | node3 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node3 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | | node3 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ |
A member still waits forever to reach the majority of the group
Whenever there is a network partition the group can decide to expel an unreachable member. A member can also decide by itself to leave voluntarily the group if it cannot contact the majority of its members. The duration of the waiting period before leaving the group is set using system variable group_replication_unreachable_majority_timeout. By default this variable is set to 0, which means that members which find themselves in a minority due to a network partition wait forever to leave the group. This value has not been changed in MySQL 8.0.21. You should keep the default value to avoid to have to manually rejoin members which left the group due to timeout during these network partitions.
An expelled member rejoins automatically
Once a member is either expelled or gave up waiting to reach the group, if it is able to restore communication with the group there is a way to rejoin it automatically without user intervention : the auto-rejoin feature. This feature was introduced in MySQL 8.0.16 and it is configurable with system variable group_replication_autorejoin_tries. See the Enabling member auto-rejoin in group replication blog post for details.
Up to and including MySQL 8.0.20, its value defaults to 0, that is, the member will not try to rejoin the group. In MySQL 8.0.21 we have changed the default value to 3, meaning that it makes three automatic attempts to rejoin the group, with a 5 minutes interval between each attempt.
It’s important to note that in the presence of a network partition auto-rejoin will kick in as soon as the network partition is resolved and regardless of the duration of the partition. In case a member is expelled it needs to resume communication with the group in order to know that it was expelled. Only after the network partition is resolved does it enter ERROR
state and tries to rejoin automatically. And since the default value of group_replication_unreachable_majority_timeout
is set to 0, a member waits forever to reach the group majority.
Let’s observe what happens in a replication group with 3 members when the network connection between the third node and the rest of the group is dropped for a duration longer than 10 seconds.
1
2
|
17:17:21 $docker network disconnect groupnet node3 |
The network partition is kept for 30 seconds. Since the elapsed time is longer than 10 seconds the third member is expelled by the rest of the group. But since it doesn’t receive any message from the group it still considers itself as ONLINE
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
|
17:17:51 $for N in 1 2 3 do docker exec -it node$N bin/mysql -uroot --socket=/tmp/mysql.0.sock \ -e "SHOW VARIABLES WHERE Variable_name = 'hostname';" \ -e "SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE, MEMBER_VERSION " \ -e "FROM performance_schema.replication_group_members;"; done +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node1 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node2 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node3 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | UNREACHABLE | PRIMARY | 8.0.21 | | node2 | UNREACHABLE | PRIMARY | 8.0.21 | | node3 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ |
Let’s restore the network connection between third member and the rest of the group:
1
2
|
17:17:52 $docker network connect groupnet node3 |
Few seconds later the third member receives a message with the expulsion and enters ERROR
state. Since auto rejoin is enabled it will try to rejoin the group automatically.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
17:17:58 $for N in 1 2 3 do docker exec -it node$N bin/mysql -uroot --socket=/tmp/mysql.0.sock \ -e "SHOW VARIABLES WHERE Variable_name = 'hostname';" \ -e "SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE, MEMBER_VERSION " \ -e "FROM performance_schema.replication_group_members;"; done +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node1 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node2 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node3 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node3 | ERROR | | 8.0.21 | +-------------+--------------+-------------+----------------+ |
After a few seconds the third member successfully rejoins the group and becomes ONLINE
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
17:18:04 $for N in 1 2 3 do docker exec -it node$N bin/mysql -uroot --socket=/tmp/mysql.0.sock \ -e "SHOW VARIABLES WHERE Variable_name = 'hostname';" \ -e "SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE, MEMBER_VERSION " \ -e "FROM performance_schema.replication_group_members;"; done +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node1 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | | node3 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node2 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | | node3 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ +---------------+-------+ | Variable_name | Value | +---------------+-------+ | hostname | node3 | +---------------+-------+ +-------------+--------------+-------------+----------------+ | MEMBER_HOST | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +-------------+--------------+-------------+----------------+ | node1 | ONLINE | PRIMARY | 8.0.21 | | node2 | ONLINE | PRIMARY | 8.0.21 | | node3 | ONLINE | PRIMARY | 8.0.21 | +-------------+--------------+-------------+----------------+ |
If after 3 attempts the member cannot rejoin the group it proceeds to the action specified by the variable group_replication_exit_state_action. The behavior of this variable was explained in the Fail early fail fast preventing stale reads in group replication blog post, so it is not discussed here. At this point, in order to rejoin a member you need to add it manually or have a script to do it automatically.
Trade-offs of these changes
Increasing the value of group_replication_member_expel_timeout
has its trade-offs, since a member might stay longer in state UNREACHABLE
. While there is a least one unreachable member the group has the following limitations:
- Group membership re-configurations aren’t allowed (members cannot be added or removed)
- For single-primary mode a new group primary cannot be elected
- Although the unreachable member does not accept writes, reads can still be made as long as it is able to communicate with its clients, which increases the likelihood of stale reads
- If MySQL Group Replication consistency level is equal to
AFTER
orBEFORE_AND_AFTER
then a write transaction must wait for the unreachable member to become online and apply it
If you cannot afford these limitations and want to prioritize data consistency over minimum user intervention then set the group_replication_member_expel_timeout
system variable to 0.
1 |
SET PERSIST group_replication_member_expel_timeout = 0 |
If an unreachable member resumes communication with the group it will use XCom cache to retrieve missed messages exchanged by the group while it was away. The cache size limit is set using system variable group_replication_message_cache_size. If you previously tuned the size of the XCom message cache with reference to the expected volume of messages during the previous default waiting period before member expulsion (5 seconds), you might start to see warning messages from GCS on active group members stating that a message that is likely to be needed for recovery by a unreachable member has been removed from the message cache. If this situation occurs you should increase your group_replication_message_cache_size
setting to account for the new expel timeout.
If the member is expelled and resumes communication with the group auto-rejoin will start. While auto-rejoin is happening the group can function normally : members can be added and removed and a new primary can be elected. The constraints on transaction concurrency when consistency level is AFTER
or BEFORE_AND_AFTER
are also absent. However, clients can still connect to the expelled member and read stale data. So if you want to avoid stale reads for any period of time, set the group_replication_autorejoin_tries
system variable to 0.
1 |
SET PERSIST group_replication_autorejoin_tries = 0 |
Summary
To summarize, in MySQL 8.0.21 we have changed the default response of group replication to temporary network partitions and server slowdowns so that :
- the replication group waits at least 10 seconds before expelling an member to which it cannot connect
- a member which left the group makes 3 attempts to automatically rejoin
With this change you won’t need to manually rejoin members to the group which are expelled due to transient failures.