MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
More Robust Network Partition Handling in Group Replication

As Group Replication (GR) matures and it is deployed in a myriad of different systems, we begin to witness specific network conditions that we must be able to cope with in order to make Group Replication more tolerant and robust to those failures.

This post will describe the enhancements that were made to support certain scenarios, which will guarantee that GR will continue to provide service.

Gradual Group Degradation support

The presence of an unstable network can cause nodes to constantly drop the network connections of XCom’s fully connected mesh. Unstable networks can be seen in Cloud environments or in slowly degrading networks switched in internal networks, or even in congested networks. This causes degradation over time which can lead to a full group partition.

Let us consider:

  • A group {A,B,C};
  • A flaky network among group members;
  • An expel timeout configured to a large value;

When the network degrades, nodes can form partial majorities, for instance, {A,B} and {B,C}, and throw inconsistent suspicions about each other. Those suspicions are stored and become active internally in Group Communication System (GCS).

If the group loses the majority and becomes fully partitioned: {A}{B}{C}, internal suspicious will fire and issue expel commands that will be stuck in XCom waiting for a majority. If the group recovers to {A,B,C}, those messages will unblock but they are not valid in the current context. This can lead to a possible full group disband.

XCom is our Paxos implementation and GCS is a layer on top that
implements some services, like keeping track of the dynamic
membership and act on suspicions. XCom itself does not deal with
the healing of the group. GCS itself is the one that keeps some
memory of the group and therefore GCS decides if a node should be
expelled or not based on the messages arriving from the Paxos
stream.

The solution we’ve come up with is to keep track of which members it has already ordered the expulsion in its internal memory. GCS should treat those members as if they do not belong to the current view when considering a majority to expel another node. This means that the considered quorum i.e. the minimum number of votes to consider that we have a majority in the group to issue an expel request will be the currently active group members minus the members already deemed to be expelled. If we can’t reach a majority, that expel request won’t be issued and we have solved our problem.

We have to eventually clean up our memory of which member we have already ordered the expulsion. We clean up our memory of having ordered the expulsion of a member once the expulsion has taken place, i.e. is no longer part of the group.

An improved node-to-node connection management

In XCom, there were two ways to state that a server was active from any other server point-of-view:

a) We actually send data to the server S via socket primitives, or;
b) We intend to send stuff to the server S, pushing a message to its outgoing queue

The problem that we observed happened in a group {node1,node2,node3}. node1 is either network partitioned from the group, or crashed. The group expels node1 and the group is now {node2,node3}.
The network partition of node1 is reestablished and a new node1 asks to join the group. node1 is added back to the XCom low-level configuration, but times out in the full process fro GCS i.e: state exchange. GCS tries to join again but is denied because it is already in the XCom configuration.

This happens because both node2 and node3 never broke the original connection to node1‘s first incarnation. node1 was constantly being marked as active, solely because of the intention of sending a message to it, i.e: scenario b) of marking a node as active. The socket was having errors due to scenario a), thus marking node1 as inactive, but scenario b) was contradicting that order.

The solution was to remove scenario b), meaning that we only consider a server to be active when we actually are able to physically write into its connection. If we are not able to send a message because the connection is broken or the socket send buffers are depleted and poll() returns with an error, the node is marked as non-active and the connection will be forcefully shut down.

Consistent Connection Status across the group

As you all know, GCS/XCom keeps a full mesh of connections among its group members. This means that in a group with 2 members {node1,node2}, we have a node1 -> node2 connection and a node2 -> node1 connection. Due to either misconfigurations or real network issues, we could end up in a situation where one of the connection sides is broken.

Let us consider that node1 -> node2 connection is broken. node1 won’t be able to send messages to node2 but node2 will be able to send messages to node1. node2 will mark node1 as UNREACHEABLE while node1 will see both node1 and node2 as ONLINE, until its buffer is depleted and the connection is broken, which can take some time. This leads to an inconsistent state in the group’s P_S tables.

To solve this, we will rely on an existing mechanism, which sends are_you_alive messages to nodes that we think that have died. In the previous example, node2 will start sending those messages to node1. Since node1 sees node2 as healthy, it will start a counter. If it receives a certain amount if are_you_alive in a certain time interval, it will cut off the connections and you will see in node1 logs the following message:

"Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node {node_address}:{node_port}.Please check the connection status to this member"

The P_S table status in node1 regarding node2 will also change its status to UNRECHEABLE.

Conclusion

This blog post presents us with three major improvements in Group Replication in order for it to better support transient Network Failures: Gradual Group Degradation, Improved connection management, and Improved Connection Status. If your setup happens to hit the problematic scenarios, give MYSQL 8.0.21 a try and let us know your feedback.