Detailed Description

Interface that defines the operations that state exchange will provide.

In what follows, we describe how the state exchange algorithm works and the view change process where it is inserted and which is an essential part of our system.

The view change process is comprised of two major parts:

Adding or removing a node from the system, accomplished in the XCom/Paxos layer: "The SMART Way to Migrate Replicated Stateful Services"
A state exchange phase in which all members distribute data among themselves.

Whenever a node wants to add or remove itself from the group, or after a failure when a healthy member expels the faulty node from the group, a reconfiguration request is sent in the form of an add_node or remove_node message.

After the success of the request, XCOM sends a global view message that contains information on all nodes tagging them as alive or faulty to all non-faulty members. MySQL GCS looks at this information and computes who has joined and who has left the group. The computation is trivially simple and compares the set of nodes received in the current view with the set of nodes in the previous view:

. left nodes = (alive_members in old_set) - (alive_members in new_set)

. joined nodes = (alive_members in new_set) - (alive_members in old_set)

However, the new view is only delivered to an upper layer after all members exchange what we call a state message. While the view is being processed and the state exchange is ongoing, all incoming data messages are not delivered to the application and are put into a buffer. So after getting state messages from all members, the view change is delivered to the upper layer along with the content of the state messages and any buffered message is delivered afterwards.

Why blocking the delivery of data messages and why these state messages?

Recall that all messages are atomically delivered and we can guarantee that all nodes will have the same state which encompasses messages (e.g. transactions) in queues and in the storage (e.g. binary log) because all new data messages are buffered while the state messages are being exchanged.

Blocking the delivery of new data messages give us a synchronization point.

But if all nodes have the same state why gathering a state message from all members?

The power of choice. Let us use MySQL Group Replication as a concrete example of an upper layer to understand why. This is done because having information on all members allow the new node to choose a member that is not lagging behind (i.e. has a small queue) as a donor in a recovery phase. Besides, the state message also carries information on IPs and Ports used to access the MySQL Instances. This information is necessary to start the recovery which will be asynchronously started and will dump the missing data from a donor.

Note that the content of the state message is opaque to the MySQL GCS layer which only provides a synchronization point.

The documentation for this class was generated from the following file:

plugin/group_replication/libmysqlgcs/src/bindings/xcom/gcs_xcom_state_exchange.h