We are very excited about the next release of MySQL Group Replication 0.9.0 in MySQL 5.7.15 and the great work that has been done to improve its stability. Release after release, MySQL Group Replication becomes more stable and more user-friendly and has reached a maturity level that made us declare 0.9.0 a release candidate. One of the majors steps towards a better MySQL Group Replication was the adoption of a homegrown paxos-based protocol for the communication system. Although this landmark was already advertised when the 0.6.0 and 0.8.0 versions were released, we must provide more technical details on this matter.
In a series of blog posts, we will fill in this gap. In this post, we will describe why we have decided to introduce a homegrown paxos-based implementation and removed Corosync for the communication system. For those who do not know, we have initially used Corosync as our communication system and supported both Corosync and our homegrown solution for a while. We will also briefly introduce the technology that allowed us to support different communication systems almost with no difficulty.
1. Why a homegrown Communication System?
1.1. Why not Corosync?
Corosync is a great tool that has been developed for years. It has been successfully deployed by many different projects and we have used it extensively to develop, test and improve MySQL Group Replication. However, we could not continue to use Corosync for GR because of three main reasons:
- it does not run on Windows and limiting the number of platforms supported by MySQL with Group Replication was basically not an option as it would be a huge drawback to our users;
- it only provides symmetric encryption (all MySQL Server instances would have to agree on a common key somehow) but security is a major concern for us and we wanted a robust solution, one that was at least as safe as MySQL is today;
- it is not cloud-friendly, since multicast or broadcast through User Datagram Protocol (UDP) may be a liability and may not be widely supported; we wanted a solution that could be deployed in any public and private cloud where a multi-tenant environment with shared resources is default, so we wanted support for Transmission Control Protocol (TCP).
So, although fine to prototype, Corosync’s architecture was not a good one for our ultimate goals and to address all requirements we would eventually have to re-write it.
1.2. Why Paxos?
We also wanted a paxos-based protocol for the communication system that had a flexible design and could be customized to a variety of scenarios. See Good Leaders are game changers: Raft & Paxos for a brief introduction on Paxos.
In Paxos, processes (i.e. MySQL Server instances) participating in the protocol may assume one of the following roles or their combinations: proposer, acceptor and learner. We believe that the possibility of setting up different roles per processes will allow us to tune our product to different use cases thus providing the best possible solutions to our users. Initially, however, our first goal was to simply overcome the single leader configuration (i.e. single proposer), quite common in a paxos-based solution, because the leader may easily become a bottleneck harming performance and scalability of the overall solution.
1.3. Our homegrown solution
Our solution:
- Runs without problems in all MySQL Server platforms;
- Is based on a variant of a Paxos protocol (Mencius) and supports multiple leaders;
- May use Security Sockets Layer (SSL) as an encryption mechanism;
- Is cloud friendly in the sense that it supports TCP and is elastic.
In a future blog post, we will dive into our solution, named eXtended COMmunications, or simply XCOM, and describe how it works. Note that it is an internal only component, not yet a product meant for external usage per se and its name may change in the future.
2. GCS Interface
From the beginning we did not want to tie Group Replication to any particular group communication system. Our goal was to have a flexible solution that could support different communication systems and that could foster the development of other ones by the vibrant research community in this area.
So we had to define a minimum but still powerful interface along with the expected semantics to support Group Replication. Thankfully, we did not have to start from scratch and we borrowed lots of ideas from Towards a Generic Group Communication Service.
In order to achieve this goal, we have defined a Group Communication System Interface, or simply GCS, which consists of an abstract interface and different concretizations per Communication System. MySQL Group Replication uses the interface to configure the Communication System, get information on the MySQL Server instances in the group, reconfigure the group member set and send messages to and receive messages from it. The communication from the the GCS to the Group Replication happens through the observer pattern which announces when there are changes in the membership, i.e. nodes have joined or left the group, a message has been delivered, etc.
The only drawback is that each Communication System requires a different implementation of this abstract interface as commonly expected. This extra effort, however, soon paid off when we had figured out that Corosync would not be a good fit for us and we needed another solution. Note that this does not imply a big development effort as usually the implementation consists in translating the information from the Communication System into a format that can be consumed by the Group Replication. Note also that this is a thin layer and for that reason any impact on performance should be negligible.
3. GCS Semantics
The Communication System along with the Group Communication System Interface must guarantee the following properties:
[Closed Group] – A closed group where only group members can send messages to the group. Although there may be use cases that could benefit from an open group, we believe that shortcomings with this approach may be circumvented by carefully configuring the roles (proposer, acceptor and learner) assumed by the different processes.
[Total Order] – Messages delivered to an application must be totally ordered among each other. This is key to implement Group Replication which builds a database state machine where the database state depends exclusively on its current state and the sequence of messages (i.e. transactions) that are processed. For that reason, all MySQL Server instances must process the same sequence of transactions by the same order.
[Safe Delivery] – Messages delivered to an application by a sender (i.e. where the message was originated) must eventually be delivered by all non-faulty members. This means that the sender cannot deliver a message to an application before knowing for sure that the message was already received by a majority of members. This avoids data loss when the sender delivers a message and suddenly fails: a feature that was not supported by Corosync.
[View Synchrony] – All non-faulty servers must deliver the same sequence of messages before installing a new configuration (i.e. set of members or new group view). This makes application recovery easier as it provides a synchronization point. Note though that Group Replication does not require that a message sent in one group view must be delivered in the same group view.
[Ownership transfer] – Ownership transfer in the sense that the GCS along with the Communication System must be responsible for deallocating the object or memory created by the Group Replication to send a message. On the other hand, objects or memory created by either the GCS along or the Communication System to deliver notifications to the Group Replication must be deallocated by the latter.
4. Conclusion
You can find MySQL with Group Replication for download at MySQL Labs and instructions on how to set up an instance at MySQL Group Replication: A Quick Start Guide. Please, take it for a spin and if you do encounter any issue, please file a bug report under the ‘MySQL Server: Group Replication’ category so that we can be informed about the issue and can investigate it further.
Note though that this software has not been declared production-ready yet and is for testing purposes only.