WL#9053: Group Replication: Push Group Replication Plugin

Affects: Server-5.7 — Status: Complete

Description
Requirements
Dependent Tasks
High Level Architecture

EXECUTIVE SUMMARY
=================

This worklog is the worklog that captures the work required to push the
Group Replication plugin to mysql-trunk. Moreover, it also tracks the
some of the user stories and the high level requirements for MySQL Group
Replication.

Background
==========

MySQL Group Replication is a MySQL plugin that brings multi-master
update everywhere to MySQL. This plugin ties together concepts and
technologies from distributed systems, such as group communication,
with traditional database replication. The ultimate result is a
seamlessly distributed and replicated database over a set of MySQL
servers cooperating together to keep the replicated state strongly
consistent.

This document lays out the user stories and the high level
requirements for the project. High level requirements are then mapped
into worklogs and test cases.

Definitions
===========

- Cluster - Throughout the following text, the word cluster means
  group of Group Replication servers.

- Site - site or server are used interchangeable throughout the text
  below and refer to the same thing, a server.

User Stories
============

- As a MySQL DBA I want to deploy a cluster of MySQL servers that
  automatically withstands server failures and rearranges the cluster
  configuration so that reconfiguration is done automatically and
  without human intervention.

- As a MySQL DBA I want to be able to log into a single server in a
  cluster of MySQL Servers, which is responsible for a given partition
  of the data, so that I can discover the cluster membership and
  member status.

- As a MySQL user I want to execute transactions on a MySQL cluster
  where the failure of the server where transactions execute shall not
  render data loss.

- As a MySQL DBA I want to deploy a cluster of MySQL servers with a
  lossless Primary-Backups replication setup so that any transaction
  that is executed against the primary server is never lost in the
  event of primary failure.

- As a MySQL DBA I want to deploy a cluster of MySQL servers with a
  lossless multi-master replication setup, so I don't have to deal
  with primary fail-over on the middleware in the event of primary
  server failures.

- As a MySQL DBA I want to deploy a cluster of MySQL servers with a
  lossless multi-master replication setup, so I can load balance my
  write load across different servers in the cluster.

- As a MySQL DBA I want to deploy a cluster of MySQL servers into
  which I can add and remove servers dynamically while the service is
  running so I don't have to deal with server synchronization.

- As a MySQL DBA I want to connect two clusters across wide-area so I
  can deploy mutliple datacenters to address disaster recovery
  requirements.

References
==========

http://mysqlhighavailability.com/mysql-group-replication-hello-world/ 
http://mysqlhighavailability.com/category/replication/group-replication/

- Initialization

  This section presents the high level requirements for the
  initialization of the group replication plugin. Namely, in what form
  group replication is loaded into the server and when and how group
  replication can be loaded and started or unloaded and stopped.

  I1. Group Replication SHALL be a MySQL plugin that the user can
  load/unload while the MySQL server is online.

  I2. Group Replication SHALL be a MySQL plugin that the user can
  start and stop while the MySQL server is online.

  I3. Group Replication plugin SHALL be able to start during the MySQL
  server boot and be available when the first user session is opened.

- Deployment

  These are the requirements for deploying a Group Replication cluster.

  - General
    
    D1. Group Replication SHALL support up to 9 servers in the group.
    This is obviously dependent on the workload. Write intensive workload
    will REQUIRE high bandwidth networks.

    D2. A Group SHALL be identified by a UUID. This is the group_name.
 
  - Provisioning

    D3. A server SHALL be provisioned according to the current procedure
    in place to provision a slave server for instance.

  - Boostrapping

    D4. When starting the group, one server in the group SHALL be
    responsible to bootstrap the group. The user SHALL configure which
    server is responsible for bootstrapping it.

    D5. Should N servers in the group be marked as servers that
    bootstrap the group, then N groups SHALL be formed.

  - Adding New Server to the Group

    D6. A server MUST be provisioned before it is added to the group.

    D7. Servers SHALL be added to the group one by one and not all in
    one go.

  - Distributed Recovery

    D8. Once added to the group a server SHALL fetch any missing state
    from a donor before starting processing new transactions from the
    user and the group itself - this is the distributed recovery.

    D9. Recovery donor SHALL be picked randomly until a good donor is
    found. Then recovery MUST succeed.

    D10. The newly added server SHALL start in RECOVERYING state and
    then moves to ONLINE after distributed recovery terminates.

    D11. A server SHALL remain in read-only mode until it is declared
    ONLINE.

    D12. A server SHALL remain in read-only mode if recovery fails.

- Conflicts: Detection and Resolution

  These are the high level requirements for Group Replication
  regarding conflict detection and resolution. Note though that MySQL
  is not transactional (e.g., one cannot roll back DDL), thence some
  special rules apply to DDL.

  In group replication, transactions executing at different servers
  execute and modify their local view of the data, i.e., they execute
  on their own snapshot. Before committing, they need to synchronize
  with all the other servers in the group to validate that their
  changes don't actually conflict with changes that happened on other
  snapshots/servers concurrently. One can think of this as a
  distributed multi-version concurrency control system with optimistic
  locking, where transactions execute optimistically and without a
  priori synchronization at each server and then, on commit, they
  establish the global ordering of their operations by consulting the
  group. Then if there are two transactions writing to the same row
  concurrently, the first one to be totally ordered will be the one
  committing successfully, the second one will roll back/abort. We can
  see this as an extended definition of snapshot isolation concurrency
  control, in fact this type of approach has been named generalized
  snapshot isolation.

  C1. Conflicts SHALL be resolved by aborting one of the transaction
  that conflicts.  
  
  C2. Any two DML transactions, executing at different servers,
  updating the same data row, SHALL conflict and an error returned to
  the user on one of them (depending on the stage of the execution
  that a transaction is at when the conflict occurs).
      - ER_ERROR_DURING_COMMIT - when a local transaction is aborted
        due to a high-priority transaction during the commit stage.
      - ER_LOCK_DEADLOCK - when a local transaction is aborted due to
        a high-priority transaction during the update stage.
      - ER_TRANSACTION_ROLLBACK_DURING_COMMIT - a conflict is detected
        on certification.

  C3. Any two operations, one DML and one DDL, operating on different
  objects (e.g., table t1 and table t2), SHALL execute concurrently
  and correctly at different servers in the group.

  C4. Any two operations, one DML and one DDL, operating on the same
  object (e.g., table t1), SHALL execute correctly at different
  servers in the group, provided that they do NOT execute
  concurrently.

  C5. Any two operations, one DML and one DDL, operating on the same
  object (e.g., table t1), SHALL lead to data inconsistencies if they
  execute concurrently on different servers in the group.

  C6. Any two operations, one DML and one DDL, operating on the same
  object (e.g., table t1), SHALL execute correctly if they both
  execute on the same server in the group.

- Network Partitioning

  At its core, Group Replication relies on a variant of Paxos to
  provide distributed agreement before a transaction commits. The
  algorithm implemented by XCom is very close to Mencius. This means
  that the system will operate correctly and remain in a synchronized
  state as long as servers in the group can reach a quorum among
  themselves. Obviously, if the quorum is lost, then the system will
  not allow modifications, since a majority is unable to decided on
  the fate of a given operation.
 
  On distributed systems, when a server becomes unreachable, it is
  impossible to say whether a server/process has crashed or whether it
  is simply unreachable from the given origin. The only thing one
  knows is that the system is not reachable within a specific time
  boundary. Thence, when we refer to a failed server, is a server that
  is unreachable.

  N1. A quorum requires a majority of nodes, thence to tolerate f
  failures a group replication system MUST have 2f+1 servers in the
  group.
      - Assuming f=1 (3 servers deployed):
      
      N1.1 Should 1 server crash, then the system SHALL remain
           available for reads and writes.

      N1.2 Should 2 servers crash at once, then the system SHALL
           remain available only for reads. Write operations SHALL block.

      N1.3 Should 2 servers crash consecutively, i.e., the group
           notices that one server has been removed and then the
           other, then the cluster SHALL reconfigure the group and
           allow the group to shrink automatically.

- View Changes, Failure Detection and Reconfiguration

  This section presents the requirements on failure detection and
  reconfigurations of the group. A view is the status of the
  membership at a given logical moment in time. Servers keep track of
  the membership and a distributed failure detection mechanism is in
  place to find which servers are alive. If the failure detector
  suspects that a server is no longer reachable, then a
  reconfiguration of the membership may be triggered
  automatically. Conversely, a new server is added to the group, by
  having a delegate to intercede for it, i.e., by having a delegate to
  trigger a reconfiguration and thus a view change. Servers need to
  agree on the new view after a suspicion and/or a reconfiguration is
  triggered.

  V1. Servers in the group SHALL suspect other servers every N seconds
  automatically.
  
  V2. Should a server in the group crash or become unreachable, then
  the failure detection will hint that the server has failed. A server
  SHALL then trigger a reconfiguration automatically, ultimately
  resulting in installing a new view.
  
  V3. Should a server a be added to the group, then a new view SHALL
  be installed.

- Observability

  This section presents the high level requirements for the
  observability of the dynamic group replication stats.

  O1. The user SHALL be able to inspect the state of members in the
  group by checking performance schema tables:
      
      O1.1 replication_group_member_stats - This table SHALL show
           statistics on the certification procedure, including the view
           identifier of the given group.

      O1.2 replication_group_members - This table SHALL present the
           membership of the group at a given point in time and the
           status of the members in the group.

      O1.3 replication_connection_status - This table SHALL present
           status of the cluster connection status.

      O1.4 replication_applier_status - This table SHALL present
           status of the group replication applier.

- Performance

  This section provides the high level requirements for Group
  Replication performance.

  P1. The performance of Group Replication deployed in with just a
  single master writing and several backups copying the changes, SHALL
  be comparable/competitive to Galera in Single Leader mode.

  P2. The performance of Group Replication when deployed in
  multi-master without hotspots does not have a boundary or a baseline
  to compare to on the first release, thence the requirement is that
  it performs good enough. However the stretch goal is that it is
  competitive with Galera in multi-master mode.

  P3. The performance of Group Replication when deployed in
  multi-master with hotspots (i.e., different servers updating the
  rows) SHALL be very LOW.

  P4. Enabling SSL WILL present an overhead on the throughput.

- Interoperability

  These are the high level requirements for interoperability of Group
  Replication with other MySQL components, in particular with MySQL
  async replication.

  IO1. It SHALL be possible to replicate between a server in a group
  and a standalone server using regular replication.

  IO2. It SHALL be possible to replicate from a regular MySQL master
  to a server in a group.

  IO3. Group Replication SHALL use the regular replication
  infrastructure - relay logs, binary logs, applier threads, receiver
  thread.

- Security

  These are the high level requirements regarding security in Group
  Replication.

  S1. It SHALL be possible to interconnect servers in the group using
  SSL, thus using secure channels.

  S2. If two servers are not using the same certificates, then will
  fail to communicate.

  S3. Recovery credentials SHALL NOT be exposed on any log or command
  line option.

  S4. Group Replication SHALL use the same approach as regular
  replication to setup recovery credentials (CHANGE MASTER TO ...)

By Design Behaviour and Requirements
====================================

- Row-based Replication.

  Why? Write-sets are calculated at the time record extraction for RBR
  is performed.

- InnoDB tables only.

  Why? Operations need to be transactional so that the group decides
  whether to commit or rollback an operation at commit time. Moreover,
  transactions that are executing during the optimistic stage can be
  aborted because a certified transaction is invalidating it, thence
  no need to continue execution of the optimistic one.

- Primary Key on each table.
  
  Why? An item in the write set relies on the hash of a primary key,
  thence unique across the cluster.

- On conflicts, abort on the commit operation.

  Why? Because Group Replication implements a replication protocol
  which is based on the concept of generalized snapshot isolation.

Known Limitations
=================

Multi-master
------------

- DML/DDL update everywhere is not possible.

- DDL and DML for the object being altererd needs to go through the
  same server.
- MySQL does not have transactional nor even atomic DDL. Server
  limitation.
- Transaction savepoints are not supported.
- The write-set extraction procedure does not consider savepoints,
  thence no way to deal with it at the moment.
- Binlog event checksums are not supported.
- Reduced performance on large transactions.
- Transactions are cached and sent in one message atomically to all
  servers in the group.
- GR can generate inconsistencies if there are multiple FK
  dependencies.
- GR does not take into account Key-Range locking in the certification.
- GR does not take into account table locks in the certification.

Single Master
-------------

- Transaction savepoints are not supported.
- Binlog event checksums are not supported.
- Reduced performance on large transactions.