WL#7648: Decoupling provisioning from updates to state store

Status: In-Documentation

Description
High Level Architecture
Low Level Design

GOAL
====
The aim of this work is to decouple operations that update the state store from 
those that are responsible for setting up servers such as provisioning them or 
simply configuring replication or any high-availability solution. This is key to 
have an easy-to-use product, which could be easily adopted and deployed at 
different environments.

In other words, this work is a stepping stone towards supporting different 
provisioning mechanisms (e.g. MySQL Enterprise Backup, AWS Support, etc) and
different High-Availability Solutions (e.g. MySQL Cluster, DRDB, etc). Besides 
it will allow users who are willing to use Fabric but want to rely on their own 
scripts to continue doing so.

CONTEXT
=======
Fabric provides three main sub-systems: High-availability, Sharding and 
Provisioning. Fabric organizes servers into high-availability groups for 
providing resilience to failures and shards are assigned to groups in order to 
take advantage of their high-availability features. Currently, only the standard 
MySQL Replication is supported though.

Appropriate interfaces are provided so that groups and shards can be set up and 
provisioned by a database administrator. On the other hand, connectors and, in 
particular, users' applications don't need to manage groups and shards but need 
to fetch information on them to find out which servers are responsible for a 
group or shard. So Fabric must provide interfaces that allow to:

    . fetch information - Retrieve information on high-availability groups and
      shards stored in the state store, etc

    . update information - Add/remove servers into/from a group or define/move/
      split/remove shards, etc. In order words, update the state store.

    . provisioning the system - Install new servers; take backups and restore
      them; configure replication among servers, etc.

Usually when information on groups or shards is updated, provisioning steps are 
executed as well. For example, moving a shard from one group to another may 
require to restore a backup of the source group to the destination group before 
updating any information which maps a shard to a group. However, in some cases, 
users may want to execute they own provisioning steps because:

    . Fabric provisioning steps are not a good fit to their environment;

    . Their provisioning steps provide better performance;

    . Integrating their provisioning steps into Fabric requires time, etc.

PROPOSAL
========
Any command that may execute a provisioning step must provide the --update_only 
option so that users may choose whether they want to execute the provisioning 
steps or skip them. By default the option is false meaning that the provisioning 
shall be executed. If the option is set to true, the command will only update 
the state store.

In the future, any provisioning method shall be available as a command as well. 
Doing so, users will be able to choose the appropriate provisioning method while 
setting up their environment.

Currently, Fabric only supports a provisioning method that is deeply rooted in 
the shard.move/split routines. Besides, the group.add/set_status configures 
replication which can also be considered a provisioning method. Note though that 
we will not extract these provisioning methods as commands in the context of 
this WL. 


REMARKS
=======
. We will not provide support to different provisioning solutions in the context
  of this work.

. We will not provide support to different High-availability solutions in the
  context of this work. See WL#7392 for further details.

. This work depends on WL#7528 which provide the means to use/access optional
  parameters in a command.

. We also took the opportunity to do the following changes in this patch:
  
  . Removed the group.import_topology(...). It required to start servers
    with the following options: --report-host and --report-port. This was
    not very useful.
  
  . Renamed group.check_group_availability(...) to group.health(...) and
    removed the "is_master" from the returned value. Users can check the
    same information in the "status" value.

CHANGED COMMANDS
================
We shall add the --update_only parameter to the following commands:

. group.add(server, ..., update_only=False, ...)

  - This command adds a server into a group: updates the state store and
    configures replication.

. group.promote(group_id, slave_uuid=None, update_only=False, ...)

  - This command demotes the current master if there is any and promotes a new
    server to master. Only secondaries are automatically chosen to become 
    primary. If users want to promote a spare to master, the --slave_uuid 
    parameter must be provided.

. group.demote(group_id, slave_uuid=None, update_only=False, ...)

  - Demotes the current master if there is any.

. server.set_status(server, ..., update_only=False, ...)

  - This command changes a server's status: updates the state store and
    configures replication.

. sharding.move(shard_id, ..., update_only=False, ...)

  - This command moves a shard from a group to another: updates the state store,
    takes a backup of the source group, restore it to the destination group and
    synchronizes the source and destination group.

. sharding.split(shard_id, ..., update_only=False, ...)

  - This command splits a shard between to groups: updates the state store,
    takes a backup of the source group, restore it to the destination group and
    synchronizes the source and destination group.


STATUS TRANSITIONS
==================

These are the possible server's status:

. Primary denotes that a server may accept write transactions
  and secondaries connect to it to fetch updates.

. Secondary denotes that a server accepts read-only transactions
  and connect to a primary to fetch updates.

. Spare is a secondary that is not automatically elected to
  become a primary if there is a need to do so.

. Faulty is a server that is not behaving as expected or
  is unreachable.

+-----------+---------+-----------+-------+--------+
|  FROM/TO  | PRIMARY | SECONDARY | SPARE | FAULTY |
|-----------|---------|-----------|-------|--------|
| PRIMARY   |         |     *     |       |   +    |
|-----------|---------|-----------|-------|--------|
| SECONDARY |    *    |           |   x   |   +    |
|-----------|---------|-----------|-------|--------|
| SPARE     |    *    |     x     |       |   +    |
|-----------|---------|-----------|-------|--------|
| FAULTY    |         |           |   x   |        |
+-----------|---------+-----------+-------+--------+

The operations that can change a server's status are the following:

. group.promote() and group.demote() denoted with *.

. threat.report_faulty() and threat.report_error() denoted
  with +.

. server.set_status() denoted with x.