WL#7388: Failover/Switchover based on unreliable failure detectors

Status: In-Documentation

GOAL
====
Fabric provides the some means to report server failures. Although very useful,
the current interface is not suitable to report transient failures as servers
are permanently marked as faulty and made unusable. For example,

mysqlfabric server set_status 85d2fbc6-0fb3-4439-801b-330068d8f043 FAULTY

In this case, the server identified by '85d2fbc6-0fb3-4439-801b-330068d8f043' is
marked as faulty. Should the server be a master a new one is elected provided
there is a slave that can become the new master.

This task aims to improve the current interface and allow external entities to
report transient server failures. Fabric will keep track of reported issues and,
only after receiving N reports from S sources within an interval T, the server's
status will be set to faulty.

REMARKS
=======
. See WL#7455 for information on how we are planning to handle security issues.
. Trying to get post-mortem information on server in order to avoid loosing
  transactions is out of the scope.
. See WL#7391 for details on fence mechanisms and reliable failure detectors.
REPORTING ON UNSTABLE SERVERS
=============================
In this task, we are planning to extend Fabric so that users can report
transient errors that cause a server to be marked as faulty only after receiving
N reports from S sources within an interval T. Users willing to exploit this
new behavior, must report server issues as follows:

mysqlfabric threat report_error 85d2fbc6-0fb3-4439-801b-330068d8f043
127.0.0.1 ER_CON_COUNT_ERROR

The command reports that the "127.0.0.1" source got the "ER_CON_COUNT_ERROR"
error while accessing the server identified by "85d2fbc6-0fb3-4439-801b-
330068d8f043".

Internally, Fabric executes the following code:

  Constants:
      _NOTIFICATION_INTERVAL # Interval T in seconds.
      _NOTIFICATION_CLIENTS # Number of different sources (i.e. S sources)
      _NOTIFICATIONS # Number of notifications (i.e. N reports)

  Classes:
      Group - Contains information on a group.
      ErrorLog - Stores and retrieves errors reported.
      Server - Contains information on a server.

  Parameters:
      uuid - Server where the error belongs to.
      reporter - Source which has seen the error.
      error - Error code, description, message, whatever, etc.

  Code:

  01: def report_error(uuid, reporter, error):
  02:    now = get_time()
  03:    server = get_server(uuid)
  04:
  05:    ErrorLog.append_error(server, reporter, error, now)
  06:    errors = ErrorLog.fetch(server, _NOTIFICATION_INTERVAL, now)
  07:
  08:    if errors.is_unstable(_NOTIFICATIONS, _NOTIFICATION_CLIENTS):
  09:        group = Group.fetch(server.group_id)
  10:        if can_set_server_faulty(server, group, now):
  11:            set_status_faulty(server)
  12:
  13: def set_status_faulty(server):
  14:    server.status = FAULTY
  15:
  16:    group = Group.fetch(server.group_id)
  17:    trigger_within_procedure(
  18:        "SERVER_LOST", group.group_id, str(server.uuid)
  19:    )
  20:
  21:    if group.master == server.uuid:
  22:        trigger_within_procedure("FAIL_OVER", group.group_id)
  23
  24: def can_set_server_faulty(server, group, now):
  25:      diff = now - group.master_defined
  26:      if (group.master == server.uuid and diff > _FAILOVER_INTERVAL) \
  28:          or group.master != server.uuid:
  29:          return True
  30:      return False

The first step consists in storing the error information into an error log (05) 
for future use. Then (06) it reads the set of reported errors within a time 
interval (i.e. now -  _NOTIFICATION_INTERVAL) in order to check whether the 
erver is unstable or not.

The server is considered unstable if there are (_NOTIFICATIONS) reports from 
(_NOTIFICATION_CLIENTS) different clients (08). If this is the case and the 
server's status can be set to faulty (10), the "set_status_faulty" function is 
called (11). Should the server is the current master, a new master is elected 
(21).

Any unstable slave can have its status automatically set to faulty. However, in
order to avoid making the system unstable, the number of failover operations 
triggered within a time period is limited (26). The algorithm checks whether 
enough time has elapsed since the last time the Group.master property has been 
updated (i.e. now - Group.master_defined > _FAILOVER_INTERVAL). If enough time 
has elapsed, the failover operation is triggered by calling the "FAIL_OVER" 
event. Otherwise, nothing is done.

The Group.master property contains a reference to the master in the group and is 
updated after demoting the current master, promoting a new server to master or 
failing over to a new healthy server.

UNSTABLE SYSTEM
---------------
In order to avoid making the system unstable, the number of failover operations
triggered by unstable servers that may happen within a time period is limited.
Note though that this does not forbid the system to set a slave's status to
faulty.

Users can also manually demote the current master or promote a server to master.

REPORTING ON FAULTY SERVERS
===========================
If one knows for sure that a server (i.e. MySQL) is really unreachable or dead
and so that gathering reports on servers' instability is not necessary before
having the server's status set to faulty, the following command can be executed:

mysqlfabric threat report_fault 85d2fbc6-0fb3-4439-801b-330068d8f043
127.0.0.1 "Error Description"

The command reports that the "127.0.0.1" considered the server identified by
"85d2fbc6-0fb3-4439-801b-330068d8f043" to be faulty.

Internally, Fabric executes the following code:

  Classes:
      Group - Contains information on a group.
      ErrorLog - Stores and retrieves errors reported.
      Server - Contains information on a server.

  Parameters:
      uuid - Server that is considered to be faulty.
      reporter - Source which has seen server's issues.
      error - Error code, description, message, whatever, etc.

  Code:

  01: def report_fault(uuid, reporter, error):
  02:    now = get_time()
  03:    server = get_server(uuid)
  04:
  05:    ErrorLog.append_error(server, reporter, error, now)
  06
  07:    group = Group.fetch(server.group_id)
  08:    set_status_faulty(server)
  09:
  10: def set_status_faulty(server):
  11:    server.status = FAULTY
  12:
  13:    group = Group.fetch(server.group_id)
  14:    trigger_within_procedure(
  15:        "SERVER_LOST", group.group_id, str(server.uuid)
  16:    )
  17:
  18:    if group.master == server.uuid:
  19:        trigger_within_procedure("FAIL_OVER", group.group_id)


The first step consists in storing the error information into an error log
(05) for future use. Then (06) it sets the server's status to faulty by
calling the "set_status_faulty" function.

FENCING OFF SERVERS
===================
In the "set_status_faulty" function outlined above, the "SERVER_LOST" event
is also triggered. This event must be used to fence off unstable and faulty
servers after setting its status to faulty. This is executed to avoid the
situation where a connector can access and update the old master but Fabric
has no access to it and a new master is being promoted or has been promoted.

The event is not implemented because this would require a previous knowledge
on the users environment. So users are responsible for extending the system
and provide the appropriate fence mechanism: kill the MySQL process, shutdown
its server, disconnect the server from the network, etc.

FAIL OVER/SWITCH OVER
=====================
The promote operation encapsulates both the failover and switchover operations.
It uses the master's status to decide which one shall be executed. If the
master's status is different from FAULTY, a switchover is executed. Otherwise,
a failover executed.

Both operations must not try to access any server whose status is faulty in
order to avoid blocking its execution. An attempt to access a faulty server
may block because the connector python provides a synchronous interface
which means that it follows the TCP/IP timeout configurations which may take
hours until the timeout.

In order words, faulty servers are not picked as candidates and we do not try
to fetch any information from it. In the future though, we will try to dig
binlogs from the master to avoid loosing transactions. However, this is not
in the scope of this work.

TOOLS
=====
In the context of the WL#7387, we are planning to extend connectors to report 
errors that are usually not catch by regular failure detectors. Users can also 
extend the current implementation and build their own failure detection 
solutions either as procedures within Fabric or external tools. If Fabric will 
monitor a large amount of groups or servers, it is though preferable to build 
external tools due to possible performance issues.

BUG#71372 will introduce an example on how to build an external failure 
detector.

Off-the-shelf Failure Detector
------------------------------
For convenience, Fabric provides an off-the-shelf and ready-to-use failure
detector that can be enabled in small setups per group as follows:

mysqlfabric group activate 

To disable it, do the following:

mysqlfabric group deactivate 

When the failure detector is enabled, it executes the following code:

  Constants:
      _DETECTION_INTERVAL - # Interval in seconds
      _DETECTIONS - # Failed attempts to access a server
      _DETECTION_TIMEOUT - # Number of seconds that it should wait before
       giving up.

  Classes:
      Group - Contains information on a group.
      Server - Contains information on a server.

  Parameters:
      uuid - Server that is considered to be faulty.
      reporter - Source which has seen server's issues.
      error - Error code, description, message, whatever, etc.

  Code:

  01: quarantine = {}
  01:
  02: while enabled:
  03:     try:
  04:         unreachable = set()
  05:         group = Group.fetch(self.__group_id)
  06:         if group is not None:
  07:             for server in group.servers():
  08:                 if server.status != FAULTY and \
  09:                     server.is_alive(_DETECTION_TIMEOUT):
  10:                     continue
  11:
  12:                 unreachable.add(server.uuid)
  13:
  14:                 unstable = False
  15:                 failed_attempts = 0
  16:                 if server.uuid not in quarantine:
  17:                     quarantine[server.uuid] = failed_attempts = 1
  18:                 else:
  19:                     failed_attempts = quarantine[server.uuid] + 1
  20:                     quarantine[server.uuid] = failed_attempts
  21:                 if failed_attempts >= _DETECTIONS:
  22:                     unstable = True
  23:
  24:                 can_set_faulty = group.can_set_server_faulty(
  25:                     server, get_time()
  26:                 )
  27:                 if unstable and can_set_faulty:
  28:                     procedure = trigger("REPORT_FAILURE", ...)
  29:                     wait_for_procedure(procedure)
  30:
  31:         for uuid in quarantine.keys():
  32:             if uuid not in unreachable:
  33:                 del quarantine[uuid]
  34:
  35:         sleep(_DETECTION_INTERVAL / _DETECTIONS)

For each server in the group, it checks whether it is alive or not trying to
connect to it. The _DETECTION_TIMEOUT parameter determines the maximum amount
of time that the function waits until a connection is established. Note that
it may return with an error earlier if the server is unreachable for example
(09). If the server is faulty or is alive, it goes to the next server.

So if it does not manage to connect to a server (_DETECTIONS) consecutive
times, the server is considered unstable (22). If it is unstable and can be
set to faulty, the report_failure command previously described is triggered
(28).

CONCURRENCY CONTROL
===================
Read-write procedures in Fabric are serialized through a concurrency control
mechanism. It is used in the context of this work to guarantee that the 
report_error, report_failure and fail_over commands are serialized per group.
This makes ease to design a failover that is resilient to failure but we will
remove this limitation in a near future.
TABLES
======
The following table is used to store information on errors:

    CREATE_SERVER_ERROR_LOG = (
        "CREATE TABLE error_log "
        "(server_uuid VARCHAR(40) NOT NULL, "
        "reported TIMESTAMP(6) NULL NOT NULL, "
        "reporter VARCHAR(64), "
        "error TEXT, "
        "INDEX key_server_uuid_reported (server_uuid, reported), "
        "INDEX key_reporter (reporter))"
    )

    CREATE_EVENT_ERROR_LOG = (
        "CREATE EVENT prune_error_log "
        "ON SCHEDULE EVERY % SECOND "
        "DO DELETE FROM error_log WHERE "
        "TIMEDIFF(UTC_TIMESTAMP(), reported) > MAKETIME(%s,0,0)"
    )

    ADD_FOREIGN_KEY_CONSTRAINT_SERVER_UUID = (
        "ALTER TABLE error_log ADD CONSTRAINT "
        "fk_server_uuid_error_log FOREIGN KEY(server_uuid) "
        "REFERENCES servers(server_uuid)"
    )

Old entries in the table are automatically removed by a event that runs every 
"prune_time" that are older than "prune_time". By default "prune_time" is 3600 
seconds.

In order to avoid frequently and automatically failing over to a different 
server what would make the system unstable, we also keep track of when the 
master has changed in a group:

   CREATE TABLE groups
     (group_id VARCHAR(64) NOT NULL,
     description VARCHAR(256),
     master_uuid VARCHAR(40),
     master_time DOUBLE(16, 6) NULL,
     status BIT(1) NOT NULL,
     CONSTRAINT pk_group_id PRIMARY KEY (group_id))

CONFIGURATION
=============

data/fabric.cfg.in:
-------------------

[failure_tracking]

notifications = 300
notification_clients = 50
notification_interval = 60
failover_interval = 0
detections = 3
detection_interval = 6
detection_timeout = 1
prune_time = 3600

OBJECTS
=======

ErrorLog:
---------
# Remove entries older than _PRUNE_TIME in seconds.
_MIN_PRUNE_TIME = 60
_PRUNE_TIME = _DEFAULT_PRUNE_TIME = 3600

def configure(config):
    try:
        prune_time = int(config.get("failure_tracking", "prune_time"))
        if prune_time < ErrorLog._MIN_PRUNE_TIME:
            _LOGGER.warning(
                "Prune entries be lower than %s.",
                ErrorLog._MIN_PRUNE_TIME
            )
            prune_time = ErrorLog._MIN_PRUNE_TIME
        ErrorLog._PRUNE_TIME = int(prune_time)
    except (_config.NoOptionError, _config.NoSectionError, ValueError):
        pass

FailureDetector:
----------------
# Interval in seconds within which consecutive failures are considered.
_MIN_DETECTION_INTERVAL = 2.0
_DETECTION_INTERVAL = _DEFAULT_DETECTION_INTERVAL = 5.0

# Number of consecutive failures to consider a server faulty.
_MIN_DETECTIONS = 1
_DETECTIONS = _DEFAULT_DETECTIONS = 2

# Number of seconds that it should wait before giving up to connect to a server.
_MIN_DETECTION_TIMEOUT = 1
_DETECTION_TIMEOUT = _DEFAULT_DETECTION_TIMEOUT = 1

def configure(config):
    try:
        detection_interval = \
            float(config.get("failure_tracking", "detection_interval"))
        if detection_interval < FailureDetector._MIN_DETECTION_INTERVAL:
            _LOGGER.warning(
                "Detection interval cannot be lower than %s.",
                FailureDetector._MIN_DETECTION_INTERVAL
            )
            detection_interval = FailureDetector._MIN_DETECTION_INTERVAL
        FailureDetector._DETECTION_INTERVAL = float(detection_interval)
    except (_config.NoOptionError, _config.NoSectionError, ValueError):
        pass

    try:
        detections = int(config.get("failure_tracking", "detections"))
        if detections < FailureDetector._MIN_DETECTIONS:
            _LOGGER.warning(
                "Detections cannot be lower than %s.",
                FailureDetector._MIN_DETECTIONS
            )
            detections = FailureDetector._MIN_DETECTIONS
        FailureDetector._DETECTIONS = int(detections)
    except (_config.NoOptionError, _config.NoSectionError, ValueError):
        pass

    try:
        detection_timeout = \
            int(config.get("failure_tracking", "detection_timeout"))
        if detection_timeout < FailureDetector._MIN_DETECTION_TIMEOUT:
            _LOGGER.warning(
                "Detection timeout cannot be lower than %s.",
                FailureDetector._MIN_DETECTION_TIMEOUT
            )
            detection_interval = FailureDetector._MIN_DETECTION_TIMEOUT
        FailureDetector._DETECTION_TIMEOUT = int(detection_timeout)
    except (_config.NoOptionError, _config.NoSectionError, ValueError):
        pass


      _NOTIFICATION_INTERVAL # Interval T in seconds.
      _NOTIFICATION_CLIENTS # Number of different sources (i.e. S sources)
      _NOTIFICATIONS # Number of notifications (i.e. N reports)

ReportError:
------------
# Number of reports to consider a server unstable.
_MIN_NOTIFICATIONS = 1
_NOTIFICATIONS = _DEFAULT_NOTIFICATIONS = 300

# Number of different sources that must report an error to consider a
# server unstable.
_MIN_NOTIFICATION_CLIENTS = 1
_NOTIFICATION_CLIENTS = _DEFAULT_NOTIFICATION_CLIENTS = 50

# Interval in seconds to analyze errors to consider a server unstable.
_MIN_NOTIFICATION_INTERVAL = 1
_MAX_NOTIFICATION_INTERVAL = 3600
_NOTIFICATION_INTERVAL = _DEFAULT_NOTIFICATION_INTERVAL = 60

def configure(config):
    try:
        notifications = int(config.get("failure_tracking", "notifications"))
        if notifications < ReportError._MIN_NOTIFICATIONS:
            _LOGGER.warning(
                "Notifications cannot be lower than %s.",
                ReportError._MIN_NOTIFICATIONS
            )
            notifications = ReportError._MIN_NOTIFICATIONS
        ReportError._NOTIFICATIONS = int(notifications)
    except (_config.NoOptionError, _config.NoSectionError, ValueError):
        pass

    try:
        notification_clients = \
            int(config.get("failure_tracking", "notification_clients"))
        if notification_clients < ReportError._MIN_NOTIFICATION_CLIENTS:
            _LOGGER.warning(
                "Notification_clients cannot be lower than %s.",
                ReportError._MIN_NOTIFICATION_CLIENTS
            )
            notification_clients = ReportError._MIN_NOTIFICATION_CLIENTS
        ReportError._NOTIFICATION_CLIENTS = int(notification_clients)
    except (_config.NoOptionError, _config.NoSectionError, ValueError):
        pass

    try:
        notification_interval = \
            int(config.get("failure_tracking", "notification_interval"))
        if notification_interval > _error_log.ErrorLog._PRUNE_TIME:
            _LOGGER.warning(
                "Notification interval cannot be greater than prune "
                "interval %s", _error_log.ErrorLog._PRUNE_TIME
            )
            notification_interval = _error_log.ErrorLog._PRUNE_TIME
        if notification_interval > ReportError._MAX_NOTIFICATION_INTERVAL:
            _LOGGER.warning(
                "Notification interval cannot be greater than %s.",
                ReportError._MAX_NOTIFICATION_INTERVAL
            )
            notification_interval = ReportError._MAX_NOTIFICATION_INTERVAL
        if notification_interval < ReportError._MIN_NOTIFICATION_INTERVAL:
            _LOGGER.warning(
                "Notification interval cannot be lower than %s.",
                ReportError._MIN_NOTIFICATION_INTERVAL
            )
            notification_interval = ReportError._MIN_NOTIFICATION_INTERVAL
        ReportError._NOTIFICATION_INTERVAL = int(notification_interval)
    except (_config.NoOptionError, _config.NoSectionError, ValueError):
        pass


ReportFailure:
--------------
There is no configuration and it calls the FAIL_OVER event as previously 
explained.