WL#7388: Failover/Switchover based on unreliable failure detectors
Status: In-Documentation
GOAL ==== Fabric provides the some means to report server failures. Although very useful, the current interface is not suitable to report transient failures as servers are permanently marked as faulty and made unusable. For example, mysqlfabric server set_status 85d2fbc6-0fb3-4439-801b-330068d8f043 FAULTY In this case, the server identified by '85d2fbc6-0fb3-4439-801b-330068d8f043' is marked as faulty. Should the server be a master a new one is elected provided there is a slave that can become the new master. This task aims to improve the current interface and allow external entities to report transient server failures. Fabric will keep track of reported issues and, only after receiving N reports from S sources within an interval T, the server's status will be set to faulty. REMARKS ======= . See WL#7455 for information on how we are planning to handle security issues. . Trying to get post-mortem information on server in order to avoid loosing transactions is out of the scope. . See WL#7391 for details on fence mechanisms and reliable failure detectors.
REPORTING ON UNSTABLE SERVERS ============================= In this task, we are planning to extend Fabric so that users can report transient errors that cause a server to be marked as faulty only after receiving N reports from S sources within an interval T. Users willing to exploit this new behavior, must report server issues as follows: mysqlfabric threat report_error 85d2fbc6-0fb3-4439-801b-330068d8f043 127.0.0.1 ER_CON_COUNT_ERROR The command reports that the "127.0.0.1" source got the "ER_CON_COUNT_ERROR" error while accessing the server identified by "85d2fbc6-0fb3-4439-801b- 330068d8f043". Internally, Fabric executes the following code: Constants: _NOTIFICATION_INTERVAL # Interval T in seconds. _NOTIFICATION_CLIENTS # Number of different sources (i.e. S sources) _NOTIFICATIONS # Number of notifications (i.e. N reports) Classes: Group - Contains information on a group. ErrorLog - Stores and retrieves errors reported. Server - Contains information on a server. Parameters: uuid - Server where the error belongs to. reporter - Source which has seen the error. error - Error code, description, message, whatever, etc. Code: 01: def report_error(uuid, reporter, error): 02: now = get_time() 03: server = get_server(uuid) 04: 05: ErrorLog.append_error(server, reporter, error, now) 06: errors = ErrorLog.fetch(server, _NOTIFICATION_INTERVAL, now) 07: 08: if errors.is_unstable(_NOTIFICATIONS, _NOTIFICATION_CLIENTS): 09: group = Group.fetch(server.group_id) 10: if can_set_server_faulty(server, group, now): 11: set_status_faulty(server) 12: 13: def set_status_faulty(server): 14: server.status = FAULTY 15: 16: group = Group.fetch(server.group_id) 17: trigger_within_procedure( 18: "SERVER_LOST", group.group_id, str(server.uuid) 19: ) 20: 21: if group.master == server.uuid: 22: trigger_within_procedure("FAIL_OVER", group.group_id) 23 24: def can_set_server_faulty(server, group, now): 25: diff = now - group.master_defined 26: if (group.master == server.uuid and diff > _FAILOVER_INTERVAL) \ 28: or group.master != server.uuid: 29: return True 30: return False The first step consists in storing the error information into an error log (05) for future use. Then (06) it reads the set of reported errors within a time interval (i.e. now - _NOTIFICATION_INTERVAL) in order to check whether the erver is unstable or not. The server is considered unstable if there are (_NOTIFICATIONS) reports from (_NOTIFICATION_CLIENTS) different clients (08). If this is the case and the server's status can be set to faulty (10), the "set_status_faulty" function is called (11). Should the server is the current master, a new master is elected (21). Any unstable slave can have its status automatically set to faulty. However, in order to avoid making the system unstable, the number of failover operations triggered within a time period is limited (26). The algorithm checks whether enough time has elapsed since the last time the Group.master property has been updated (i.e. now - Group.master_defined > _FAILOVER_INTERVAL). If enough time has elapsed, the failover operation is triggered by calling the "FAIL_OVER" event. Otherwise, nothing is done. The Group.master property contains a reference to the master in the group and is updated after demoting the current master, promoting a new server to master or failing over to a new healthy server. UNSTABLE SYSTEM --------------- In order to avoid making the system unstable, the number of failover operations triggered by unstable servers that may happen within a time period is limited. Note though that this does not forbid the system to set a slave's status to faulty. Users can also manually demote the current master or promote a server to master. REPORTING ON FAULTY SERVERS =========================== If one knows for sure that a server (i.e. MySQL) is really unreachable or dead and so that gathering reports on servers' instability is not necessary before having the server's status set to faulty, the following command can be executed: mysqlfabric threat report_fault 85d2fbc6-0fb3-4439-801b-330068d8f043 127.0.0.1 "Error Description" The command reports that the "127.0.0.1" considered the server identified by "85d2fbc6-0fb3-4439-801b-330068d8f043" to be faulty. Internally, Fabric executes the following code: Classes: Group - Contains information on a group. ErrorLog - Stores and retrieves errors reported. Server - Contains information on a server. Parameters: uuid - Server that is considered to be faulty. reporter - Source which has seen server's issues. error - Error code, description, message, whatever, etc. Code: 01: def report_fault(uuid, reporter, error): 02: now = get_time() 03: server = get_server(uuid) 04: 05: ErrorLog.append_error(server, reporter, error, now) 06 07: group = Group.fetch(server.group_id) 08: set_status_faulty(server) 09: 10: def set_status_faulty(server): 11: server.status = FAULTY 12: 13: group = Group.fetch(server.group_id) 14: trigger_within_procedure( 15: "SERVER_LOST", group.group_id, str(server.uuid) 16: ) 17: 18: if group.master == server.uuid: 19: trigger_within_procedure("FAIL_OVER", group.group_id) The first step consists in storing the error information into an error log (05) for future use. Then (06) it sets the server's status to faulty by calling the "set_status_faulty" function. FENCING OFF SERVERS =================== In the "set_status_faulty" function outlined above, the "SERVER_LOST" event is also triggered. This event must be used to fence off unstable and faulty servers after setting its status to faulty. This is executed to avoid the situation where a connector can access and update the old master but Fabric has no access to it and a new master is being promoted or has been promoted. The event is not implemented because this would require a previous knowledge on the users environment. So users are responsible for extending the system and provide the appropriate fence mechanism: kill the MySQL process, shutdown its server, disconnect the server from the network, etc. FAIL OVER/SWITCH OVER ===================== The promote operation encapsulates both the failover and switchover operations. It uses the master's status to decide which one shall be executed. If the master's status is different from FAULTY, a switchover is executed. Otherwise, a failover executed. Both operations must not try to access any server whose status is faulty in order to avoid blocking its execution. An attempt to access a faulty server may block because the connector python provides a synchronous interface which means that it follows the TCP/IP timeout configurations which may take hours until the timeout. In order words, faulty servers are not picked as candidates and we do not try to fetch any information from it. In the future though, we will try to dig binlogs from the master to avoid loosing transactions. However, this is not in the scope of this work. TOOLS ===== In the context of the WL#7387, we are planning to extend connectors to report errors that are usually not catch by regular failure detectors. Users can also extend the current implementation and build their own failure detection solutions either as procedures within Fabric or external tools. If Fabric will monitor a large amount of groups or servers, it is though preferable to build external tools due to possible performance issues. BUG#71372 will introduce an example on how to build an external failure detector. Off-the-shelf Failure Detector ------------------------------ For convenience, Fabric provides an off-the-shelf and ready-to-use failure detector that can be enabled in small setups per group as follows: mysqlfabric group activateTo disable it, do the following: mysqlfabric group deactivate When the failure detector is enabled, it executes the following code: Constants: _DETECTION_INTERVAL - # Interval in seconds _DETECTIONS - # Failed attempts to access a server _DETECTION_TIMEOUT - # Number of seconds that it should wait before giving up. Classes: Group - Contains information on a group. Server - Contains information on a server. Parameters: uuid - Server that is considered to be faulty. reporter - Source which has seen server's issues. error - Error code, description, message, whatever, etc. Code: 01: quarantine = {} 01: 02: while enabled: 03: try: 04: unreachable = set() 05: group = Group.fetch(self.__group_id) 06: if group is not None: 07: for server in group.servers(): 08: if server.status != FAULTY and \ 09: server.is_alive(_DETECTION_TIMEOUT): 10: continue 11: 12: unreachable.add(server.uuid) 13: 14: unstable = False 15: failed_attempts = 0 16: if server.uuid not in quarantine: 17: quarantine[server.uuid] = failed_attempts = 1 18: else: 19: failed_attempts = quarantine[server.uuid] + 1 20: quarantine[server.uuid] = failed_attempts 21: if failed_attempts >= _DETECTIONS: 22: unstable = True 23: 24: can_set_faulty = group.can_set_server_faulty( 25: server, get_time() 26: ) 27: if unstable and can_set_faulty: 28: procedure = trigger("REPORT_FAILURE", ...) 29: wait_for_procedure(procedure) 30: 31: for uuid in quarantine.keys(): 32: if uuid not in unreachable: 33: del quarantine[uuid] 34: 35: sleep(_DETECTION_INTERVAL / _DETECTIONS) For each server in the group, it checks whether it is alive or not trying to connect to it. The _DETECTION_TIMEOUT parameter determines the maximum amount of time that the function waits until a connection is established. Note that it may return with an error earlier if the server is unreachable for example (09). If the server is faulty or is alive, it goes to the next server. So if it does not manage to connect to a server (_DETECTIONS) consecutive times, the server is considered unstable (22). If it is unstable and can be set to faulty, the report_failure command previously described is triggered (28). CONCURRENCY CONTROL =================== Read-write procedures in Fabric are serialized through a concurrency control mechanism. It is used in the context of this work to guarantee that the report_error, report_failure and fail_over commands are serialized per group. This makes ease to design a failover that is resilient to failure but we will remove this limitation in a near future.
TABLES ====== The following table is used to store information on errors: CREATE_SERVER_ERROR_LOG = ( "CREATE TABLE error_log " "(server_uuid VARCHAR(40) NOT NULL, " "reported TIMESTAMP(6) NULL NOT NULL, " "reporter VARCHAR(64), " "error TEXT, " "INDEX key_server_uuid_reported (server_uuid, reported), " "INDEX key_reporter (reporter))" ) CREATE_EVENT_ERROR_LOG = ( "CREATE EVENT prune_error_log " "ON SCHEDULE EVERY % SECOND " "DO DELETE FROM error_log WHERE " "TIMEDIFF(UTC_TIMESTAMP(), reported) > MAKETIME(%s,0,0)" ) ADD_FOREIGN_KEY_CONSTRAINT_SERVER_UUID = ( "ALTER TABLE error_log ADD CONSTRAINT " "fk_server_uuid_error_log FOREIGN KEY(server_uuid) " "REFERENCES servers(server_uuid)" ) Old entries in the table are automatically removed by a event that runs every "prune_time" that are older than "prune_time". By default "prune_time" is 3600 seconds. In order to avoid frequently and automatically failing over to a different server what would make the system unstable, we also keep track of when the master has changed in a group: CREATE TABLE groups (group_id VARCHAR(64) NOT NULL, description VARCHAR(256), master_uuid VARCHAR(40), master_time DOUBLE(16, 6) NULL, status BIT(1) NOT NULL, CONSTRAINT pk_group_id PRIMARY KEY (group_id)) CONFIGURATION ============= data/fabric.cfg.in: ------------------- [failure_tracking] notifications = 300 notification_clients = 50 notification_interval = 60 failover_interval = 0 detections = 3 detection_interval = 6 detection_timeout = 1 prune_time = 3600 OBJECTS ======= ErrorLog: --------- # Remove entries older than _PRUNE_TIME in seconds. _MIN_PRUNE_TIME = 60 _PRUNE_TIME = _DEFAULT_PRUNE_TIME = 3600 def configure(config): try: prune_time = int(config.get("failure_tracking", "prune_time")) if prune_time < ErrorLog._MIN_PRUNE_TIME: _LOGGER.warning( "Prune entries be lower than %s.", ErrorLog._MIN_PRUNE_TIME ) prune_time = ErrorLog._MIN_PRUNE_TIME ErrorLog._PRUNE_TIME = int(prune_time) except (_config.NoOptionError, _config.NoSectionError, ValueError): pass FailureDetector: ---------------- # Interval in seconds within which consecutive failures are considered. _MIN_DETECTION_INTERVAL = 2.0 _DETECTION_INTERVAL = _DEFAULT_DETECTION_INTERVAL = 5.0 # Number of consecutive failures to consider a server faulty. _MIN_DETECTIONS = 1 _DETECTIONS = _DEFAULT_DETECTIONS = 2 # Number of seconds that it should wait before giving up to connect to a server. _MIN_DETECTION_TIMEOUT = 1 _DETECTION_TIMEOUT = _DEFAULT_DETECTION_TIMEOUT = 1 def configure(config): try: detection_interval = \ float(config.get("failure_tracking", "detection_interval")) if detection_interval < FailureDetector._MIN_DETECTION_INTERVAL: _LOGGER.warning( "Detection interval cannot be lower than %s.", FailureDetector._MIN_DETECTION_INTERVAL ) detection_interval = FailureDetector._MIN_DETECTION_INTERVAL FailureDetector._DETECTION_INTERVAL = float(detection_interval) except (_config.NoOptionError, _config.NoSectionError, ValueError): pass try: detections = int(config.get("failure_tracking", "detections")) if detections < FailureDetector._MIN_DETECTIONS: _LOGGER.warning( "Detections cannot be lower than %s.", FailureDetector._MIN_DETECTIONS ) detections = FailureDetector._MIN_DETECTIONS FailureDetector._DETECTIONS = int(detections) except (_config.NoOptionError, _config.NoSectionError, ValueError): pass try: detection_timeout = \ int(config.get("failure_tracking", "detection_timeout")) if detection_timeout < FailureDetector._MIN_DETECTION_TIMEOUT: _LOGGER.warning( "Detection timeout cannot be lower than %s.", FailureDetector._MIN_DETECTION_TIMEOUT ) detection_interval = FailureDetector._MIN_DETECTION_TIMEOUT FailureDetector._DETECTION_TIMEOUT = int(detection_timeout) except (_config.NoOptionError, _config.NoSectionError, ValueError): pass _NOTIFICATION_INTERVAL # Interval T in seconds. _NOTIFICATION_CLIENTS # Number of different sources (i.e. S sources) _NOTIFICATIONS # Number of notifications (i.e. N reports) ReportError: ------------ # Number of reports to consider a server unstable. _MIN_NOTIFICATIONS = 1 _NOTIFICATIONS = _DEFAULT_NOTIFICATIONS = 300 # Number of different sources that must report an error to consider a # server unstable. _MIN_NOTIFICATION_CLIENTS = 1 _NOTIFICATION_CLIENTS = _DEFAULT_NOTIFICATION_CLIENTS = 50 # Interval in seconds to analyze errors to consider a server unstable. _MIN_NOTIFICATION_INTERVAL = 1 _MAX_NOTIFICATION_INTERVAL = 3600 _NOTIFICATION_INTERVAL = _DEFAULT_NOTIFICATION_INTERVAL = 60 def configure(config): try: notifications = int(config.get("failure_tracking", "notifications")) if notifications < ReportError._MIN_NOTIFICATIONS: _LOGGER.warning( "Notifications cannot be lower than %s.", ReportError._MIN_NOTIFICATIONS ) notifications = ReportError._MIN_NOTIFICATIONS ReportError._NOTIFICATIONS = int(notifications) except (_config.NoOptionError, _config.NoSectionError, ValueError): pass try: notification_clients = \ int(config.get("failure_tracking", "notification_clients")) if notification_clients < ReportError._MIN_NOTIFICATION_CLIENTS: _LOGGER.warning( "Notification_clients cannot be lower than %s.", ReportError._MIN_NOTIFICATION_CLIENTS ) notification_clients = ReportError._MIN_NOTIFICATION_CLIENTS ReportError._NOTIFICATION_CLIENTS = int(notification_clients) except (_config.NoOptionError, _config.NoSectionError, ValueError): pass try: notification_interval = \ int(config.get("failure_tracking", "notification_interval")) if notification_interval > _error_log.ErrorLog._PRUNE_TIME: _LOGGER.warning( "Notification interval cannot be greater than prune " "interval %s", _error_log.ErrorLog._PRUNE_TIME ) notification_interval = _error_log.ErrorLog._PRUNE_TIME if notification_interval > ReportError._MAX_NOTIFICATION_INTERVAL: _LOGGER.warning( "Notification interval cannot be greater than %s.", ReportError._MAX_NOTIFICATION_INTERVAL ) notification_interval = ReportError._MAX_NOTIFICATION_INTERVAL if notification_interval < ReportError._MIN_NOTIFICATION_INTERVAL: _LOGGER.warning( "Notification interval cannot be lower than %s.", ReportError._MIN_NOTIFICATION_INTERVAL ) notification_interval = ReportError._MIN_NOTIFICATION_INTERVAL ReportError._NOTIFICATION_INTERVAL = int(notification_interval) except (_config.NoOptionError, _config.NoSectionError, ValueError): pass ReportFailure: -------------- There is no configuration and it calls the FAIL_OVER event as previously explained.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.