WL#12649: Automatic connection failover for Async Replication Channels - Step I: Automatic Connection Failover

Affects: Server-8.0   —   Status: Complete

EXECUTIVE SUMMARY

This worklog implements a mechanism in asynchronous replication that makes the receiver automatically try to re-establish an asynchronous replication connection to another sender, in case the current connection gets interrupted due to the failure of the current sender.

TERMINOLOGY

Let us first define the terminologies which will be used in this worklog.

Asynchronous Replication terminologies:

The different roles that servers are playing while engaging in asynchronous replication activities.

  • sender: endpoint that sends data.
  • receiver: endpoint that receives data from the source.

  • sender list: a list of multiple senders in which each item contains sender connection details and a priority. The user would add this sender list so that receiver can connect to new sender in case existing sender fails. The sender with higher priority would be choosen to connect next.

sender list                                sender list
 [sender1] -----> [receiver]               [sender1] ---x--> [receiver]
                                                                ^
                             --------->                         |
 [sender2]                                 [sender2] -----------|


          sender1 fails, receiver connects to  sender2
          --------------------------------------------

  • weight (priority): When more then one sender are present for the same channel, then next sender would be selected with highest weight (priority) for the same channel. The weight is a number between 1 and 100.

MOTIVATION

The main driver of this worklog is to make deployment fault-tolerant of sender failure by automating the process of re-establishment of an asynchronous replication connection to another sender of sender list. The sender server can be deployed across wide-area in different location, to improve the Disaster Recovery (DR) to even failures of whole data center. This will also be true for Innodb cluster as the sender can also be a member of an Innodb cluster. When Innodb Cluster was introduced it created a fault-tolerant system where redundant components could be removed automatically and the system continued to operate as expected. This work will further improve fault-tolerance of Innodb cluster to failures of whole data center, by creating fault-tolerant replica cluster across wide-area in different location.

USER STORIES

  • As a MySQL DBA I want to setup asynchronous replication between two servers, S1 (sender) to R1 (receiver), and in case of failure i.e. S1 goes down, I want the R1 to automatically reconnect to the another sender with higher priority S2 (sender), so that I do not have to build monitoring and automation myself to make the connection across wide-area in different location reliable.

sender list                             sender list
[S1:90] ---> [R1]                         [S1:90] ---x--> [R1]
                                                            ^
                       ------->                             |
[S2:80]                                   [S2:80] ----------|

[S3:70]                                   [S3:70]


       S1 (sender) fails, R1 (receiver) connects to S2
       ------------------------------------------------

  • As a MySQL DBA I want to configure a MySQL receiver server so that I am able to list the sender servers it should connect to in case of current sender dies.

FUNCTIONAL REQUIREMENTS:

FR1. The receiver thread(IO thread) stops due to sender crash/stop or due to network failure, receiver MUST establish asynchronous channel between new sender and existing receiver.

FR2. The receiver MUST choose next sender to establish asynchronous channel based on priority of sender. The sender with highest priority for the same channel MUST be choosen to connect next.

FR3. If a sender with higher priority for the same channel becomes available, the receiver MUST reconnect only if current asynchronous replication connection fails due to sender crash/stop or due to network failure.

FR4. The Asynchronous replication credentials SHALL be provided through CHANGE MASTER command and not through START SLAVE parameters, so that replication credentials persist even after mysql restart, and can connect to next sender with those replication credentials.

FR5. The receiver IO thread MUST fail with ER_RPL_ASYNC_RECONNECT_FAIL_NO_SOURCE error when current Asynchronous Replication connection fails and there are no senders to connect.

FR6. It uses global transaction identifiers to track exactly which transactions have been already committed on receiver, so that it knows from where receiver I/O thread SHALL begin reading from new sender. The receiver MUST have Global Transaction Identifiers ON (gtid_mode=ON), to enable Asynchronous Replication Connection Failover feature. If GTID mode is disabled it MUST fail with error ER_RPL_ASYNC_RECONNECT_GTID_MODE_OFF. While if Asynchronous Replication Connection Failover feature is enabled i.e. SOURCE_CONNECTION_AUTO_FAILOVER=ON for any channel setting gtid_mode!=ON fails with ER_DISABLE_AUTO_POSITION_REQUIRES_ASYNC_RECONNECT_OFF error.

FR7. The receiver MUST also have MASTER_AUTO_POSITION option enabled in the CHANGE MASTER TO, so that receiver can get replication data stream using GTID-based auto-positioning rather than file-based positioning. If MASTER_AUTO_POSITION option is disabled while enabling Asynchronous Replication Connection Failover feature it MUST fail with ER_RPL_ASYNC_RECONNECT_AUTO_POSITION_OFF. While if Asynchronous Replication Connection Failover feature is enabled i.e. SOURCE_CONNECTION_AUTO_FAILOVER=ON for any channel disabling MASTER_AUTO_POSITION fails with ER_DISABLE_AUTO_POSITION_REQUIRES_ASYNC_RECONNECT_OFF error.

FR8. When the receiver IO thread has exhausted the sender list i.e. it has tried connecting all the senders and none of the connection was successful, it resets the sender list and again starts trying connecting each sender from the sender list starting from the topmost sender according to highest priority. This keeps going in an endless loop till the IO thread is stopped using STOP SLAVE (IO_THREAD).

FR9. If the receiver is cloned using CLONE PLUGIN, the new cloned server MUST have same value for SOURCE_CONNECTION_AUTO_FAILOVER option of CHANGE MASTER and same sender list connection details in mysql.replication_asynchronous_connection_failover.

NON-FUNCTIONAL REQUIREMENTS:

NFR1. The same connection credentials SHALL be used for all the senders of the same channel, so that on failure of existing asynchronous replication channel it can connect to the next sender with the same connection credentials but different host().

NFR2. The mysql.replication_asynchronous_connection_failover table added in this worklog to store sender configuration details SHALL only be updated through UDFs provided with this worklog. The update or insert rows in the mysql.replication_asynchronous_connection_failover table manually can cause undefined behavior, and is not supported.

NFR3. The next source is selected according to descending order of weight in case of failure of existing one. But if multiple source are added with the same weight then there is no other mechanism to deterministically select the next source other than the weights, and so will be selected randomly.

NON REQUIREMENTS:

NR1. The receiver crashes, there is no recovery path from this failure.

1. INTRODUCTION

The worklog introduces a new replication asynchronous connection failover feature whose purpose is it to keep servers in sync using asynchronous replication channel and in case of failure re-establishes asynchronous replication connection with another sender which is selected according to descending order of weight.

sender list                          sender list
[sender1:90] ---> [receiver]         [sender1:90] ---x--> [receiver]
                                                            ^
                            ------->                        |
[sender2:80]                         [sender2:80] ----------|

[sender3:70]                         [sender3:70]

       sender1 fails, receiver connects to sender2
       -------------------------------------------

On sender1 failure, the next sender with higher weight is choosen by receiver to connect, and that would be sender2 as it has weight of 80 higher then sender3 which is 70.

The Replication asynchronous connection failover feature is divided into two parts:

  1. The list of potential senders on the receiver, which further in this worklog will be named as Asynchronous Connection Sender List.

  2. The mechanism which would react and reconnect on asynchronous replication channel failure, which further in this worklog will be named as Asynchronous Connection Failover Mechanism.

2. REPLICATION ASYNCHRONOUS CONNECTION SENDER LIST

The Sender List will provide functionality through which user can change sender configuration details and influence next sender selection.

The Sender List detail would contain: - asynchronous replication channel name - sender connection details i.e. host, port and network namespace - weight based on which next sender would be selected.

2.1. SENDER LIST STORAGE

A new replication_asynchronous_connection_failover table has been added to mysql database to store sender configuration details provided by DBA/User. The schema of the table is as follows:

CREATE TABLE IF NOT EXISTS 
mysql.replication_asynchronous_connection_failover (
  Channel_name CHAR(64) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL
  COMMENT 'The replication channel name that connects source and replica.',
  Host CHAR(255) CHARACTER SET ASCII NOT NULL COMMENT 'The source hostname
  that the replica will attempt to switch over the replication connection
  to in case of a failure.',
  Port INTEGER UNSIGNED NOT NULL COMMENT 'The source port that the replica
  will attempt to switch over the replication connection to in case of a
  failure.',
  Network_namespace VARCHAR(64) COMMENT 'The source network namespace that
  the replica will attempt to switch over the replication connection to in
  case of a failure. If its value is empty, connections use the default
  (global) namespace.',
  Weight TINYINT UNSIGNED NOT NULL COMMENT 'The order in which the replica 
  shall try to switch the connection over to when there are failures.
  Weights can be set to a number between 1 and 100, where 100 is the
  highest weigth and 1 the lowest.',
CONSTRAINT CHECK (Weight BETWEEN 1 AND 100),
PRIMARY KEY(Channel_name, Host, Port, Network_namespace))
DEFAULT CHARSET=utf8 STATS_PERSISTENT=0
COMMENT 'The source configuration details'

2.2. SENDER LIST CONFIGURATION

Two new UDF's has been provided with this worklog to add/delete sender configuration details. Please find details in section 9.1 and 9.2.

2.3. CREDENTIALS MANAGEMENT AND ASSUMPTIONS

For asynchronous replication connection, configuration details like username and password (or certification based authentication) are also needed which DBA/User needs to provide through CHANGE MASTER or START SLAVE command. As mentioned in FR4 also, if the DBA/User wants the asynchronous replication credentials to be available even after mysql restart, he should use CHANGE MASTER command instead of START SLAVE to provide replication credentials.

The DBA/User shall configure the credentials for the channel as usual:

CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='password' FOR CHANNEL 'rpl_async';

The DBA/User should not specify credentials per sender. The credentials are specified for channel and as such, for all the senders configured for a specific channel, they have to have the same user configured, so that the receiver can connect to them.

3. REPLICATION ASYNCHRONOUS CONNECTION FAILOVER MECHANISM:

The Asynchronous Connection Failover Mechanism re-establishes asynchronous replication channel to a new sender on sender crash/stop or due to network failure.

3.1. ASYNCHRONOUS CONNECTION FAILOVER MECHANISM STATE:

In this worklog a new column 'SOURCE_CONNECTION_AUTO_FAILOVER' will be added to slave_master_info table:

SOURCE_CONNECTION_AUTO_FAILOVER BOOLEAN NOT NULL COMMENT 'Indicates whether the channel is monitored on failures.'

It can be changed through CHANGE MASTER command. The SOURCE_CONNECTION_AUTO_FAILOVER column will have one of the following two values: {0|1}

Example:

enabled: CHANGE MASTER TO SOURCE_CONNECTION_AUTO_FAILOVER = 0 FOR CHANNEL '...';

disabled: CHANGE MASTER TO SOURCE_CONNECTION_AUTO_FAILOVER = 1 FOR CHANNEL '...';

3.2. ASYNCHRONOUS CONNECTION FAILOVER MECHANISM MANAGEMENT:

The Asynchronous Connection Failover Mechanism is enabled or disabled through CHANGE MASTER, but it starts only when channel is failed. When asynchronous replication channel IO thread fails due to sender crash/stop or due to network failure, it checks whether Failover Mechanism is enabled for the channel i.e. checks for SOURCE_CONNECTION_AUTO_FAILOVER value for channel. And if its enabled (SOURCE_CONNECTION_AUTO_FAILOVER = 1) then IO thread would get next sender configuration details having highest weight and would connect with this new sender.

The failed asynchronous replication is retried MASTER_RETRY_COUNT (set by DBA/User) before connecting to a new sender by Asynchronous Connection Failover Mechanism.

4. SECURITY CONTEXT

4.1. REPLICATION CREDENTIALS ON ALL POTENTIAL SENDERS

To use asynchronous replication mechanism feature (and according to NFR1), the DBA/User can add credentials for potential senders for any channel in either of the following way:

  1. The same username and password are configured on the receiver for all potential senders for a given channel.

  2. Certification based authentication, where receiver has its own certificate and key to authenticate against the sender and same user for all the senders of the given channel.

Although case 1 (the same credentials for all senders of the same channel) is simpler, but it strikes as the less robust approach to handling credentials for the entire cluster. So a certificate based authentication (case 2) for replication is recommended.

5. CROSS-VERSION REPLICATION

There is no impact on cross-version replication.

6. UPGRADE/DOWNGRADE

The worklog will introduce a new column in the slave_master_info, but that is not expected to cause problems.

7. PROTOCOL

There are no changes in any protocol.

8. FAILURE MODEL SPECIFICATION

8.1. The single sender and it crashes.

On failure of exisiting asynchronous replication channel, the IO thread checks for another sender. And if it finds that no senders for the channel i.e. no senders configuration details added in mysql.replication_asynchronous_connection_failover table by the DBA/User for the channel, then the asynchronous connection failover mechanism would fail with ER_RPL_ASYNC_RECONNECT_FAIL_NO_SOURCE error.

8.3. UDF Errors

8.3.1. Add sender configuration details validation checks

The UDF asynchronous_connection_failover_add_source would return ER_UDF_ERROR error on following failures:

i. Check that the parameter count is greater than 3.

ERROR 3200 (HY000): asynchronous_connection_failover_add_source UDF failed; Wrong arguments: You must specify all arguments.

ii. Check that the channel name is provided. The channel_name can be empty.

ERROR 3200 (HY000): asynchronous_connection_failover_add_source UDF failed; Wrong arguments: You must specify channel name.

iii. Check host parameters values is provided and not empty.

ERROR 3200 (HY000): asynchronous_connection_failover_add_source UDF failed; Wrong arguments: You must specify hostname.

iv. Check port parameters values is provided.

ERROR 3200 (HY000): asynchronous_connection_failover_add_source UDF failed; Wrong arguments: You must specify value for port.

v. Check the weight parameter value is between 1-100.

ERROR 3200 (HY000): asynchronous_connection_failover_add_source UDF failed; Wrong argument: The weight argument value must be between 1-100.

8.3.2. Delete source configuration details validation checks

The UDF asynchronous_connection_failover_delete_source would return ER_UDF_ERROR error on following failures:

i. Check that the parameter count is greater than 3.

ERROR 3200 (HY000): asynchronous_connection_failover_add_source UDF failed; Wrong arguments: You must specify all arguments.

ii. Check that the channel name is provided. The channel_name can be empty.

ERROR 3200 (HY000): asynchronous_connection_failover_add_source UDF failed; Wrong arguments: You must specify channel name.

iii. Check host parameters values is provided and not empty.

ERROR 3200 (HY000): asynchronous_connection_failover_add_source UDF failed; Wrong arguments: You must specify hostname.

iv. Check port parameters values is provided.

ERROR 3200 (HY000): asynchronous_connection_failover_add_source UDF failed; Wrong arguments: You must specify value for port.

8.3.3. Errors while registering/unregistering asynchronous connection failover UDFs

When mysql server initializes, it will also register asynchronous_connection_failover_add_source() and asynchronous_connection_failover_delete_source() UDFs, so that User/DBA can use them to add/delete source credentials details. And when mysql stops it does unregisters them.

The following errors can occur while registering/unregistering these UDFs:

i. Error acquire registry service. If it fails to acquire registry service which is used for UDF registration, then it fails with ER_UDF_REGISTER_SERVICE_ERROR error.

ii. Error registering asynchronous connection failover UDFs. After acquiring register service UDFs are registered, and if it fails to register these UDFs, then it fails with ER_UDF_REGISTER_ERROR error.

iii. Error unregistering asynchronous connection failover UDFs. And when mysql server stop it also unregisters asynchronous_connection_failover_add_source() and asynchronous_connection_failover_delete_source() UDFs, and if there is error in unregistering it fails with ER_UDF_UNREGISTER_ERROR.

9. USER INTERFACE

9.1. Add sender configuration details

The asynchronous_connection_failover_add_source() UDF can be used to add sender configuration details:

SELECT asynchronous_connection_failover_add_source(channel, host, port, network_namespace, weight);

SELECT asynchronous_connection_failover_add_source('channel_rpl_async', '127.0.0.1', 3310, 'blue', 80);
+-----------------------------------------------------------------------------------------------------+
| asynchronous_connection_failover_add_source('channel_rpl_async', '127.0.0.1', 3310, 'blue', 80)     |
+-----------------------------------------------------------------------------------------------------+
| Source configuration details successfully inserted.                                                 |
+-----------------------------------------------------------------------------------------------------+
1 row in set (0,01 sec)

9.2. Delete source configurations

The asynchronous_connection_failover_delete_source() UDF can be used to delete sender configuration details:

SELECT asynchronous_connection_failover_delete_source(channel, host, port, network_namespace);

SELECT asynchronous_connection_failover_delete_source(channel_rpl_async', '127.0.0.1', 3310, 'blue');
+---------------------------------------------------------------------------------------------------+
| asynchronous_connection_failover_delete_source(channel_rpl_async', '127.0.0.1', 3310, 'blue')     |
+---------------------------------------------------------------------------------------------------+
| Source configuration details successfully deleted.                                                |
+---------------------------------------------------------------------------------------------------+
1 row in set (0,01 sec)

Arguments:

The UDF's arguments used in above section 9.1 and 9.2 has following meaning: channel The replication channel name that connects source and replica.

host             The source hostname that the replica will attempt to
                 switch over the replication connection to in case of a
                 failure.

port             The source port that the replica will attempt to
                 switch over the replication connection to in case of a
                 failure.

network_namespace  The source network namespace that the replica will
                   attempt to switch over the replication connection to in
                   case of a failure. If its value is empty, connections use
                   the default (global) namespace.

weight           The order in which the replica shall try to switch the
                 connection over to when there are failures. Weights can be
                 set to a number between 1 and 100, where 100 is the highest
                 weigth and 1 the lowest.
                 The default value of 50 is used, if weight is not provided
                 for the asynchronous_connection_failover_add_source() UDF.

Return value: A string containing the result of the operation, for example whether it was successful or not.

9.3. Create a reliable channel and add replication credentials

The user needs to provide replication credentials so that asynchronous channel can be setup between sender and receiver and set SOURCE_CONNECTION_AUTO_FAILOVER=1 (to enable asynchronous connection failover mechanism) through CHANGE MASTER. A CHANGE MASTER TO statement employing the SOURCE_CONNECTION_AUTO_FAILOVER option can be executed on a running replica. But the Asynchronous Connection Failover Mechanism can't be enabled i.e. SOURCE_CONNECTION_AUTO_FAILOVER=1 for group replication recovery channel.

CHANGE MASTER TO SOURCE_CONNECTION_AUTO_FAILOVER=1 FOR CHANNEL 'channel_rpl_async';

10. OBSERVABILITY

10.1. ASYNCHRONOUS CONNECTION FAILOVER MECHANISM STATUS

To know asynchronous connection failover mechanism status for the channel, new field/column 'SOURCE_CONNECTION_AUTO_FAILOVER' would be added to slave_master_info and replication_connection_configuration table.

The SOURCE_CONNECTION_AUTO_FAILOVER column/field will have one of the following two states:

1            It is in running state for the asynchronous replication
             channel and would monitor channel failures.
0 (Default)  It is in stopped state and won't monitor failures for the
             asynchronous replication channel.

Example:

SELECT CHANNEL_NAME, SOURCE_CONNECTION_AUTO_FAILOVER FROM replication_connection_configuration FOR CHANNEL_NAME = 'channel_rpl_async';
+--------------------------+----------------------------------+
|       CHANNEL_NAME       | SOURCE_CONNECTION_AUTO_FAILOVER  |
+--------------------------+----------------------------------+
| channel_rpl_async        |                                1 |
+--------------------------+----------------------------------+
1 rows in set (0.04 sec)

10.2. ASYNCHRONOUS CONNECTION FAILOVER SENDER DETAILS

To know the sender configuration details a new performance_schema replication_asynchronous_connection_failover table has been added with this worklog:

CREATE TABLE
performance_schema.replication_asynchronous_connection_failover (
  Channel_name CHAR(64) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL
  COMMENT 'The replication channel name that connects source and replica.',
  Host CHAR(255) CHARACTER SET ASCII NOT NULL COMMENT 'The source hostname 
  that the replica will attempt to switch over the replication connection 
  to in case of a failure.',
  Port INTEGER UNSIGNED NOT NULL COMMENT 'The source port that the replica
  will attempt to switch over the replication connection to in case of a
  failure.',
  Network_namespace VARCHAR(64) COMMENT 'The source network namespace that 
  the replica will attempt to switch over the replication connection to in 
  case of a failure. If its value is empty, connections use the default 
  (global) namespace.',
  Weight TINYINT UNSIGNED NOT NULL COMMENT 'The order in which the replica
  shall try to switch the connection over to when there are failures.
  Weights can be set to a number between 1 and 100, where 100 is the highest
  weigth and 1 the lowest.',
CONSTRAINT CHECK (Weight BETWEEN 1 AND 100),
PRIMARY KEY(Channel_name, Host, Port, Network_namespace))
DEFAULT CHARSET=utf8 STATS_PERSISTENT=0
COMMENT 'The source configuration details';

Interface Specification

Events that have severity higher than information/note, shall be written to the error log as well. The following errors shall be written:

I.1. ER_RPL_ASYNC_RECONNECT_FAIL_NO_SOURCE [SYSTEM]

eng "Failed to automatically re-connect to a different source, for channel '%s', because %s. To remove the error %s."

The arguments for error ER_RPL_ASYNC_RECONNECT_FAIL_NO_SOURCE for Failure Model Specification '8.1. The single sender and it crashes', would be:

  1. channel name
  2. no alternative source is specified
  3. add new source details for the channel

The error message would be as below: Failed to automatically re-connect to a different source, for channel 'channel_name', because no alternative source is specified. To remove the error add new source details for the channel.

I.2. ER_UDF_REGISTER_SERVICE_ERROR [ERROR]

eng "Could not execute the installation of UDF functions. Check for other errors in the log"

I.3. ER_UDF_REGISTER_ERROR [ERROR]

eng "Could not execute the installation of UDF function: %s. Check if the function is already present, if so, try to remove it."

I.4. ER_UDF_UNREGISTER_ERROR [ERROR]

eng "Could not uninstall UDF functions. Try to remove them manually if present."

I.5. ER_RPL_ASYNC_RECONNECT_GTID_MODE_OFF [ERROR]

eng "Failed to enable Asynchronous Replication Connection Failover feature. The Gtid mode must be ON to enable it."

I.6. ER_RPL_ASYNC_RECONNECT_AUTO_POSITION_OFF [ERROR]

eng "Failed to enable Asynchronous Replication Connection Failover feature. The MASTER_AUTO_POSITION option of CHANGE MASTER TO command must be ON to enable it."

11. REPLICATION ASYNCHRONOUS CONNECTION FAILOVER LIFECYCLE

The Asynchronous Connection IO thread goes through various stages of its lifecycle, but we would show stages only which are are relevant to Asynchronous Connection Failover Mechanism.

      start      |----------->[RUNNING]-------------|
      IO thread  |                                  |
      successful |                                  |
                 |    start IO thread fail          V
     ------->[START]-------------------------->[FAILURE]
   IO thread     ^                                  |
                 |      Asynchronous Connection     | SOURCE_CONNECTION_AUTO_FAILOVER=1?
                 |        Failover mechanism        | (11.1 step 4)
 Monitor start   |                                  |
 IO is successful|-----------[NEW SENDER]<-----------
 (11.1. step 5-8)


                  Fig: IO thread lifecycle

The lower half of this IO thread lifecycle diagram where Asynchronous Connection Failover mechanism is invoked on FAILURE, and once initiated it keeps monitoring till Asynchronous connection is successful.

START:  The IO thread is started using `START SLAVE` command by DBA/User or
        by Asynchronous Connection Failover mechanism after existing
        asynchronous connection fails and it tries connecting to new sender.

RUNNING: The IO thread is successfully connected to a sender.

FAILURE: The Asynchronous Connection IO thread fails due to sender
         stop/crash or due to network failure.

NEW SENDER: After failure of existing Asynchronous Connection, the IO
            thread checks if Asynchronous Connection Failover mechanism is
            enabled (SOURCE_CONNECTION_AUTO_FAILOVER=1) and if enabled gets
            a new sender from the Sender List added by DBA/User to
            mysql.replication_asynchronous_connection_failover. The new
            sender will be used for asynchronous connection.

11.1. Asynchronous Replication IO thread failure

On Asynchronous Replication IO thread failure following would happen:

1 The Replication asynchronous connection IO thread fails due to sender stop/crash or due to network failure and not due to executing 'STOP SLAVE' command by DBA/User.

2 The IO thread on failure checks SOURCE_CONNECTION_AUTO_FAILOVER column value for the channel in replication_connection_configuration.

SELECT SOURCE_CONNECTION_AUTO_FAILOVER FROM performance_schema.replication_connection_configuration FOR CHANNEL_NAME = 'ch1';

3 If SOURCE_CONNECTION_AUTO_FAILOVER column value is 0 in replication_connection_configuration for the channel, the IO thread skips asynchronous connection failover mechanism.

4 If SOURCE_CONNECTION_AUTO_FAILOVER column value is 1 in replication_connection_configuration for the channel, then it executes below query to get next sender configuration details:

SELECT host, port, network_namespace, weight
FROM mysql.replication_asynchronous_connection_failover
WHERE channel = 'ch1'
ORDER BY FIELD(host, current_failed_host),
         FIELD(port, current_failed_port),
         FIELD(network_namespace, current_failed_network_namespace),
         weight DESC;

Note: The current_failed_host, current_failed_port and current_failed_network_namespace are connection details of sender for which IO thread failed.

5 Then it picks top result from the query it executed in step 4 and updates host, port and network_namespace details in slave_master_info for the same channel which failed.

6 And starts replica again i.e. executes START SLAVE to reconnect to a new sender.

7 The IO thread monitors if START SLAVE executed in step 6. is successful in connecting to sender.

8 If not successful, it re-executes step 5 with next top result it got from the query it executed in step 4. It continues re-executing step 5 with next top result in an endless loop. When the list of sender is exhausted i.e. it has tried connecting all the senders and none of the connection was successful, it resets the senders list and again starts trying the topmost senders in the sender list. This keeps going in an endless loop till the IO thread is stopped using STOP SLAVE (IO_THREAD).

11.2. On START SLAVE:

When DBA/User executes START SLAVE command, the IO thread does not invoke asynchronous connection failover mechanism, it is only invoked on its failure. The IO thread does monitoring if START SLAVE is initiated by failover mechanism itself, so that on failure next sender can be tried. To know when IO thread starts replica by itself, please check section 11.1. step 5 to step 8.

If START SLAVE is not successful, it will pick next result and update host, port and network_namespace details in slave_master_info and would try to connect with this new sender. It will keep on trying with new host, port and network_namespace till either all senders are tried or asynchronous replication is connected.

11.3. On STOP SLAVE:

When DBA/User executes STOP SLAVE command, the IO thread does not invoke asynchronous connection failover mechanism. The failover mechanism is only invoked when it is failed due to sender crash/stop or due to network failure, and not when IO thread is stopped delibrately by DBA/User.

STOP SLAVE FOR CHANNEL 'ch1';
STOP SLAVE IO_THREAD FOR CHANNEL 'ch1';