MySQL :: Automatic Asynchronous Replication Connection Failover

MySQL 8.0.22 introduces Automatic Asynchronous Replication connection failover. This feature makes MySQL asynchronous replication fault-tolerant to source connection failure, by automating the process of re-establishment of a replication connection to another available source. If there are source servers in multiple data centers, the user can also configure the replica to fail over between them, to provide Disaster Recovery from failures of an entire data center. The availability of a database can be maintained by keeping an extra copy of the data in a standby replica that receives it continuously from the source using replication. But, as always, there could be cases where the connection to the source fails and the stream of replicated data is interrupted.

In order to explain how the source failure would be handled with this feature, let’s consider the following scenario: there are two sites – New York and London, as shown in Fig 1 below, connected through asynchronous replication channel C2. The servers (S1, S2) in New York site are also connected through replication channel C1 (asynchronous or Group Replication for automatic fail-over), which means that S2 has a copy of the data on S1.

The server (R1) in London site is configured with S1 and S2’s source connection details, so when connection C2 fails (Fig 2) between S1 and R1, R1 establishes a new connection C3 between S2 and R1, as shown in Fig 3.

In the next section, we will walk you through the configuration and usage of the feature.

Usage

Prerequisites

Before enabling the asynchronous connection failover feature, all servers need to use gtid_mode=ON, and the replica that uses asynchronous connection failover must have configured MASTER_AUTO_POSITION=1 with CHANGE MASTER TO.

1. Create a replication channel

On the replica server, create a replication channel ‘ch1’ with the replication credentials of the source servers.

replica> CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='password', MASTER_HOST='node1', MASTER_PORT=3306, MASTER_RETRY_COUNT=2 FOR CHANNEL 'ch1';

1	replica> CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='password', MASTER_HOST='node1', MASTER_PORT=3306, MASTER_RETRY_COUNT=2 FOR CHANNEL 'ch1';

Note: The default values of MASTER_RETRY_COUNT (86400 times), and MASTER_CONNECT_RETRY (60 seconds), were designed for a connection to a single source, and will make the replica retry the same connection for 60 days. When these attempts are exhausted, the asynchronous connection failover mechanism takes over. So to ensure that the asynchronous connection failover mechanism can be activated promptly, set MASTER_RETRY_COUNT to a minimal number that just allows a few retry attempts with the same source, in case the connection failure is caused by a transient network outage.

2. Add the alternate source configuration details for the asynchronous replication channel

The feature is controlled on the replica. The UDF asynchronous_connection_failover_add_source() can be used to add connection configuration for an alternate source.

SELECT asynchronous_connection_failover_add_source(channel, host, port, network_namespace, weight);

1	SELECT asynchronous_connection_failover_add_source(channel, host, port, network_namespace, weight);

The arguments for this UDF have the following meanings:

channel/host/port/network_namespace: The replication channel connection attributes specifying how to connect to this alternate source.
weight: When multiple alternate sources are specified, the failover operation will try to connect to source having highest weight. If weight is not specified, it defaults to 50.

On replica, configure two sources:

replica> SELECT asynchronous_connection_failover_add_source('ch1', 'node1', 3306, '', 80);
+-----------------------------------------------------------------------------------------------------+
| asynchronous_connection_failover_add_source('ch1', 'node1', 3306, '', 80)                   |
+-----------------------------------------------------------------------------------------------------+
| Source configuration details successfully inserted.                                                 |
+-----------------------------------------------------------------------------------------------------+
1 row in set (0,01 sec)

replica> SELECT asynchronous_connection_failover_add_source('ch1', 'node2', 3306, '', 70);
+-----------------------------------------------------------------------------------------------------+
| asynchronous_connection_failover_add_source('ch1', 'node2', 3306, '', 70)                   |
+-----------------------------------------------------------------------------------------------------+
| Source configuration details successfully inserted.                                                 |
+-----------------------------------------------------------------------------------------------------+
1 row in set (0,01 sec)

replica> SELECT asynchronous_connection_failover_add_source('ch1', 'node1', 3306, '', 80);

+-----------------------------------------------------------------------------------------------------+

| asynchronous_connection_failover_add_source('ch1', 'node1', 3306, '', 80) |

+-----------------------------------------------------------------------------------------------------+

| Source configuration details successfully inserted. |

+-----------------------------------------------------------------------------------------------------+

1 row in set (0,01 sec)

replica> SELECT asynchronous_connection_failover_add_source('ch1', 'node2', 3306, '', 70);

+-----------------------------------------------------------------------------------------------------+

| asynchronous_connection_failover_add_source('ch1', 'node2', 3306, '', 70) |

+-----------------------------------------------------------------------------------------------------+

| Source configuration details successfully inserted. |

+-----------------------------------------------------------------------------------------------------+

1 row in set (0,01 sec)

The values inserted can be verified in the replication_asynchronous_connection_failover table of the performance_schema database.

replica> SELECT * FROM performance_schema.replication_asynchronous_connection_failover;
+--------------+-----------+------+-------------------+--------+
| CHANNEL_NAME | HOST      | PORT | NETWORK_NAMESPACE | WEIGHT |
+--------------+-----------+------+-------------------+--------+
| ch1          | node1     | 3306 |                   |     80 |
| ch1          | node2     | 3306 |                   |     70 |
+--------------+-----------+------+-------------------+--------+
2 rows in set (0.00 sec)

replica> SELECT * FROM performance_schema.replication_asynchronous_connection_failover;

+--------------+-----------+------+-------------------+--------+

+--------------+-----------+------+-------------------+--------+

| ch1 | node1 | 3306 | | 80 |

| ch1 | node2 | 3306 | | 70 |

+--------------+-----------+------+-------------------+--------+

2 rows in set (0.00 sec)

3. Enable the feature

Enable the feature for the replication channel ‘ch1’ using CHANGE MASTER TO SOURCE_CONNECTION_AUTO_FAILOVER=1. This statement can be used when the replication threads are running.

replica> CHANGE MASTER TO SOURCE_CONNECTION_AUTO_FAILOVER=1 FOR CHANNEL 'ch1';
Query OK, 0 rows affected (0.12 sec)

replica> SELECT CHANNEL_NAME, SOURCE_CONNECTION_AUTO_FAILOVER FROM performance_schema.replication_connection_configuration WHERE CHANNEL_NAME = 'ch1';
+--------------------------+----------------------------------+
|       CHANNEL_NAME       | SOURCE_CONNECTION_AUTO_FAILOVER  |
+--------------------------+----------------------------------+
| ch1                      |                                1 |
+--------------------------+----------------------------------+
1 row in set (0.04 sec)

replica> CHANGE MASTER TO SOURCE_CONNECTION_AUTO_FAILOVER=1 FOR CHANNEL 'ch1';

Query OK, 0 rows affected (0.12 sec)

replica> SELECT CHANNEL_NAME, SOURCE_CONNECTION_AUTO_FAILOVER FROM performance_schema.replication_connection_configuration WHERE CHANNEL_NAME = 'ch1';

+--------------------------+----------------------------------+

| CHANNEL_NAME | SOURCE_CONNECTION_AUTO_FAILOVER |

+--------------------------+----------------------------------+

| ch1 | 1 |

+--------------------------+----------------------------------+

1 row in set (0.04 sec)

4. Start replica channel

Start the replication channel ‘ch1’ on the replica server, and check its I/O thread status:

replica> START REPLICA FOR CHANNEL 'ch1';
Query OK, 0 rows affected, 1 warning (0.03 sec)

replica> select CHANNEL_NAME, SOURCE_UUID, SERVICE_STATE  from performance_schema.replication_connection_status\G
*************************** 1. row ***************************
 CHANNEL_NAME: ch1
  SOURCE_UUID: 14d730cf-200e-11eb-9d08-0010e0734796
SERVICE_STATE: ON
1 row in set (0.00 sec)

replica> START REPLICA FOR CHANNEL 'ch1';

Query OK, 0 rows affected, 1 warning (0.03 sec)

replica> select CHANNEL_NAME, SOURCE_UUID, SERVICE_STATE from performance_schema.replication_connection_status\G

*************************** 1. row ***************************

CHANNEL_NAME: ch1

SOURCE_UUID: 14d730cf-200e-11eb-9d08-0010e0734796

SERVICE_STATE: ON

1 row in set (0.00 sec)

5. Stop mysqld on source1 server

To see the asynchronous connection failure mechanism working, log in to the host machine of source1 (node1 in this example), and stop the source server.

node1$ systemctl stop mysqld.service

1	node1$ systemctl stop mysqld.service

6. Verify that the replication channel has automatically failed over, that is, connected to source2

replica> select CHANNEL_NAME, SOURCE_UUID, SERVICE_STATE from performance_schema.replication_connection_status\G
*************************** 1. row ***************************
CHANNEL_NAME: ch1
SOURCE_UUID: 6c39cb5b-2000-11eb-9783-0010e0734796
SERVICE_STATE: ON
1 row in set (0.00 sec)

replica> select CHANNEL_NAME, SOURCE_UUID, SERVICE_STATE from performance_schema.replication_connection_status\G

*************************** 1. row ***************************

CHANNEL_NAME: ch1

SOURCE_UUID: 6c39cb5b-2000-11eb-9783-0010e0734796

SERVICE_STATE: ON

1 row in set (0.00 sec)

Before failing over to the next configured alternate source, the receiver thread on the replica will attempt to connect to the current source MASTER_RETRY_COUNT times (which is 2 in this case), as configured by CHANGE MASTER TO command in the first step. See the error log below.

2020-11-06T08:57:47.078756Z 37 [ERROR] [MY-010584] [Repl] Slave I/O for channel '': error connecting to master 'root@node1:3306' - retry-time: 1 retries: 1 message: Can't connect to MySQL server on 'node1' (111), Error_code: MY-002003
2020-11-06T08:57:48.079089Z 37 [ERROR] [MY-010584] [Repl] Slave I/O for channel '': error connecting to master 'root@node1:3306' - retry-time: 1 retries: 2 message: Can't connect to MySQL server on 'node1' (111), Error_code: MY-002003
2020-11-06T08:57:48.079145Z 37 [Note] [MY-010563] [Repl] Slave I/O thread for channel '' killed while connecting to master
2020-11-06T08:57:49.081452Z 37 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2020-11-06T08:57:49.083093Z 37 [System] [MY-010562] [Repl] Slave I/O thread for channel '': connected to master 'root@node1:3306',replication started in log 'FIRST' at position 642
2020-11-06T08:57:49.086941Z 37 [Warning] [MY-010549] [Repl] The master's UUID has changed, although this should not happen unless you have changed it manually. The old UUID was 14d730cf-200e-11eb-9d08-0010e0734796.

2020-11-06T08:57:47.078756Z 37 [ERROR] [MY-010584] [Repl] Slave I/O for channel '': error connecting to master 'root@node1:3306' - retry-time: 1 retries: 1 message: Can't connect to MySQL server on 'node1' (111), Error_code: MY-002003

2020-11-06T08:57:48.079089Z 37 [ERROR] [MY-010584] [Repl] Slave I/O for channel '': error connecting to master 'root@node1:3306' - retry-time: 1 retries: 2 message: Can't connect to MySQL server on 'node1' (111), Error_code: MY-002003

2020-11-06T08:57:48.079145Z 37 [Note] [MY-010563] [Repl] Slave I/O thread for channel '' killed while connecting to master

2020-11-06T08:57:49.081452Z 37 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.

2020-11-06T08:57:49.083093Z 37 [System] [MY-010562] [Repl] Slave I/O thread for channel '': connected to master 'root@node1:3306',replication started in log 'FIRST' at position 642

2020-11-06T08:57:49.086941Z 37 [Warning] [MY-010549] [Repl] The master's UUID has changed, although this should not happen unless you have changed it manually. The old UUID was 14d730cf-200e-11eb-9d08-0010e0734796.

Note: Asynchronous connection failover kicks in only when a replication connection fails. So even if an old source which has a higher weightage comes back online, the replication connection is not re-connected to the old source. Similarly in our example, even if source1 comes back online (mysqld is started) after some time, the asynchronous replication connection will stay connected to source2 till the connection fails or the user stops it.

7. Delete source configurations

If you later need to delete one or more alternate sources, use the asynchronous_connection_failover_delete_source() UDF on the replica.

SELECT asynchronous_connection_failover_delete_source(channel, host, port, network_namespace);

1	SELECT asynchronous_connection_failover_delete_source(channel, host, port, network_namespace);

Example:

mysql> SELECT asynchronous_connection_failover_delete_source('ch1', 'node1', 3306, '');
+---------------------------------------------------------------------------------------------------+
| asynchronous_connection_failover_delete_source('ch1', 'node1', 3306, '')                  |
+---------------------------------------------------------------------------------------------------+
| Source configuration details successfully deleted.                                                |
+---------------------------------------------------------------------------------------------------+
1 row in set (0,01 sec)

mysql> SELECT asynchronous_connection_failover_delete_source('ch1', 'node1', 3306, '');

+---------------------------------------------------------------------------------------------------+

| asynchronous_connection_failover_delete_source('ch1', 'node1', 3306, '') |

+---------------------------------------------------------------------------------------------------+

| Source configuration details successfully deleted. |

+---------------------------------------------------------------------------------------------------+

1 row in set (0,01 sec)

MySQL Router

Please note that this feature does not fully replace the MySQL Router
when it acts as the replication source endpoint in front of a MySQL
InnoDB Cluster. Most notably, this feature has the following
limitations:

it does not automatically learn about new members or members
that are removed
it does not follow the primary role, it stays connected to whatever
host it was connected to
it does not follow the majority network partition
it does not care if a host is not part of the group any longer, as
long as it can connect, it will

So, please, be aware that this feature does not replace the router
in those use cases scenarios that must observe such requirements.

Conclusion

We showed how to configure replicas to enable asynchronous connection fail over between multiple source servers. When a source or the connection to it fails, this feature allows the replica, even in different data center, to automatically connect to a new source seamlessly. Hope you will enjoy this feature, and that it provides improved fault tolerance for your cross-datacenter replication channels.

See documentation for more details on how to configure and use the feature, and see worklog for the technical specification.