MySQL 8.0.22 introduces Automatic Asynchronous Replication connection failover. This feature makes MySQL asynchronous replication fault-tolerant to source connection failure, by automating the process of re-establishment of a replication connection to another available source. If there are source servers in multiple data centers, the user can also configure the replica to fail over between them, to provide Disaster Recovery from failures of an entire data center. The availability of a database can be maintained by keeping an extra copy of the data in a standby replica that receives it continuously from the source using replication. But, as always, there could be cases where the connection to the source fails and the stream of replicated data is interrupted.
In order to explain how the source failure would be handled with this feature, let’s consider the following scenario: there are two sites – New York and London, as shown in Fig 1 below, connected through asynchronous replication channel C2. The servers (S1, S2) in New York site are also connected through replication channel C1 (asynchronous or Group Replication for automatic fail-over), which means that S2 has a copy of the data on S1.
The server (R1) in London site is configured with S1 and S2’s source connection details, so when connection C2 fails (Fig 2) between S1 and R1, R1 establishes a new connection C3 between S2 and R1, as shown in Fig 3.
In the next section, we will walk you through the configuration and usage of the feature.
Usage
Prerequisites
Before enabling the asynchronous connection failover feature, all servers need to use gtid_mode=ON, and the replica that uses asynchronous connection failover must have configured MASTER_AUTO_POSITION=1 with CHANGE MASTER TO.
1. Create a replication channel
On the replica server, create a replication channel ‘ch1’ with the replication credentials of the source servers.
1 |
replica> CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='password', MASTER_HOST='node1', MASTER_PORT=3306, MASTER_RETRY_COUNT=2 FOR CHANNEL 'ch1'; |
Note: The default values of MASTER_RETRY_COUNT (86400 times), and MASTER_CONNECT_RETRY (60 seconds), were designed for a connection to a single source, and will make the replica retry the same connection for 60 days. When these attempts are exhausted, the asynchronous connection failover mechanism takes over. So to ensure that the asynchronous connection failover mechanism can be activated promptly, set MASTER_RETRY_COUNT to a minimal number that just allows a few retry attempts with the same source, in case the connection failure is caused by a transient network outage.
2. Add the alternate source configuration details for the asynchronous replication channel
The feature is controlled on the replica. The UDF asynchronous_connection_failover_add_source() can be used to add connection configuration for an alternate source.
1 |
SELECT asynchronous_connection_failover_add_source(channel, host, port, network_namespace, weight); |
The arguments for this UDF have the following meanings:
- channel/host/port/network_namespace: The replication channel connection attributes specifying how to connect to this alternate source.
- weight: When multiple alternate sources are specified, the failover operation will try to connect to source having highest weight. If weight is not specified, it defaults to 50.
On replica, configure two sources:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
replica> SELECT asynchronous_connection_failover_add_source('ch1', 'node1', 3306, '', 80); +-----------------------------------------------------------------------------------------------------+ | asynchronous_connection_failover_add_source('ch1', 'node1', 3306, '', 80) | +-----------------------------------------------------------------------------------------------------+ | Source configuration details successfully inserted. | +-----------------------------------------------------------------------------------------------------+ 1 row in set (0,01 sec) replica> SELECT asynchronous_connection_failover_add_source('ch1', 'node2', 3306, '', 70); +-----------------------------------------------------------------------------------------------------+ | asynchronous_connection_failover_add_source('ch1', 'node2', 3306, '', 70) | +-----------------------------------------------------------------------------------------------------+ | Source configuration details successfully inserted. | +-----------------------------------------------------------------------------------------------------+ 1 row in set (0,01 sec) |
The values inserted can be verified in the replication_asynchronous_connection_failover table of the performance_schema database.
1
2
3
4
5
6
7
8
|
replica> SELECT * FROM performance_schema.replication_asynchronous_connection_failover; +--------------+-----------+------+-------------------+--------+ | CHANNEL_NAME | HOST | PORT | NETWORK_NAMESPACE | WEIGHT | +--------------+-----------+------+-------------------+--------+ | ch1 | node1 | 3306 | | 80 | | ch1 | node2 | 3306 | | 70 | +--------------+-----------+------+-------------------+--------+ 2 rows in set (0.00 sec) |
3. Enable the feature
Enable the feature for the replication channel ‘ch1’ using CHANGE MASTER TO SOURCE_CONNECTION_AUTO_FAILOVER=1. This statement can be used when the replication threads are running.
1
2
3
4
5
6
7
8
9
10
|
replica> CHANGE MASTER TO SOURCE_CONNECTION_AUTO_FAILOVER=1 FOR CHANNEL 'ch1'; Query OK, 0 rows affected (0.12 sec) replica> SELECT CHANNEL_NAME, SOURCE_CONNECTION_AUTO_FAILOVER FROM performance_schema.replication_connection_configuration WHERE CHANNEL_NAME = 'ch1'; +--------------------------+----------------------------------+ | CHANNEL_NAME | SOURCE_CONNECTION_AUTO_FAILOVER | +--------------------------+----------------------------------+ | ch1 | 1 | +--------------------------+----------------------------------+ 1 row in set (0.04 sec) |
4. Start replica channel
Start the replication channel ‘ch1’ on the replica server, and check its I/O thread status:
1
2
3
4
5
6
7
8
9
10
|
replica> START REPLICA FOR CHANNEL 'ch1'; Query OK, 0 rows affected, 1 warning (0.03 sec) replica> select CHANNEL_NAME, SOURCE_UUID, SERVICE_STATE from performance_schema.replication_connection_status\G *************************** 1. row *************************** CHANNEL_NAME: ch1 SOURCE_UUID: 14d730cf-200e-11eb-9d08-0010e0734796 SERVICE_STATE: ON 1 row in set (0.00 sec) |
5. Stop mysqld on source1 server
To see the asynchronous connection failure mechanism working, log in to the host machine of source1 (node1 in this example), and stop the source server.
1 |
node1$ systemctl stop mysqld.service |
6. Verify that the replication channel has automatically failed over, that is, connected to source2
1
2
3
4
5
6
|
replica> select CHANNEL_NAME, SOURCE_UUID, SERVICE_STATE from performance_schema.replication_connection_status\G *************************** 1. row *************************** CHANNEL_NAME: ch1 SOURCE_UUID: 6c39cb5b-2000-11eb-9783-0010e0734796 SERVICE_STATE: ON 1 row in set (0.00 sec) |
Before failing over to the next configured alternate source, the receiver thread on the replica will attempt to connect to the current source MASTER_RETRY_COUNT times (which is 2 in this case), as configured by CHANGE MASTER TO command in the first step. See the error log below.
1
2
3
4
5
6
|
2020-11-06T08:57:47.078756Z 37 [ERROR] [MY-010584] [Repl] Slave I/O for channel '': error connecting to master 'root@node1:3306' - retry-time: 1 retries: 1 message: Can't connect to MySQL server on 'node1' (111), Error_code: MY-002003 2020-11-06T08:57:48.079089Z 37 [ERROR] [MY-010584] [Repl] Slave I/O for channel '': error connecting to master 'root@node1:3306' - retry-time: 1 retries: 2 message: Can't connect to MySQL server on 'node1' (111), Error_code: MY-002003 2020-11-06T08:57:48.079145Z 37 [Note] [MY-010563] [Repl] Slave I/O thread for channel '' killed while connecting to master 2020-11-06T08:57:49.081452Z 37 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information. 2020-11-06T08:57:49.083093Z 37 [System] [MY-010562] [Repl] Slave I/O thread for channel '': connected to master 'root@node1:3306',replication started in log 'FIRST' at position 642 2020-11-06T08:57:49.086941Z 37 [Warning] [MY-010549] [Repl] The master's UUID has changed, although this should not happen unless you have changed it manually. The old UUID was 14d730cf-200e-11eb-9d08-0010e0734796. |
Note: Asynchronous connection failover kicks in only when a replication connection fails. So even if an old source which has a higher weightage comes back online, the replication connection is not re-connected to the old source. Similarly in our example, even if source1 comes back online (mysqld is started) after some time, the asynchronous replication connection will stay connected to source2 till the connection fails or the user stops it.
7. Delete source configurations
If you later need to delete one or more alternate sources, use the asynchronous_connection_failover_delete_source() UDF on the replica.
1 |
SELECT asynchronous_connection_failover_delete_source(channel, host, port, network_namespace); |
Example:
1
2
3
4
5
6
7
|
mysql> SELECT asynchronous_connection_failover_delete_source('ch1', 'node1', 3306, ''); +---------------------------------------------------------------------------------------------------+ | asynchronous_connection_failover_delete_source('ch1', 'node1', 3306, '') | +---------------------------------------------------------------------------------------------------+ | Source configuration details successfully deleted. | +---------------------------------------------------------------------------------------------------+ 1 row in set (0,01 sec) |
MySQL Router
Please note that this feature does not fully replace the MySQL Router
when it acts as the replication source endpoint in front of a MySQL
InnoDB Cluster. Most notably, this feature has the following
limitations:
- it does not automatically learn about new members or members
that are removed - it does not follow the primary role, it stays connected to whatever
host it was connected to - it does not follow the majority network partition
- it does not care if a host is not part of the group any longer, as
long as it can connect, it will
So, please, be aware that this feature does not replace the router
in those use cases scenarios that must observe such requirements.
Conclusion
We showed how to configure replicas to enable asynchronous connection fail over between multiple source servers. When a source or the connection to it fails, this feature allows the replica, even in different data center, to automatically connect to a new source seamlessly. Hope you will enjoy this feature, and that it provides improved fault tolerance for your cross-datacenter replication channels.
See documentation for more details on how to configure and use the feature, and see worklog for the technical specification.