Once the replication system is setup and running, it is not uncommon that one might want to verify if the data is being replicated correctly on the slaves. In normal circumstances, the same data is expected on the master and its slaves (excluding the use of filtering rules). Nevertheless, faults at the data level can introduce inconsistent changes on servers without raising any kind of error. These data inconsistencies can result from bugs, hardware malfunction, human errors, or unauthorized access.
It is desirable to detect these issues, in order to fix them and ultimately prevent them from happening again. Determining the cause of such issues might not be an easy task since it might be caused by byzantine failures at distinct levels. However, the first big step toward a solution to this kind of problem is being able to detect data inconsistency and make sure that the data among the replication servers is synchronized.
The goal is to execute the mysqlrplsync utility to detect data consistency issues on an active replication system making sure that the master and its slaves are synchronized.
Executing this task manually on an active system is difficult and sometimes tedious since changes may be continuously happening on all servers (in an asynchronous way) and the same data needs to be compared between servers. Moreover, it can introduce an undesirable and uncontrolled impact on the system performance if you lock the tables or stop replication.
Fortunately, the mysqlrplsync utility allows us to perform this task in an easy and optimized way with a controlled impact on the running system (limiting the execution time of all operations).
Let's assume that a replication topology with one master (server1:3310) and two slaves (server2:3311, server3:3312) was previously setup and it is running without errors.
Start the mysqlrplsync utility, specifying the servers you want to check.
mysqlrplsync --master=user:pass@localhost:3310 \
--slaves=rpl:pass@localhost:3311,rpl:pass@localhost:3312# # GTID differences between Master and Slaves: # - Slave 'localhost@3311' is 15 transactions behind Master. # - Slave 'localhost@3312' is 12 transactions behind Master. # # Checking data consistency. # # Using Master 'localhost@3310' as base server for comparison. # Checking 'test_rplsync_db' database... # - Checking 't0' table data... # [OK] `test_rplsync_db`.`t0` checksum for server 'localhost@3311'. # [OK] `test_rplsync_db`.`t0` checksum for server 'localhost@3312'. # - Checking 't1' table data... # [OK] `test_rplsync_db`.`t1` checksum for server 'localhost@3311'. # [OK] `test_rplsync_db`.`t1` checksum for server 'localhost@3312'. # Checking 'test_db' database... # - Checking 't0' table data... # [OK] `test_db`.`t0` checksum for server 'localhost@3311'. # [OK] `test_db`.`t0` checksum for server 'localhost@3312'. # - Checking 't1' table data... # [OK] `test_db`.`t1` checksum for server 'localhost@3311'. # [OK] `test_db`.`t1` checksum for server 'localhost@3312'. # #...done. # # SUMMARY: No data consistency issue found. #
The above example illustrates how to start the mysqlrplsync utility to check if all data on the specified replication topology is synchronized.
To do this, we simply need to specify the master's connection
option, and the list of slaves with the
option. As an alternative to the
option, one can use the
specifying a user and password (or login-path) to connect to
the slaves and the utility attempts to discover all of the
slaves connected to the master using the specified login. For
example, '--discover-slaves-login=root:secret' is used to
discover all of the slaves and login to each using the 'root'
user id and the password 'secret'.
can be very handy especially if there is a huge number of
slaves in the topology, but bear in mind that the explicit
specification of slaves is safer and that discovery can fail
to find some servers. In particular, it is important to note
that in order for slaves to be discovered, they must be
started with the '--report-host' and '--report-port' options
with appropriate values and they must be correctly connected
to the master (IO thread running) otherwise discovery may fail
to identify the slave.
In the above example, no data consistency issues were found. In case any data difference are found, each is clearly identified by the '[DIFF]' prefix followed by concise information of where and what is the difference. Additionally, at the end the utility displays a summary of the number of issues found.
The utility also allows users to check consistency on the slaves without specifying the master. However, be advised that only checking the slaves does not guarantee that there is no data consistency issue between the master and the slaves. Also keep in mind that the results provided by the utility are valid at the time the checks are actually performed for each table. This is because in an active system with data continuously changing, inconstancy issues might be introduced in the immediate instance after the check is completed.
The user for the master must have permissions to lock tables, perform the checksum, and get information about the master status. Specifically, the user used to connect to the master requires the following privileges: SUPER or REPLICATION CLIENT, LOCK TABLES and SELECT.
The user for the slaves must have permissions to start/stop the slave, perform the checksum, and get information about the slave status. More specifically, the login user to connect to slaves requires the following privileges: SUPER and SELECT.
In the above example, the mysqlrplsync
utility was used to check all the data on the servers.
However, it is possible to check only specific databases and
tables. For that purpose, the user only need specify the
target database and tables as arguments when invoking the
utility. It is also possible to exclude specific database and
tables from the check using the
--exclude option. For
example, '--exclude=test_rplsync_db,test_db.t0' excludes the
database 'test_rplsync_db' and table 'test_db.t0' from the
check performed by the utility.
The utility provides important options to control the
execution time of the checksum queries performed on each table
and the waiting time for slaves to reach an established
synchronization point, namely: the
--rpl-timeout options. A
polling process is applied on each slave to periodically check
if replication has caught up with the defined sync point (all
transactions have been processed).
The periodic interval to perform this check can be adjusted
option. These options are fundamental to control the impact of
the execution of the utility on the replication system allow
you to limit the execution time of the checksum queries for
large tables and the time slaves wait for replication to catch
up. When the timeouts defined by those options are reached,
the check is skipped. Nevertheless, the user can always
execute the utility later only for the skipped tables using
higher timeout values.
The utility provides the flexibility to be executed separately for different set of servers, only affecting different parts of the replication system at each time. For example, consider a heterogeneous system (where slaves have a different performance characteristics) with one master 'M' and three slaves 'S1', 'S2' and 'S3'. To minimize the impact on the master, the user can run the utility first for the master 'M' and the fastest slave 'S1', and then run it again only for the slaves 'S1', 'S2' and 'S3'. If no consistency issues are found in the first execution (M = S1) or in the second execution (S1 = S2 = S3), then by transitivity and due to the inclusion of the same server 'S1' in both checks, it can be said that there is no consistency issues in the topology (M = S1 = S2 = S3) at the time the first check was completed. This kind of execution must be performed sequentially and not concurrently, otherwise the synchronization process of each instance may affect the other and it may not work properly.