MySQL :: Extending replication instrumentation: an insight on transaction retries

MySQL 8.0.13 improves replication lag monitoring by extending the instrumentation for transaction transient errors. These temporary errors, which include lock timeouts caused by client transactions executing concurrently as the slave is replicating, do not stop the applier thread: instead, they cause a transaction to retry.

In this new release, we add eight new columns to the Performance Schema table replication_applier_status_by_worker reporting for both the last applied and the currently applying transaction:
– the number of times the transaction was retried by the worker after the first attempt (if the transaction was applied at the first attempt, this number is zero);
– the error number of the last transient error that caused the transaction to be retried;
– the message text for the last transient error that caused the transaction to be retried;
– the timestamp, in ‘YYYY-MM-DD HH:MM:SS[.fraction]’ format, for the last transient error that caused the transaction to be retried.

Until MySQL 8.0.12, in addition to checking the error log, we could monitor the number of retries that took place on each replication channel in the COUNT_TRANSACTIONS_RETRIES column in the Performance Schema table replication_applier_status, but this approach was not very effective for a multi-threaded slave. Additionally, existing LAST_ERROR_* columns in the performance schema table replication_applier_status_by_worker only display information about errors which lead the applier thread to stop. Now, we can easily monitor both transient and non-transient errors for the currently applying and last applied transactions directly from the Performance Schema table replication_applier_status_by_worker.

Let’s see this in action!

Considering that a master-slave replication topology has already been setup and the variable slave_transaction_retries set to a high number, for instance 10, let’s create a table on the master.

[on master]
CREATE TABLE t1 (a INT UNIQUE);

1 2	[on master] CREATE TABLE t1 (a INT UNIQUE);

After the table is replicated on the slave, we start a transaction on the slave to lock the table

[on slave]
BEGIN;
INSERT INTO t1 VALUES(1);

[on slave]

BEGIN;

INSERT INTO t1 VALUES(1);

and go back to the master to execute and replicate a transaction that writes the exact same row, so that the slave will be forced to retry applying this transaction.

[on master]
INSERT INTO t1 VALUES(1);

1 2	[on master] INSERT INTO t1 VALUES(1);

We can now check the new fields in performance schema table replication_applier_status_by_worker, which will tell us that the applying transaction has already been retried twice because of error 1205 and when the last retry was started.

SELECT * FROM performance_schema.replication_applier_status_by_worker;
CHANNEL_NAME channel_1
WORKER_ID 1
THREAD_ID 50
SERVICE_STATE ON
LAST_ERROR_NUMBER 0
LAST_ERROR_MESSAGE
LAST_ERROR_TIMESTAMP 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP 0000-00-00 00:00:00.000000
APPLYING_TRANSACTION 12da8c69-cd30-11e8-93d0-0010e0734796:2
APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 2018-10-11 11:31:41.606029
APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP 2018-10-11 11:31:41.606029
APPLYING_TRANSACTION_START_APPLY_TIMESTAMP 2018-10-11 11:31:41.607262
LAST_APPLIED_TRANSACTION_RETRIES_COUNT 0
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER 0
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP 0000-00-00 00:00:00.000000
APPLYING_TRANSACTION_RETRIES_COUNT 2
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER 1205
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE Lock wait timeout exceeded; try restarting transaction
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP 2018-10-11 11:31:48.649632

SELECT * FROM performance_schema.replication_applier_status_by_worker;

CHANNEL_NAME channel_1

WORKER_ID 1

THREAD_ID 50

SERVICE_STATE ON

LAST_ERROR_NUMBER 0

LAST_ERROR_MESSAGE

LAST_ERROR_TIMESTAMP 0000-00-00 00:00:00.000000

LAST_APPLIED_TRANSACTION

LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 0000-00-00 00:00:00.000000

LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP 0000-00-00 00:00:00.000000

LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP 0000-00-00 00:00:00.000000

LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP 0000-00-00 00:00:00.000000

APPLYING_TRANSACTION 12da8c69-cd30-11e8-93d0-0010e0734796:2

APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 2018-10-11 11:31:41.606029

APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP 2018-10-11 11:31:41.606029

APPLYING_TRANSACTION_START_APPLY_TIMESTAMP 2018-10-11 11:31:41.607262

LAST_APPLIED_TRANSACTION_RETRIES_COUNT 0

LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER 0

LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE

LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP 0000-00-00 00:00:00.000000

APPLYING_TRANSACTION_RETRIES_COUNT 2

APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER 1205

APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE Lock wait timeout exceeded; try restarting transaction

APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP 2018-10-11 11:31:48.649632

Now let’s rollback the slave local transaction, so that the one being replicated from the master can commit.

[on slave]
ROLLBACK;

1 2	[on slave] ROLLBACK;

After the replicated transaction is finally applied, we can see that it was retried six times, the last transient error it experienced and when the last retry started. Also note that since the error in the applying transaction was transient, the LAST_ERROR fields are still clear.

SELECT * FROM performance_schema.replication_applier_status_by_worker;
CHANNEL_NAME channel_1
WORKER_ID 1
THREAD_ID 50
SERVICE_STATE ON
LAST_ERROR_NUMBER 0
LAST_ERROR_MESSAGE
LAST_ERROR_TIMESTAMP 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION 12da8c69-cd30-11e8-93d0-0010e0734796:2
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 2018-10-11 11:31:41.606029
LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP 2018-10-11 11:31:41.606029
LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP 2018-10-11 11:31:41.607262
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP 2018-10-11 11:32:19.682475
APPLYING_TRANSACTION
APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 0000-00-00 00:00:00.000000
APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP 0000-00-00 00:00:00.000000
APPLYING_TRANSACTION_START_APPLY_TIMESTAMP 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION_RETRIES_COUNT 6
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER 1205
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE Lock wait timeout exceeded; try restarting transaction
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP 2018-10-11 11:32:14.652626
APPLYING_TRANSACTION_RETRIES_COUNT 0
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER 0
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP 0000-00-00 00:00:00.000000

SELECT * FROM performance_schema.replication_applier_status_by_worker;

CHANNEL_NAME channel_1

WORKER_ID 1

THREAD_ID 50

SERVICE_STATE ON

LAST_ERROR_NUMBER 0

LAST_ERROR_MESSAGE

LAST_ERROR_TIMESTAMP 0000-00-00 00:00:00.000000

LAST_APPLIED_TRANSACTION 12da8c69-cd30-11e8-93d0-0010e0734796:2

LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 2018-10-11 11:31:41.606029

LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP 2018-10-11 11:31:41.606029

LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP 2018-10-11 11:31:41.607262

LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP 2018-10-11 11:32:19.682475

APPLYING_TRANSACTION

APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 0000-00-00 00:00:00.000000

APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP 0000-00-00 00:00:00.000000

APPLYING_TRANSACTION_START_APPLY_TIMESTAMP 0000-00-00 00:00:00.000000

LAST_APPLIED_TRANSACTION_RETRIES_COUNT 6

LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER 1205

LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE Lock wait timeout exceeded; try restarting transaction

LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP 2018-10-11 11:32:14.652626

APPLYING_TRANSACTION_RETRIES_COUNT 0

APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER 0

APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE

APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP 0000-00-00 00:00:00.000000

The information in START_APPLY_TIMESTAMP fields was also refined in this release. Prior to 8.0.13, this timestamp was refreshed every time the worker retried to apply a transaction due to a transient error. Now, START_APPLY_TIMESTAMP is set to when the worker started its first attempt to apply the transaction, as you can see in the examples above. This means the time elapsed between START_APPLY_TIMESTAMP and END_APPLY_TIMESTAMP matches the total time the worker spent applying a transaction, including retries.

This sums up the improvements in monitoring replication lag in MySQL 8.0.13. We look forward to hearing your feedback!