WL#9131: Add RETRIES field to replication_applier_status_by_worker P_S table
Affects: Server-8.0
—
Status: Complete
Executive Summary ================= A MySQL replication applier is capable of retrying to apply a transaction that failed because of transient errors (see: https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_slave_transaction_retries). Since MySQL 5.7.5, a multi-threaded slave is also capable of retrying transactions, but there is no way of monitoring how many times a worker has tried to apply the current transaction. Existing LAST_ERROR_* columns in the performance schema table replication_applier_status_by_worker only display information about errors that lead the applier thread to stop (a non-transient error or a transient error that was retried and failed up to the retry limit). This means that it is not possible to use performance_schema to monitor transient errors while a transaction is being applied. Similarly, if a transaction with transient errors was applied, performance_schema tables will also not show any information about these errors. This worklog introduces eight new columns in the performance schema table replication_applier_status_by_worker that report the number of retries for the last transaction applied by the worker and the transaction the worker is currently applying, and the number, message and timestamp of the last transient error causing a retry. The worklog also changes the behaviour of the column APPLYING_TRANSACTION_START_APPLY_TIMESTAMP which, instead of being reset every time the transaction was retried, stores the same timestamp when the worker first started applying it. User story ========== As a MySQL user, I want to be able to understand if a replica (whether it is a slave in a replication topology or a member of a group) is lagging because transactions are retrying due to transient errors and which was this error.
Functional Requirements ======================= F1. The table performance_schema.replication_applier_status_by_worker shall present updated information for the number of retries and last transient error number, message and timestamp for the last applied and currently applying transaction for each worker. F2. LAST_TRANSIENT_ERROR_NUMBER shall be zero if there were no retries. F3. LAST_TRANSIENT_ERROR_MESSAGE shall be empty if there were no retries. F4. LAST_TRANSIENT_ERROR_TIMESTAMP shall show "0000-00-00 00:00:00.000000" if there were no retries. F5. RETRIES_COUNT shall be zero if there were no retries. F6. If RETRIES_COUNT is larger than zero, LAST_TRANSIENT_ERROR_NUMBER shall not be zero. F7. If RETRIES_COUNT is larger than zero, LAST_TRANSIENT_ERROR_MESSAGE shall not be empty. F8. If RETRIES_COUNT is larger than zero, LAST_TRANSIENT_ERROR_TIMESTAMP shall display a valid timestamp, larger than START_APPLY_TIMESTAMP and smaller than END_APPLY_TIMESTAMP. F9. The persistence of this information shall follow the same rules as the other LAST_APPLIED and CURRENTLY_APPLYING columns. F10. This information shall be available in both single-threaded slave and multi-threaded slave modes. F11. If the performance schema instrumentation is not enabled, LAST_TRANSIENT_ERROR_TIMESTAMP shall not be collected. F12. When a transaction is retried, i.e., when APPLYING_TRANSACTION_RETRIES_COUNT is incremented, APPLYING_TRANSACTION_START_APPLY_TIMESTAMP shall not be updated. Non-Functional Requirements =========================== NF1.This worklog shall not introduce more than 5% regression in performance.
Eight new columns will be added to performance_schema.replication_applier_status_by_worker: LAST_APPLIED_TRANSACTION_RETRIES_COUNT (ulong) - The number of times the last applied transaction was retried by the worker. LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER (uint) - The number of the last transient error. LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE (varchar) - The message of the last transient error. LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP ((timestamp) - The timestamp of the last transient error. APPLYING_TRANSACTION_RETRIES_COUNT (ulong) - The number of times the transaction that is currently being applied was retried until this moment. APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER (uint) - The number of the last transient error. APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE (varchar) - The message of the last transient error. APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP (timestamp) - The timestamp of the last transient error.
1) Since this worklog only adds new columns to an already existing performance schema table, no new files will be added. 2) add two new fields to Trx_monitoring_info: Error last_transient_error ulong transaction_retries 3) add new method to Gtid_monitoring_info: void Gtid_monitoring_info::store_transient_error(uint transient_errno_arg, const char *transient_err_message_arg, ulong trans_retries_arg) { atomic_lock(); processing_trx->transaction_retries = trans_retries_arg; processing_trx->last_transient_error.number = transient_errno_arg; snprintf(last_transient_error.message, sizeof(last_transient_error.message), "%.*s", MAX_SLAVE_ERRMSG - 1, transient_err_message_arg); last_transient_error.update_timestamp(); atomic_unlock(); } 4) add new method to Relay_log_info: void retry_processing(uint transient_errno_arg, const char *transient_err_message_arg, ulong trans_retries_arg) { gtid_processing->store_transient_error(transient_errno_arg, transient_err_message_arg, trans_retries_arg); }
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.