WL#9131: Add RETRIES field to replication_applier_status_by_worker P_S table

Affects: Server-8.0   —   Status: Complete

Executive Summary
=================
A MySQL replication applier is capable of retrying to apply a transaction that
failed because of transient errors (see:
https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_slave_transaction_retries).
Since MySQL 5.7.5, a multi-threaded slave is also capable of retrying
transactions, but there is no way of monitoring how many times a worker has
tried to apply the current transaction.
Existing LAST_ERROR_* columns in the performance schema table 
replication_applier_status_by_worker only display information about errors that 
lead the applier thread to stop (a non-transient error or a transient error that 
was retried and failed up to the retry limit). This means that it is not 
possible to use performance_schema to monitor transient errors while a 
transaction is being applied. Similarly, if a transaction with transient errors 
was applied, performance_schema tables will also not show any information about 
these errors.
This worklog introduces eight new columns in the performance schema table
replication_applier_status_by_worker that report the number of retries for the
last transaction applied by the worker and the transaction the worker is
currently applying, and the number, message and timestamp of the last transient
error causing a retry.
The worklog also changes the behaviour of the column
APPLYING_TRANSACTION_START_APPLY_TIMESTAMP
which, instead of being reset every time the transaction was retried, stores the
same timestamp when the worker first started applying it.

User story
==========
As a MySQL user, I want to be able to understand if a replica (whether it is a
slave in a replication topology or a member of a group) is lagging because
transactions are retrying due to transient errors and which was this error.
Functional Requirements
=======================

F1. The table performance_schema.replication_applier_status_by_worker shall
present updated information for the number of retries and last transient error
number, message and timestamp for the last applied and currently applying
transaction for each worker.

F2. LAST_TRANSIENT_ERROR_NUMBER shall be zero if there were no retries.

F3. LAST_TRANSIENT_ERROR_MESSAGE shall be empty if there were no retries.

F4. LAST_TRANSIENT_ERROR_TIMESTAMP shall show "0000-00-00 00:00:00.000000" if
there were no retries.

F5. RETRIES_COUNT shall be zero if there were no retries.

F6. If RETRIES_COUNT is larger than zero, LAST_TRANSIENT_ERROR_NUMBER shall not
be zero.

F7. If RETRIES_COUNT is larger than zero, LAST_TRANSIENT_ERROR_MESSAGE shall not
be empty.

F8. If RETRIES_COUNT is larger than zero, LAST_TRANSIENT_ERROR_TIMESTAMP shall
display a valid timestamp, larger than START_APPLY_TIMESTAMP and smaller than
END_APPLY_TIMESTAMP.

F9. The persistence of this information shall follow the same rules as the other
LAST_APPLIED and CURRENTLY_APPLYING columns.

F10. This information shall be available in both single-threaded slave and
multi-threaded slave modes.

F11. If the performance schema instrumentation is not enabled,
LAST_TRANSIENT_ERROR_TIMESTAMP shall not be collected.

F12. When a transaction is retried, i.e., when APPLYING_TRANSACTION_RETRIES_COUNT
is incremented, APPLYING_TRANSACTION_START_APPLY_TIMESTAMP shall not be updated.

Non-Functional Requirements
===========================

NF1.This worklog shall not introduce more than 5% regression in performance.
Eight new columns will be added to
performance_schema.replication_applier_status_by_worker:

LAST_APPLIED_TRANSACTION_RETRIES_COUNT (ulong) - The number of times the last
                                  applied transaction was retried by the worker.
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER (uint) - The number of the
                                  last transient error.
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE (varchar) - The message of
                                  the last transient error.
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP ((timestamp) - The
                                  timestamp of the last transient error.
APPLYING_TRANSACTION_RETRIES_COUNT (ulong) - The number of times the
                                  transaction that is currently being applied
                                  was retried until this moment.
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER (uint) - The number of the last
                                  transient error.
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE (varchar) - The message of the
                                  last transient error.
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP (timestamp) - The timestamp
                                  of the last transient error.
1) Since this worklog only adds new columns to an already existing performance
schema table, no new files will be added.

2) add two new fields to Trx_monitoring_info:
 Error last_transient_error
 ulong transaction_retries

3) add new method to Gtid_monitoring_info:
void Gtid_monitoring_info::store_transient_error(uint transient_errno_arg,
                                                 const char 
                                                      *transient_err_message_arg,
                                                 ulong trans_retries_arg) {
  atomic_lock();
  processing_trx->transaction_retries = trans_retries_arg;
  processing_trx->last_transient_error.number = transient_errno_arg;
  snprintf(last_transient_error.message,
           sizeof(last_transient_error.message), 
           "%.*s",
           MAX_SLAVE_ERRMSG - 1,
           transient_err_message_arg);
  last_transient_error.update_timestamp();
  atomic_unlock();
}

4) add new method to Relay_log_info:
void retry_processing(uint transient_errno_arg,
                      const char *transient_err_message_arg,
                      ulong trans_retries_arg) {
  gtid_processing->store_transient_error(transient_errno_arg,
                                         transient_err_message_arg,
                                         trans_retries_arg);
}