WL#15620: MTA: extended applier statistics

Affects: Server-9.x — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

EXECUTIVE SUMMARY

This worklog aims at enhancing observability of replication. It gathers statistics across the replication pipeline and presents them in new PFS tables. Also, this work unifies the way statistics are gathered and presented to the user (replacing the logging of applier metrics into the error log). These will facilitate observability, troubleshooting and forensics, which is extremely important when running and operating a MySQL service.

From 9.1.0, this feature is available in Enterprise Edition.

From 9.7.0, this feature is available in Community and all other editions.

USER/DEV STORIES

U1. As a MySQL service operator,

- I want to query a table to get metrics of wait times, wait counts,
  transaction counts, event counts, and byte counts for the applier
  coordinator and worker threads,

- so that I build Grafana dashboards exposing the exact behavior and
  bottlenecks of the replication applier given the current workload running
  through the system.

SCOPE

Introduce new tables that expose the relevant metrics.
Remove existing text written to the error log. It is the wrong place to output performance statistics. Also, this contains some meaningless metrics and lacks some important metrics.

LIMITATIONS

The feature does not work with replica_parallel_workers=0.

Functional Requirements

--- For metrics grouped per channel

FR1. performance_schema.replication_applier_metrics must report the coordinator total active time duration since its measurements were restarted. If the applier is stopped and started then all periods are summed.

 This information is reset when the replica server restarts. This information
 is not reset when the applier stops.

 The name of the field shall be TOTAL_ACTIVE_TIME_DURATION.

FR2. performance_schema.replication_applier_metrics must report the last time (since server start) when the channel's applier was started.

 This information is reset each time the applier for a channel is started.

 The name of the field shall be LAST_APPLIER_START.

FR3. performance_schema.replication_applier_metrics must report the number of transactions committed since the last time the metrics were reset.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 The name of the field shall be TRANSACTIONS_COMMITTED_COUNT.

FR4. performance_schema.replication_applier_metrics must report the number of transactions currently being worked on.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 A transactions is marked as ongoing when its first event is
 scheduled on an worker and the counter decreases when the
 transaction commits.

 The name of the field shall be TRANSACTIONS_ONGOING_COUNT.

FR5. performance_schema.replication_applier_metrics must report the number of transactions in wait to be applied.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 A transaction is pending when it is queued in the relay log and
 stops being pending when it is committed. Transactions that are
 ongoing are also counted as pending.

 If a server is restarted and relay logs exist, the number of
 pending transactions is only known when all old relay logs are
 consumed. Until then the shown number is NULL.

 The name of the field shall be TRANSACTIONS_PENDING_COUNT.

FR6. performance_schema.replication_applier_metrics must report the total size of transactions committed in bytes since the last time the metrics were reset.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 The name of the field shall be TRANSACTIONS_COMMITTED_SIZE_BYTES_COUNT.

FR7. performance_schema.replication_applier_metrics must report the total size in bytes of transactions that are currently being executed by workers.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 This size is increased based on the first seen GTID or
 transaction payload event (for compressed transactions) that are
 scheduled for each transaction. The size decreases when the transaction commit

 The name of the field shall be TRANSACTIONS_ONGOING_FULL_SIZE_BYTES_SUM.

FR8. performance_schema.replication_applier_metrics must report the size in bytes of the parts of transactions that were already executed for ongoing transactions.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 This size is increased for each executed event (given that the
 event provides a data size). The size decreases when the transaction commit

 The name of the field shall be TRANSACTIONS_ONGOING_PROGRESS_SIZE_BYTES_SUM.

FR9. performance_schema.replication_applier_metrics must report the size in bytes for all the transactions that are waiting execution

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 A transaction is pending when it is queued in the relay log and
 stops being pending when it is committed. Transactions that are
 ongoing are also counted as pending.

 This size is increased for each based on the first seen GTID or
 transaction payload event (for compressed transactions) that are
 queued for each transaction. The size decreases when the
 transaction commit.

 If a server is restarted and relay logs exist, the size of
 pending transactions is only known when all old relay logs are
 consumed. Until then the shown number is NULL.

 The name of the field shall be TRANSACTIONS_PENDING_SIZE_BYTES_SUM.

FR10. performance_schema.replication_applier_metrics must report the number of events committed since the last time the metrics were reset.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 For compressed transactions, this shall count the embedded events and not
 count the payload event.

 The name of the field shall be EVENTS_COMMITTED_COUNT.

FR11. performance_schema.replication_applier_metrics must report the number of times spent waiting for work from upstream, i.e., wait for relay log to grow since the last time the metrics were reset.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 This counter might increase while there is no work to be
 executed as the coordinator executes not one but a continuous
 series of waiting cycles.

 The name of the field shall be WAITS_FOR_WORK_FROM_SOURCE_COUNT.

FR12. performance_schema.replication_applier_metrics must report time spent while waiting for work from upstream, i.e., wait for relay log to grow since the last time the metrics were reset.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 The name of the field shall be WAITS_FOR_WORK_FROM_SOURCE_SUM_TIME.

FR13. performance_schema.replication_applier_metrics must report the number of times the coordinator waited with scheduling a transaction until some worker became available.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 The name of the field shall be WAITS_FOR_AVAILABLE_WORKER_COUNT.

FR14. performance_schema.replication_applier_metrics must report the aggregated time (unit: nanoseconds) the coordinator waited with scheduling a transaction until some worker became available.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 The name of the field shall be WAITS_FOR_AVAILABLE_WORKER_SUM_TIME.

FR15. performance_schema.replication_applier_metrics must report the number of times the coordinator waited for a preceding/dependent transaction to commit.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 The name of the field shall be WAITS_COMMIT_SCHEDULE_DEPENDENCY_COUNT.

FR16. performance_schema.replication_applier_metrics must report the aggregated time the coordinator waited for a preceding/dependent transaction to commit.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 The name of the field shall be WAITS_COMMIT_SCHEDULE_DEPENDENCY_SUM_TIME.

FR17. performance_schema.replication_applier_metrics must report the number of times the coordinator waited with scheduling an event to a worker until the worker has reduced the queue size below replica_pending_jobs_size_max bytes'.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 The name of the field shall be WAITS_FOR_WORKER_QUEUES_MEMORY_COUNT.

FR18. performance_schema.replication_applier_metrics must report the aggregated time the coordinator waited with scheduling an event to a worker until the worker has reduced the queue size below replica_pending_jobs_size_max bytes.

 This information is reset when the replica server restarts or the user
 issues RESET REPLICA. This information is not reset when the applier stops.

 The name of the field shall be WAITS_FOR_WORKER_QUEUES_MEMORY_SUM_TIME.

FR19. performance_schema.replication_applier_metrics must report the number of times the coordinator waited because there were no empty slots in the worker queue to add more tasks.

  This information is reset when the replica server restarts or the channel
  on replica server is deleted.  This information is not reset when the
  applier stops.

  The name of the field shall be WAITS_WORKER_QUEUES_FULL_COUNT.

FR20. performance_schema.replication_applier_metrics must report the aggregated time the coordinator waited because there were no empty slots in the worker queue to add more tasks.

  This information is reset when the replica server restarts or the channel
  on replica server is deleted.  This information is not reset when the
  applier stops.

  The name of the field shall be WAITS_WORKER_QUEUES_FULL_SUM_TIME.

FR21. performance_schema.replication_applier_metrics must report cumulative time spent by the coordinator on reading events from relay log since the metrics were reset.

  This information is reset when the replica server restarts or the channel
  on replica server is deleted.  This information is not reset when the
  applier stops.

  The name of the field shall be TIME_TO_READ_FROM_RELAY_LOG_SUM_TIME

FR22. performance_schema.replication_applier_metrics must report the number of times workers have waited for the preceding transaction to commit before they could commit their own.

  This information is reset when the replica server restarts or the channel
  on replica server is deleted.  This information is not reset when the
  applier stops.

  The name of the field shall be WAITS_DUE_TO_COMMIT_ORDER_COUNT.

FR23. performance_schema.replication_applier_metrics must report the number of nanoseconds workers have waited for the preceding transaction to commit before they could commit their own.

  This information is reset when the replica server restarts or the channel
  on replica server is deleted.  This information is not reset when the
  applier stops.

  The name of the field shall be WAITS_DUE_TO_COMMIT_ORDER_SUM_TIME.

--- For metrics grouped per worker

FR24. performance_schema.replication_applier_progress_by_worker must report for each worker what is the type of transaction being executed, if it is a DML or DDL transaction.

  This information is only available for workers executing a
  transaction.

  For each transaction the type is only known after the GTID event
  is processed. Until then the type is 'UNASSIGNED'

  The name of the field shall be APPLYING_TRANSACTION_TYPE.

FR25. performance_schema.replication_applier_progress_by_worker must report the total size in bytes of transactions that are currently being executed by the worker.

  This information is only available for workers executing a
  transaction.

  This size is increased based on the first seen GTID or
  transaction payload event (for compressed transactions) that is
  processed by the worker. The size is reset when the transaction commit

  The name of the field shall be APPLYING_TRANSACTION_FULL_SIZE_BYTES.

FR26. performance_schema.replication_applier_progress_by_worker must report the size in bytes of the parts of transactions that were already executed for the worker ongoing transaction.

  This information is only available for workers executing a
  transaction.

  This size is increased for each executed event (given that the
  event provides a data size). The size is reset when the transaction commit

  The name of the field shall be APPLYING_TRANSACTION_APPLIED_SIZE_BYTES.

--- Overall requirements

FR27. After component installation, metrics shall be displayed for all channels created or restarted after this point. Same for Group replication channels.

FR28. RESET REPLICA shall make statistics be reset as well.

FR29. RESET REPLICA ALL shall make all statistics be reset for a given channel.

Non-Functional Requirements

NFR1. Applier throughput must not degrade more than 3% of the current throughput for the sysbench write only workload.

NFR2. Applier latency must not degrade more than 3% of the current apply latency for the sysbench write only workload.

SUMMARY OF THE APPROACH

There is a set of stats that are periodically written out to the error log. We are going to expose some of those through performance schema. We are only exposing those that are meaningful to operators/DBAs, not server developers. A new table replication_applier_metrics is introduced that will also include new stats that we deemed useful for operators and DBAs

Moreover, we are removing an undocumented error log entry that contains part of this information and instead putting it in the PFS table.

In this worklog we also add a new table that maps for each worker additional information about the transaction currently being applied. This table is: replication_applier_progress_by_worker

Code structure

The basis for information collection will be two classes; one for each table.

The server offers this information in the form of a service,

replication_appplier_metrics

With this service, its users can request information about either the coordinators or the workers for all channels in the server.

Note that several per-channel metrics are collected by workers (for example, metrics incremented at commit time).

The new tables will be created and supported by a new component:

component_replication_applier_metrics

The component will then use the server service to request information each time the user selects from one of the tables.

USER INTERFACE AND OBSERVABILITY

Columns in the new table performance_schema.replication_applier_metrics:

CHANNEL_NAME Type: CHAR(64) NOT NULL Comment: 'The channel name.' Dimension: Not applicable
TOTAL_ACTIVE_TIME_DURATION Type: BIGINT NOT NULL Comment: 'Total channel applier uptime (ns).' Dimension: Nanoseconds
LAST_APPLIER_START Type: TIMESTAMP Comment: 'The last time (since server start) when the channel applier was started.' Dimension: TIMESTAMP, 2038-01-19 03:14:07
TRANSACTIONS_COMMITTED_COUNT Type: BIGINT NOT NULL Comment: 'The number of transactions committed so far.' Dimension: Counter
TRANSACTIONS_ONGOING_COUNT Type: BIGINT NOT NULL Comment: 'The number of ongoing transactions in the applier.' Dimension: Counter
TRANSACTIONS_PENDING_COUNT Type: BIGINT Comment: 'The number of transactions that are in queue and not yet committed, including both unscheduled and ongoing transactions; or NULL when this has not yet been computed.' Dimension: Counter. Can be NULL
TRANSACTIONS_COMMITTED_SIZE_BYTES_SUM Type: BIGINT NOT NULL Comment: 'The size in bytes of transactions committed so far.' Dimension: Bytes
TRANSACTIONS_ONGOING_FULL_SIZE_BYTES_SUM Type: BIGINT NOT NULL Comment: 'The size in bytes of transactions currently being applied.' Dimension: Bytes
TRANSACTIONS_ONGOING_PROGRESS_SIZE_BYTES_SUM Type: BIGINT NOT NULL Comment: 'The size in bytes of already executed events for transactions currently being applied.' Dimension: Bytes
TRANSACTIONS_PENDING_SIZE_BYTES_SUM Type: BIGINT Comment: 'The size in bytes of transactions that are in queue and not yet committed, including both unscheduled and ongoing transactions; or NULL when this has not yet been computed.' Dimension: Bytes. Can be NULL
EVENTS_COMMITTED_COUNT Type: BIGINT NOT NULL Comment: 'The total number of events in all committed transactions. For compressed transactions, this counts all the embedded events, not the payload/envelope event.' Dimension: Counter
WAITS_FOR_WORK_FROM_SOURCE_COUNT Type: BIGINT NOT NULL Comment: 'Number of times when the replication applier is waiting for work from upstream, i.e., when the coordinator has scheduled all transactions that are received.' Dimension: Counter
WAITS_FOR_WORK_FROM_SOURCE_SUM_TIME Type: BIGINT NOT NULL Comment: 'Sum of the time (ns) spent by the replication applier waiting for work from upstream, i.e., when the coordinator has scheduled all transactions that are received.' Dimension: Nanoseconds
WAITS_FOR_AVAILABLE_WORKER_COUNT Type: BIGINT NOT NULL Comment: 'The number of times the replication applier waited with scheduling a transaction until some worker became available.' Dimension: Counter
WAITS_FOR_AVAILABLE_WORKER_SUM_TIME Type: BIGINT NOT NULL Comment: 'The aggregated time (ns) the replication applier waited with scheduling a transaction until some worker became available.', Dimension: Nanoseconds
WAITS_COMMIT_SCHEDULE_DEPENDENCY_COUNT Type: BIGINT NOT NULL Comment: 'The number of times the replication applier waited for the set of dependent transactions to commit before scheduling a transaction.' Dimension: Counter
WAITS_COMMIT_SCHEDULE_DEPENDENCY_SUM_TIME Type: BIGINT NOT NULL Comment: 'The aggregated time (ns) the replication applier waited for the set of dependent transactions to commit before scheduling a transaction.' Dimension: Nanoseconds
WAITS_FOR_WORKER_QUEUES_MEMORY_COUNT Type: BIGINT NOT NULL Comment: 'The number of times the replication applier waited with scheduling an event to a worker until the worker has reduced the queue size below replica_pending_jobs_size_max bytes.' Dimension: Counter
WAITS_FOR_WORKER_QUEUES_MEMORY_SUM_TIME Type: BIGINT NOT NULL Comment: 'The aggregated time (ns) the replication applier waited with scheduling an event to a worker until the worker has reduced the queue size below replica_pending_jobs_size_max bytes.' Dimension: Nanoseconds
WAITS_WORKER_QUEUES_FULL_COUNT Type: BIGINT NOT NULL Comment: 'The number of times the replication applier waited because the worker's queue contained the maximum number of 16384 events.' Dimension: Counter
WAITS_WORKER_QUEUES_FULL_SUM_TIME Type: BIGINT NOT NULL Comment: 'The aggregated time (ns) the replication applier waited because the worker's queue contained the maximum number of 16384 events.' Dimension: Nanoseconds
WAITS_DUE_TO_COMMIT_ORDER_COUNT Type: BIGINT UNSIGNED NOT NULL Comment: 'The number of times a worker waited for a preceding transaction to commit before committing the current transaction.' Dimension: Counter
WAITS_DUE_TO_COMMIT_ORDER_SUM_TIME Type: BIGINT UNSIGNED NOT NULL Comment: 'The time (ns) a worker waited for a preceding transaction to commit before committing the current transaction.' Dimension: Nanoseconds
READ_FROM_RELAY_LOG_SUM_TIME Type: BIGINT NOT NULL Comment 'Time spent (ns) by the replication applier on reading events from relay log.' Dimension: Nanoseconds

Columns in the new table performance_schema.replication_applier_progress_by_worker:

CHANNEL_NAME Type: CHAR(64) NOT NULL Comment: 'The channel name.' Dimension: Not applicable
WORKER_ID Type: BIGINT UNSIGNED NOT NULL Comment: 'The worker id.' Dimension: Not applicable
THREAD_ID Type: BIGINT UNSIGNED NOT NULL Comment: 'The worker thread id.' Dimension: Not applicable
ONGOING_TRANSACTION_TYPE Type: ENUM('UNASSIGNED', 'DML', 'DDL') NOT NULL Comment: 'The type of the transaction being applied.' Dimension: Not applicable
ONGOING_TRANSACTION_FULL_SIZE_BYTES Type: BIGINT NOT NULL Comment: 'The total size in bytes of the transaction being applied.' Dimension: Bytes
ONGOING_TRANSACTION_APPLIED_SIZE_BYTES Type: BIGINT NOT NULL Comment: 'The size in bytes of all the events executed for the ongoing transaction.' Dimension: Bytes

Motivation:

For the coordinator:

TOTAL_ACTIVE_TIME_DURATION - Information about the the coordinator total active time duration since the beginning of the measurements. Useful to calculate the total throughput of the system since it first started for example.
LAST_APPLIER_START - This information allow us to calculate stats since the last time the applier started, as this could have been due to a configuration change or an error was encountered.
TRANSACTIONS_COMMITTED_COUNT - Analyzing this over time one extracts transaction throughput of the coordinator (trx/s). The increment of this between two samples, divided by the delay between the samples, gives the throughput of the coordinator during that time interval, in transactions per seconds.
TRANSACTIONS_ONGOING_COUNT - Gives the idea of the usage of the applier vs the number of configured workers. A number that is lower than the number of workers might mean several things. If the number of pending transactions is 0 it means there is not enough work and the number of workers can be reduced if desired. If the number of pending transactions is high, then it means some issue like low parallelism exists.
TRANSACTIONS_PENDING_COUNT - Analyzing this over time gives us an idea of the delay in the applier for this channel. It also tells us the applier can't keep up with the load so some action should be taken if possible.
TRANSACTIONS_COMMITTED_SIZE_BYTES_SUM - Analyzing this over time one extracts transaction throughput of the coordinator (bytes/s). It also allows us to know the mean size of the transactions executed in the system.
TRANSACTIONS_ONGOING_FULL_SIZE_BYTES_SUM - This allows to evaluate the size of the load currently being applied and spot the appearance of big transactions by looking a sudden changes on the mean size.
TRANSACTIONS_ONGOING_PROGRESS_SIZE_BYTES_SUM - This allows to know how much of the load is already executed, allowing to have a global look of the progress of big transactions in the system.
TRANSACTIONS_PENDING_SIZE_BYTES_SUM - This allows to know how much we have to apply and give the commit rate in bytes give a prediction of how long we can take to apply everything.
EVENTS_COMMITTED_COUNT - Analyzing this over time gives the event throughput (events/s). The increment of this between two samples, divided by the delay between the samples, gives the throughput during that time interval, in events per seconds. Comparing this statistics with TRANSACTIONS_COMMITTED_COUNT gives the average number of events per transaction, which may impact contention between coordinator and workers, and can be reduced by enabling binlog_compression on the source. For compressed transactions, this gives the number of embedded events and not the number of payload events. If any user wants to know the number of payload events per transaction, it is always 2; and the rate of non-embedded is exactly twice the transaction rate.
WAITS_FOR_WORK_FROM_SOURCE_COUNT - Analyzing this over time one measures when coordinator waits for work from upstream. If there are too many idle workers, it may mean that the user can reduce the numbers of workers if desired or change the shape of the machine to save resources.
WAITS_FOR_WORK_FROM_SOURCE_SUM_TIME - Same as above, this allows to spot idle appliers.
WAITS_FOR_AVAILABLE_WORKER_COUNT - Analyzing this over time shows how many times the coordinator could not schedule work cause there were no workers available, so the number of workers should be increased.
WAITS_FOR_AVAILABLE_WORKER_SUM_TIME - Analyzing this over time and correlating with the counter shows whether the waits are taking longer or less. A constant increase in this column means more workers should be added to keep up with the load.
WAITS_COMMIT_SCHEDULE_DEPENDENCY_COUNT - Analyzing this over time shows if the coordinator has been waiting more or less due to dependencies clogging the pipeline. The difference of this value between two measurements, divided by the number of transactions scheduled between those two measurements, gives the fraction of transactions for which the parallelization was bounded by conflicting transactions.
WAITS_COMMIT_SCHEDULE_DEPENDENCY_SUM_TIME - Analyzing this over time, shows how much time the coordinator actually waited for dependencies to be satisfied. A steady increase in this column and a low number of active workers means the user should somehow try to change its workload to increase parallelism.
WAITS_FOR_WORKER_QUEUES_MEMORY_COUNT - Analyzing this over time, shows whether the workload is putting pressure on the size in bytes of the memory available for the workers. This means the user must increase replica_pending_jobs_size_max.
WAITS_FOR_WORKER_QUEUES_MEMORY_SUM_TIME - Analyzing this over time, shows whether workers have been waiting longer because of not enough memory.
WAITS_WORKER_QUEUES_FULL_COUNT - Analyzing this over time shows whether the queues are constantly full or not. If the issue is caused by large number of events per transaction then enabling binlog compression can help to solve this issue.
WAITS_WORKER_QUEUES_FULL_SUM_TIME - Analyzing this shows how much time workers waited due to queues being full.
READ_FROM_RELAY_LOG_SUM_TIME - Analyzing this over time one measures time spent by the coordinator on reading events from the relay log. A high value for this column might mean a slow drive where the relay log is stored.
WAITS_DUE_TO_COMMIT_ORDER_COUNT - Analyzing this over time, shows the aggregated counter of how many times workers have waited for their turn to commit while waiting for the commit order to be satisfied. One possible cause for a high value is the existence of large transactions, so the user should try to break them down into smaller transactions.
WAITS_DUE_TO_COMMIT_ORDER_SUM_TIME - Analyzing this over time, shows the aggregated counter of how much time workers have spent for their turn to commit while waiting for the commit order to be satisfied. High values indicate large transactions.

For the worker:

ONGOING_TRANSACTION_TYPE: Lets the operator know if the stuck transaction is a DDL or DML
ONGOING_TRANSACTION_FULL_SIZE_BYTES: Allows to know what the is the size of transaction being applied, allowing to stop the execution of big transactions when comparing this value to other workers.
ONGOING_TRANSACTION_APPLIED_SIZE_BYTES: Allows to evaluate the progress of a large transaction being executed by this worker.

RESET

The user can reset the current statistics values by executing the command

RESET REPLICA [FOR CHANNEL <name of channel>]

These also affect TOTAL_ACTIVE_TIME_DURATION and LAST_APPLIER_START.

EXTRA INFORMATION ON ERROR LOG vs PFS

Statistics entries in the error log typically look like this:

2022-11-08T10:12:27Z Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 120; events assigned = 1000000000; worker queues filled over overrun level = 70000; waited due a Worker queue full = 50; waited due the total size = 5200; waited at clock conflicts = 80000000000000 waited (count) when Workers occupied = 6000000 waited when Workers occupied = 100000000000

2022-11-08T10:14:27Z Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 120; events assigned = 1000847872; worker queues filled over overrun level = 70000; waited due a Worker queue full = 50; waited due the total size = 5200; waited at clock conflicts = 80117605917400 waited (count) when Workers occupied = 6000029 waited when Workers occupied = 100083425000

The corresponding entries in PFS are:

events assigned = 1000000000 (old rli->mts_events_assigned)
- this is similar to, but not exactly the same as, performance_schema.replication_applier_statistics.EVENTS_COMMITTED_COUNT. events_assigned counts scheduled events (i.e., ongoing and committed), whereas EVENTS_COMITTED_COUNT measures only committed events. The two metrics serve the same purposes, but EVENTS_COMMITTED_COUNT does not have the problem that it can count a transcation twice if the transaction is rolled back and restarted.
worker queues filled over overrun level = 70000 (old rli->mts_wq_overrun_cnt)
- this has NOT been migrated to performance schema
waited due a Worker queue full = 50 (old rli->mts_wq_overfill_cnt)
- this is in performance_schema.replication_applier_metrics.WAITS_WORKER_QUEUES_FULL_COUNT
waited due the total size = 5200 (old rli->wq_size_waits_cnt)
- this is in performance_schema.replication_applier_metrics.WAITS_FOR_WORKER_QUEUES_MEMORY_COUNT
waited at clock conflicts (old rli->mts_total_wait_overlap)
- this is in performance_schema.replication_applier_metrics.WAITS_COMMIT_SCHEDULE_DEPENDENCY_SUM_TIME
waited when Workers occupied (old rli->mts_total_wait_worker_avail)
- this is in performance_schema.replication_applier_metrics.WAITS_FOR_AVAILABLE_WORKER_SUM_TIME
To analyze trends one needs to calculate the difference between the error log entries values. This would indicate a trend over the last 120 seconds (that is hardcoded). With a PFS, one needs to query it over time and calculate the difference as well. The polling frequency is up to the application monitoring the server.

CONSISTENCY

It is not guaranteed that different metrics within the same row in the tables are mutually consistent. In other words, it is possible that the result set consists of metrics that were read at different points in time, and that the applier makes progress between those points in time. Even metrics that are closely related, such as TRANSACTIONS_COMMITTED_COUNT and TRANSACTIONS_COMMITTED_SIZE_BYTES_SUM, may not be obtained at the same logical point in time.

It is guaranteed that each individual metric is obtained at a point in time between the beginning and the end of the operation that reads the table. (But not necessarily the same point as other metrics.)

Here are some situations that can be perceived as anomalies, which follow from this:

The total number of transactions may be different from TRANSACTIONS_COMMITTED_COUNT + TRANSACTIONS_PENDING_COUNT, and the total size of all transactions may be different from TRANSACTIONS_PENDING_SIZE_BYTES_SUM
- TRANSACTIONS_COMMITTED_SIZE_BYTES_SUM. (Some transaction may be counted as pending, then it gets committed, then it counts as committed too).
TRANSACTIONS_ONGOING_PROGRESS_SIZE_BYTES_SUM may be greater than TRANSACTIONS_ONGOING_FULL_SIZE_BYTES_SUM (they may relate to different sets of transactions); ONGOING_TRANSACTION_APPLIED_SIZE_BYTES may be greater tha ONGOING_TRANSACTION_FULL_SIZE_BYTES (they may related to different transactions).

USER PROCEDURE

The user, or robot, interact with this feature by querying the performance schema as is done for any other performance schema stats.

SECURITY CONTEXT

To access the tables the user must first install the component and for that it needs INSERT privileges for the 'component' table.

To then use the plugin the user needs privileges to SELECT from the 'performance_schema' database.

SYSTEM RESOURCES

No new (considerable) memory storage is used. We store for each coordinator and worker some counters and time structures that should not represent a great consumption of memory .

Synchronization: all counters and time measurements are implemented using atomic 64-bit integers and are not guarded by locks. Reading from the table requires holding a lock that protects deletion of the channel. This lock exists in the server already; the table is only a new customer for the lock.

No new threads are introduced by this feature.

A new component is introduced that needs to be installed by the user.

No new execution stages are introduced by this feature.

DEPLOYMENT and INSTALLATION

There is a component to install, the user can do it with:

INSTALL COMPONENT 'file://component_replication_applier_metrics';

Note that installation does not affect channels that are already running. The user must restart replication for channels where metrics are to be observed. This also applies to Group Replication where the member has to be restarted.

PROTOCOL

No changes to any existing protocol.

FAILURE MODEL SPECIFICATION

N/A

UPGRADE/DOWNGRADE and CROSS-VERSION REPLICATION

After the user upgrades, there will be new PFS tables. This is a non-breaking change.

We will also remove the very cryptic error log entry that pollutes it periodically and is not documented as something that our users should engage or decode.

This worklog will be divided into 4 steps:

Steps

Server statistics

First part is tracking the metrics in the server.

The Applier_metrics class stores metrics for the replication_applier_metrics table. The coordinator's Relay_log_info object has an instance of this class as member. Code executed by both coordinator and workers update the metrics in this class.

The Worker_metrics class stores metrics for the replication_applier_progress_by_worker table. The worker's Slave_worker object has an instance of this class as member. Code executed in the worker updates metrics in this class.

The auxiliary class Time_based_metric encapsulates two metrics relating to any process that can start and stop: one counter, representing the number of times the process started, and one duration, representing the total time that the process has been active. This is used for all the metric pairs WAITS_*COUNTER / WAITS*SUMTIME.

This section is compilable. This section has no tests.

Server service

This step introduces a service that exposes two methods; one for each table. Each method fetches the contents of the corresponding table, which it exposes to the user as an array, returned through a pointer argument.

This section is compilable. This section has no tests.

Component

This step introduces the component that defines the two tables, using the service from the previous step to fetch data.

This section is compilable. This section has no tests.

This section is feature-complete, so after this step it is possible to test all the requirements.

Testing

This step contains tests that verify the behavior of the tables. It is separated from the previous step in order to keep a clearer separation between different parts of the work during reviews and in revision history.

Performance

When metric collection is disabled, is expected that no visible performance regressions is seen.

When metric collection is enabled, sysbench measurements with standard configurations did not indicate a significant regression in transaction rate.

However, if the binlog and redo log are disabled, there was a 3% regression in transaction rate. The regression was on a version that used a mutex while updating each time metric. This has subsequently been removed, so now only atomic variables are used. It remains to measure if the regression is lower now.

Possible future optimization

Several atomic variables are updated by concurrent workers. If further optimization is needed, we may implement sharded counters, where each writer thread increments its own counter and reader threads sum the counters for all writers.