MySQL :: MySQL Replication Monitoring : Enhanced Features for the Enterprise Edition

MySQL Blog Archive

For the latest blogs go to blogs.oracle.com/mysql

MySQL Replication Monitoring : Enhanced Features for the Enterprise Edition

Posted on Nov 13, 2024 by Pedro Gomes, Anibal Pinto
Category: Monitoring
Tags: replication, monitoring

For years, MySQL has been a cornerstone for businesses worldwide. Its ease of use and built-in data replication features have made it a perfect fit for creating high available databases. Moreover, it has made it possible to design and implement simple business continuity solutions based on live snapshots trough replication.

As setups become more complex and performance demands increase, having robust monitoring tools is crucial to identify issues and allocate resources effectively. The ability to observe, monitor, operate and optimize is paramount, enabling better hardware utilization and cost savings by obviating the need for premature upgrades.

In this blog post, we showcase two enhancements introduced in the new MySQL Enterprise Edition, starting with the MySQL 9.1 version.

Flow Control: More Information, More Control
Replication Applier: Enhanced Metrics for Better Performance

The first enhancement focuses on improving the observability of the flow control mechanism in MySQL clusters running with Group Replication. With this feature, users can easily identify situations where the system is throttled due to secondaries struggling to keep up with the primary commit rate. This visibility enables proactive performance management.

The second feature is an extension to the observability of the replication applier that displays advanced metrics. Such metrics give the user the ability to tune and optimize the replication applier to meet the workload requirements.

Now, let's explore each of these exciting new additions to the Enterprise Edition in more detail.

Flow Control: More Information, More Control

Flow control is an important aspect of Group Replication setups, ensuring stable and predictable performance across the group.

One fundamental concept behind Group Replication is to act as a replicated state machine. Therefore it ensures that a transaction commits only once a majority of servers in the group has received it and agreed on the order of concurrent transactions. This approach works well when the total number of writes to the group doesn't exceed the write capacity of any individual member.

However, if some members have lower write throughput than others, they may start exhibiting what is commonly known as replication lag. To address this, the Flow Control mechanism introduces waits on the primary to ensure secondaries don't exceed their write capacity. Monitoring these occurrences proves to be invaluable as it informs users when the system is throttling itself.
Leveraging this information help establishing the root cause for potential replication lag and identifies performance optimization opportunities.

To enhance Flow Control monitoring, we've added the following global status variables:

Gr_flow_control_throttle_count : number of times flow control throttled
Gr_flow_control_throttle_time_sum : total throttle time in microseconds
Gr_flow_control_throttle_active_count : if sessions are currently being throttled, and how many are affected
Gr_flow_control_throttle_last_throttle_timestamp : the timestamp of the last throttled transaction

This information can help understanding how flow control is impacting primary and how to tune it to keep secondaries up to date.

How can you fetch these variables?

These are status variables, so the normal performance schema queries to retrieve their content can be used.

SELECT * FROM performance_schema.global_status WHERE VARIABLE_NAME LIKE 'Gr_flow_control%';
+--------------------------------------------------+---------------------+
| VARIABLE_NAME                                    | VARIABLE_VALUE      |
+--------------------------------------------------+---------------------+
| Gr_flow_control_throttle_active_count            | 3                   |
+--------------------------------------------------+---------------------+
| Gr_flow_control_throttle_count                   | 10                  |
+--------------------------------------------------+---------------------+
| Gr_flow_control_throttle_last_throttle_timestamp | 2024-10-16 12:50:56 |
+--------------------------------------------------+---------------------+
| Gr_flow_control_throttle_time_sum                | 10                  |
+--------------------------------------------------+---------------------+

A real usage scenario:

We now present how some of this data would look in a system like Grafana, SigNoz or similar.
Typically, data would be queried at fixed intervals and stored in a time-series database, allowing for historical analysis and trend identification.
By presenting this data to system administrators, they get valuable insights into the behavior of their setup over time.

With that in mind, let's consider a scenario where we have a made-up system that is experiencing sporadic throttling issues.
By sampling Gr_flow_control_throttle_time_sum we can generate graphs like the following:

Flow control throttle time sum Flow Control Throttle time delta

On the left we have a straightforward representation of the Gr_flow_control_throttle_time_sum variable across time.
Steep increases of the throttle time sum indicate periods of time when flow control was activated.
On the right we have an alternative view that displays the deltas of the value, clearly showing when flow control was active.

At this point, it's up to the DBA to investigate potential issues causing throttling.
Flow control throttling for Group Replication is configured by setting group_replication_flow_control_certifier_threshold and group_replication_flow_control_applier_threshold.
The first defines the number of transactions waiting certification that when surpassed should trigger a slowdown, while the second focus on the number of transactions waiting application.
Both these metrics COUNT_TRANSACTIONS_IN_QUEUE and COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE are visible on the replication_group_member_stats table.

From the two the likely bottleneck candidate for evaluation is the replication applier, maybe taking its time to apply a certain transaction (e.g., a long running DDL).
With more information the DBA would have a clear view on what transactions are causing the throttling, and can evaluate whether these transaction can be re-written in a more efficient form.
This is a better allocation of your funds, instead of simply buying a faster and bigger machine you can opt for a better usage of your hardware.

Figuring out how to find this transaction or other underlying issues is a perfect segway to the next exciting feature of your Enterprise Edition: enhanced replication applier metrics.

Replication Applier: Enhanced Metrics for Better Performance

When replicating data, whether it is within a Group Replication cluster or via asynchronous replication, minimizing the delay between data being sent to a MySQL server and becoming queryable on replicas is quite important.

To achieve this, monitoring the execution of transactions on replicas is key.
The ability to pinpoint issues affecting replication performance can lead to significant improvements and cost savings by easily adjusting the configuration to meet the worload requirements.

With that goal in mind, we introduce two new replication applier monitoring tables:

performance_schema.replication_applier_metrics
performance_schema.replication_applier_progress_by_worker

How does it look like

In this example, we can observe various metrics for the replication applier on a group replication member.

SELECT * FROM performance_schema.replication_applier_metrics where CHANNEL_NAME='group_replication_applier'\G
*************************** 1. row ***************************
                                CHANNEL_NAME: group_replication_applier
                  TOTAL_ACTIVE_TIME_DURATION: 82155916936
                          LAST_APPLIER_START: 2024-10-11 21:04:08
                TRANSACTIONS_COMMITTED_COUNT: 2
                  TRANSACTIONS_ONGOING_COUNT: 0
                  TRANSACTIONS_PENDING_COUNT: 0
       TRANSACTIONS_COMMITTED_SIZE_BYTES_SUM: 482
    TRANSACTIONS_ONGOING_FULL_SIZE_BYTES_SUM: 0
TRANSACTIONS_ONGOING_PROGRESS_SIZE_BYTES_SUM: 0
         TRANSACTIONS_PENDING_SIZE_BYTES_SUM: NULL
                      EVENTS_COMMITTED_COUNT: 7
            WAITS_FOR_WORK_FROM_SOURCE_COUNT: 5
         WAITS_FOR_WORK_FROM_SOURCE_SUM_TIME: 82097262349
            WAITS_FOR_AVAILABLE_WORKER_COUNT: 0
         WAITS_FOR_AVAILABLE_WORKER_SUM_TIME: 0
      WAITS_COMMIT_SCHEDULE_DEPENDENCY_COUNT: 2
   WAITS_COMMIT_SCHEDULE_DEPENDENCY_SUM_TIME: 22515305
         WAITS_FOR_WORKER_QUEUE_MEMORY_COUNT: 0
      WAITS_FOR_WORKER_QUEUE_MEMORY_SUM_TIME: 0
              WAITS_WORKER_QUEUES_FULL_COUNT: 0
           WAITS_WORKER_QUEUES_FULL_SUM_TIME: 0
             WAITS_DUE_TO_COMMIT_ORDER_COUNT: 0
          WAITS_DUE_TO_COMMIT_ORDER_SUM_TIME: 0
        TIME_TO_READ_FROM_RELAY_LOG_SUM_TIME: 123552
1 row in set (0.00 sec)

This table is designed to provide a comprehensive summary of transaction execution for the applier, including the number and size of transactions, as well as the number of pending transactions.
Additionally, it offers insights into potential bottlenecks, such as memory constraints, worker availability or the parallelization characteristics of the executed workload.

Here are some of the valuable insights and actions you can take using this table:

Calculate Application Rates: Sample TRANSACTIONS_COMMITTED_COUNT and TRANSACTIONS_COMMITTED_SIZE_BYTES_SUM at regular intervals to determine the average rate of transaction execution per second or bytes per second. You can also roughly estimate the time required to clear the current backlog using TRANSACTIONS_PENDING_COUNT and TRANSACTIONS_PENDING_SIZE_BYTES_SUM.
Assess Idle Time: Analyze WAITS_FOR_WORK_FROM_SOURCE_COUNT/_SUM_TIME to understand how much time your system is idle. This information can help identify opportunities to optimize hardware costs by adjusting the shape of your machines.
Adjust Number of Configured Workers: The "WAITS_FOR_AVAILABLE_WORKER" metric indicates when to increase or decrease the number of workers for the replica applier, ensuring efficient resource utilization.
Profile Workload Characteristics: If you observe idle workers along with increasing values for WAITS_COMMIT_SCHEDULE_DEPENDENCY_SUM_TIME, it suggests that your workload has significant transaction dependencies. This causes the applier to wait for transactions to complete before executing new ones.
Analyze Resource Consumption: Full worker queues may indicate that compression could be beneficial. Memory counts can help determine if you need to tune your machine memory or adjust limits with replica_pending_jobs_size_max. Increasing commit order times may point to large transactions clogging the system.

Track that big transaction: Are we there yet?

In order to help you track big transactions that sometimes clog the system we added the table: performance_schema.replication_applier_progress_by_worker.

SELECT * FROM performance_schema.replication_applier_progress_by_worker\G
*************************** 1. row ***************************
                          CHANNEL_NAME: 
                             WORKER_ID: 0
                             THREAD_ID: 47
              ONGOING_TRANSACTION_TYPE: DML
   ONGOING_TRANSACTION_FULL_SIZE_BYTES: 2388
ONGOING_TRANSACTION_APPLIED_SIZE_BYTES: 1313
*************************** 2. row ***************************
                          CHANNEL_NAME: 
                             WORKER_ID: 1
                             THREAD_ID: 48
              ONGOING_TRANSACTION_TYPE: UNASSIGNED
   ONGOING_TRANSACTION_FULL_SIZE_BYTES: 0
ONGOING_TRANSACTION_APPLIED_SIZE_BYTES: 0
*************************** 3. row ***************************
                          CHANNEL_NAME: 
                             WORKER_ID: 2
                             THREAD_ID: 49
              ONGOING_TRANSACTION_TYPE: UNASSIGNED
   ONGOING_TRANSACTION_FULL_SIZE_BYTES: 0
ONGOING_TRANSACTION_APPLIED_SIZE_BYTES: 0
*************************** 4. row ***************************
                          CHANNEL_NAME: 
                             WORKER_ID: 3
                             THREAD_ID: 50
              ONGOING_TRANSACTION_TYPE: UNASSIGNED
   ONGOING_TRANSACTION_FULL_SIZE_BYTES: 0
ONGOING_TRANSACTION_APPLIED_SIZE_BYTES: 0
4 rows in set (0.01 sec)

In this example, we have four workers. Worker 0 is currently executing a DML transaction with a total size of 2388 bytes.
The TRANSACTION_PROGRESS column indicates that half of the transaction has been executed. Workers 1, 2, and 3 are idle, with no active transactions.

This table is particularly useful when you detect stagnation in your system.
By extracting the GTID from replication_applier_status_by_worker, you can track the progress of specific transactions and identify any bottlenecks or issues causing delays.

Additionally, the table will indicate if workers are executing DDLs, but the size shown is irrelevant for tracking progress, as DDLs sizes are unrelated to their complexity.

Going back to our performance issue.

Previously, we discussed a scenario where a Group Replication system was experiencing throttling, indicating that the secondary couldn't keep up with the primary's load.
Now, let's explore how the new Replication Applier metrics can assist in diagnosing various issues in such a scenario.

We'll utilize time-series visualizations for the fields presented in this section to identify several potential issues and their solutions.

Scenario 1: Not enough configured workers.

In this scenario, we have a server that may be misconfigured with a low number of replication applier workers.
As the backlog of transactions to be applied increases, it triggers the flow control mechanism, leading to throttling across the group.
The below graph show the correlation between the backlog increases and the flow control activation.

Transactions pending application

While this graph hints at an issue, let's turn to other metrics to uncover the root cause.
Below we observe a significant increase in wait times for available workers, indicating a potential worker shortage.

Wait on available workers time sum

By monitoring the number of active workers, we can spot the issue.
Flow control is triggered whenever the maximum number of workers is reached.

Number of active workers

Increasing the number of workers could resolve the issue, given that the machine has sufficient resources to handle the additional load.
This adjustment would help process transactions more efficiently and reduce the need for throttling.

Scenario 2: The workload has low parallelism

In this scenario, similar to the previous example, we observe an increasing backlog of transactions.
However, this time, the number of active workers doesn't reach the maximum during the observation period.

Number of active workers

To diagnose the issue, let's turn to another metric.
The graph below shows an increase in wait times due to transaction commit dependencies, indicating that the replication applier is waiting on data dependencies.

Scheduling dependency wait time sum

This scenario is one of those cases where upgrading hardware might not be a solution to improve the overall system throughput as the system has hotspots/bottlenecks that are artificially created by the application. Therefore, the DBA or Developer needs modify the workload to take advantage of how the database and replication topology are setup and thus remove the bottlenecks. As consequence, the end-to-end throughput increases, in a balanced way, without your business incurring in extra costs.

Scenario 3: Uncovering a Large Transaction

In this scenario, we observe a familiar pattern: the backlog is increasing, the number of occupied workers isn't at its maximum, and wait times for dependencies are on the rise.
However, there's a key difference: the WAITS_COMMIT_SCHEDULE_DEPENDENCY_COUNT metric remains mostly unchanged.
Additionally, the number of active workers remains constant (and often low) for extended periods.

Scheduling dependency wait count Number of active workers

If the number of active workers is bigger than 1, you can also expect an increase on the wait for the commit order (WAITS_DUE_TO_COMMIT_ORDER_SUM_TIME)

To confirm our suspicions, let's turn to a representation of the transactions currently being executed by each worker and their progress (this is the only metric shown here that represents a given moment in time and not a time window).

Worker transaction progress

The graph above reveals a large transaction being executed by one of the workers.
Being of a considerable size, only 60% of it has been completed, suggesting that large transactions like this one may be clogging the system.

Optimizing or breaking down large transactions into smaller ones could help improve replication performance and reduce the likelihood of clogging.

Note: The analysis and insights provided for the replication applier are equally applicable to asynchronous replication scenarios, where similar performance issues may occur.

Conclusion

To unlock superior performance from your existing hardware and, in the long run, save costs by optimizing resource utilization, consider upgrading to the MySQL Enterprise Edition.
This edition is designed to help you make the most of your infrastructure, avoiding unnecessary upgrades and reducing hardware costs.

The new monitoring additions will make life of a DevOps, operator or system administrator easier.
Be it in Group Replication or in standard asynchronous replication scenarios, the new metrics are an invaluable addition to the Enterprise Edition.
They are a powerful tool enabling quick troubleshooting, understanding of workload characteristics and operational response to make replicated systems efficient.