WL#9838: Group Replication: Flow-control fine tuning
Affects: Server-8.0 — Status: Complete
EXECUTIVE SUMMARY 1. Current flow-control scheme Group Replication introduced flow control to avoid excessive buffering and to maintain group members reasonably close to one another. For several reasons, it's more efficient to keep buffering low but, as mentioned before, it is not a requirement that members are kept in sync for replication to work: once a slave becomes delayed it will just increase the number of pending transactions in the relay log. The flow control mechanism enters the scene to bound the back-log members can accumulate, in terms of transactions to certify and transactions to apply. Once those bounds are exceeded, and while they remain exceeded, the flow-control mechanism limits the throughput of writer members to adjust to the capacity of the slowest members of the group. By imposing conservative bounds it tries to make sure all members are able to keep-up and can return safely to within the bounds defined by the user. The flow-control system works asynchronously, the capacity to use in each period is calculated deterministically in each member, in each period, using the statistics that are regularly sent to the group by each member. The coordination is purposely kept coarse-grained, in order to reduce the number of messages in flight and to keep the system running as smoothly as possible. The implementation depends on two basic mechanisms: 1) monitoring the members to gather statistics on throughput and queue sizes of all group members to make educated guesses on what is the maximum write pressure each member can withstand and 2) throttling the members to avoid writing beyond its fair-share of the capacity available in each step. The following variables are used to control flow-control behaviour in Group Replication: - group-replication-flow-control-mode = QUOTA | DISABLED - group-replication-flow-control-certififier-threshold = 0..n - group-replication-flow-control-applier-threshold = 0..n Flow-control can be enabled and disabled by selecting a flow control algorithm with the group-replication-flow-control-mode option. QUOTA is the default flow-control mode, and only one for now, and it calculates a write quota for the group in terms of commits/period. That quota, after being divided by the number of members that tried to write in the last period, will be the maximum allowance the client threads have to spend during the next flow-control period. Flow control takes into account two work queues: the certification queue and the group replication applier queue, so a user may choose 1) doing flow control at the certifier or at the applier level, or both and 2) what is the trigger size on each queue. The options group-replication-flow-control-certififier-threshold and group-replication-flow-control-applier-threshold specify the size of the queues that when exceeded will force flow-control to throttle the throughput on the writer members. More information about how these options work is available in the user's manual here: https://dev.mysql.com/doc/refman/5.7/en/group-replication-options.html 2. Limitations of current scheme Current flow-control mechanism used by GR depends on heuristics that leave little control to the user, other then enabling/disabling it and setting the flow-control thresholds. However, there are several built-in values that could benefit from user input in some situations, and that is presently not possible. 3. User stories - as a user, I would like to be able to configure flow-control in order to optimize the lag between member members, while extracting the highest throughput of the underlying system and running within what I consider safe operational limits; - as a user, I would like to be able to configure flow-control to keep the throughput as stable as possible independently of the number of clients and of the demands of the workload; - as a user, I would like to be able to configure flow-control so that all members of the group have similar opportunity to submit transactions on the cluster independently of the number of clients and of the demands of the workload; - as a user, I would like the perform management tasks on some members without reducing the throughput of the rest of the cluster in some cases; - as a user, I would like to adapt flow-control policies to workloads which take a long time to commit as well as it does to shorter transactions; - as a user, I would like to assign different tasks to some members of the group and be allowed to configure flow-control accordingly. 4. Areas requiring more user control This worklog introduces user control over: 1. The minimum flow-control quota that can be assigned to a member: presently it is set to 5% of the lowest flow-control thresholds, but these are unrelated variables and it makes little sense for them to be dependent. 2. The minimum recovering flow-control quota: different from the minimum flow-control quota in that it applies only to members entering the group, so that the members entering the group don't excessively affect throughput. Some users complained that flow-control should not be used when a member is joining, but one of the original goals was precisely to avoid having members that never join because they cannot keep up with the group workload while joining. 3. The maximum cluster commit quota: in order to keep the group operating within a safe throughput range (to improve on predictable behaviour and balance between members). 4. The proportion of the quota that is assigned to each member: when flow-control is active the quota of each writer member is similar, which may be sub-optimal if some members are expected to write more than others. This can also be used to assign more than 100% of the calculated quota (total), so the user can over/under-subscribe the estimations done be the system, either because the calculations are too conservative, or because the user prefers to let some lag to pile up, but slower then without flow control. It will also allow the user to set a maximum quota por member, as a part of the maximum commit quota. 5. The flow-control period: currently flow-control messages are sent once per second, a fixed rate that depends on the unfortunately also fixed rate of the garbage collection in GR, which can be inconvenient if the commit rate is very low and not enough transactions are executed per second to give the heuristics enough data to work properly. 6. The percentage of the quota that is reserved for catch-up and when flow-control is released: presently constants in the code, 10% and 50% of the quota respectively, require better tunning those properly reduce variations that are presently seem in the throughput when the system is close to exhausting its free capacity. In addition, participation in flow-control should be optional, so that some members can be ignored by other members if they are doing some maintenance tasks for instance. That would allow the members to remain in the group, but would not try to make them in sync, at the cost of having the garbage collection in GR become less effective if the members stop applying the transactions that are being sent to the group. Flow-control should remain local to the writer, and control the local member only. If there is a mismatch between the configurations, if unintentional, they need to be adressed by changing the variables, something that should continue to be doing dynamically even if Group Replication is running.
FUNCTIONAL REQUIREMENTS FR1. When a member has group_replication_flow_control_mode=DISABLED the other members should not be delayed even if it does not apply any transactions, or has a large back-log. FR2. The computed commits/period quota assigned to each member of the group shall not be lower then group_replication_flow_control_min_recovery_quota (new option). This option will replace group_replication_flow_control_min_quota when the only reason for the flow-control to throttle the master is due to some members being in recovery mode, if other members exceed the queue sizes due to other reasons group_replication_flow_control_min_quota will be applied. FR3. When flow-control is active the computed commits/period quota assigned for each member shall not be: - lower than group_replication_flow_control_min_quota (new option) - higher than group_replication_max_quota (new option), multiplied by group_replication_member_quota_percent and divided by 100 if group_replication_member_quota_percent is not zero. FR4. The total commits/period quota of the cluster, when flow control is active, shall be assigned to a member according to the following policy: - divided equaly by the number of members that wrote in the last period if group_replication_member_quota_percent=0 (new option) - assigned a quota = group_replication_member_quota_percent * total quota if group_replication_member_quota_percent is between 1% and 100% (the user is responsible for the quota over-subscription). FR5. The flow-control messages and execution of heuristics to set the quota shall occur every group_replication_flow_control_period (new option) seconds. The period specified does not need to be the same on all nodes, but members that have periods exceeding 10 times that of other members will be ignored in some iterations, as stats information about remote members is discarded if those members do not update them at least once every 10 periods. FR6. When flow-control is active, the member commits/period quota shall be reduced by group_replication_flow_control_hold_percent (new option) percent to allow the back-log to decrease gradualy. FR7. When flow-control becomes inactive, the total cluster commits/period quota shall be increased by group_replication_flow_control_release_percent percent in each flow control period to allow the cluster the increase throughput gradually and avoid large spikes. To disable this option the user can set group_replication_flow_control_release_percent=0, in which case the quota shall be released immediately. NON-FUNCTIONAL REQUIREMENTS NF1. The user should be able to configure the flow-control settings to allow smooth execution over time, even when the cluster is near its capacity. NF2. The flow-control settings should not reduce the cluster throughput significantly unless the user specifies it.
APPROACH Change current flow-control code to introduce further boundaries on the quotas assigned. To implement this worklog one needs to create 7 new options: - group_replication_flow_control_min_quota= X commits/s - group_replication_flow_control_min_recovery_quota= X commits/s - group_replication_flow_control_max_commit_quota= X commits/s - group_replication_flow_control_member_quota_percent= Y % - group_replication_flow_control_period= Z seconds - group_replication_flow_control_hold_percent= Y % - group_replication_flow_control_release_percent= Y % In addition, an existing option needs to have its behaviour better defined, so that a member which has: - group_replication_flow_control_mode= DISABLED it will be excluded from the flow-control calculations by other members. DETAILS * New system variables to allow user control over the flow-control: - Name: group_replication_flow_control_min_quota - Input Values: [0, MAX_INT] - Default: 0 - Units: commits per flow-control period - Description: This option controls the lowest flow-control quota that can be assigned to a member, independently of the calculated minimum quota executed in the last period. A value of 0 implies that there is no minimum quota. Cannot be larger than group_replication_flow_control_max_commit_quota. - Dynamic: yes. - Scope: Global - Type: Integer - Name: group_replication_flow_control_min_recovery_quota - Input Values: [0, MAX_INT] - Default: 0 - Units: commits per flow-control period - Description: This option controls the lowest quota that can be assigned to a member because of another recovering member in the group, independently of the calculated minimum quota executed in the last period. A value of 0 implies that there is no minimum quota. Cannot be larger than group_replication_flow_control_max_commit_quota. - Dynamic: yes. - Scope: Global - Type: Integer - Name: group_replication_flow_control_max_quota - Input Values: [0, MAX_INT] - Default: 0 - Units: commits per flow-control period - Description: This option defines the maximum flow-control quota of the cluster, or the maximum available quota for any period while flow-control is enabled. A value of 0 implies that there is no maximum quota set. Cannot be smaller than group_replication_flow_control_min_commit_quota and group_replication_flow_control_min_recovery_quota. - Dynamic: yes. - Scope: Global - Type: Integer - Name: group_replication_flow_control_member_quota_percent - Input Values: [0, 100%] - Default: 0 - Description: This option defines the percentage of the quota that a member should assume is available for itself when calculating the quotas. A value of 0 implies that the quota should be split equally between members that were writers in the last period. - Dynamic: yes. - Scope: Global - Type: Integer - Name: group_replication_flow_control_period - Input Values: [1, 60] - Default: 1 - Units: seconds between flow-control periods - Description: This option defines how many seconds to wait between flow-control iterations, in which flow-control messages are sent and flow-control managements tasks are ran. - Dynamic: yes. - Scope: Global - Type: Integer - Name: group_replication_flow_control_hold_percent - Input Values: [0, 100%] - Default: 10% - Description: This option defines what percentage of the cluster quota should remain unused to allow a cluster under flow-control to catch-up on backlog. A value of 0 implies that no part of the quota will be reserved for catching up on the work backlog. - Dynamic: yes. - Scope: Global - Type: Integer - Name: group_replication_flow_control_release_percent - Input Values: [0, 1000%] - Default: 50% - Description: This option defines how the cluster quota should be released when flow-control no longer needs to throttle the writer members, with this percentage being the quota increase per flow-control period. A value of 0 implies that once the flow-control thresholds are within limits the quota will be released in a single flow-control iteration. The range allows the quota to be released at up to 10 times current quota, as that allows a greater degree of adaptation, mainly when the flow-control period is large and the quotas are very small. - Dynamic: yes. - Scope: Global - Type: Integer All the variables are not only dynamic, but must also be changeable while Group Replication is running. * Observability - This worklog relies on the flow-control mechanism that we have in place already, only the commit quotas change. * Cross-version group replication - The flow-control decisions impacted by this worklog are local and should not affect cross-version replication. * Rolling-upgrades - The upgrade procedures do not change as a results of this worklog.
The main heuristics of the flow-control logic are executed in the Flow_control_module::flow_control_step(). Other then adding the new options, the changes in this worklog consist on introducing boundaries on the quota calculations performed there, according to the specifications above. The changes in the flow-control interval are also controlled inside this function, which continues to be called every second but only allows flow-control to run when enough seconds have passed since it last executed. In order to ignore members that have flow-control-mode=DISABLED the flow-control the flow control messages will add a char to share the flow-control-mode of each member.
Copyright (c) 2000, 2023, Oracle Corporation and/or its affiliates. All rights reserved.