WL#9838: Group Replication: Flow-control fine tuning

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

EXECUTIVE SUMMARY

1. Current flow-control scheme

Group Replication introduced flow control to avoid excessive buffering and to
maintain group members reasonably close to one another. 
For several reasons, it's more efficient to keep buffering low but, as mentioned
before, it is not a requirement that members are kept in sync for replication to
work: once a slave becomes delayed it will just increase the number of pending
transactions in the relay log.

The flow control mechanism enters the scene to bound the back-log members
can accumulate, in terms of transactions to certify and transactions to apply.
Once those bounds are exceeded, and while they remain exceeded, the flow-control
mechanism limits the throughput of writer members to adjust to the capacity of
the slowest members of the group. By imposing conservative bounds it tries to
make sure all members are able to keep-up and can return safely to within the
bounds defined by the user.

The flow-control system works asynchronously, the capacity to use in each period
is calculated deterministically in each member, in each period, using the
statistics that are regularly sent to the group by each member. The coordination
is purposely kept coarse-grained, in order to reduce the number of messages in
flight and to keep the system running as smoothly as possible.

The implementation depends on two basic mechanisms: 1) monitoring the members to
gather statistics on throughput and queue sizes of all group members to make
educated guesses on what is the maximum write pressure each member can withstand
and 2) throttling the members to avoid writing beyond its fair-share of the
capacity available in each step.

The following variables are used to control flow-control behaviour in Group
Replication:
  - group-replication-flow-control-mode = QUOTA | DISABLED
  - group-replication-flow-control-certififier-threshold = 0..n
  - group-replication-flow-control-applier-threshold = 0..n

Flow-control can be enabled and disabled by selecting a flow control algorithm
with the group-replication-flow-control-mode option. QUOTA is the default
flow-control mode, and only one for now, and it calculates a write quota for the
group in terms of commits/period. That quota, after being divided by the number
of members that tried to write in the last period, will be the maximum allowance
the client threads have to spend during the next flow-control period.

Flow control takes into account two work queues: the certification queue and the
group replication applier queue, so a user may choose 1) doing flow control at
the certifier or at the applier level, or both and 2) what is the trigger size
on each queue.

The options group-replication-flow-control-certififier-threshold and
group-replication-flow-control-applier-threshold specify the size of the queues
that when exceeded will force flow-control to throttle the throughput on the
writer members. More information about how these options work is available in
the user's manual here:
  https://dev.mysql.com/doc/refman/5.7/en/group-replication-options.html

2. Limitations of current scheme

Current flow-control mechanism used by GR depends on heuristics that leave
little control to the user, other then enabling/disabling it and setting the
flow-control thresholds. However, there are several built-in values that could
benefit from user input in some situations, and that is presently not possible.

3. User stories
   - as a user, I would like to be able to configure flow-control in order to
optimize the lag between member members, while extracting the highest throughput
of the underlying system and running within what I consider safe operational limits;

   - as a user, I would like to be able to configure flow-control to keep the
throughput as stable as possible independently of the number of clients and of
the demands of the workload;

   - as a user, I would like to be able to configure flow-control so that all
members of the group have similar opportunity to submit transactions on the
cluster independently of the number of clients and of the demands of the workload;

   - as a user, I would like the perform management tasks on some members
without reducing the throughput of the rest of the cluster in some cases;

   - as a user, I would like to adapt flow-control policies to workloads which
take a long time to commit as well as it does to shorter transactions;

   - as a user, I would like to assign different tasks to some members of the
group and be allowed to configure flow-control accordingly.

4. Areas requiring more user control

This worklog introduces user control over:

1. The minimum flow-control quota that can be assigned to a member: presently it
is set to 5% of the lowest flow-control thresholds, but these are unrelated
variables and it makes little sense for them to be dependent.

2. The minimum recovering flow-control quota: different from the minimum
flow-control quota in that it applies only to members entering the group, so
that the members entering the group don't excessively affect throughput. Some
users complained that flow-control should not be used when a member is joining,
but one of the original goals was precisely to avoid having members that never
join because they cannot keep up with the group workload while joining.

3. The maximum cluster commit quota: in order to keep the group operating within
a safe throughput range (to improve on predictable behaviour and balance between
members).

4. The proportion of the quota that is assigned to each member: when
flow-control is active the quota of each writer member is similar, which may be
sub-optimal if some members are expected to write more than others. This can
also be used to assign more than 100% of the calculated quota (total), so the
user can over/under-subscribe the estimations done be the system, either because
the calculations are too conservative, or because the user prefers to let some
lag to pile up, but slower then without flow control. It will also allow the
user to set a maximum quota por member, as a part of the maximum commit quota.

5. The flow-control period: currently flow-control messages are sent once per
second, a fixed rate that depends on the unfortunately also fixed rate of the
garbage collection in GR, which can be inconvenient if the commit rate is very
low and not enough transactions are executed per second to give the heuristics
enough data to work properly.

6. The percentage of the quota that is reserved for catch-up and when
flow-control is released: presently constants in the code, 10% and 50% of the
quota respectively, require better tunning those properly reduce variations that
are presently seem in the throughput when the system is close to exhausting its
free capacity.

In addition, participation in flow-control should be optional, so that some
members can be ignored by other members if they are doing some maintenance tasks
for instance. That would allow the members to remain in the group, but would not
try to make them in sync, at the cost of having the garbage collection in GR
become less effective if the members stop applying the transactions that are
being sent to the group.

Flow-control should remain local to the writer, and control the local member
only. If there is a mismatch between the configurations, if unintentional, they
need to be adressed by changing the variables, something that should continue to
be doing dynamically even if Group Replication is running.

FUNCTIONAL REQUIREMENTS

FR1. When a member has group_replication_flow_control_mode=DISABLED the other
members should not be delayed even if it does not apply any transactions, or has
a large back-log.

FR2. The computed commits/period quota assigned to each member of the group
shall not be lower then group_replication_flow_control_min_recovery_quota (new
option). This option will replace group_replication_flow_control_min_quota when
the only reason for the flow-control to throttle the master is due to some
members being in recovery mode, if other members exceed the queue sizes due to
other reasons group_replication_flow_control_min_quota will be applied.

FR3. When flow-control is active the computed commits/period quota assigned for
each member shall not be:
       - lower than group_replication_flow_control_min_quota (new option)
       - higher than group_replication_max_quota (new option), multiplied by
group_replication_member_quota_percent and divided by 100 if
group_replication_member_quota_percent is not zero.

FR4. The total commits/period quota of the cluster, when flow control is active,
shall be assigned to a member according to the following policy:
       - divided equaly by the number of members that wrote in the last period
if group_replication_member_quota_percent=0 (new option)
       - assigned a quota = group_replication_member_quota_percent * total quota
if group_replication_member_quota_percent is between 1% and 100% (the user is
responsible for the quota over-subscription).

FR5. The flow-control messages and execution of heuristics to set the quota
shall occur every group_replication_flow_control_period (new option) seconds.
The period specified does not need to be the same on all nodes, but members that
have periods exceeding 10 times that of other members will be ignored in some
iterations, as stats information about remote members is discarded if those
members do not update them at least once every 10 periods.

FR6. When flow-control is active, the member commits/period quota shall be
reduced by group_replication_flow_control_hold_percent (new option) percent to
allow the back-log to decrease gradualy.

FR7. When flow-control becomes inactive, the total cluster commits/period quota
shall be increased by group_replication_flow_control_release_percent percent in
each flow control period to allow the cluster the increase throughput gradually
and avoid large spikes. To disable this option the user can set
group_replication_flow_control_release_percent=0, in which case the quota shall be
released immediately.

NON-FUNCTIONAL REQUIREMENTS

NF1. The user should be able to configure the flow-control settings to allow
smooth execution over time, even when the cluster is near its capacity.

NF2. The flow-control settings should not reduce the cluster throughput
significantly unless the user specifies it.

APPROACH

Change current flow-control code to introduce further boundaries on the quotas
assigned.

To implement this worklog one needs to create 7 new options:
   - group_replication_flow_control_min_quota= X commits/s
   - group_replication_flow_control_min_recovery_quota= X commits/s
   - group_replication_flow_control_max_commit_quota= X commits/s
   - group_replication_flow_control_member_quota_percent= Y %
   - group_replication_flow_control_period= Z seconds
   - group_replication_flow_control_hold_percent= Y %
   - group_replication_flow_control_release_percent= Y %

In addition, an existing option needs to have its behaviour better defined, so
that a member which has:
   - group_replication_flow_control_mode= DISABLED 
it will be excluded from the flow-control calculations by other members.


DETAILS

* New system variables to allow user control over the flow-control:

- Name: group_replication_flow_control_min_quota
  - Input Values: [0, MAX_INT]
  - Default: 0
  - Units: commits per flow-control period
  - Description: This option controls the lowest flow-control quota that can be
assigned to a member, independently of the calculated minimum quota executed in
the last period. A value of 0 implies that there is no minimum quota.
    Cannot be larger than group_replication_flow_control_max_commit_quota.
  - Dynamic: yes.
  - Scope: Global
  - Type: Integer

- Name: group_replication_flow_control_min_recovery_quota
  - Input Values: [0, MAX_INT]
  - Default: 0
  - Units: commits per flow-control period
  - Description: This option controls the lowest quota that can be assigned to a
member because of another recovering member in the group, independently of the
calculated minimum quota executed in the last period. A value of 0 implies that
there is no minimum quota. Cannot be larger than
group_replication_flow_control_max_commit_quota.
  - Dynamic: yes.
  - Scope: Global
  - Type: Integer

- Name: group_replication_flow_control_max_quota
  - Input Values: [0, MAX_INT]
  - Default: 0
  - Units: commits per flow-control period
  - Description: This option defines the maximum flow-control quota of the
cluster, or the maximum available quota for any period while flow-control is
enabled. A value of 0 implies that there is no maximum quota set.
    Cannot be smaller than group_replication_flow_control_min_commit_quota and
group_replication_flow_control_min_recovery_quota.
  - Dynamic: yes.
  - Scope: Global
  - Type: Integer

- Name: group_replication_flow_control_member_quota_percent
  - Input Values: [0, 100%]
  - Default: 0
  - Description: This option defines the percentage of the quota that a member
should assume is available for itself when calculating the quotas. A value of 0
implies that the quota should be split equally between members that were
writers in the last period.
  - Dynamic: yes.
  - Scope: Global
  - Type: Integer

- Name: group_replication_flow_control_period
  - Input Values: [1, 60]
  - Default: 1
  - Units: seconds between flow-control periods
  - Description: This option defines how many seconds to wait between
flow-control iterations, in which flow-control messages are sent and
flow-control managements tasks are ran.
  - Dynamic: yes.
  - Scope: Global
  - Type: Integer

- Name: group_replication_flow_control_hold_percent
  - Input Values: [0, 100%]
  - Default: 10%
  - Description: This option defines what percentage of the cluster quota should
remain unused to allow a cluster under flow-control to catch-up on backlog.
    A value of 0 implies that no part of the quota will be reserved for catching
    up on the work backlog.
  - Dynamic: yes.
  - Scope: Global
  - Type: Integer

- Name: group_replication_flow_control_release_percent
  - Input Values: [0, 1000%]
  - Default: 50%
  - Description: This option defines how the cluster quota should be released
when flow-control no longer needs to throttle the writer members, with this
percentage being the quota increase per flow-control period. 
    A value of 0 implies that once the flow-control thresholds are within
limits the quota will be released in a single flow-control iteration.
    The range allows the quota to be released at up to 10 times current quota,
as that allows a greater degree of adaptation, mainly when the flow-control
period is large and the quotas are very small.
  - Dynamic: yes.
  - Scope: Global
  - Type: Integer

All the variables are not only dynamic, but must also be changeable while Group
Replication is running.


* Observability
  - This worklog relies on the flow-control mechanism that we have in place
already, only the commit quotas change.

* Cross-version group replication
  - The flow-control decisions impacted by this worklog are local and should not
affect cross-version replication.

* Rolling-upgrades
  - The upgrade procedures do not change as a results of this worklog.

The main heuristics of the flow-control logic are executed in the
Flow_control_module::flow_control_step().

Other then adding the new options, the changes in this worklog consist on
introducing boundaries on the quota calculations performed there, according to
the specifications above.

The changes in the flow-control interval are also controlled inside this
function, which continues to be called every second but only allows flow-control
to run when enough seconds have passed since it last executed.

In order to ignore members that have flow-control-mode=DISABLED the flow-control
the flow control messages will add a char to share the flow-control-mode of each
member.