WL#9838: Group Replication: Flow-control fine tuning

Affects: Server-8.0   —   Status: Complete

EXECUTIVE SUMMARY

1. Current flow-control scheme

Group Replication introduced flow control to avoid excessive buffering and to
maintain group members reasonably close to one another. 
For several reasons, it's more efficient to keep buffering low but, as mentioned
before, it is not a requirement that members are kept in sync for replication to
work: once a slave becomes delayed it will just increase the number of pending
transactions in the relay log.

The flow control mechanism enters the scene to bound the back-log members
can accumulate, in terms of transactions to certify and transactions to apply.
Once those bounds are exceeded, and while they remain exceeded, the flow-control
mechanism limits the throughput of writer members to adjust to the capacity of
the slowest members of the group. By imposing conservative bounds it tries to
make sure all members are able to keep-up and can return safely to within the
bounds defined by the user.

The flow-control system works asynchronously, the capacity to use in each period
is calculated deterministically in each member, in each period, using the
statistics that are regularly sent to the group by each member. The coordination
is purposely kept coarse-grained, in order to reduce the number of messages in
flight and to keep the system running as smoothly as possible.

The implementation depends on two basic mechanisms: 1) monitoring the members to
gather statistics on throughput and queue sizes of all group members to make
educated guesses on what is the maximum write pressure each member can withstand
and 2) throttling the members to avoid writing beyond its fair-share of the
capacity available in each step.

The following variables are used to control flow-control behaviour in Group
Replication:
  - group-replication-flow-control-mode = QUOTA | DISABLED
  - group-replication-flow-control-certififier-threshold = 0..n
  - group-replication-flow-control-applier-threshold = 0..n

Flow-control can be enabled and disabled by selecting a flow control algorithm
with the group-replication-flow-control-mode option. QUOTA is the default
flow-control mode, and only one for now, and it calculates a write quota for the
group in terms of commits/period. That quota, after being divided by the number
of members that tried to write in the last period, will be the maximum allowance
the client threads have to spend during the next flow-control period.

Flow control takes into account two work queues: the certification queue and the
group replication applier queue, so a user may choose 1) doing flow control at
the certifier or at the applier level, or both and 2) what is the trigger size
on each queue.

The options group-replication-flow-control-certififier-threshold and
group-replication-flow-control-applier-threshold specify the size of the queues
that when exceeded will force flow-control to throttle the throughput on the
writer members. More information about how these options work is available in
the user's manual here:
  https://dev.mysql.com/doc/refman/5.7/en/group-replication-options.html

2. Limitations of current scheme

Current flow-control mechanism used by GR depends on heuristics that leave
little control to the user, other then enabling/disabling it and setting the
flow-control thresholds. However, there are several built-in values that could
benefit from user input in some situations, and that is presently not possible.

3. User stories
   - as a user, I would like to be able to configure flow-control in order to
optimize the lag between member members, while extracting the highest throughput
of the underlying system and running within what I consider safe operational limits;

   - as a user, I would like to be able to configure flow-control to keep the
throughput as stable as possible independently of the number of clients and of
the demands of the workload;

   - as a user, I would like to be able to configure flow-control so that all
members of the group have similar opportunity to submit transactions on the
cluster independently of the number of clients and of the demands of the workload;

   - as a user, I would like the perform management tasks on some members
without reducing the throughput of the rest of the cluster in some cases;

   - as a user, I would like to adapt flow-control policies to workloads which
take a long time to commit as well as it does to shorter transactions;

   - as a user, I would like to assign different tasks to some members of the
group and be allowed to configure flow-control accordingly.

4. Areas requiring more user control

This worklog introduces user control over:

1. The minimum flow-control quota that can be assigned to a member: presently it
is set to 5% of the lowest flow-control thresholds, but these are unrelated
variables and it makes little sense for them to be dependent.

2. The minimum recovering flow-control quota: different from the minimum
flow-control quota in that it applies only to members entering the group, so
that the members entering the group don't excessively affect throughput. Some
users complained that flow-control should not be used when a member is joining,
but one of the original goals was precisely to avoid having members that never
join because they cannot keep up with the group workload while joining.

3. The maximum cluster commit quota: in order to keep the group operating within
a safe throughput range (to improve on predictable behaviour and balance between
members).

4. The proportion of the quota that is assigned to each member: when
flow-control is active the quota of each writer member is similar, which may be
sub-optimal if some members are expected to write more than others. This can
also be used to assign more than 100% of the calculated quota (total), so the
user can over/under-subscribe the estimations done be the system, either because
the calculations are too conservative, or because the user prefers to let some
lag to pile up, but slower then without flow control. It will also allow the
user to set a maximum quota por member, as a part of the maximum commit quota.

5. The flow-control period: currently flow-control messages are sent once per
second, a fixed rate that depends on the unfortunately also fixed rate of the
garbage collection in GR, which can be inconvenient if the commit rate is very
low and not enough transactions are executed per second to give the heuristics
enough data to work properly.

6. The percentage of the quota that is reserved for catch-up and when
flow-control is released: presently constants in the code, 10% and 50% of the
quota respectively, require better tunning those properly reduce variations that
are presently seem in the throughput when the system is close to exhausting its
free capacity.

In addition, participation in flow-control should be optional, so that some
members can be ignored by other members if they are doing some maintenance tasks
for instance. That would allow the members to remain in the group, but would not
try to make them in sync, at the cost of having the garbage collection in GR
become less effective if the members stop applying the transactions that are
being sent to the group.

Flow-control should remain local to the writer, and control the local member
only. If there is a mismatch between the configurations, if unintentional, they
need to be adressed by changing the variables, something that should continue to
be doing dynamically even if Group Replication is running.