WL#11512: Allow XCom's event horizon value to be configurable

Affects: Server-8.0 — Status: Complete

Executive summary

The current implementation of the Group Communication System (GCS) component is XCom. The event horizon is one of XCom's internal parameters that govern its behaviour. XCom decides several consensus instances concurrently, and the event horizon value specifies the maximum number of outstanding consensus instances at any time. Therefore, XCom's performance on any particular deployment depends on the event horizon value. This worklog's goal is to allow users to tune the GCS' performance for their particular deployments. Specifically, to allow users to configure XCom's event horizon value.

1 User stories

As a MySQL user, I want to configure XCom's event horizon value so that I am able to tune performance on my particular deployment.

Requirements

1 Definitions

New instance: a server version that implements this worklog.

Old instance: a server version that does not implement this worklog.

Compile-time default: the default event horizon value (10) defined at compile-time on old instances.

2 Functional

It must be possible to change a group's event horizon through any member of the group if all members are running a newer instance.
It must not be possible to change the group's event horizon value if any member is running an older instance.
It must be possible to add an older instance to a group if the group's event horizon value is the compile-time default.
It must not be possible to add an older instance to a group if the group's event horizon value is not the compile-time default.
It must be possible to add a newer instance to a group regardless of the group's event horizon value.
It must not be possible for an instance that is not in a group to change or inspect the event horizon.
It must be possible to set the event horizon to a value inside the domain [10, 200].

3 Non-functional

All members of a group must agree on the event horizon value for every group configuration.
All members of a group must use the same event horizon value when processing a message.
It should be possible to change a group's event horizon value without having to stop the group's regular message flow, or the group itself.

4 Discussion

The functional requirements 1--4 and non-functional requirements 1 and 2, are safety properties: they are required for correct behaviour.

The functional requirements 1, 5, and 6, and non-functional requirement 3, are usability requirements.

Interface specification

1 Configuration

1.1 Modifying the event horizon

We create a new UDF group_replication_set_write_concurrency(new_event_horizon) to reconfigure the group's event horizon value to new_event_horizon at runtime.

For example:

mysql> SELECT group_replication_set_write_concurrency(42);

The UDF must fail with an appropriate error code and message if new_event_horizon is outside the domain of possible event horizon values.

But since group_replication_set_write_concurrency is asynchronous, its final outcome is only reflected in the server log. If succeeding would violate functional requirement 2 or 6, group_replication_set_write_concurrency fails and that fact, and reason, is logged in the server log. Otherwise, group_replication_set_write_concurrency succeeds and that fact is logged in the server log.

1.2 Inspecting the event horizon

We create a new UDF group_replication_get_write_concurrency to inspect the group's event horizon value at runtime.

For example:

mysql> SELECT group_replication_get_write_concurrency();
+-------------------------------------------+
| group_replication_get_write_concurrency() |
+-------------------------------------------+
|                                        42 |
+-------------------------------------------+

The command must fail with an appropriate error code and message if the event horizon cannot be inspected. The commands fails if succeeding would violate functional requirement 6.

2 Conclusion

A group is bootstrapped with the default event horizon value.

An user can inspect a group's event horizon value by connecting to a member and executing the group_replication_get_write_concurrency UDF.

An user can reconfigure a group's event horizon value by connecting to a member and executing the group_replication_set_write_concurrency UDF.

Design specification

1 Preliminaries

Each consensus instance is uniquely identified by a synode_no. synode_nos are totally ordered.

Each group configuration is stored in a site_def.

Each consensus instance is associated with a group configuration, i.e. there is a mapping from synode_no to site_def.

Consider a group reconfiguration r decided in consensus instance c. Currently, the reconfiguration r takes effect starting from consensus instance c + n + 1, where n is the event horizon value. For example, consider a group with 3 members m1, m2, and m3. If the reconfiguration add_member(m4) is decided in consensus instance 42 and the event horizon value is 10, then m4 becomes part of the group from consensus instance 42 + 10 + 1 = 53 onwards.

Besides being used in the calculation of when a reconfiguration takes place, the event horizon is also used as a "high water mark" on the number of outstanding consensus instances. Consider that consensus instance c is the latest whose command has been executed. Currently, the group can decide, but not execute, up to consensus instance c + n before consensus instance c + 1 is executed, where n is the event horizon value. For example, if consensus instance 41 is the latest executed instance and the event horizon value is 10, the group can decide up to instance 41 + 10 = 51 before instance 42 is executed.

2 Objective

Currently the event horizon value is static, i.e. it is defined at compile-time. Our objective is to have a dynamic event horizon value, i.e. that can be changed at runtime.

3 Design

Make the event horizon a group parameter. In practice this means that each site_def must now contain an event horizon value. It also means that changing the event horizon value requires a group reconfiguration equivalent to adding a member to the group, for example. Before this worklog, we had the relation site_def(synode_no) where each consensus instance (synode_no) had an associated group configuration (site_def). We also have a static event horizon value event_horizon. After this worklog, we change the event horizon from a static value to a relation event_horizon(site_def) where each group configuration (site_def) has an associated event horizon value (event_horizon).

On the normal case, when a group reconfiguration is decided on consensus instance c, the reconfiguration takes effect starting from consensus instance c + event_horizon(site_def(c)), i.e. after event-horizon instances according to the group configuration of instance c.

Similarly, on the normal case, if consensus instance c is the latest whose decision has been executed, then the group can decide, but not execute, up to consensus instance c + event_horizon(site_def(c)) before consensus instance c + 1 is executed.

3.1 Join

IIUC, a node receives the group configuration (via gcs_snapshot) when it joins the group. The joining node must receive the event horizon as part of the group configuration. Here a gist of the flow:

Node asks seed to join group.
Seed, on behalf of node, proposes to add node to group.
Group agrees when node becomes part of group.
Node will eventually receive are you alive? messages when it becomes part of the group.
Node replies to are you alive? with yes, but need configuration.
Node receives configuration.
Node is good to go and will in the mean time receive missed decisions.

3.2 Upgrade/downgrade

Older instances of the server, without this worklog implemented, have a static event horizon value defined at compile-time (10). Therefore, we need to ensure the invariant that a group with at least an older instance as a member must use an event horizon value of 10.

3.2.1 Join

When a node using an older instance of the server wants to join a group that is using an event horizon value different from 10, we need to ensure our invariant remains true.

One way to to maintain correctness is to prevent the older instance from joining the group. This requires an explicit reconfiguration of the event horizon value before an older instance can be added to a group, but it does not change settings from under the user. We will implement this.

3.2.2 Event horizon reconfiguration

If at least one of the group's members is using an older instance of the server then reconfigurations of the event horizon must be denied. The denial must be reflected in the log.

3.2.3 Concurrency between join and event horizon reconfiguration

There is also a possibility of breaking our invariant when there is a join concurrent with a reconfiguration of the event horizon value. We say a join and an event horizon reconfiguration are concurrent if one is decided on consensus instance i and thus takes effect on instance k, and the other is decided on consensus instance j such that i < j < k. This means that reconfiguration j is decided on a state where the effects of reconfiguration i have been scheduled but not yet applied.

For example, assume the event horizon value is 10. Consider the join of member m that uses an older instance of the server. The join is decided at consensus instance i and thus takes effect from instance i + 10 onwards. Now consider that someone issues a reconfiguration of the event horizon value to 20 which is decided at instance i + 1 and takes effect from instance i + 11 onwards. The invariant is broken from instance i + 11 onwards because we have member m using an older instance of the server and the group's event horizon value is not 10. A similar issue can occur if it is the event horizon reconfiguration that is decided before the join.

A possible solution is to validate reconfigurations at execution time. Because reconfigurations are state machine commands, they are executed sequentially in the same order on all members. Before executing a reconfiguration r we can validate it against the active configuration and all pending configurations. The active and pending configurations at this point are the same on all members because reconfigurations are state machine commands. If reconfiguration r is incompatible with the active or pending configurations then r must not be applied. (E.g. because we are reconfiguring the event horizon and there are older instances in the group, or because we are adding an older istance to a group with a custom event horizon value.) We will implement this.

3.3 Relationship with the `pax_machine` cache.

Currently XCom's code assumes that #cache > event_horizon * #members, i.e. there is at least one code path that assumes that it is always possible to obtain a free cache entry. Custom event horizon values, particularly ones much greater than the previous compile-time value (10) can possibly break this invariant. We must maintain this invariant.

We identify at least two possible solutions:

This proposal already validates reconfigurations to maintain correctness (§3.2.1). We can additionally validate that #cache > event_horizon * #members remains true if the reconfiguration takes place.
Dynamically grow the pax_machine cache so that the equation #cache > event_horizon * #members remains true when the reconfiguration takes place.

We will implement (1).

3.4 Logging

When a reconfiguration is applied, this fact must be logged.

When a reconfiguration is not applied, due to the reasons above (§3.2), this fact and the reason must be logged.

3.5 Concurrency between event horizon reconfigurations

It is possible for two, or more, reconfigurations of the event horizon to occur concurrently. We say two reconfigurations are concurrent if one is decided on consensus instance i and thus takes effect on instance k, and the other is decided on consensus instance j such that i < j < k. This means that reconfiguration j is decided on a state where the effects of reconfiguration i have been scheduled but not yet applied.

Consider the following example, where configs is the map from instance to configuration:

configs = {
    a: start=0 eh=10
}

instance   40: cmd=... <--------------- latest executed
instance   41: cmd=change eh to 1000 <- going to execute
instance   42: cmd=change eh to 1
...
instance   50: cmd=...

When we execute the configuration at instance 41, we search for the latest pending configuration that modifies the event horizon. In this case there are none, so we install the new configuration at this_instance + event_horizon(active_config) + 1, i.e. 41 + 10 + 1 = 52. This is what we already did before this WL. From this point on we know that the system may start deciding up to instance 52 + 1000 = 1052:

configs = {
    a: start=0  eh=10
    b: start=52 eh=1000
}

instance   40: cmd=...
instance   41: cmd=change eh to 1000 <- latest executed
instance   42: cmd=change eh to 1
...
instance   50: cmd=...
instance   51: cmd=...
instance   52: cmd=... <--------------- start(b)
...

However, let us consider the case where there is a concurrent reconfiguration:

configs = {
    a: start=0  eh=10
    b: start=52 eh=1000
}

instance   40: cmd=...
instance   41: cmd=change eh to 1000 <- latest executed
instance   42: cmd=change eh to 1 <---- going to execute
...
instance   50: cmd=...
instance   51: cmd=...
instance   52: cmd=... <--------------- start(b)
...

Again, we search for the latest pending configuration that modifies the event horizon, which is configuration b. (Configuration b is pending because this_instance < start(b), i.e. 42 < 52.) We apply the new configuration c after b. This means that the new configuration c with an event horizon of 1 will take effect at instance start(b) + event_horizon(b) + 1, i.e. 52 + 1000 + 1 = 1053. From this point on we know that the system may start deciding up to instance 1053 + 1 = 1054:

configs = {
    a: start=0    eh=10
    b: start=52   eh=1000
    c: start=1053 eh=1
}

instance   40: cmd=...
instance   41: cmd=change eh to 1000
instance   42: cmd=change eh to 1 <---- latest executed
...
instance   50: cmd=...
instance   51: cmd=...
instance   52: cmd=... <--------------- start(b)
...
instance 1053: cmd=... <--------------- start(c)

Using the same rationale for multiple reconfigurations:

configs = {
    a: start=0    eh=10
    b: start=52   eh=1000
    c: start=1053 eh=1
}

instance   40: cmd=...
instance   41: cmd=change eh to 1000
instance   42: cmd=change eh to 1 <---- latest executed
instance   43: cmd=change eh to 100 <-- going to execute
...
instance   50: cmd=...
instance   51: cmd=...
instance   52: cmd=... <--------------- start(b)
...
instance 1053: cmd=... <--------------- start(c)

The latest pending event horizon reconfiguration is c, so we apply the new configuration at start(c) + event_horizon(c) + 1, i.e. 1053 + 1 + 1 = 1055:

configs = {
    a: start=0    eh=10
    b: start=52   eh=1000
    c: start=1053 eh=1
    d: start=1055 eh=100
}

instance   40: cmd=...
instance   41: cmd=change eh to 1000
instance   42: cmd=change eh to 1
instance   43: cmd=change eh to 100 <-- latest executed
...
instance   50: cmd=...
instance   51: cmd=...
instance   52: cmd=... <--------------- start(b)
...
instance 1053: cmd=... <--------------- start(c)
...
instance 1055: cmd=... <--------------- start(d)

Succintly, we linearize concurrent configurations if one of them reconfigures the event horizon. Hence, for any given concurrent reconfiguration commands x and y s.t. x reconfigures the event horizon and instance(x) < instance(y), there are always event_horizon(x) instances between start(x) and start(y). Looking at the same diagrams as above:

configs = {
    a: start=0    eh=10
    b: start=52   eh=1000
    c: start=1053 eh=1
    d: start=1055 eh=100
}

instance   40: cmd=...
instance   41: cmd=change eh to 1000
instance   42: cmd=change eh to 1
instance   43: cmd=change eh to 100 <-- latest executed
...
instance   50: cmd=...
instance   51: cmd=...
instance   52: cmd=... <--------------- start(b)
[there are event_horizon(b) instances in this window]
instance 1053: cmd=... <--------------- start(c)
[there are event_horizon(c) instances in this window]
instance 1055: cmd=... <--------------- start(d)

3.6 Algorithms

3.6.1 Reconfigure event horizon

procedure apply_config(consensus_instance, new_config):
  // are event horizon reconfigurations scheduled?
  if event_horizon_reconfigurations_pending():
    // yes, apply this reconfiguration after the latest
    after_config := get_latest_event_horizon_reconfig()
    new_config.start(after_config.start + after_config.event_horizon + 1)
    install_config(new_config)
  else:
    // no event horizon reconfigurations scheduled, apply normally
    active_config := get_active_config()
    new_config.start(consensus_instance + active_config.event_horizon + 1)
    install_config(new_config)

function reconfig_event_horizon(consensus_instance, event_horizon) -> boolean:
  success := yes
  new_config := create_new_config()
  new_config.event_horizon(event_horizon)
  latest_config := get_latest_config()
  // can we reconfigure?
  if no_old_servers(latest_config) or is_default(event_horizon):
    apply_config(consensus_instance, new_config)
  else:
    success := no
  return success

void handle_config(app_data_ptr) will implement this algorithm.

3.6.2 Add new member

function reconfig_new_member(consensus_instance, new_member) -> boolean:
  success := yes
  new_config := create_new_config()
  new_config.add_member(new_member)
  latest_config := get_latest_config()
  // is the configuration compatible with the new member?
  if is_compatible(latest_config, new_member):
    apply_config(consensus_instance, new_config) // see §3.6.1
  else:
    success := no
  return success

void handle_config(app_data_ptr) will implement this algorithm.

3.6.3 Check if instance is inside event horizon

The function below features its current (before this WL) algorithm:

function is_instance_inside_event_horizon(consensus_instance) -> boolean:
  active_config := get_active_config()
  last_executed_instance := get_last_executed_instance()
  threshold := last_executed_instance + active_config.event_horizon
  return consensus_instance <= threshold

The function above may clash with executor_task's exit logic if there are reconfigurations that decrease the event horizon. The rationale of that logic is as follows. Consider a node n which is leaving the group. A new reconfiguration command c, which removes n from the group, is decided at consensus instance decided(c). When it comes the time to execute instance decided(c), the executor_task performs two actions:

Computes start(c), the instance where configuration c takes effect; and
Computes can_leave(c) = start(c) + event_horizon(c), the instance that, when decided, ensures that all instances up to start(c) were executed on a majority.

The invariant that, when can_leave(c) is decided, then start(c) - 1 has been executed by a majority of the remaining nodes in the new configuration, can be violated if we use is_instance_inside_event_horizon as defined above.

Consider the following example, where configs is the map from instance to configuration:

configs = {
    a: start=0 eh=10 nodes={n,...}
}
           instance   40: cmd=... <------------ latest executed
         / instance   41: cmd=change eh to 2 <- going to execute
event   |  instance   42: cmd=remove n
horizon |  ...
         \ instance   50: cmd=...

We execute an event horizon reconfiguration from 10 to 2:

configs = {
    a: start=0  eh=10 nodes={n,...}
    b: start=52 eh=2  nodes={n,...}
}
           instance   40: cmd=...
           instance   41: cmd=change eh to 2 <- latest executed
         / instance   42: cmd=remove n <------- going to execute
event   |  ...
horizon |  instance   50: cmd=...
         \ instance   51: cmd=...
           instance   52: cmd=...

And now we execute the removal of n:

configs = {
    a: start=0  eh=10 nodes={n,...}
    b: start=52 eh=2  nodes={n,...}
    c: start=55 eh=2  nodes={...}
}
           instance   40: cmd=...
           instance   41: cmd=change eh to 2
           instance   42: cmd=remove n <------- latest executed
         / ...
event   |  instance   50: cmd=...
horizon |  instance   51: cmd=...
         \ instance   52: cmd=...
           instance   53: cmd=...
           instance   54: cmd=...
           instance   55: cmd=... <------------ start(c)
           instance   56: cmd=...
           instance   57: cmd=... <------------ can_leave(c)

If we now roll forward until we execute instance 47:

configs = {
    a: start=0  eh=10 nodes={n,...}
    b: start=52 eh=2  nodes={n,...}
    c: start=55 eh=2  nodes={...}
}
           instance   40: cmd=...
           instance   41: cmd=change eh to 2
           instance   42: cmd=remove n
           ...
           instance   47: cmd=... <------------ latest executed
         / ...
event   |  instance   50: cmd=...
horizon |  instance   51: cmd=...
        |  instance   52: cmd=...
        |  instance   53: cmd=...
        |  instance   54: cmd=...
        |  instance   55: cmd=... <------------ start(c)
        |  instance   56: cmd=...
         \ instance   57: cmd=... <------------ can_leave(c)

It is apparent that after executing instance 47, it is possible to decide instance can_leave(c) = 57, which means we have violated the invariant that start(c) - 1 (54) was executed by a majority when can_leave(c) (57) is decided.

The core issue here is that is_instance_inside_event_horizon does not take into account pending event horizon reconfigurations, which can violate the invariant required by executor_task. We solve this issue by not extending the event horizon until it stabilizes on the reconfigured size:

function is_instance_inside_event_horizon(consensus_instance) -> boolean:
  last_executed_instance := get_last_executed_instance()
  active_event_horizon := get_active_config().event_horizon
  // are event horizon reconfigurations scheduled?
  if event_horizon_reconfigurations_pending():
    // compute common-case threshold
    possibly_unsafe_threshold := last_executed_instance + active_event_horizon
    // compute threshold taking into account first event horizon reconfiguration
    new_config := get_first_event_horizon_reconfig()
    maximum_safe_threshold := new_config.start - 1 + new_config.event_horizon
    // use the minimum of both for safety
    threshold := min(possibly_unsafe_threshold, maximum_safe_threshold)
  else:
    // no event horizon reconfigurations scheduled, common case
    threshold := last_executed_instance + active_event_horizon
  return consensus_instance <= threshold

With this new function, let us go back to our example:

configs = {
    a: start=0 eh=10 nodes={n,...}
}
           instance   40: cmd=... <------------ latest executed
         / instance   41: cmd=change eh to 2 <- going to execute
event   |  instance   42: cmd=remove n
horizon |  ...
         \ instance   50: cmd=...

We execute an event horizon reconfiguration from 10 to 2:

configs = {
    a: start=0  eh=10 nodes={n,...}
    b: start=52 eh=2  nodes={n,...}
}
           instance   40: cmd=...
           instance   41: cmd=change eh to 2 <- latest executed
         / instance   42: cmd=remove n <------- going to execute
event   |  ...
horizon |  instance   50: cmd=...
         \ instance   51: cmd=...
           instance   52: cmd=... <------------ start(b)

And now we execute the removal of n:

configs = {
    a: start=0  eh=10 nodes={n,...}
    b: start=52 eh=2  nodes={n,...}
    c: start=55 eh=2  nodes={...}
}
           instance   40: cmd=...
           instance   41: cmd=change eh to 2
           instance   42: cmd=remove n <------- latest executed
         / ...
event   |  instance   50: cmd=...
horizon |  instance   51: cmd=...
         \ instance   52: cmd=... <------------ start(b)
           instance   53: cmd=...
           instance   54: cmd=...
           instance   55: cmd=... <------------ start(c)
           instance   56: cmd=...
           instance   57: cmd=... <------------ can_leave(c)

If we now roll forward until we execute instance 47:

configs = {
    a: start=0  eh=10 nodes={n,...}
    b: start=52 eh=2  nodes={n,...}
    c: start=55 eh=2  nodes={...}
}
           instance   40: cmd=...
           instance   41: cmd=change eh to 2
           instance   42: cmd=remove n
           ...
           instance   47: cmd=... <------------ latest executed
         / ...
event   |  instance   50: cmd=...
horizon |  instance   51: cmd=...
        |  instance   52: cmd=... <------------ start(b)
         \ instance   53: cmd=...
           instance   54: cmd=...
           instance   55: cmd=... <------------ start(c)
           instance   56: cmd=...
           instance   57: cmd=... <------------ can_leave(c)

The event horizon stops extending beyond 53 (52 - 1 + 2) until configuration b (that decreases the event horizon to 2) takes effect, which happens on instance 52. For example, when we execute instance 50:

configs = {
    a: start=0  eh=10 nodes={n,...}
    b: start=52 eh=2  nodes={n,...}
    c: start=55 eh=2  nodes={...}
}
           instance   40: cmd=...
           instance   41: cmd=change eh to 2
           instance   42: cmd=remove n
           ...
           instance   47: cmd=...
           ...
           instance   50: cmd=... <------------ latest executed
event    / instance   51: cmd=...
horizon |  instance   52: cmd=... <------------ start(b)
         \ instance   53: cmd=...
           instance   54: cmd=...
           instance   55: cmd=... <------------ start(c)
           instance   56: cmd=...
           instance   57: cmd=... <------------ can_leave(c)

And then execute instace 51:

configs = {
    a: start=0  eh=10 nodes={n,...}
    b: start=52 eh=2  nodes={n,...}
    c: start=55 eh=2  nodes={...}
}
           instance   40: cmd=...
           instance   41: cmd=change eh to 2
           instance   42: cmd=remove n
           ...
           instance   47: cmd=...
           ...
           instance   50: cmd=...
           instance   51: cmd=... <------------ latest executed
event    / instance   52: cmd=... <------------ start(b)
horizon  \ instance   53: cmd=...
           instance   54: cmd=...
           instance   55: cmd=... <------------ start(c)
           instance   56: cmd=...
           instance   57: cmd=... <------------ can_leave(c)

And finally execute instance 52, when the event horizon reconfiguration takes place:

configs = {
    a: start=0  eh=10 nodes={n,...}
    b: start=52 eh=2  nodes={n,...}
    c: start=55 eh=2  nodes={...}
}
           instance   40: cmd=...
           instance   41: cmd=change eh to 2
           instance   42: cmd=remove n
           ...
           instance   47: cmd=...
           ...
           instance   50: cmd=...
           instance   51: cmd=...
           instance   52: cmd=... <------------ latest executed, start(b)
event    / instance   53: cmd=...
horizon  \ instance   54: cmd=...
           instance   55: cmd=... <------------ start(c)
           instance   56: cmd=...
           instance   57: cmd=... <------------ can_leave(c)

And if we roll forward until can_leave(c) is within the event horizon:

configs = {
    a: start=0  eh=10 nodes={n,...}
    b: start=52 eh=2  nodes={n,...}
    c: start=55 eh=2  nodes={...}
}
           instance   40: cmd=...
           instance   41: cmd=change eh to 2
           instance   42: cmd=remove n
           ...
           instance   47: cmd=...
           ...
           instance   50: cmd=...
           instance   51: cmd=...
           instance   52: cmd=... <------------ start(b)
           instance   53: cmd=...
           instance   54: cmd=...
           instance   55: cmd=... <------------ latest executed, start(c)
event    / instance   56: cmd=...
horizon  \ instance   57: cmd=... <------------ can_leave(c)

We can see that the invariant is maintained, i.e. instance start(c) - 1 (54) has been executed by a majority when can_leave(c) (57) is decided.

int too_far(synode_no) will implement this algorithm.