WL#11512: Allow XCom's event horizon value to be configurable
Executive summary
The current implementation of the Group Communication System (GCS) component is XCom. The event horizon is one of XCom's internal parameters that govern its behaviour. XCom decides several consensus instances concurrently, and the event horizon value specifies the maximum number of outstanding consensus instances at any time. Therefore, XCom's performance on any particular deployment depends on the event horizon value. This worklog's goal is to allow users to tune the GCS' performance for their particular deployments. Specifically, to allow users to configure XCom's event horizon value.
1 User stories
- As a MySQL user, I want to configure XCom's event horizon value so that I am able to tune performance on my particular deployment.
Requirements
1 Definitions
New instance: a server version that implements this worklog.
Old instance: a server version that does not implement this worklog.
Compile-time default: the default event horizon value (10) defined at compile-time on old instances.
2 Functional
- It must be possible to change a group's event horizon through any member of the group if all members are running a newer instance.
- It must not be possible to change the group's event horizon value if any member is running an older instance.
- It must be possible to add an older instance to a group if the group's event horizon value is the compile-time default.
- It must not be possible to add an older instance to a group if the group's event horizon value is not the compile-time default.
- It must be possible to add a newer instance to a group regardless of the group's event horizon value.
- It must not be possible for an instance that is not in a group to change or inspect the event horizon.
- It must be possible to set the event horizon to a value inside the domain [10, 200].
3 Non-functional
- All members of a group must agree on the event horizon value for every group configuration.
- All members of a group must use the same event horizon value when processing a message.
- It should be possible to change a group's event horizon value without having to stop the group's regular message flow, or the group itself.
4 Discussion
The functional requirements 1--4 and non-functional requirements 1 and 2, are safety properties: they are required for correct behaviour.
The functional requirements 1, 5, and 6, and non-functional requirement 3, are usability requirements.
Interface specification
1 Configuration
1.1 Modifying the event horizon
We create a new UDF group_replication_set_write_concurrency(new_event_horizon)
to reconfigure the group's event horizon value to new_event_horizon
at
runtime.
For example:
mysql> SELECT group_replication_set_write_concurrency(42);
The UDF must fail with an appropriate error code and message if
new_event_horizon
is outside the domain of possible event horizon values.
But since group_replication_set_write_concurrency
is asynchronous, its final
outcome is only reflected in the server log.
If succeeding would violate functional requirement 2 or 6,
group_replication_set_write_concurrency
fails and that fact, and reason, is
logged in the server log.
Otherwise, group_replication_set_write_concurrency
succeeds and that fact is
logged in the server log.
1.2 Inspecting the event horizon
We create a new UDF group_replication_get_write_concurrency
to inspect the
group's event horizon value at runtime.
For example:
mysql> SELECT group_replication_get_write_concurrency();
+-------------------------------------------+
| group_replication_get_write_concurrency() |
+-------------------------------------------+
| 42 |
+-------------------------------------------+
The command must fail with an appropriate error code and message if the event horizon cannot be inspected. The commands fails if succeeding would violate functional requirement 6.
2 Conclusion
A group is bootstrapped with the default event horizon value.
An user can inspect a group's event horizon value by connecting to a member and
executing the group_replication_get_write_concurrency
UDF.
An user can reconfigure a group's event horizon value by connecting to a member
and executing the group_replication_set_write_concurrency
UDF.
Design specification
1 Preliminaries
Each consensus instance is uniquely identified by a synode_no
. synode_no
s
are totally ordered.
Each group configuration is stored in a site_def
.
Each consensus instance is associated with a group configuration, i.e. there is
a mapping from synode_no
to site_def
.
Consider a group reconfiguration r
decided in consensus instance c
.
Currently, the reconfiguration r
takes effect starting from consensus
instance c + n + 1
, where n
is the event horizon value.
For example, consider a group with 3 members m1
, m2
, and m3
.
If the reconfiguration add_member(m4)
is decided in consensus instance 42
and the event horizon value is 10
, then m4
becomes part of the group from
consensus instance 42 + 10 + 1 = 53
onwards.
Besides being used in the calculation of when a reconfiguration takes place,
the event horizon is also used as a "high water mark" on the number of
outstanding consensus instances.
Consider that consensus instance c
is the latest whose command has been
executed.
Currently, the group can decide, but not execute, up to consensus instance
c + n
before consensus instance c + 1
is executed, where n
is the event
horizon value.
For example, if consensus instance 41
is the latest executed instance and the
event horizon value is 10
, the group can decide up to instance 41 + 10 = 51
before instance 42
is executed.
2 Objective
Currently the event horizon value is static, i.e. it is defined at compile-time. Our objective is to have a dynamic event horizon value, i.e. that can be changed at runtime.
3 Design
Make the event horizon a group parameter.
In practice this means that each site_def
must now contain an event
horizon value.
It also means that changing the event horizon value requires a group
reconfiguration equivalent to adding a member to the group, for example.
Before this worklog, we had the relation site_def(synode_no)
where each
consensus instance (synode_no
) had an associated group configuration
(site_def
).
We also have a static event horizon value event_horizon
.
After this worklog, we change the event horizon from a static value to a
relation event_horizon(site_def)
where each group configuration (site_def
)
has an associated event horizon value (event_horizon
).
On the normal case, when a group reconfiguration is decided on consensus
instance c
, the reconfiguration takes effect starting from consensus
instance c + event_horizon(site_def(c))
, i.e. after event-horizon instances
according to the group configuration of instance c
.
Similarly, on the normal case, if consensus instance c
is the latest whose
decision has been executed, then the group can decide, but not execute, up
to consensus instance c + event_horizon(site_def(c))
before consensus
instance c + 1
is executed.
3.1 Join
IIUC, a node receives the group configuration (via gcs_snapshot
) when it
joins the group.
The joining node must receive the event horizon as part of the group
configuration.
Here a gist of the flow:
- Node asks seed to join group.
- Seed, on behalf of node, proposes to add node to group.
- Group agrees when node becomes part of group.
- Node will eventually receive are you alive? messages when it becomes part of the group.
- Node replies to are you alive? with yes, but need configuration.
- Node receives configuration.
- Node is good to go and will in the mean time receive missed decisions.
3.2 Upgrade/downgrade
Older instances of the server, without this worklog implemented, have a static event horizon value defined at compile-time (10). Therefore, we need to ensure the invariant that a group with at least an older instance as a member must use an event horizon value of 10.
3.2.1 Join
When a node using an older instance of the server wants to join a group that is using an event horizon value different from 10, we need to ensure our invariant remains true.
One way to to maintain correctness is to prevent the older instance from joining the group. This requires an explicit reconfiguration of the event horizon value before an older instance can be added to a group, but it does not change settings from under the user. We will implement this.
3.2.2 Event horizon reconfiguration
If at least one of the group's members is using an older instance of the server then reconfigurations of the event horizon must be denied. The denial must be reflected in the log.
3.2.3 Concurrency between join and event horizon reconfiguration
There is also a possibility of breaking our invariant when there is a join
concurrent with a reconfiguration of the event horizon value.
We say a join and an event horizon reconfiguration are concurrent if one is
decided on consensus instance i
and thus takes effect on instance k
, and
the other is decided on consensus instance j
such that i < j < k
.
This means that reconfiguration j
is decided on a state where
the effects of reconfiguration i
have been scheduled but not yet applied.
For example, assume the event horizon value is 10
.
Consider the join of member m
that uses an older instance of the server.
The join is decided at consensus instance i
and thus takes effect from
instance i + 10
onwards.
Now consider that someone issues a reconfiguration of the event horizon value
to 20
which is decided at instance i + 1
and takes effect from instance
i + 11
onwards.
The invariant is broken from instance i + 11
onwards because we have member
m
using an older instance of the server and the group's event horizon value is
not 10
.
A similar issue can occur if it is the event horizon reconfiguration that is
decided before the join.
A possible solution is to validate reconfigurations at execution time.
Because reconfigurations are state machine commands, they are executed
sequentially in the same order on all members.
Before executing a reconfiguration r
we can validate it against the active
configuration and all pending configurations.
The active and pending configurations at this point are the same on all members
because reconfigurations are state machine commands.
If reconfiguration r
is incompatible with the active or pending configurations
then r
must not be applied.
(E.g. because we are reconfiguring the event horizon and there are older
instances in the group, or because we are adding an older istance to a group
with a custom event horizon value.)
We will implement this.
3.3 Relationship with the pax_machine
cache.
Currently XCom's code assumes that #cache > event_horizon * #members
, i.e.
there is at least one code path that assumes that it is always possible to
obtain a free cache entry.
Custom event horizon values, particularly ones much greater than the previous
compile-time value (10) can possibly break this invariant.
We must maintain this invariant.
We identify at least two possible solutions:
- This proposal already validates reconfigurations to maintain correctness
(§3.2.1).
We can additionally validate that
#cache > event_horizon * #members
remains true if the reconfiguration takes place. - Dynamically grow the
pax_machine
cache so that the equation#cache > event_horizon * #members
remains true when the reconfiguration takes place.
We will implement (1).
3.4 Logging
When a reconfiguration is applied, this fact must be logged.
When a reconfiguration is not applied, due to the reasons above (§3.2), this fact and the reason must be logged.
3.5 Concurrency between event horizon reconfigurations
It is possible for two, or more, reconfigurations of the event horizon to occur
concurrently.
We say two reconfigurations are concurrent if one is decided on consensus
instance i
and thus takes effect on instance k
, and the other is decided on
consensus instance j
such that i < j < k
.
This means that reconfiguration j
is decided on a state where the effects of
reconfiguration i
have been scheduled but not yet applied.
Consider the following example, where configs
is the map from instance to
configuration:
configs = {
a: start=0 eh=10
}
instance 40: cmd=... <--------------- latest executed
instance 41: cmd=change eh to 1000 <- going to execute
instance 42: cmd=change eh to 1
...
instance 50: cmd=...
When we execute the configuration at instance 41
, we search for the latest
pending configuration that modifies the event horizon.
In this case there are none, so we install the new configuration at
this_instance + event_horizon(active_config) + 1
, i.e. 41 + 10 + 1 = 52
.
This is what we already did before this WL.
From this point on we know that the system may start deciding up to
instance 52 + 1000 = 1052
:
configs = {
a: start=0 eh=10
b: start=52 eh=1000
}
instance 40: cmd=...
instance 41: cmd=change eh to 1000 <- latest executed
instance 42: cmd=change eh to 1
...
instance 50: cmd=...
instance 51: cmd=...
instance 52: cmd=... <--------------- start(b)
...
However, let us consider the case where there is a concurrent reconfiguration:
configs = {
a: start=0 eh=10
b: start=52 eh=1000
}
instance 40: cmd=...
instance 41: cmd=change eh to 1000 <- latest executed
instance 42: cmd=change eh to 1 <---- going to execute
...
instance 50: cmd=...
instance 51: cmd=...
instance 52: cmd=... <--------------- start(b)
...
Again, we search for the latest pending configuration that modifies the event
horizon, which is configuration b
.
(Configuration b
is pending because this_instance < start(b)
, i.e.
42 < 52
.)
We apply the new configuration c
after b
.
This means that the new configuration c
with an event horizon of 1
will take
effect at instance start(b) + event_horizon(b) + 1
, i.e.
52 + 1000 + 1 = 1053
.
From this point on we know that the system may start deciding up to
instance 1053 + 1 = 1054
:
configs = {
a: start=0 eh=10
b: start=52 eh=1000
c: start=1053 eh=1
}
instance 40: cmd=...
instance 41: cmd=change eh to 1000
instance 42: cmd=change eh to 1 <---- latest executed
...
instance 50: cmd=...
instance 51: cmd=...
instance 52: cmd=... <--------------- start(b)
...
instance 1053: cmd=... <--------------- start(c)
Using the same rationale for multiple reconfigurations:
configs = {
a: start=0 eh=10
b: start=52 eh=1000
c: start=1053 eh=1
}
instance 40: cmd=...
instance 41: cmd=change eh to 1000
instance 42: cmd=change eh to 1 <---- latest executed
instance 43: cmd=change eh to 100 <-- going to execute
...
instance 50: cmd=...
instance 51: cmd=...
instance 52: cmd=... <--------------- start(b)
...
instance 1053: cmd=... <--------------- start(c)
The latest pending event horizon reconfiguration is c
, so we apply the new
configuration at start(c) + event_horizon(c) + 1
, i.e. 1053 + 1 + 1 = 1055
:
configs = {
a: start=0 eh=10
b: start=52 eh=1000
c: start=1053 eh=1
d: start=1055 eh=100
}
instance 40: cmd=...
instance 41: cmd=change eh to 1000
instance 42: cmd=change eh to 1
instance 43: cmd=change eh to 100 <-- latest executed
...
instance 50: cmd=...
instance 51: cmd=...
instance 52: cmd=... <--------------- start(b)
...
instance 1053: cmd=... <--------------- start(c)
...
instance 1055: cmd=... <--------------- start(d)
Succintly, we linearize concurrent configurations if one of them reconfigures
the event horizon.
Hence, for any given concurrent reconfiguration commands x
and y
s.t.
x
reconfigures the event horizon and instance(x) < instance(y)
, there are
always event_horizon(x)
instances between start(x)
and start(y)
.
Looking at the same diagrams as above:
configs = {
a: start=0 eh=10
b: start=52 eh=1000
c: start=1053 eh=1
d: start=1055 eh=100
}
instance 40: cmd=...
instance 41: cmd=change eh to 1000
instance 42: cmd=change eh to 1
instance 43: cmd=change eh to 100 <-- latest executed
...
instance 50: cmd=...
instance 51: cmd=...
instance 52: cmd=... <--------------- start(b)
[there are event_horizon(b) instances in this window]
instance 1053: cmd=... <--------------- start(c)
[there are event_horizon(c) instances in this window]
instance 1055: cmd=... <--------------- start(d)
3.6 Algorithms
3.6.1 Reconfigure event horizon
procedure apply_config(consensus_instance, new_config):
// are event horizon reconfigurations scheduled?
if event_horizon_reconfigurations_pending():
// yes, apply this reconfiguration after the latest
after_config := get_latest_event_horizon_reconfig()
new_config.start(after_config.start + after_config.event_horizon + 1)
install_config(new_config)
else:
// no event horizon reconfigurations scheduled, apply normally
active_config := get_active_config()
new_config.start(consensus_instance + active_config.event_horizon + 1)
install_config(new_config)
function reconfig_event_horizon(consensus_instance, event_horizon) -> boolean:
success := yes
new_config := create_new_config()
new_config.event_horizon(event_horizon)
latest_config := get_latest_config()
// can we reconfigure?
if no_old_servers(latest_config) or is_default(event_horizon):
apply_config(consensus_instance, new_config)
else:
success := no
return success
void handle_config(app_data_ptr)
will implement this algorithm.
3.6.2 Add new member
function reconfig_new_member(consensus_instance, new_member) -> boolean:
success := yes
new_config := create_new_config()
new_config.add_member(new_member)
latest_config := get_latest_config()
// is the configuration compatible with the new member?
if is_compatible(latest_config, new_member):
apply_config(consensus_instance, new_config) // see §3.6.1
else:
success := no
return success
void handle_config(app_data_ptr)
will implement this algorithm.
3.6.3 Check if instance is inside event horizon
The function below features its current (before this WL) algorithm:
function is_instance_inside_event_horizon(consensus_instance) -> boolean:
active_config := get_active_config()
last_executed_instance := get_last_executed_instance()
threshold := last_executed_instance + active_config.event_horizon
return consensus_instance <= threshold
The function above may clash with executor_task
's exit logic if there are
reconfigurations that decrease the event horizon.
The rationale of that logic is as follows.
Consider a node n
which is leaving the group.
A new reconfiguration command c
, which removes n
from the group, is decided
at consensus instance decided(c)
.
When it comes the time to execute instance decided(c)
, the executor_task
performs two actions:
- Computes
start(c)
, the instance where configurationc
takes effect; and - Computes
can_leave(c) = start(c) + event_horizon(c)
, the instance that, when decided, ensures that all instances up tostart(c)
were executed on a majority.
The invariant that, when can_leave(c)
is decided, then start(c) - 1
has been
executed by a majority of the remaining nodes in the new configuration, can be
violated if we use is_instance_inside_event_horizon
as defined above.
Consider the following example, where configs
is the map from instance to
configuration:
configs = {
a: start=0 eh=10 nodes={n,...}
}
instance 40: cmd=... <------------ latest executed
/ instance 41: cmd=change eh to 2 <- going to execute
event | instance 42: cmd=remove n
horizon | ...
\ instance 50: cmd=...
We execute an event horizon reconfiguration from 10 to 2:
configs = {
a: start=0 eh=10 nodes={n,...}
b: start=52 eh=2 nodes={n,...}
}
instance 40: cmd=...
instance 41: cmd=change eh to 2 <- latest executed
/ instance 42: cmd=remove n <------- going to execute
event | ...
horizon | instance 50: cmd=...
\ instance 51: cmd=...
instance 52: cmd=...
And now we execute the removal of n
:
configs = {
a: start=0 eh=10 nodes={n,...}
b: start=52 eh=2 nodes={n,...}
c: start=55 eh=2 nodes={...}
}
instance 40: cmd=...
instance 41: cmd=change eh to 2
instance 42: cmd=remove n <------- latest executed
/ ...
event | instance 50: cmd=...
horizon | instance 51: cmd=...
\ instance 52: cmd=...
instance 53: cmd=...
instance 54: cmd=...
instance 55: cmd=... <------------ start(c)
instance 56: cmd=...
instance 57: cmd=... <------------ can_leave(c)
If we now roll forward until we execute instance 47
:
configs = {
a: start=0 eh=10 nodes={n,...}
b: start=52 eh=2 nodes={n,...}
c: start=55 eh=2 nodes={...}
}
instance 40: cmd=...
instance 41: cmd=change eh to 2
instance 42: cmd=remove n
...
instance 47: cmd=... <------------ latest executed
/ ...
event | instance 50: cmd=...
horizon | instance 51: cmd=...
| instance 52: cmd=...
| instance 53: cmd=...
| instance 54: cmd=...
| instance 55: cmd=... <------------ start(c)
| instance 56: cmd=...
\ instance 57: cmd=... <------------ can_leave(c)
It is apparent that after executing instance 47
, it is possible
to decide instance can_leave(c) = 57
, which means we have violated the
invariant that start(c) - 1
(54) was executed by a majority when
can_leave(c)
(57) is decided.
The core issue here is that is_instance_inside_event_horizon
does not take
into account pending event horizon reconfigurations, which can violate the
invariant required by executor_task
.
We solve this issue by not extending the event horizon until it stabilizes on
the reconfigured size:
function is_instance_inside_event_horizon(consensus_instance) -> boolean:
last_executed_instance := get_last_executed_instance()
active_event_horizon := get_active_config().event_horizon
// are event horizon reconfigurations scheduled?
if event_horizon_reconfigurations_pending():
// compute common-case threshold
possibly_unsafe_threshold := last_executed_instance + active_event_horizon
// compute threshold taking into account first event horizon reconfiguration
new_config := get_first_event_horizon_reconfig()
maximum_safe_threshold := new_config.start - 1 + new_config.event_horizon
// use the minimum of both for safety
threshold := min(possibly_unsafe_threshold, maximum_safe_threshold)
else:
// no event horizon reconfigurations scheduled, common case
threshold := last_executed_instance + active_event_horizon
return consensus_instance <= threshold
With this new function, let us go back to our example:
configs = {
a: start=0 eh=10 nodes={n,...}
}
instance 40: cmd=... <------------ latest executed
/ instance 41: cmd=change eh to 2 <- going to execute
event | instance 42: cmd=remove n
horizon | ...
\ instance 50: cmd=...
We execute an event horizon reconfiguration from 10 to 2:
configs = {
a: start=0 eh=10 nodes={n,...}
b: start=52 eh=2 nodes={n,...}
}
instance 40: cmd=...
instance 41: cmd=change eh to 2 <- latest executed
/ instance 42: cmd=remove n <------- going to execute
event | ...
horizon | instance 50: cmd=...
\ instance 51: cmd=...
instance 52: cmd=... <------------ start(b)
And now we execute the removal of n
:
configs = {
a: start=0 eh=10 nodes={n,...}
b: start=52 eh=2 nodes={n,...}
c: start=55 eh=2 nodes={...}
}
instance 40: cmd=...
instance 41: cmd=change eh to 2
instance 42: cmd=remove n <------- latest executed
/ ...
event | instance 50: cmd=...
horizon | instance 51: cmd=...
\ instance 52: cmd=... <------------ start(b)
instance 53: cmd=...
instance 54: cmd=...
instance 55: cmd=... <------------ start(c)
instance 56: cmd=...
instance 57: cmd=... <------------ can_leave(c)
If we now roll forward until we execute instance 47
:
configs = {
a: start=0 eh=10 nodes={n,...}
b: start=52 eh=2 nodes={n,...}
c: start=55 eh=2 nodes={...}
}
instance 40: cmd=...
instance 41: cmd=change eh to 2
instance 42: cmd=remove n
...
instance 47: cmd=... <------------ latest executed
/ ...
event | instance 50: cmd=...
horizon | instance 51: cmd=...
| instance 52: cmd=... <------------ start(b)
\ instance 53: cmd=...
instance 54: cmd=...
instance 55: cmd=... <------------ start(c)
instance 56: cmd=...
instance 57: cmd=... <------------ can_leave(c)
The event horizon stops extending beyond 53
(52 - 1 + 2
) until
configuration b
(that decreases the event horizon to 2
) takes effect, which
happens on instance 52
.
For example, when we execute instance 50
:
configs = {
a: start=0 eh=10 nodes={n,...}
b: start=52 eh=2 nodes={n,...}
c: start=55 eh=2 nodes={...}
}
instance 40: cmd=...
instance 41: cmd=change eh to 2
instance 42: cmd=remove n
...
instance 47: cmd=...
...
instance 50: cmd=... <------------ latest executed
event / instance 51: cmd=...
horizon | instance 52: cmd=... <------------ start(b)
\ instance 53: cmd=...
instance 54: cmd=...
instance 55: cmd=... <------------ start(c)
instance 56: cmd=...
instance 57: cmd=... <------------ can_leave(c)
And then execute instace 51
:
configs = {
a: start=0 eh=10 nodes={n,...}
b: start=52 eh=2 nodes={n,...}
c: start=55 eh=2 nodes={...}
}
instance 40: cmd=...
instance 41: cmd=change eh to 2
instance 42: cmd=remove n
...
instance 47: cmd=...
...
instance 50: cmd=...
instance 51: cmd=... <------------ latest executed
event / instance 52: cmd=... <------------ start(b)
horizon \ instance 53: cmd=...
instance 54: cmd=...
instance 55: cmd=... <------------ start(c)
instance 56: cmd=...
instance 57: cmd=... <------------ can_leave(c)
And finally execute instance 52
, when the event horizon reconfiguration takes
place:
configs = {
a: start=0 eh=10 nodes={n,...}
b: start=52 eh=2 nodes={n,...}
c: start=55 eh=2 nodes={...}
}
instance 40: cmd=...
instance 41: cmd=change eh to 2
instance 42: cmd=remove n
...
instance 47: cmd=...
...
instance 50: cmd=...
instance 51: cmd=...
instance 52: cmd=... <------------ latest executed, start(b)
event / instance 53: cmd=...
horizon \ instance 54: cmd=...
instance 55: cmd=... <------------ start(c)
instance 56: cmd=...
instance 57: cmd=... <------------ can_leave(c)
And if we roll forward until can_leave(c)
is within the event horizon:
configs = {
a: start=0 eh=10 nodes={n,...}
b: start=52 eh=2 nodes={n,...}
c: start=55 eh=2 nodes={...}
}
instance 40: cmd=...
instance 41: cmd=change eh to 2
instance 42: cmd=remove n
...
instance 47: cmd=...
...
instance 50: cmd=...
instance 51: cmd=...
instance 52: cmd=... <------------ start(b)
instance 53: cmd=...
instance 54: cmd=...
instance 55: cmd=... <------------ latest executed, start(c)
event / instance 56: cmd=...
horizon \ instance 57: cmd=... <------------ can_leave(c)
We can see that the invariant is maintained, i.e. instance start(c) - 1
(54)
has been executed by a majority when can_leave(c)
(57) is decided.
int too_far(synode_no)
will implement this algorithm.