WL#11615: MYSQL GCS: Improve XCom Cache Management

Affects: Server-8.0 — Status: Complete

Introduction

=============

XCom keeps the messages exchanged as a part of the consensus protocol in a cache. The XCom cache is the single most important consumer of memory in the GCS system as a whole and it is one of its most essential components. Among other things, the cache is used for node recovery: if an XCom node becomes unreachable for some time without being expelled, when it comes back to the group it will recover the missed messages from the caches of the other nodes.

Currently, the cache has a fixed number of entries and a fixed 1GB size limit. This can be problematic because since WL#11570 users can define arbitrary expel timeouts. This will open the possibility for the caches of alive nodes to become exhausted while another node is unreachable without being expelled. If that occurs, when the unreachable node comes back it will not be able to recover and, as a consequence, will be expelled, even though it respected the expel timeout.

Having a fixed number of entries is also problematic in high load systems with small transactions, since the number of available cache entries can quickly be depleted. This can lead XCom to become too slow or even block.

User Stories

=============

As a MySQL DBA, I want to be able to configure the maximum size of the XCom cache, so that I am able to choose the ideal cache size having into account my system load and network requirements.
As a MySQL DBA, I want to set the cache size limit to a higher value than 1GB, so that I can maximize the chances that the messages missed by an unreachable node are still in the other nodes' caches when it becomes reachable again.
As a MySQL DBA, I want to be informed when messages needed by an unreachable node have been evicted from the cache, so that I can set a more appropriate value for the maximum size of the cache.

Goal

=====

This worklog will remove the static nature of the XCom cache by allowing users to set the the size limit of the XCom cache. To that end, we will expose a new Group Replication (GR) option named group_replication_message_cache_size. In addition, the cache will no longer be bound by a fixed number of entries; instead, it will grow dynamically, so long as the size limit is respected.

To help guide the configuration of the new option, we will print out a warning informing the user when a message that has been missed by an unreachable node is evicted from the cache of one of the alive nodes. Such event indicates that the current size limit of that node's cache is not high enough to handle similar failures. The user can, then, act by tweaking the cache size appropriately.

Scope

======

This worklog is part of the effort to make GR more suitable for WAN operation and more maintenance friendly. It will give users more control over the configuration of GR. It will also have the side effect of making users aware of the XCom cache and consider it when doing capacity planning.

This worklog will not altogether prevent unreachable nodes from being expelled. In fact, this situation would only be completely preventable if the cache had infinite memory. It is, thus, the responsibility of the user to set the cache size limit to a value that is large enough to hold the amount of messages missed by a node that will be unreachable for the time specified by the user.

While the worklog will allow the user to have more control over the recovery process, no changes will be made to the recovery process itself.

Functional Requirements

==========================

FR1. It must be possible to configure the maximum size of the XCom cache via a new server option named group_replication_message_cache_size.

FR2. It must be possible to change the value of the cache size limit at runtime.

FR3. It must be possible to cache any arbitrary number of messages as long as the cache size limit is not exceeded.

FR4. Every time the current size of the cache exceeds the cache size limit, XCom must remove entries in LRU order until the current size is below the cache size limit.

FR5. It must be possible for different nodes in the group to have different cache size limits.

FR6. It must not be possible to set the cache size limit to a value lower than 1073741824 bytes (1GB).

FR7. When not explicitly set by the user the cache size limit shall default to 1073741824 bytes (1GB).

FR8. If a user tries to set an invalid value for the cache size, an error message shall be printed to the MySQL shell.

FR9. The first time that a message that has been missed by an unreachable node is evicted from the cache of another node, a warning message shall be printed to the error log.

Non-functional requirements

==============================

NFR1. The code modifications resulting from this worklog must not degrade the performance of GCS and XCom.

NFR2. There must not be a throughput degradation of more than 10% while the cache is being resized after the user changes the cache size limit.

NFR3. The code modifications resulting from this WL must not break compatibility with older GR versions.

Background

=============

Currently, the limits of the XCom cache are static and hardcoded, both regarding the maximum number of messages (50000) and the maximum size of the cache (1GB); when any of these limits is reached, XCom will remove (in LRU order) one or more messages to make room for new entries. Having these limits posed no problems to the recovery process until now because nodes used to be expelled automatically after being unreachable for 5 seconds; this is too short an interval to allow the system to generate enough load to evict any message missed by the unreachable node. Thus, nodes would always be able to recover missed messages and get back to the group.

This has changed with WL#11570, which introduced means for users to freely define the timeout after which nodes must be expelled. Users can now set longer timeouts, during which it is possible that missed messages are evicted from the caches of the alive nodes. The result is that unreachable nodes that are able to reconnect to the group before the timeout expires will not be able to recover and will be forcibly expelled.

Even when the system has no unreachable node, having a static cache prevents XCom from operating correctly when the system is under high load. For example, if the system generates a high number of small messages, the cache can become depleted, causing some cached messages that have not yet been decided by the consensus protocol to be removed from the cache. When a node receives a request for such a message, it will consider the node that submitted the request to be faulty and will forcibly expel it.

Summary of Changes

======================

To address the issue described above, we will expose the XCom cache to the users and allow them users to set the cache size limit to a value that they think makes sense with the configured expel timeout and the execution environment of the server. In addition, we will remove the static nature of the cache, by allowing it to grow in terms of the number of entries, effectively removing the current limit of 50000 entries. Each increase will consist in creating an additional 50000 free entries, which corresponds to approximately 10MB of metadata.

In addition to the changes to the behavior of the cache, we will add a new warning message that informs the user that a message that is needed for recovery has been evicted. We detail this below in section Observability.

Behavior notes

================

By definition, the cache will continue to grow until the size limit is reached or surpassed, at which point XCom will continuously remove some of the older entries until the current size is below the limit - the current size of the cache will include the size of the cached data, as well as the size of the metadata used by the cache. It should be noted, however, that the entry removal process will be best effort (as it is currently). This is because the cache does not remove entries that hold paxos messages that have been decided, but have not yet been delivered to the application. In an extreme case, it is possible that all the cache messages are in this state (e.g., in a very high load system with large messages and a high event horizon), in which case no message can be removed; this is, however, an unlikely event.

For the same reason discussed in the paragraph above, when the user reduces the cache size limit, the reduction in cache size may not be immediate. For example, consider that the the user reduces the cache size limit from 2G to 1G; if, say, 1.5G of the original 2G correspond to undelivered decided messages, the cache will not be able to garbage collect them. Hence, until those messages are delivered the cache size will be above the new limit set by the user.

Also of note is that we will increase the number of entries whenever necessary, even when that increase causes the cache size limit to be surpassed simply due to the additional metadata. We will do this to make sure that there are free entries available for use at all times. After the number of entries is increased, XCom will simply proceed by removing the necessary number of cached entries to restore the current cache size to below the limit. For example, consider that the cache size limit is 1G and the current cache size is 999M. If at this point the cache has 49999 used entries it will create 50000 new free entries. This will result in adding 10M of metadata to the cache size, which will cause it to go over the 1G limit. As a result of this, XCom will remove the necessary number of cached messages in order to release the excess 10M.

While it will not be mandatory that all nodes in a group have the same cache size limit, users must be aware that when an unreachable node tries to recover a missed message, it will contact only one of its peers (chosen at random). If the node fails to recover (because the peer has already garbage collected the missed message) it will be expelled. For this reason, we recommend setting the same cache size on every node.

User Interface

===============

The new option introduced by the worklog shall have the following interface:

Name: group_replication_message_cache_size
Unit: Bytes
Scope: Global
Type: Integer
Minimum Value: 1073741824 (1GB)
Maximum Value (32-bit platforms): 4294967295 (4GB)
Maximum Value (64-bit platforms): 18446744073709551615 (16EiB)
Default: 1073741824 (1GB)
Dynamic: Yes
Credentials: SUPER/SYSTEM_VARIABLES_ADMIN
Persistence: PERSIST
Description: Maximum size of the message cache.

The size limit will be specified in bytes, in order to be consistent with other similar variables of MySQL (such as innodb_buffer_pool_size).

Observability

==============

If the user tries to set an invalid value for the variable an error message will be shown in the shell and the previous limit will remain unchanged. The error message presented depends on the value input by the user:

If the user tries to setinputs 1) an invalid value (e.g., a non-numeric value) or 2) a negative value or 3) a value that is higher than 4GB in 32-bit machines and 16EiB in 64-bit machines, the following message (with error code 1232) will be printed:

ERROR 1232 (42000): Incorrect argument type to variable 'group_replication_message_cache_size'

If the user inputs a non-negative value lower than the minimum value allowed for the variable, the following message (with error code 1231) will be printed:

ERROR 1231 (42000): The value <input value> is not within the range of accepted values for the
option group_replication_message_cache_size. The value must be between 1073741824 and
18446744073709551615 inclusive.

If the value defined by the user in the server configuration is not valid an error message is printed to the error log.

A previous worklog (WL#9855) has already added memory instrumentation of the XCom cache. With this, users can query Performance Schema to observe the memory usage of the cache. To view this information type the following command in the MySQL command line:

SELECT * FROM performance_schema.memory_summary_global_by_event_name \\
  WHERE EVENT_NAME LIKE 'memory/group_rpl/GCS_XCom::xcom_cache';

For a detailed description of the output of this command, please refer to the Interface Specification of WL#9855. We highlight the following two columns, which are of particular relevance for this worklog:

CURRENT_COUNT_USED: This column show the current number of cached entries.
CURRENT_NUMBER_OF_BYTES_USED: This column shows the current size of the cache, which corresponds to the aggregated size of all the messages cached plus the metadata.

To aid the user in setting an appropriate value for the cache size limit, GCS will print a warning message as soon as the cache of the local node evicts a message that has been missed by an unreachable node. The message will have the following format:

"Messages that are needed to recover node *x* have been evicted from the message cache. Consider
resizing the maximum size of the cache by setting group_replication_message_cache_size."

Note that the warning will be printed at the alive nodes. Furthermore, it will only be printed once (the first time that a missed message is evicted) for each unreachable node; otherwise, it would continue to be printed until the node became reachable again, since XCom continuously evicts messages in LRU order.

Security

=========

This worklog has no direct influence on the security of the system, as it will simply consist in turning the XCom cache from static to dynamic. However, since we allow users to set an arbitrarily high value for the cache size limit, users must be careful when setting it, in order to not run out of memory. MySQL contains several other caches and object pools (e.g., the Innodb buffer pool) whose size must be taken into account when planning the allocation of resources to the server.

Upgrade/Downgrade

======================

This worklog does not affect the upgrade/downgrade path. Setting the cache size is a local-only operation and cache sizes do not need to be symmetric across the group; this asymmetry will not affect the normal operation of the nodes. However, since the cache will take part in the recovery of nodes that have become unreachable, it is recommended that all nodes have the same value, in order to maximize the chances that recovery succeeds.

1. Introduction

=================

The worklog can be broken into three main building blocks: 1) introducing the new group_replication_message_cache_size option, 2) modify the internals of the XCom cache to accommodate its new requirements, and 3) print the warning message informing the user that a cached message needed for recovery has been evicted. A new MTR test will also be created. The worklog will have no effect on current behavior, so no changes will need to be made to existing tests (both MTR and GCS).

2. Adding new server option

==============================

A. Setting the option in GR

The plugin.cc file will be modified to add the option's corresponding variable to GR, as well as its registration as a system variable. In addition, a new function called update_xcom_cache_size will be implemented, which will be responsible for calling into GCS to update the value of the cache size limit whenever the user changes it via the command line. Another new function, named check_xcom_cache_size, which will be responsible for validating the value set by the user, will also be implemented.

B. Passing the cache size limit to GCS/XCom

GR passes the value of the xcom_cache_size_var onto GCS in two occasions: 1) at startup (e.g., if the user explicitly sets the option in the .cnf file or the command line) or 2) at runtime via the update_xcom_cache_size function of GR.

To set the value of the cache size at startup, we will add a new GCS parameter named xcom_cache_size. This value will be read during the initialization of GCS and, after validation, passed onto XCom as follows:

The process described above can only be used before XCom is initialized. When the cache size limit is updated at runtime a different mechanism must be used in order to ensure concurrency correctness between the GCS and XCom threads; otherwise, it would be possible for the variable to be modified while XCom was reading it, leading to undefined behavior.

Hence, to deal with runtime cache size limit modifications, we will use code that already exists in XCom, but is currently unused - the xcom_client_set_cache_limit function (part of the XCom client API), which sends a message to XCom requesting it to change the cache size limit. For GR to call this function, we will add an intermediary GCS function called set_cache_limit. This function will be called by the update_xcom_cache_size function of GR.

3. Modifying the XCom cache

===============================

The bulk of this work will consist in transitioning the XCom cache from a static data type to a proper dynamic structure capable of growing. All code related with this change will be encapsulated inside the XCom cache source files, xcom_cache.c/h.

A. Background on cache internals

Currently the XCom cache is statically pre-allocated with 50000. This means that no entries are ever added or removed from the cache. This way, when we say that a new item is added to the cache, we mean that an existing cache entry is filled with new data (both metadata and actual paxos messages data); similarly, when we say that an entry is removed, we mean that it is cleared from its metadata and message data - the entry itself is not removed.

Internally, the cache consists of an LRU of paxos machines. The LRU is logically divided in two: a Protected LRU consisting of the cache entries that hold cached paxos machines and a Probation LRU consisting of the unused machines. Logically, the Probation LRU will consist of the first n entries of the LRU, while the 50000-n entries of the Protected LRU will follow it:

        _    _    _    _    _    _    _    _    _    _    _    _    _
LRU:   |_|->|_|->|_|->|_|->|_|->|_|->|_|->|_|->|_|->|_|->|_|->|_|->|_|
        ^                             ^                             ^
        |                             |                             |
       HEAD                       Probation                     Protected

Physically, the LRU is implemented as an array of list items (i.e., an item with prev and next constructs) to simplify (de)initialization. Both internal LRUs are kept separate and items move between one and the other as they are cached and deleted. The result is a physical structure that could resemble the following (note that the following is an illustration; an actual running cache can have much more complex links between the LRU entries):

Cache:
      __S1__ __S2__ __S3__ __S4__ __S5__ __S6__ __S7__ __S8__ ___S9__
     |______|______|______|______|______|______|______|______|_______|
         |      |      |      |      |      |      |      |      |
        _v_    _v_    _v_    _v_    _v_    _v_    _v_    _v_    _v_
       |   |<-|   |<-|   |<-|   |<-|   |  |   |<-|   |<-|   |<-|   |
       |___|->|___|->|___|->|___|->|___|  |___|->|___|->|___|->|___|
        | ^      ProtectedLRU       | ^    | ^   ProbationLRU   | ^ 
        | |_________________________| |    | |__________________| |
        |_____________________________|    |______________________|

For fast lookup, the cache maintains an index that maps the synodes of the cached paxos machines to the LRU entries that hold them. The index is implemented as an hash table with separate chaining for collision resolution: each bucket of the table is a list of entries; to retrieve an entry with key synode, the cache has to iterate over the list held by the bucket with index hash(synode) until it finds the entry that has the requested synode.

The cache has only one operation: pax_machine *get_cache(synode)

This operation serves both for retrieval and insertion: if a cache entry with synode exists, it will be retrieved; otherwise, a new one will be initialized and added to the cache.

The cache will continue to fill entries as long as there are free items in the Probation LRU. Otherwise, it will have to clear an item from the Protected LRU before re-using it. The size limit is enforced when a new item is added to the cache: after the addition, the cache verifies if its size is over the limit; if so, it will iterate over the Protected LRU and try to remove elements until the size is below the limit. As explained before, this operation is best effort and it may not be possible to remove any element, in which case nothing is done; this is, however, an unlikely event.

B. Dynamic resizing

In the following we use the term length to refer to the number of entries currently allocated for the cache, regardless of whether they are empty or actually holding data; we will use the term occupation to refer to the current number of entries that are being used (i.e., that hold cached data). The terms entry and slot will be used interchangeably.

Although we will modify the cache to allow for its length to be increased, we will not change the default length (50000), which will also be the minimum length. This means that the cache will always have, at least, 50000 slots. It also means that regardless of the cache size limit set (or not) by the user, XCom will continue to start with the 50000 slots that it currently uses. Hence, whenever XCom fills all of its available entries without reaching the cache size limit, it will have to increase its length, in order to make room for more messages to be cached. Conversely, XCom will also decrease the length of the cache whenever a decrease in the cache size limit causes too many cache entries to become unused.

It is important to note that, unlike what happens currently, after this worklog it will be possible to actually add and remove cache entries. We, thus, redefine the main cache operations as follows:

Adding an item to the cache will consist of allocating a new entry and adding it to the cache; this will result in an increase in the length of the cache.
Removing an item from the cache will consist of deallocating an existing entry from the cache; this will result in a decrease in the length of the cache.
Adding data to the cache will consist of filling an existing cache entry with metadata and message data; this corresponds to the add operation of the current implementation of the cache.
Clearing an item from the cache will consist of deleting the metadata and message data stored in that item, without removing the entry; this corresponds to the remove operation of the current implementation of the cache.

The new version of the cache will retain the two main structures currently used (the LRU and the hash table), but will make some changes to them in order to allow them to grow (as well as decrease). The following sections will describe these changes in detail.

B.1 Increasing the length

The length of the cache will be increased whenever the occupation of the cache reaches 95%. Since the cache operation is backed by both a list and an hash table, both structures will have to be resized. For simplicity, each resize will consist in adding 50000 new slots to each structure.

Since the LRU is implemented as an array, we could simply allocate a new 50000 slots array and connect the items at the end of both lists. However this would make decreasing the length of the LRU a much more complex and heavier operation. For this reason, we will modify the LRU to a proper standalone list, which means that the list items will be created on the fly while initializing the LRU (while currently they are created automatically when the array is created). This should not introduce too much overhead, since XCom already has to traverse the list in order to initialize its items.

In addition to allocating the memory for the extended new list, it is necessary to initialize it. To avoid the performance overhead of having to initialize (and, thus, iterate over) the new list at once, we will do so incrementally, as described in section B.3.

The hash table will also need to be resized in order to avoid having too many collisions. To do so, we will create a new hash table of 50000 slots and keep it side by side with the previous table. Because the cache can be resized multiple times, it is possible that at any time there are several tables, each with 50000 slots. To keep track of the multiple tables,they will be stored in a stack:

           ------      ------ ------     --------
HT_Stack: | HT_n | -> |Slot 1|Slot 2|...|Slot 50K| 
           ------      ------ ------     --------
          | ...  |
           ------      ------ ------     --------
          | HT_1 | -> |Slot 1|Slot 2|...|Slot 50K|
           ------      ------ ------     --------

Whenever a new hash table is added to the stack, we will record the maximum msgno seen by XCom and associate it with the new table. To cache a new item, XCom will iterate over the stack: items will be added to the first hash table whose associated msgno is lower than the msgno of the new item. Similarly, to retrieve an item XCom will iterate over the stack until it reaches an hash table whose associated msgno is higher than the msgno we are trying to retrieve.

In terms of performance, we expect some overhead will be added after creating a new hash table, since some recently used items will have been cached in an older table (likely hash table n-1). However, as the paxos protocol makes progress, the synodes will increase and the vast majority (likely all) of the items requested from the cache will be present in the most recent hash table. This means that retrievals for cached entries will be served from the first hash table, while retrievals for non-cached entries and insertions will, under normal operation, require, at most, two steps.

B.2 Decreasing the length

Decreasing the length of the cache will only occur when the following conditions are met: 1) the size of the cache has been decreased by the user, 2) the occupation of the cache is below 70%, 3) there are empty hash tables at the bottom of the hash table stack and 4) the metadata of the cache occupies 100MB or more (i.e., 1% of the minimum cache size). This last condition corresponds to approximately 500000 slots, or ten times the default length of the cache. This is a conservative approach that prioritizes stability and reduced maintenance overhead over space efficiency.

Each decrease operation will consist of removing 50000 items from the LRU, as well as the empty hash table at the bottom of the hash table stack. After each decrease operation terminates, a new one will be automatically triggered until the resulting length has, at most, 70% occupation. Within each decrease operation, entries will be removed from the LRU incrementally, as described in the next section.

B.3 Cache maintenance

XCom needs to verify if the current cache size respects the cache size limit in two moments. The first is when a new item is added to the cache. This verification will continue to be executed as it is currently: whenever a new data item is added to the cache, XCom will calculate the resulting cache size and, if it is above the cache size limit, XCom will continuously clear entries in LRU order until the cache size limit is respected.

The second moment is after the user decreases the cache size limit. Because the size decrease is arbitrarily defined by the user, the number of messages that need to be cleared from the cache afterwards can be too high to do at once. Hence, to prevent XCom from blocking the progress of the consensus protocol, we will clear messages incrementally, 100 messages at a time. To do so, we will take advantage of XCom's task system by having a background task do the message clearing. While the task is actively clearing messages, XCom will not execute the regular validation of the cache size after a message is added to the cache.

The background task will also be responsible for dealing with cache length maintenance, namely by 1) verifying if the cache needs resizing and 2) incrementally execute the resize. Each resizing round will add/remove 500 slots (i.e., 1% of 50000), so that it does not introduce too much overhead, while also ensuring that it makes enough progress quickly. Resizing will stop once the 50000 slots target increment/decrement is reached; after the resize terminates, XCom will start a new resize round if necessary (i.e., if the occupation is still below the 70% threshold).

It is possible that while a resize operation is taking place the user changes the cache size limit. The following table summarizes the behavior of the maintenance operations regarding concurrency:

 ___________________________________________________________________
|                     |                  USER ACTION                |
|  ONGOING OPERATION  |---------------------------------------------|
|                     |    *Increase Size*    |   *Decrease Size*   |
|=====================|=============================================|
|  *Length Increase*  | Complete resizing     | Pause until size    |
|                     |                       | decrease terminates |
|---------------------|---------------------------------------------|
|  *Length Decrease*  | Continue, but cancel  | Pause until size    |
|                     | if occupation >= 95%  | decrease terminates |
 -------------------------------------------------------------------

The table shows that if the user decreases the size cache limit while a resizing operation is taking place, that operation will block until the size decrease operation terminates; afterwards, XCom will re-evaluate the length of the cache and decide if the operation should continue or be reversed. If the user increases the cache size limit and a length decrease operation is taking place, then the length decrease operation will continue; however, if during the decrease the cache reaches 95% occupation (the threshold for increasing the length of the cache) the decrease operation will be replaced by an increase.

The following concurrency rule (not shown in the table) will also be applied: if during a size decrease operation the occupation of the cache falls below the threshold for decreasing the length of the cache (70%), the length decrease operation will not be triggered; after the size decrease terminates XCom will evaluate the length of the cache and trigger a length decrease if necessary.

C. Concluding remarks

All the resizing variables described above are tentative values based on analysis and will be subject to performance testing. They may, thus, be subject to change and not correspond to the final values used in the implementation. This applies to the following variables:

Length increase threshold: the occupation of the cache, in percentage, that triggers an increase - currently set to 95%.
Length decrease threshold: the minimum number of slots in the cache after which XCom considers a length decrease - currently set to 100K slots, i.e, 1% of the minimum cache size limit (1GB)
Length decrease target: the desired maximum occupation of the cache after a decrease operation - currently set to 70%.
Increase/decrease target: the number of slots to add/remove in each cache maintenance operation - currently set to 50000 slots.
Increase/decrease increment: the number of slots to add/remove in each step of the cache maintenance algorithm - currently set to 500 slots, i.e., 1% of the minimum number of cache entries (50000).
Size decrease increment: the number of entries to be cleared from the cache in each step of the cache size decrease operation - currently set to 100 entries.

4. Changes to memory instrumentation

=========================================

Currently, memory instrumentation in Performance Schema shows only the amount of data occupied by app_data. Since in this WL we will consider the metadata of each pax_machine when accounting for the cache size, we need to also account for this data in P_S. We will also have to change the type of the variables used to hold he current size of the cache. The changes will be as follows:

xcom_cache.c

-static size_t cache_size = 0;
+static unsigned long long cache_size = 0;

...

size_t pax_machine_size(pax_machine const *p) {
  size_t size = get_app_msg_size(p->proposer.msg);

  if (p->acceptor.msg && p->proposer.msg != p->acceptor.msg)
    size += get_app_msg_size(p->acceptor.msg);

  if (p->learner.msg && p->acceptor.msg != p->learner.msg &&
      p->proposer.msg != p->learner.msg)
    size += get_app_msg_size(p->learner.msg);
- return size;
+ return size + sizeof(pax_machine);
}

gcs_psi.cc

-static long long current_count = 0;
+static unsigned long long current_count = 0;