MySQL 9.1.0
Source Code Documentation
|
Metadata Cache plugin communicates with Metadata and Group Replication exposed by the cluster to obtain its topology and availability information. The digest of this information is then exposed to Routing Plugin in form of a routing table. Key components:
MD = metadata, several tables residing on the metadata server, which (among other things) contain cluster topology information. It reflects the desired "as it should be" version of topology.
GR = Group Replication, a server-side plugin responsible for synchronising data between cluster nodes. It exposes certain dynamic tables (views) that we query to obtain health status of the cluster. It reflects the real "as it actually is" version of topology.
MDC = MD Cache, this plugin (the code that you're reading now).
ATTOW = contraction for "at the time of writing".
[xx] (where x is a digit) = reference to a note in Notes section.
MDC refresh runs in its own thread and periodically queries both MD and GR for status, then updates routing table which is queried by the Routing Plugin. Its entry point is MetadataCache::start()
, which (indirectly) runs a "forever loop" in MetadataCache::refresh_thread()
, which in turn is responsible for perodically running MetadataCache::refresh()
.
MetadataCache::refresh()
is the "workhorse" of refresh mechanism.
MetadataCache::refresh_thread()
call to MetadataCache::refresh()
can be triggered in 2 ways:
<TTL>
seconds passed since last refreshIt's implemented by running a sleep loop between refreshes. The loop sleeps 1 second at a time, until <TTL>
iterations have gone.
Once refresh is called, it goes through the following stages:
In subsequent sections each stage is described in more detail
This stage can be divided into two major substages:
Implemented in: ClusterMetadata::connect()
MDC starts with a list of MD servers written in the dynamic configuration (state) file, such as:
"cluster-metadata-servers": [ "mysql://192.168.56.101:3310", "mysql://192.168.56.101:3320", "mysql://192.168.56.101:3330" ]
It iterates through the list and tries to connect to each one, until connection succeeds.
Implemented in: ClusterMetadata::fetch_instances_from_metadata_server()
Using connection established in Stage 1.1, MDC runs a SQL query which extracts a list of nodes (GR members) belonging to the cluster. Note that this the configured "should be" view of cluster topology, which might not correspond to actual topology, if for example some nodes became unavailable, changed their role or new nodes were added without updating MD in the server.
Implemented in: ClusterMetadata::update_cluster_status_from_gr()
Here MDC iterates through the list of GR members obtained from MD in Stage 1.2, until it finds a "trustworthy" GR node. A "trustworthy" GR node is one that passes the following substages:
If MDC doesn't find a "trustworthy" node, it clears the routing table, resulting in Routing Plugin not routing any new connections.
Implemented in: ClusterMetadata::update_cluster_status_from_gr()
New connection to GR node is established (on failure, Stage 2 progresses to next iteration).
Implemented in: fetch_group_replication_members()
Single SQL query is ran to produce a status report of all nodes seen by this node (which would be the entire cluster if it was in perfect health, or its subset if some nodes became unavailable or the cluster was experiencing a split-brain scenario):
If either SQL query fails to execute, Stage 2 iterates to next GR node.
Implemented in: ClusterMetadata::update_cluster_status_from_gr()
and ClusterMetadata::check_cluster_status_in_gr()
MD and GR data collected up to now are compared, to see if GR node just queried belongs to an available cluster (or to an available cluster partition, if cluster has partitioned). For a cluster (partition) to be considered available, it has to have quorum, that is, meet the following condition:
count(ONLINE nodes) + count(RECOVERING nodes) is greater than 1/2 * count(all original nodes according to MD).
If particular GR node does not meet the quorum condition, Stage 2 iterates to next GR node.
OTOH, if GR node is part of quorum, Stage 2 will not iterate further, because it would be pointless (it's not possible to find a member that's part of another quorum, because there can only be one quorum, the one we just found). This matters, because having quorum does not automatically imply being available, as next paragraph explains.
The availability test will resolve node's cluster to be in of the 4 possible states:
As already mentioned, reaching 1st of the 4 above states will result in Stage 2 iterating to the next node. OTOH, achieving one of remaining 3 states will cause MDC to move on to Stage 3, where it sets the routing table accordingly.
ATTOW, our Router has a certain limitation: it assumes that MD contains an exact set or superset of nodes in GR. The user is normally expected to use MySQL Shell to reconfigure the cluster, which automatically updates both GR and MD, keeping them in sync. But if for some reason the user tinkers with GR directly and adds nodes without updating MD accordingly, availability/quorum calculations will be skewed. We run checks to detect such situation, and log a warning like so:
log_error("Member %s:%d (%s) found in Group Replication, yet is not
defined in metadata!"
but beyond that we just act defensively by having our quorum calculation be conservative, and error on the side of caution when such discrepancy happens (quorum becomes harder to reach than if MD contained all GR members).
Quorum is evaluated in code as follows:
bool have_quorum = (quorum_count > member_status.size()/2);
quorum_count
is the sum of PRIMARY, SECONDARY and RECOVERING nodes that appear in MD and GRmember_status.size()
is the sum of all nodes in GR, regardless of if they show up in MD or notquorum_count
nor member_status.size()
member_status.size()
, making quorum harder to reach.To illustrate how our quorum calculation will behave when GR and MD get out sync, below are some example scenarios:
MD defines nodes A, B, C GR defines nodes A, B, C, D, E A, B are alive; C, D, E are dead
Availability calculation should deem cluster to be unavailable, because only 2 of 5 nodes are alive, even though looking purely from MD point-of-view, 2 of its 3 nodes are still alive, thus could be considered a quorum. In such case:
quorum_count = 2 (A and B) member_status.size() = 5
and thus:
have_quorum = (2 > 5/2) = false
MD defines nodes A, B, C GR defines nodes A, B, C, D, E A, B are dead, C, D, E are alive
Availability calculation, if fully GR-aware, could deem cluster as available, because looking from purely GR perspective, 3 of 5 nodes form quorum. OTOH, looking from MD perspective, only 1 of 3 its nodes (C) is alive.
Our availability calculation prefers to err on the side of caution. So here the availability is judged as not available, even though it could be. But that's the price we pay in exchange for the safety the algorithm provides demonstrated in the previous scenario:
quorum_count = 1 (C) member_status.size() = 5
and thus:
have_quorum = (1 > 5/2) = false
MD defines nodes A, B, C GR defines nodes C, D, E A, B are not reported by GR, C, D, E are alive
According to GR, there's a quorum between nodes C, D and E. However, from MD point-of-view, A and B went missing and only C is known to be alive.
Again, our available calculation prefers to err on the safe side:
quorum_count = 1 (C) member_status.size() = 5
and thus:
have_quorum = (1 > 5/2) = false
Need for cluster configuration aside, there is another challenge. GR can provide IP/hostnames of nodes as it sees them from its own perspective, but those IP/hostnames might not be externally-reachable. OTOH, MD tables provide external IP/hostnames which Router relies upon to reach the GR nodes.
Implemented in: MetadataCache::refresh()
Once stage 2 is complete, the resulting routing table from Stage 2 is applied. It is also compared to the old routing table and if there is a difference between them then appropriate log messages are issued advising of availability change.
[01] There has been a recent concern ATTOW, that MD returned might be stale, if MD server node is in RECOVERING state. This assumes the MD server is also deployed on an InnoDB cluster.
[02] It might be better to always start from the last successfully-connected server, rather than 1st on the list, to avoid unnecessary connection attempts when 1st server is dead.
[03] If MD-fetching SQL statement fails to execute or process properly, it will raise an exception that is caught by the topmost catch of the refresh process, meaning, another MD server will not be queried. This is a bug, reported as BUG#28082473