WL#8446: Group Replication: Implement error detection and sleep routines on donor selection
Affects: Server-5.7
—
Status: Complete
Under the Group Replication project, automatic distributed recovery is a key element to give the user a good user experience when using the plugin. The ability to automatically select a donor and failover to a proper server when a error occurs in a transparent and configurable way is crucial for this purpose. With the purpose of improving distributed recovery, this worklog intends to enhance donor selection and create better support for errors during event application with new sleep routines when something fails.
FR1: Recovery should detect any failure in the process and recover from it or error out. FR2: Recovery should execute sleep routines after a donor is not found in the group and before a new round of connection attempts begins. FR3: Donor selection on Recovery should be less static being improved by a random selection of the next donor.
This worklog aims to improve Group Replication Recovery, that from here on, we shall simply refer as Recovery, as we don't mention any other kind of server/slave/etc recovery in this worklog. Lets start by saying what we don't intend to do here: 1) Improve the selection algorithm with some kind of given metric (latency, executed GTIDs, etc) 2) Improve the algorithm of selection to ignore a prior servers that we know that will fail as donors (that have needed purged GTIDs for example) 3) Ignore possible donors that are already transmitting data to another joiner This worklog will focus on two tasks: 1) Implement error detection when we are receiving data from a donor. As of now, the joiner will only detect connection errors (nonexistent users, wrong passwords, etc) but whenever there is an error after this point, recovery will get stuck waiting for the process to end not knowing that something bad happened. 2) Implement proper sleep routines in the donor selection mechanism. As of now, only the number of retries is configurable, and whenever an connection error occurs we just switch to another donor never sleeping between tries. There is, even so, room in the planned changes to recovery code to also make the donor selection algorithm less static. 3) Since we are changing the selection structures we intend to make the selection random. This avoids the continuous selection of the first member in the group as the group donor. We will now descend into each one of this tasks. === Error detection === As said above, as of now, whenever a connection to a Donor is attempted, we wait for it to be established, detecting if an error occurred in the process. The same does not apply however to error when applying/receiving events past this phase. An error can occur as: a) There are purged GTIDs in the Donor that the Joiner doesn't have. b) For some reason there is some data that the Joiner and the Donor both share but it is logged with different GTIDs. This can happen for example when there is some test data, for example, that still remains in both servers. In any of these situations, since recovery is not informed of these failures, the process goes on waiting forever for the final view change event to come (See WL#6837 for a general view on how recovery works). Being stuck, recovery can't also failover to another donor that could on a) have the needed GTIDs or on b) lack the extra conflicting data, allowing a successful conclusion of the process. To address this challenge, we propose the use of the server applier hook "applier_stop" to detect this errors. The algorithm shall be: Algorithm ========= *Note: Being named "thread stop" we name the hook that fires whenever the receiver stops: "receiver_stop" for better readability. Recovery running: Select donor Establish connection to donor (failover on connection errors) Register an observer for the "applier_stop" and "receiver_stop" hooks. While waiting for the last view change to come: If (applier_stop /receiver_stop) hook fires: Something bad happens => donor_connection_thread_error=true recovery loops again and connects to another donor. If view change event arrives: Un-register the "applier/receiver stop" hooks Terminate donor connection. Continue with the recovery process. This way, whenever the recovery applier/receiver thread stops, we have access to this information and the process can react accordingly. Note: This hook mechanism can be used as well to report errors on the applier module that are at this point ignored with the exception of the channel thread being shown as Off on the performance schema. === Sleep routines === The other focus of this worklog is to develop proper sleep routines whenever a failure happens and failover kicks in. Currently, the process of failover lacks any kind of waiting process between connection attempts, something similar to its asynchronous slave connection counterpart. Here we also address the retry count as well. The questions to answer on the design are: 1) How many time we try to connect to a donor? As of now, the default number of attempts made by a joining member to find a suitable donor is equal to the number of online members when it joined the group. The logic behind this reasoning is that if we tried all possible donors and none was suitable then it would be pointless to progress. With this worklog, the intention is set the default value to a more familiar number, as the old default was too small when compared to the similar CHANGE MASTER parameter: MASTER_RETRY_COUNT (86400 tries by default). After implementation this value will still come from the existing group_replication_recovery_retry_count variable, being the default will be updated. Note: In recovery, the process of failing over to another server is not comparable to a simple slave re-connection as there is a pool of available servers for connection. As so, we can distinguish between individual connections attempts and full group rotations. The decision here, even so, is to count every donor connection attempt and not full group rotations. We also now count the first connection as an attempt, contrarily to the past where we counted only the re-connections. This means that retries=1 now implies that recovery only tries once to connect and never changes donor. This is the current slave behavior. 2) How much time do we sleep? To this second question, we decide again to look into the server where the slave counter part is a CHANGE MASTER option: MASTER_CONNECT_RETRY. The difference for the recovery process tough is that we don't have an explicit CHANGE MASTER command so we must fetch this value from a new plugin variable: group_replication_recovery_connect_retry This variable, with a default set to 60 seconds (same as on change master) will then be used to determine how much time should recovery sleep between attempts. 3) When do we sleep? In here, two options were possible, to a) Sleep between all connections. b) Sleep when we attempted to connect to all group members and failed. We went with b), because here, there is no reason to wait between connections as the next donor may not have the problem that affected the previous problematic one. This is inherently different from a slave connection scenario where we connect over and over to the same server, so we should wait for the underlying problem to be resolved. In here, this reasoning of "the DBA will solve" only makes sense when all group options are exhausted, so we only wait after it. Algorithm ========= A generic view of the recovery algorithm and the sleep routines can be expressed as: recovery_retry_count = get_server_retry_count_value() Recovery running: connected= false; error= false suitable_donors= group.get_online_members(); connection loop: while (!connected || !error) if (recovery_retry_count == 0) throw "No more tries!" return error if(!error && suitable_donors.is_empty) suitable_donors= group.get_online_members(); if(suitable_donors) throw "No available donors" return error else sleep(recovery_connect_retry) possible_donor = suitable_donors.pop() recovery_retry_count-- connected= attempt_to_connect() === Random donor selection === The only change needed for guaranteeing a random selection of the donor server is to assure an aleatory order in the suitable donor list. This can be done with a random shuffle when we pick the group members and select the viable ones that are then copied into another structure. This change implicates however a change to every test on MTR that makes any assumptions about the donor selection order. This is addressed below on the low level design.
# For error detection # ======================= New File: channel_state_observer.cc/h =========================== :: New class :: Channel_observation_manager A class where we can register observers for channel state events. Methods :: /* Initialize the class with plugin info for hook registration */ initialize(MYSQL_PLUGIN plugin_info) /* A method to register observers to the events that come from the server. */ register_channel_observer(Channel_observer observer) /* A method to remove observers. */ unregister_channel_observer(Channel_observer observer) :: New class :: Channel_state_observer This is class with abstract methods that recovery and the applier for example can implement to listen to events in the server channels. It can listen to Binlog_relay_IO_observer events: thread_start thread_stop applier_stop before_request_transmit after_read_event after_queue_event after_reset_slave Methods :: virtual int thread_start(); ... Updated File: recovery.cc/h =========================== :: New fields :: donor_channel_thread_error - reports if the recovery applier/receiver thread is dead :: New class :: Recovery_channel_state_observer: public Channel_state_observer This class will implement the recovery reaction to applier stops Class fields :: thread_id - the thread id of the established channel Methods :: int applier_stop() { declare applier_dead = true; terminate_recovery_slave_threads() awake the recovery_condition condition } :: Updated logic :: On establish_donor_connection(): start_recovery_donor_threads + Recovery_channel_state_observer observer= new + Recovery_channel_state_observer() + Channel_state_observer.register_channel_observer(observer); On terminate_recovery_slave_threads(): + Channel_state_observer.unregister_channel_observer(observer); donor_connection_interface.stop_threads on recovery_thread_handle() //wait until.. while (!donor_transfer_finished && // it's over !recovery_aborted && // recovery aborted !on_failover && // switch to another donor +!donor_channel_thread_error) // connection applier of dead { //Wait for completion condition_wait(recovery_condition) } # For sleep routines # ====================== Updated File: plugin.cc/h ============ :: New Sys Var :: group_replication_recovery_reconnect_interval Variable created to allow the user to define the sleeping time between attempts Updated File: recovery.cc/h =========================== :: New fields :: recovery_connection_retry - the time to wait after attempting all members suitable_donors - queue of suitable group members that is emptied was we try to connect :: Removed field :: rejected_donors - replaced with suitable donors. :: Updated logic :: On start_recovery and update_recovery: + suitable_donors= group.get_online_members(); On select_donor() implement the above presented algorithm: while (!connected || !error) if (recovery_retry_count == 0) throw "No more tries!" return error if(!error && suitable_donors.is_empty) suitable_donors= group.get_online_members(); if(suitable_donors) throw "No available donors" return error else sleep(recovery_connect_retry) possible_donor = suitable_donors.pop() recovery_retry_count-- connected= attempt_to_connect() # Random donor selection # ========================== Updated File: recovery.cc/h ============== :: Updated logic :: On start_recovery and update_recovery: Implement this in a way that guarantees a random member order. + suitable_donors= group.get_online_members(); std::random_shuffle is an option to consider when implementing this. Updated File: group_replication_recovery_donor_failover.test ============================================== This test starts with a group with server 1 and server 2. When server 3 joins we assume server 1 will be selected donor so we kill it to prove that recovery still goes smoothly. Server 2 can now be the donor also. Decision: Use the recovery channel to determine the donor uuid: SELECT source_uuid FROM performance_schema.replication_connection_status WHERE channel_name='group_replication_recovery' AND service_state='ON' With this uuid we query server 1 uuid and server 2 uuid and see who the donor is. Proto algorithm: --connect server2 server2_uuid = get_uuid donor_uuid= SELECT source_uuid... #default case donor_id= 1 if(server2_uuid == donor_uuid) donor_id= 2 --connect server$donor_id stop group replication .... Other test changes: ============== Every test that is based on the joiner failing when connecting to the donor and waiting for it to become offline will now need changes to set the number of retries to a low number that makes this process fast. Code refactor: recovery_state_transfer.cc/h ============================ Many of the code that relates to the state transfer process can be moved to a separated file to isolate this complex component from the global process.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.