WL#9426: Group Replication: Single Primary Mode
Affects: Server-5.7
—
Status: Complete
Executive Summary ================= This worklog implements a configuration mode for group replication that makes a single server act as a writable master (PRIAMRY) and the rest of the servers act as hot-standbys (SECONDARY). The group itself coordinates automatically to figure out which is the server that will act as the PRIMARY, through a simple leader election mechanism (e.g., grabbing the alive server with the lowest ID, in lexicographycal order, whenever there is a view change). NOTE: This worklog does NOT cover: - facility for the user to trigger a new primary election - facility for the user to specify that a server is now the new primary This should be covered in another worklog, since operations-wise, it would be good if the user is able to override the system and state which is the primary node. User Stories ============ - As a MySQL DBA I want to set the Replication group in a mode where only one server can accept updates so that I prevent multi-site concurrency issues and thus on-commit aborts. - As a MySQL DBA I want to set the Replication group in a mode where only one server can accept updates so that I am able to make use cascading foreign keys with more than one level and serializable isolation mode. - As a MySQL DBA I want to the Replication group in a mode where only one server can accept updates so that I am able to run DDL through the group in a more controlled way, i.e., without having to orchestrate
Functional Requirements ======================= FR1. The user SHALL be able to turn ON/OFF update everywhere extra checks, but only when the Group Replication plugin is stopped. FR2. When the update_everywhere_extra_checks is disabled, the user SHALL be able to run transactions with isolation level SERIALIZABLE. FR3. The user SHALL be able to turn ON/OFF single_primary_mode, but only when the Group Repication plugin is stopped. FR4. When the single_primary_mode is set, the first server to join the group SHALL be the PRIMARY server. All other servers that join afterwards SHALL be SECONDARIES. FR5. All SECONDARIES SHALL have their super_read_only variable set to ON after joining the group and being set as SECONDARY. FR6. SHOULD the user set super_read_only to OFF when the plugin sets it to ON (i.e., after classifying it as SECONDARY) then the user is overriding the plugin and SHALL not be guaranteed that read only properties will hold. Moreover, at a later stage, the plugin may reset the super_read_only back to ON. FR7. SHOULD a network partition happen in which the primary is put on the partition without majority, then servers cannot agree on whether the server is still primary or not. However, since there is no quorum the system will stop for that partition. The partition with majority SHALL elect a new primary. FR8. Whenever a server is promoted to PRIMARY, its super_read_only variable SHALL be set to OFF automatically and unconditinally. FR9. Whenever a server is set as SECONDARY, its super_read_only variable SHALL be set to ON automatically and unconditinally. Non-functional Requirements =========================== NFR1. The fact that with many servers serving requests shows us that we have some scalability improvement, suggests that a single server SHALL be a scalability bottleneck, even if just RW transactions are routed to it. NFR2. There SHALL not be any performance impact on replication throughput by enabling/disabling
High Level Specification ======================== Motivation ---------- The MySQL Group Replication plugin brings multi-master update everywhere replication support for MySQL. This means that a transaction first executes optimistically in a member of the group and on commit it is synchronized with the group to verify that it does not conflict with other transactions that had been executing concurrently on other members. If it conflicts, then the transaction is rolled back. The problem with the MySQL Server is that it has a lot of shortcomings: - DDL is not transactional, thence it cannot be certified or rolled back; - Architecture of the server puts the storage engine too far away from the execution and replication layers, meaning that the SE does things that upper layers may be oblivious to (e.g., foreign keys cascading, or different lock sets, ...); - Hierarchical locking (MDL and storage engine locking). These shortcomings undermine the ability to do completely safe multi-master update everywhere, unless the user is told a set of guidelines to cope with some restrictions. This works fine for a lot of users. However, some users don't really care about the update everywhere feature, but are interested in the split brain protection, commit quorum and membership framework. Running the write workload through a single server helps mitigating the above issues, since execution and coordination is performed on a single server. It will not help with the fact that DDL is not transactional, but helps with the fact that one confines to a single server the inconsistencies and hidden interactions between storage engine and server layers. This makes things slightly safer, since transactions are synchronized on a single server and thus there will not be cross-server execution conflicts to detect and deal with. In addition to what was outlined above, other internal project wants to use Group Replication in this single master mode. And this is our main motivation driver. Approach -------- The design decision is that we implement this at the Group Replication layer only in this worklog. This is the minimum one can do to implement this functionality. In the optimal case, in a single leader mode, we could bypass parts of the replication pipeline and even optimize XCom to have a single proposer instead of multiple proposers. However, that is huge surgery and needs to be done in a step-wise approach. Again, this worklog is just to have a single leader mode in group replication from the get go. Moreover, we are not implementing any new option for the user to trigger an election in this worklog or specifying which is the primary directly since this would bloat the work. Also, if we are thinking big, we should have election policies, that could be modified and then a way to trigger an election. This means that we need a proper metadata model in the group -- perhaps with persistence. This is already quite a bit of work to do and that would present high risk. There is nothing preventing us from doing it at a later stage though. Now, the details! To implement a single master mode, the approach is the following: SERVER ROLES. We will start by defining a role for servers in the group. We start by classifying servers as PRIMARY, SECONDARY. At any logical point in time, there is only one PRIMARY and zero or more SECONDARIES. CONFIGURATION. We will add a couple of options to the group replication plugin. One that enables single master mode and another that disables multi-master update everywhere checks. These options need to be set on all servers in the group. It enables/disables automatic functionality such as leader election. LEADER ELECTION. When there is a view change, if the PRIMARY goes away, a new PRIMARY needs to be elected from the SECONDARIES pool. Only SECONDARIES that are in the ONLINE recovery state have a chance to be elected. If there are no valid SECONDARY candidates no candidate is elected. READONLY STATE. When the single master option is turned on, then a single server, the PRIMARY, will have super_read_only disabled. All other servers the SECONDARIES will have super_read_only enabled. Once a SECONDARY is promoted to PRIMARY, then it disables super_read_only. User Interface -------------- There will be a couple of new system variable: - group_replication_single_primary_mode - group_replication_enforce_update_everywhere_checks and a new status variable** (group_replication_member_role). SYS_VAR: group_replication_single_primary_mode ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These are the proposed system variables: NAME: single_primary_mode TYPE: BOOL DEFAULT: TRUE DYNAMIC: YES, but can only be changed while GR is stopped. SCOPE: GLOBAL DESCRIPTION: Activates the automatic election. This means that there will only be a primary server that takes incoming updates and all others in the group will be secondaries. On primary failures, the group will elect the new primary from the pool of secondaries. Default: TRUE. SYS_VAR: group_replication_enforce_update_everywhere_checks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NAME: enforce_update_everywhere_checks TYPE: BOOL DEFAULT: FALSE DYNAMIC: YES, but can only be changed while GR is stopped. SCOPE: GLOBAL DESCRIPTION: Enable/Disable strict consistency checks for multi-master update everywhere. Default: FALSE. System Variables Behavior ~~~~~~~~~~~~~~~~~~~~~~~~~ These variable options can be combined in the following way: E - enforce_update_everywhere_checks S - single_primary_mode |--------+--------------+--------------+--------------+--------------| | | S: OFF -> ON | S: ON -> OFF | E: OFF -> ON | E: ON -> OFF | |--------+--------------+--------------+--------------+--------------| | E: ON | ERROR | IMPOSSIBLE | ---- | ---- | | E: OFF | SUCCESS | SUCCESS | ---- | ---- | |--------+--------------+--------------+--------------+--------------| | S: ON | ---- | ---- | ERROR | IMPOSSIBLE | | S: OFF | ---- | ---- | SUCCESS | SUCCESS | |--------+--------------+--------------+--------------+--------------| In other words: - single primary mode cannot be set while enforcing update every where checks. - update everywhere checks cannot be turned on while in single primary mode. In addition, these options cannot be set while GR is running. If the user wants to reset them, he needs to stop GR, reset them and the restart GR. Finally, all servers in the group need to have been configured the same way, otherwise, a server that has a different configuration will not be allowed to join the group. The motivation for having a separate enforce_update_everywhere_checks option is to allow DevOps the flexibility to choose whether he wants to implement automatic primary promotion, or rely on existing infrastructure that is maintained by himself to do it. He may want to have all servers as PRIMARY but always ensure that he writes to a single PRIMARY at the application level. In other words, he may wish that some other layer takes the responsibility for picking the PRIMARY server, while at the same time not being held back by the strictness of an update everywhere mode and its checks. STATUS_VAR: group_replication_primary_member ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is the proposed status variable: NAME: group_replication_primary_member TYPE: String DEFAULT: "" SCOPE: GLOBAL DESCRIPTION: Shows the UUID of the primary member in the current view. Special cases: If member is on OFFLINE state or group_replication_single_primary_mode= FALSE, its value will be "" (empty). If member is on ERROR state, its value will be UNDEFINED. Primary Election ================ On view changes, a server is selected as primary for a given view when: 1. It was already the primary on the previous view, 2. Or it is the first ONLINE member on the new view after ordering all servers in lexicographycal order by using their UUIDs. This is performed using the std::sort function. Considerations ============== Security -------- - There are no security concerns we can think of. Upgrade/Downgrade ----------------- There are no update/downgrade concerns other than the fact that on downgrades, if the version one is downgrading to does not have this functionality, then the user cannot expect that this functionality works. Interfaces ---------- Interfacing with this new functionality is done through the regular system variables and/or status variable frameworks that we have in place. Logs go through the error log. ** Regarding the status variable, we can of course add a new column to the P_S table replication_group_members. In that column we can add text in json format with meta information. This information would initially contain the role of the server but in the future could contain more information. For example: --- a/scripts/mysql_system_tables.sql +++ b/scripts/mysql_system_tables.sql @@ -2251,7 +2251,8 @@ SET @cmd="CREATE TABLE performance_schema.replication_group_members(" "MEMBER_ID CHAR(36) collate utf8_bin not null," "MEMBER_HOST CHAR(60) collate utf8_bin not null," "MEMBER_PORT INTEGER," - "MEMBER_STATE CHAR(64) collate utf8_bin not null" + "MEMBER_STATE CHAR(64) collate utf8_bin not null," + "MEMBER_META TEXT not null" ") ENGINE=PERFORMANCE_SCHEMA;"; Then exporting the role would be a simple SELECT on this P_S table instead of selecting the status variable from the global_status P_S table. Moreover, this would display the roles of the other servers in the group as well instead of just the one we are querying. If not implemented as part of this worklog, this should be a follow-up work. DevOps Friendly and Non-intrusiveness ------------------------------------- This makes the system more autonomic, which can be a good thing or a bad thing. Elections and super_read_only modes are set automatically and some users prefer that this is manual or controlled by their infrastructure, such as their monitoring tools, etc. See the user interface section for an explanation on how the update everywhere checks can be lifted with group_replication_update_everywhere_extra_checks system variable. Integration with other Tools/Components --------------------------------------- Other tools or components can just check the system variable super_read_only and group_replication_member_current_role to figure out what is the role that a server is playing at any given time.
Low Level Design ================ On the low level design we have the following components that need to be changed: 1. Introduce two new plugin variables @@ -2338,6 +2343,25 @@ static MYSQL_SYSVAR_STR( NULL, /* update func*/ IP_WHITELIST_DEFAULT); /* default*/ +static MYSQL_SYSVAR_BOOL( + single_primary_mode, /* name */ + single_primary_mode_var, /* var */ + PLUGIN_VAR_OPCMDARG, /* optional var */ + "Whether the group should allow only a single leader or not.", + NULL, /* check func*/ + NULL, /* update func*/ + 0); /* default*/ + +static MYSQL_SYSVAR_BOOL( + update_everywhere_extra_checks, /* name */ + update_everywhere_extra_checks_var, /* var */ + PLUGIN_VAR_OPCMDARG, /* optional var */ + "Enable/Disable strict consistency checks for multi-master " + "update everywhere. Default: ON", + NULL, /* check func*/ + NULL, /* update func*/ + 1); /* default*/ + static SYS_VAR* group_replication_system_vars[]= { MYSQL_SYSVAR(group_name), MYSQL_SYSVAR(start_on_boot), @@ -2366,9 +2390,46 @@ static SYS_VAR* group_replication_system_vars[]= { MYSQL_SYSVAR(gtid_assignment_block_size), MYSQL_SYSVAR(ssl_mode), MYSQL_SYSVAR(ip_whitelist), + MYSQL_SYSVAR(single_primary_mode), + MYSQL_SYSVAR(update_everywhere_extra_checks), NULL, }; 2. Introduce a plugin status variable + +static int +show_member_role(MYSQL_THD thd, SHOW_VAR *var, char *buff) +{ + var->type= SHOW_CHAR; + var->value= NULL; + + if (group_member_mgr && local_member_info) + { + std::string uuid= local_member_info->get_uuid(); + Group_member_info *info= + group_member_mgr->get_group_member_info(uuid); + Group_member_info::Group_member_role role= + info ? info->get_role() : + Group_member_info::MEMBER_ROLE_END; + const char* str= + Group_member_info::get_member_role_string(role); + memcpy(buff, str, strlen(str)); + buff[strlen(str)]= '\0'; + var->value= buff; + delete info; + } + + return 0; +} + +static SHOW_VAR group_replication_status_vars[]= +{ + {"group_replication_member_current_role", + (char*) &show_member_role, + SHOW_FUNC, SHOW_SCOPE_GLOBAL}, + {NULL, NULL, SHOW_LONG, SHOW_SCOPE_GLOBAL}, +}; + + mysql_declare_plugin(group_replication_plugin) { MYSQL_GROUP_REPLICATION_PLUGIN, @@ -2380,7 +2441,7 @@ mysql_declare_plugin(group_replication_plugin) plugin_group_replication_init, /* Plugin Init */ plugin_group_replication_deinit, /* Plugin Deinit */ 0x0008, /* Plugin Version: major.minor */ - NULL, /* status variables */ + group_replication_status_vars, /* status variables */ group_replication_system_vars, /* system variables */ NULL, /* config options */ 0, /* flags */ 3. Extend member information that is exchanged with this additional metadata. 3.1 Deploy an enumeration capturing the member roles to the Group_member_info class + /* + @enum Group_member_role + + This enumeration describes all the roles a server can have. + */ + typedef enum + { + MEMBER_ROLE_PRIMARY= 1, + MEMBER_ROLE_SECONDARY, + MEMBER_ROLE_END + } Group_member_role; + 3.2 Add a new field to the Group_member_info class that holds the value of the current member role. Add getters and setters. @@ -165,6 +182,11 @@ public: Group_member_status get_recovery_status(); /** + @return the member role type code. + */ + Group_member_role get_role(); + + /** @return the member plugin version */ const Member_version& get_member_version(); @@ -206,6 +228,13 @@ public: std::string& retrieve_gtids); /** + Updates this object member role. + + @param[in] new_role the role to set. + */ + void set_role(Group_member_role new_role); + + /** Returns a textual representation of this object @return an std::string with the representation @@ -217,6 +246,12 @@ public: */ static const char* get_member_status_string(Group_member_status status); + + /** + @return the member role as string + */ + static const char* get_member_role_string(Group_member_role role); + /** Redefinition of operate == and <. They operate upon the uuid */ @@ -239,6 +274,7 @@ private: std::string retrieved_gtid_set; uint write_set_extraction_algorithm; ulonglong gtid_assignment_block_size; + Group_member_role m_role; }; 3.3 Add a new field to the enum_payload_item_type Group_member_info class --- a/rapid/plugin/group_replication/include/member_info.h +++ b/rapid/plugin/group_replication/include/member_info.h @@ -79,8 +79,11 @@ public: // Length of the payload item: 8 bytes PIT_GTID_ASSIGNMENT_BLOCK_SIZE= 10, + // length of the role item: 1 byte + PIT_MEMBER_ROLE= 11, + // No valid type codes can appear after this one. - PIT_MAX= 11 + PIT_MAX= 12 }; 3.4 Add an update member role the the Group_member_info manager class. @@ -327,6 +363,15 @@ public: virtual void update_gtid_sets(const std::string& uuid, std::string& gtid_executed, std::string& gtid_retrieved)= 0; + /** + Updates the role of a single member + + @param[in] uuid member uuid + @param[in] new_role role to change to + */ + virtual void + update_member_role(std::string& uuid, + Group_member_info::Group_member_role new_role)= 0; 3.5 Add encoding/decoding support for this new field in the Group_member_info message: @@ -113,6 +116,9 @@ Group_member_info::encode_payload(std::vector* buffer) encode_payload_item_int8(buffer, PIT_GTID_ASSIGNMENT_BLOCK_SIZE, gtid_assignment_block_size); + char role_aux= (uchar)m_role; + encode_payload_item_char(buffer, PIT_MEMBER_ROLE, role_aux); + DBUG_VOID_RETURN; } @@ -182,6 +188,12 @@ Group_member_info::decode_payload(const unsigned char* buffer, &payload_item_type, >id_assignment_block_size); + unsigned char role_aux= 0; + decode_payload_item_char(&slider, + &payload_item_type, + &role_aux); + m_role= (Group_member_role)role_aux; + DBUG_VOID_RETURN; } 4. Create a deterministic leader election function that promotes a server to PRIMARY on view changes (or when recovery finishes). The reason to run the leader election when a member states that it has finished recovery successfully, is because a leader election round may have been waiting for a server to become online to elect it as a PRIMARY (for example, there are no other serves with recovery state set to ONLINE). The function shall be put in Plugin_gcs_event_handler class, since then it can be called from the recovery or view change handlers. void Plugin_gcs_events_handler::handle_leader_election_if_needed() const { ... } This function needs to be deterministic, i.e., its output be the same regardless of the member. All members will run this function at some point in time to pick the next leader. An alternative is to rely on the members uuids and always sort it in lexicographical order before picking the first in the list. +static bool comparator_group_member_info(Group_member_info *m1, + Group_member_info *m2) +{ + std::string uuid1= m1->get_uuid(); + std::string uuid2= m2->get_uuid(); + + return uuid1.compare(uuid2) < 0; +} Then on the handle_leader_election_if_needed: + group_member_mgr->get_all_members(); + + std::vector ::iterator it; + std::sort(all_members_info->begin(), all_members_info->end(), + comparator_group_member_info); + Also, this function shall not be complex, all it needs to do is: 1. check if the current member is leaving In this case do nothing. 2. iterate over the set of active members in the group and check if there is a primary server already and it is ONLINE 3. If there was no primary server - pick one from the list that is ONLINE if the next primary is ourselves, update role to PRIMARY and set super_read_only to OFF. 4. If there was a primary server - set super_read_only to ON. - Update role for that server in the group_member_info to PRIMARY 5. Deploy the call to the handle_leader_election_if_needed calls in the recovery and view change handlers: @@ -196,6 +198,10 @@ Plugin_gcs_events_handler::handle_recovery_message(const Gcs_message& message) c recovery_module->update_recovery_process(false, false); } } + + // check if we were waiting for some server to recover to + // elect a new leader + this->handle_leader_election_if_needed(); } void @@ -225,11 +231,159 @@ Plugin_gcs_events_handler::on_view_changed(const Gcs_view& new_view, //Handle joining members this->handle_joining_members(new_view, is_joining, is_leaving); + // Handle leader election if needed + this->handle_leader_election_if_needed(); + //Signal that the injected view was delivered if (view_change_notifier->is_injected_view_modification()) view_change_notifier->end_view_modification(); } 6. Add functions to force setting and unsetting of super_read_only mode in the read_mode_handler. These should be called from the leader election function. Note that that function runs at the end of recovery, thence, when a member joins the group and recovery ends, it shall set itself to read_only if it was not selected as the PRIMARY. --- a/rapid/plugin/group_replication/src/read_mode_handler.cc +++ b/rapid/plugin/group_replication/src/read_mode_handler.cc @@ -99,3 +99,37 @@ reset_super_read_only_mode(Sql_service_command *command_interface) DBUG_RETURN(error); } + +long Read_mode_handler::force_set_super_read_only_mode(Sql_service_command *command_interface) +{ + DBUG_ENTER("force_set_super_read_only_mode"); + + long error =0; + long server_super_read_only_query= + command_interface->get_server_super_read_only(); + + if (server_super_read_only_query == 0) + error = command_interface->set_super_read_only(); + + DBUG_RETURN(error); +} + +long Read_mode_handler::force_unset_super_read_only_mode(Sql_service_command *command_interface) +{ + DBUG_ENTER("force_unset_super_read_only_mode"); + + long error =0; + long server_super_read_only_query= + command_interface->get_server_super_read_only(); + + /* + If the server had no read mode active we set the read_only to 0, + this resets both read_only and super_read_only. + If the server had the read mode active we reset the super_read_only to 0. + If super_read_only was active, nothing happens. + */ + if (server_super_read_only_query > 0) + error = command_interface->reset_super_read_only(); + + DBUG_RETURN(error); +} 7. Add MTR test cases for leader election and for the new options.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.