WL#1720: Semi-synchronous replication
Affects: Server-5.5
—
Status: Complete
SUMMARY - The interface for semi-synchronous replication was done in WL#4398. - This WL#1720 will be closed when the components have been pushed to the main tree and are part of the server releases. RATIONALE Get HA by having slave acknowledge transactions before it is committed. At COMMIT time on master, wait for a ACK from slave before return to user Ensures that data always exist in two places. DESIGN LEFT TO DO 1. How to configure master? Esp if there are multiple slaves. 2. Ensure that the solution is general enough, e.g. any engine can use it. 3. Check if there can be clean interface to "replicator" DESCRIPTION That's a feature request from a customer, who calls it "semi-synchronous replication": - user sends COMMIT to master - master writes the transaction to the binlog - master waits for the slave to say "I have received the transaction and saved it into my relay log" (ACK of the slave I/O thread) or the same plus "I have executed the transaction" (ACK of the slave SQL thread). Exactly, the thread which writes to the binlog waits for a global variable "slave_pos" (and "slave_binlog") to be updated (a condition should be signaled), and the Binlog_dump thread is responsible for upating this global variable: the Binlog_dump thread sends the events to the slave, and waits for an "OK" packet from the slave, then broadcasts the condition. Ideally, for multi-statement transactions, the Binlog_dump thread should not wait for a slave "OK" for every event; ideally, only for the event which is really updating the slave's data: for InnoDB, the COMMIT; for MyISAM, any statement; in other words, all events which get directly written to the disk binary log, plus the COMMIT event. "This event requires an OK" could be a flag in the event. If the master does not receive an OK before a timeout is exceeded, the master switches this slave to asynchronous replication (and prints a message to the error log "slave XXX was switched to async rep (ie. no more waits on COMMIT) because a timeout was exceeded for binary log YYY position ZZZ (when we have transaction ID later we can print it too"). DBA should be able to request semi-sync or async replication on the fly. SHOW SLAVE HOSTS would give info about if the slave is doing semi-sync or async replication and why (timeout exceeded, or DBA requested this mode). The customer has an urgent need of "ACK that it's in the relay log" more than "ACK that it's executed by the slave". He would agree to sponsor it. Customer also suggested that, if the master has more than one slave, the master waits for a configurable number of slaves to have ACKed the transaction. But customer does not need this extension immediately. This new feature is somehow similar to 2-phase commit: master waits for slave's commit before saying "ok" to the user; however it does not rollback if the slave failed; hence it may be better for critical sites which need always ongoing operations. Guilhem should continue discussion with the customer to clarify and amend the above description, with Brian too. Monty presently agrees _on_the_idea_ of this type of replication. References: - Google implemented a version of this feature: http://mysqlha.blogspot.com/2007/05/semi-sync-replication-for-mysql-5037.html CAN WE USE THIS INTERFACE? ========================== We should consider if we can use (a superset/subset of) this interface for semi-sync. The below is draft. Components ---------- MySQL - MySQL Server Replicator - Library that is linked with MySQL server Structure --------- +--------------------------------+ | MySQL | +--------------------------------+ | Service and Pluging Interfaces | +--------------------------------+ | Replicator | +--------------------------------+ Distributed concurrency control =============================== A separate thread in the Replicator checks if the prioritized transaction and the local transactions conflict. If so then the local transaction is rolled back. This proposal is only supporting an optimistic approach. For the conservative approach, we also need a notifier hook at the begin of the transaction. MySQL - Replicator Interface ---------------------------- The interface has four main calls: 1) Commit requestor: request_commit(trx_id) (MySQL -> Replicator) Before MySQL commit a transaction it sends a request to Replicator with the following information: a) Transaction Id Replicator replies with Yes or No, indicating if the transaction is allowed to commit or not. If "No", then the transaction is rolled back. If "Yes', then MySQL can commit the transaction. This should be implemented in handler.cc:ha_commit_trans(), after 2PC prepare and right before the call to ha_commit_one_phase(). Timeline: 1 2 3 4 5 6 7 8 ----|-------------|-------------|-------|--------|---------|---> prepare | ^ Store commit commit | | Xid SE1 SE2 v | Commit Approval (Set seq no) NOTE Recovery: If the MySQL server crash between 4 and 5, then the Replicator will send the transaction to other nodes, while it is not being recovered by this MySQL server. This means that the replicator needs to ask the server for its last executed xid, and and then send local lost transactions back to the MySQL server for reapplication. And this as part of the recovery protocol. NOTE: The DMBS is not allowed to fail after step 5. The global commit has already happened. Sample code: /* Request commit from replicator */ if (!replicator->request_commit()) { /* NEW */ ha_rollback_trans(thd, all); /* NEW */ error= 1; /* NEW */ goto end; /* NEW */ } /* NEW */ error=ha_commit_one_phase(thd, all) ? (cookie ? 2 : 1) : 0; DBUG_EXECUTE_IF("crash_commit_before_unlog", abort();); if (cookie) tc_log->unlog(cookie, xid); DBUG_EXECUTE_IF("crash_commit_after", abort();); end: if (is_real_trans) start_waiting_global_read_lock(thd); /* Report commit completed to replicator */ replicator->report_commit(); /* NEW */ 2) Commit report (MySQL -> Replicator) After MySQL has committed the transaction, it must report the transaction id to Replicator. See code above. 3) Abort report (MySQL -> Replicator) Add call in: handler.cc:ha_rollback_trans(THD *thd, bool all) 4) Master: Report rows: log_subscriber::report(event) (MySQL -> Replicator) The binlog event will be sent to the subscriber. Extractor --------- Every row that is applied in the handler interface is caught in the same way as row-based replication does and reported to Replicator: a) Row being changed/binlog event b) Transaction Id Alternatives: a) Trigger level: Catch rows b) Handler level: Catch rows c) Binlogging level: Catch binlog events (log.cc:MYSQL_BIN_LOG::write(IO_CACHE*)) DECISION: Try with c. Serialization ------------- Will be implemented using log_event.cc:write_data_header() functions that are re-written not to write the stuff to file, but keep in buffer. 5) Transaction start (MySQL -> Replicator) Called when transaction is started a) Transaction id 6) Rollback request (Replicator -> MySQL) Replicator can at any time call MySQL with a request to rollback an ongoing local transaction. MySQL then needs to roll it back. 7) Slave Executor Alternatives: a) SQL level - outside client b) SQL level - mysql_parse() c) Binlog events d) Individual rows, exucute on handler level DECISION: Try with c. FILTERING --------- All events except rows events (3 types) and table map events are discarded. EXECUTION --------- This would probably mean to take the sql_binlog.cc:mysql_client_binlog_statement(THD* thd) function and strip away the base64 stuff, and use the same method to execute events. 8) Recovery function si::get_last_executed_transaction() At recovery time the Replicator needs to know the last executed transaction, so that it may replay possible local transaction that was lost if the server failed between 4, 5 above and any remote transaction that was not yet applied. This is probably some functionality in log.cc for XA reovery.
Semi-synchronous replication has two components, semi-sync master and semi-sync slave plugins. There is a flag argument for BINLOG_DUMP command, which is zero by default, when semi-sync plugin is loaded and enabled on the slave, this flag will set to BINLOG_SEMI_SYNC (0x0002) to request semi-synchronous replication behaviour from the master. The master will check this flag before it starts drumping binlog if semi-sync is enabled on master. Semi-sync is enabled on master by setting global variable `rpl_semi_sync_enabled' to 1, and disabled if set to 0. There is also a status variable `Rpl_semi_sync_status' that indicates if synchronous replication is on or off. Synchronous replication is on by default when semi-sync is enabled on master. it will be turned off when failed to get an ack from slave before timeout (rpl_semi_sync_timeout), and will be turned on again when the slave catches up. When semi-sync is enabled on master (rpl_semi_sync_enabled = 1), and synchronous replication is on (Rpl_semi_sync_status = ON), it will wait for an ack at the end of each transaction. The master only waits an ack for the last event of the transaction. In order to indicate the slave to issue a relay for the last event of a transaction, the master reserves two extra bytes in the event packet header, the first byte is a magic number (0xEF), the second byte is a sync flag, which is set when the event is the last event of a transaction is enabled. The slave checks the sync flag to decide if it needs to send an ack for the event received from master. The master will not wait for ever for the ack from slave, it will give up the it does not get an ack after a given period of time (rpl_semi_sync_timeout), and then turns off synchronous replication. The master will not wait for an ack for transactions when synchronous replication is turned off. When the slave catches up, the master will automatically turn on synchronous replication again if semi-sync is enabled.
Semi-sync master ================ Semi-sync master component include class ActiveTranx, class ReplSemiSyncMaster, and semi-sync master plugin implements the following observers: Trans_observer, Binlog_storage_observer, Binlog_transmit_observer. ActiveTranx ----------- This class maintains a list of active transactions which are waiting for an ack from slave. ReplSemiSyncMaster ------------------ This is the main class of semi-sync master component, which maintains a list of active transaction lists that need ack from slave, and receives ack from slave. `updateSyncHeader' Set the sync flag in the extra header if the binlog filename and position of the event sending out is recorded in the active transaction list. `writeTranxInBinlog' Create a new active transaction node and record the binlog filename and position, insert the new node to the active list, and update the largest position the master is waiting for reply. `readSlaveReply' Called by dump thread to wait and read ack from slave, and resume transactions when get a rely. `commitTrx' Called by the SQL transaction thread to wait for an ack from slave. Will resume when the dump thread get a reply for this transaction or timeout. Transaction observer -------------------- `after_commit' Master wait for an ack from slave after commit to storage egines. Binlog storage observer ----------------------- `after_flush' Create a new active transaction and record the binlog file name and position to wait. Binlog transmit observer ------------------------ `transmit_start' Check BINLOG_DUMP flag to see if slave requests semi-synchronous replication. increase `rpl_semi_sync_clients' if it does. `transmit_stop' decrease `rpl_semi_sync_clients' if semi-sync is enabled. `reserve_header' reserve two extra bytes in the event packet header if slave requested semi-synchronous replication. first byte is magic number, 0xFE sencond byte is sync flag, 0 by default, 1 if the event need a reply. `before_send_event' Check if the position of the event sending out is recorded in the active transaction list maintained by the master, set the sync flag in the reserved header if it is. `after_send_event' If we have set the sync flag above, then we need to wait for the ack from slave here. Wait for at most `rpl_semi_sync_timeout' milliseconds. `after_reset_master' Reset semi-sync master status variables. Semi-sync slave =============== Semi-sync slave plugin implements the Binlog_relay_IO_observer. ReplSemiSyncSlave ----------------- Main class of semi-sync slave component. `slaveReadSyncHeader' Read the semi-sync extra header, get the sync flag and payload of the event after remove the extra header. `slaveReply' Send a reply to master about the event just received. Binlog relay IO observer ------------------------ `thread_start' set `rpl_semi_sync_slave_status' to ON. `thread_stop' set `rpl_semi_sync_slave_status' to OFF. `before_request_transmit' set the BINLOG_SEMI_SYNC bit in the argument of BINLOG_DUMP command sending to master. `after_read_event' Check the magic number and sync flag in the event packet, remove the extra bytes in the header. `after_queue_event' If the sync flag is set for this event packet, send a reply to master. `after_reset_slave' Nothing currently. Reference ========= The design of replication interface is available here: http://forge.mysql.com/wiki/ReplicationFeatures/ReplicationInterface
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.