WL#5721: Refactor replication dump thread
Affects: Server-Prototype Only
—
Status: Complete
The replication master works as follows. Each slave connects in a client thread called a dump thread. Each dump thread reads binlog events from the binlog and sends them to the associated slave. Each dump thread holds a mutex lock while reading events from the binlog. This lock prevents other client threads from writing the binlog and prevents other dump threads from reading the binlog. The lock may be held for a significant time span because the dump thread is doing I/O. (For reference, the current implementation is something like this, grossly simplified: current binlog write design: acquire LOCK_log write log event to binlog release LOCK_log signal update current dump thread design: while client is not killed: acquire LOCK_log read event from binlog release LOCK_log if EOF was reached in the previous read: acquire LOCK_log wait for update signal read event from binlog release LOCK_log ) Moreover, there are unnecessarily many corner cases in this design, making the code difficult to maintain(all code for processing events duplicated). This worklog aims at simplifying the design of the dump thread. The dump thread should only hold a lock for a very short time, while reading the position up to which the binary log has been written. This is expected to have the following benefits: 1. Performance improvement. We will not do any I/O while holding locks. This means we effectively remove almost all interference between dump threads and client threads. Currently, this is not expected to give any significant improvement. However, in the future we plan to implement parallel writing of the binlog. After we have implemented parallel binlog writing, the present worklog may give a significant performance improvement. 2. Better code with fewer corner cases. The current design requires that we handle some corner cases where a dump thread may read a partially written event from the current binary log when the disk becomes full. This makes the code unnecessarily complex. (For reference, It would be better if readers and writers synchronized on a global variable that holds the position of the end of the last completely written event in the binary log. binlog write design: write event to binlog acquire lock_binlog_end binlog_end = current write position release lock_binlog_end signal update dump thread design: end_position = 0 while client is not killed: if current read position == end_position: acquire lock_binlog_end while end_position == binlog_end and client is not killed: wait for update signal release lock_binlog_end if client is killed: break read event from binlog )
Analysis On All Write Operations Of Binary Log ============================================== * Appending New Events(group flush queue header thread) 1. Get LOCK_log 2. Append events For each transaction Do Flush stmt cache and trx cache. Increase m_prep_xids(it is protected by a rwlock). 3. Flush cache to file. 4. Sends update signal 5. Releases LOCK_log * Purge Logs Purge can be done automatically or manually. It is the logic in MYSQL_BIN_LOG::purge_logs(). It purges all log files before a given log file, its name is passed in purge_logs() through the argument 'to_logs'. 1. Get LOCK_index 2. Get log file name list including all files should be purged For each entry in index file Do Read binlog file name Check if it is same to 'to_logs'. If ture, break the loop. It already finds all file names should be purged. Checking if it is active log. If true, break the loop. Active file can not be purged. It doesn't need lock, because the thread is holding LOCK_index, so active file could not be changed. Checking if it is used by dump thread. If ture, break the loop. All files following the files are using by a dump threads can not be purged. The checking process is: Get LOCK_thread_count For each thread Do Check if it is same to the file which the dump thread is reading. Releases LOCK_thread_count Dump threads could not move to next file. Because purge thread is holding LOCK_index. Dump threads have to get LOCK_index when moving to next file. 3. Remove log names from index file 4. Remove binlog files. 5. Release Lock_index Note: Ignored the safe recovery stuff here. * Log Rotation 1. Get LOCK_log 2. Get LOCK_commit LOCK_commit is used in a loop way. mysql_mutx_lock(&LOCK_commit); while (get_prep_xids() > 0) mysql_cond_wait(&m_prep_xids_cond, &LOCK_commit); 3. Get LOCK_index 4. Generate new binlog file name. 5. Close active file Append Rotate_log_event to active file Send update signal(update_cond). Close the file. 6. Create new log file Set its name as active name (copy to log_file_name variable) Open the new file. Append Format_description_log_event. Update index file. 7. Release locks Release LOCK_index Release LOCK_commit Release LOCK_log * Reset Master 1. Get LOCK_log 2. Get LOCK_index 3. Delete all log files 4. Delete index file. 5. Create and open new index file. 6. Create and open new active file. 7. Release locks Release LOCK_index Release LOCK_log Note: It doesn't check if there is any dump thread. So dump threads may encounter some wired situations. E.g. Could not find next file, etc. Remove LOCK_log From Dump Thread ================================ Binary events can only be appened at the end of the active file. So it is safe to read both inactive and active binlog files without lock. The only problem is that dump thread may read incomplete events at the end of active log file. To solve the problem, MYSQL_BIN_LOG needs a variable which records the end position of the last event in active file. It is shared between all binlog read threads and binlog write threads. - Write Threads Update the variable after appending events and before 'Signal update'. It includes 'Appending New Events', 'Log Rotation' and 'Reset Master' operations. - Read Threads Only read the events that their position are before the variable. Dump threads will wait new events after reading all events before the variable. It may includes 'Dump' and Show binlog events' operations. But this worklog only work on Dump operations. - To Sovle Deadlock Problem The deadlock problem is described below. SESSION 1 SESSION 2 DUMP THREAD --------- --------- ----------- 1 Flush Stage(mark rotate flag) 2 Sync Stage 3 Waiting binlog ACK <------------- ACK ------------- Send binlog 4 Commit Stage Waiting New Events 5 Flush Stage Signal update 6 Rotate(hold LOCK_log) WAITING for Session2 to commit 7 Sync Stage WAITING for Session1 release LOCK_log 8 WAITING for Group ACK 9 [Commit Stage]not happened yet Session1 has to hold LOCK_log to block other sessions to append new events. It also blocked dump threads to read events. In this design, dump threads do not acquire LOCK_log, it is never blocked by session1. So the deadlock is broken. SESSION 1 SESSION 2 DUMP THREAD --------- --------- ----------- 1 Flush Stage(mark rotate flag) 2 Sync Stage 3 Waiting binlog ACK <------------- ACK ------------- Send binlog 4 Commit Stage Waiting New Events 5 Flush Stage Signal update 6 Rotate(hold LOCK_log) Send binlog WAITING for Session2 to commit 7 Sync Stage Send binlog 8 WAITING Group ACK <-- ACK -- Send binlog 9 Commit Stage 10 Rotate
Binlog end position =================== * binlog_end_pos It is the shared variable described in HLS. DEFINITION: my_off_t binlog_end_pos; * LOCK_binlog_end_pos It is used to protect binlog_end_pos. The lock is required when both writing and reading binlog_end_pos. binlog_end_pos and LOCK_binlog_end_pos could be used by both binary log and relay log. But in this wl, they are used only in binary log. DEFINITION: mysql_mutx_t LOCK_binlog_end_pos; * update_binlog_end_pos() It is used to update binlog_end_pos only by MYSQL_BIN_LOG. DEFINITION: void update_binlog_end_pos(); * get_binlog_end_pos() It returns the value of binlog_end_pos; DEFINITION: my_off_t get_binlog_end_pos(); * get_binlog_end_pos_lock() It returns the pointer of LOCK_binlog_end_pos; DEFINITION: mysql_mutx_t* get_binlog_end_pos_lock(); * lock_binlog_end_pos() It locks LOCK_binlog_end_pos. DEFINITION: void lock_binlog_end_pos(); * unlock_binlog_end_pos() It unlocks LOCK_binlog_end_pos; DEFINITION: void unlock_binlog_end_pos(); * update_cond The orignal update_cond is still used to wait and signal event update. It is used by both binary log and relay log. When waiting for binary log update, the new lock LOCK_binlog_end_pos is used with it together. But when waiting for relay log update, LOCK_log is still used. * MYSQL_BIN_LOG::signal_update() It is used to broadcast update_cond to all other threads, so they can continue if they are waiting for new events updates. binlog_end_pos is updated in this function. an extra signal_update() should be called after creating a new log file. It is in MYSQL_BIN_LOG::open_binlog(); In general, it should always send a signal after binlog_end_pos is updated. So this function is called in update_binlog_end_pos(). * flush_io_cache() Events is originally written into cache, but not the binary log file. So the cache has to be flushed into the binary log file before updating binlog_end_pos. Otherwise, dump threads will encounter an error that binlog_end_pos is bigger than the valid file size. There are two places where signal_update() is called, but data is not flushed into the log file. The extra flush_io_cache() should be added. - MYSQL_BIN_LOG::new_file_impl() after appending Rotate_log_event. - MYSQL_BIN_LOG::close() after appending Stop_log_event. Binlog_sender class =================== The class is designed to make the code of mysql_binlog_send() has better modularity and readablity. All code of original mysql_binlog_send is moved to the class. Now mysql_binlog_send just need to call Binlog_sender. * run() It is the entry function which should be called by mysql_binlog_send. DEFINITION: void run(); LOGIC: init(); while (!has_error() && !m_thd->killed) { 1. send faked Rotate_log_event 2. open binlog file. 3. call send_binlog() to dump events in the file. 4. find next binlog file. } cleanup(); The loop will be broken if any error is encountered in any step. * init() It initializes the environment of Binlog_sender, and checks if dump request is valid. DEFINITION: void init(); * cleanup() It cleanup the environment of Binlog_sender when all events are sent or an error encounter. DEFINITION: void cleanup(); * send_binlog() It sends all events in a binlog file to slave. DEFINITION: int send_events(IO_CACHE *log_cache, my_off_t start_pos); parameters: log_cache The binlog will be sent. start_pos Sends events start from start_pos Returns: 0 if read all events successfully. 1 if an error is encountered or it should stop without waiting for new events. LOGIC: 1. send Format_description_log_event 2. check if the binlog file has Previous_gtids_log_event. If it has no Previous_gtids_log_event and it is in AUTO_POSITION mode, then return 0 to skip this file. 3. while (!m_thd->killed) { 3.1 call get_binlog_end_pos() to get end_pos of the binlog. if end_pos is 0, then return 0. 3.2 call send_events() to dump all events before the end_pos. } * get_binlog_end_pos() It get the end position of the binlog file. DEFINITION: my_off_t get_binlog_end_pos(IO_CACHE *log_cache, bool skip_group, bool *sent_heartbeat_event); Parameters: log_cache The binlog file is reading. LOGIC: my_off_t end_pos; do { 1. mysql_bin_log.lock_binlog_end_pos(); end_pos= mysql_bin_log.get_binlog_end_pos(); mysql_bin_log.unlock_binlog_end_pos(); 2. check if the reading file is active file. If it is not active file: 2.1 save file size to end_pos. 2.2 if end_pos equals to my_b_tell(log_cache), it alreay reads all events in the file. then returns 0. 2.3 Otherwise, returns end_pos. /* my_b_tell(log_cache) tells the end position where it has read. */ 4. if end_pos is larger than my_b_tell(log_cache), then returns end_pos. 5. if it doesn't need to wait for new events then return 1. Otherwise, flush the net cache before waiting 6. call wait_new_events() to wait for update_cond broadcast. /* Return 1 if an error is encountered in any step. */ } while (!m_thd->killed); * send_events() It reads events from a binlog file and sends them. DEFINITION: int send_events(IO_CACHE *log_cache, my_off_t end_pos, bool *skip_group, my_off_t *skip_group_end_pos); Parameters: log_cache The binlog file is sending. end_pos Only the events before end_pos is sent. Returns: 0 if read all events successfully. 1 if an error is encountered or it should stop without waiting for new events. while (!thd->killed && my_b_tell(log_cache) < end_pos) { 1. read a event from the log file. 2. check if the event should be skipped when AUTO_POSITION is on. Jump to the begin of the loop if the event should be skipped. 3. send a heartbeat if any event is skipped. 4. send the packet. /* Return 1 if an error is encountered in any step. */ } send a heartbeat if any event is skipped. * wait_new_events() It waits for update_cond broadcast until receiving or timeout. It will send a heartbeat if heartbeat is required. DEFINITION: int wait_new_events(my_off_t log_pos); Returns 0 if succeeds 1 if an error happens
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.