WL#3970: Safe slave positions using XA Position Participant
Affects: Server-6.0
—
Status: Assigned
SUMMARY ======= - To fix BUG#26540 - To ensure the the slave always have a correct position, we need to update the position using 2PC. IMPLEMENTATION ============== - Create handlerton for relay-log.info file ("rli file"). Register it for slave transactions so that it handles all updates of the rli file. - On Prepare: - Add a new line with XA to rli file (something like this): "prepared: pos12456 Xid123456" - On Commit: - When called as a handlerton, then remove the "prepared" line and update position. Something like this: rli->group_relay_log_pos= rli->group_relay_log_pos_prepared; rli->group_xid.null(); memcpy(rli->group_relay_log_name, rli->group_relay_log_name); flush_relay_log_info(rli); - On Rollback - When called as a handlerton, then remove the "prepared" line and update position same as for commit. (There can the non-transactional updates, that have been logged, so position needs to be updated also for rollbacks.) COMMENT BY GUILHEM: Yes and no, there can be two types of rollbacks in STATEMENT mode: 1) on master, transaction was only about transactional engines, and slave had to stop in the middle of it (STOP SLAVE, mysqladmin shutdown, or some unexpected duplicate key error), then a rollback is going to happen, but next time slave should restart from the transaction's start. So we must NOT update the position. 2) on master, transaction updated a non-transactional table and then rolled back so was binlogged as: BEGIN; UPDATE innodb; UPDATE myisam; ROLLBACK; Then this ROLLBACK, when executed on slave, must update the position. In ROW and MIXED modes, any update on a non-transactional table will be logged outside the context of a transaction. So there will be no need to log rollbacks in such modes. - On Recovery - On recovery() - return the Xid from the relay-log.info - On commit_by_xid() - update the position if xid matches, otherwise ignore (or assert, perhaps - it should not happen) - On rollback_by_xid - remove prepared line without updating the position or assert if not matches rli->group_xid.null(); flush_relay_log_info(rli); - Remove the old code that updates the position in relay log info, i.e. slave.cc:flush_relay_log_info() This now needs to be handled by the handlerton (as described above.) - slave.cc:flush_relay_log_info() needs to have the following lines added to store the prepared info: if (!rli->group_xid.is_null()) { pos=strmov(buff, rli->group_relay_log_name_prepared); *pos++='\n'; pos=longlong2str(rli->group_relay_log_pos_prepared, pos, 10); *pos++='\n'; pos=strmov(buff, rli->group_xid); *pos++='\n'; } - If the server does not decide to do two phase commit (this happens when at least one participant can't do XA), then the code needs to update position without the prepare line in the commit/rollback calls. TESTING ======= - Testing transactional recovery Test should (in the ideal case) include tests for: a) "Crashing" (actually DBUG_EXECUTE_IFs) slave at various points b) Testing with and without XA engines c) Testing with mixed XA and non-XA engines d) Testing with transactional and non-transactional engines e) Testing with all three kinds of groups that we have: 1) statement groups (e.g. intvar + statement) 2) RBR group (table map + rbr event) 3) transaction (BEGIN, statement, COMMIT) The below text is a modified description on the test control that already exist in the code. It is from two emails written by Serg: Use the test control injections that exist in handler.cc: % grep crash_ handler.cc DBUG_EXECUTE_IF("crash_commit_before", abort();); DBUG_EXECUTE_IF("crash_commit_after_prepare", abort();); DBUG_EXECUTE_IF("crash_commit_after_log", abort();); DBUG_EXECUTE_IF("crash_commit_before_unlog", abort();); DBUG_EXECUTE_IF("crash_commit_after", abort();); You can crash at any specific point in two phase commit. Two possibilities, either you start mysqld as mysqld -#d,crash_commit_after_prepare and it'll crash at this point on the first commit. Or you can activate it runtime: SET SESSION debug="d,crash_commit_after_prepare" for the current connection or SET GLOBAL debug="d,crash_commit_after_prepare" to do it for all connections. The next commit will crash at the specified point. On restart the recovery must happen automatically, and the transaction should be rolled back or committed, depending on where you crash - first two "crash-points" are before a transaction is written to binlog, meaning a rollback on recovery; others are after binlog write, meaning a commit. So, Falcon will always be consistent with the binlog. After a crash, even before the restart and recovery you can examine binlog independently with mysqlbinlog to see if a transaction was logged or not. It is also important to be able to crash when some storage engines has already commited and others did not. That is in the middle of commit loop. Serg didn't have a "crash-point" there, as there were only one XA-able storage engine anyway. Here's a patch (untested): ===== handler.cc 1.313 vs edited ===== --- 1.313/sql/handler.cc 2007-06-28 00:55:29 +02:00 +++ edited/handler.cc 2007-06-28 00:54:47 +02:00 @@ -781,6 +781,7 @@ int ha_commit_one_phase(THD *thd, bool a my_error(ER_ERROR_DURING_COMMIT, MYF(0), err); error=1; } + DBUG_EXECUTE_IF("crash_commit_between_engines", assert(ht == trans->ht);); status_var_increment(thd->status_var.ha_commit_count); *ht= 0; } NOTES ===== - Note that in the future we would like it to be possible to select storage media for rli file (WL#2775) - either use file or use a table. The solution above is for file, but it should be relatively easy to change the flush to flush to table instead (if the user has configured his system to use table instead of file). The code to store the rli file into a table is not part of this fix. DRAWBACK OF THIS SOLUTION ========================= - The rli needs to be synced twice to make it safe. This could be controlled via options SET SYNC_RLI_TIME=COMMIT, PREPARE_AND_COMMIT SET SYNC_RLI_PERIOD=10 (example for every 10th transaction, not very safe) This new sync will make a transaction sync to file five times or so: 1. At prepare rli 2. At commit rli 3. At prepare storage engine 4. At commit storage engine (perhaps this is not needed) 5. At prepare/commit binlog (one-phase participant) With WL#2775 the storage engine sync and the rli sync can be done at once, provided that rli is stored in the same storage engine as the affected tables of the transaction. 13:50binlog needs to be synced too (once) and all affected storage engines (innodb) - twice 13:52 there's a trivial optimization that can help immensely without sacrificing the transactional safety 13:52 (unlike SET SYNC_RLI_PERIOD=10) 13:53 simply ignore some commits. like, issue only 1 out of 10 commits 13:53 it will make transaction larger 13:53 who cares 13:54 and more of a binlog to re-execute in a crash 13:54 ah, yes. one problem 13:54 in begin ... commit; begin ... commit; begin .. rollback; 13:55 the second commit cannot be ignored, naturally 13:55 but until rollback you don't know that 13:55 the solutuion could be: 13:55 instead of commit do savepoint "COMMIT_WAS_HERE" 13:56 then on rollback you rollback to savepoint 13:56 and every 10th commit you do a real commit 13:56 or every 100th 13:56 reduces the number of syncs everywhere, not only in ril 13:56 and still completely safe in case of a crash COMPANION TASK (Guilhems suggestion) ==================================== - At Recovery, the relay log could be corrupted, master.info too, or master.info not in sync with relay log, and so we need to automatically re-fetch the pieces of relay log starting from the "safe position". Otherwise it's not crash-safe (WL#3970 makes relay-log.info trustable but not master.info...). - If we don't do it, this can happen: - write event to relay log; crash; - so master.info does not account for this event, and when slave is restarted, it fetches the event a second time, and so event will be executed twice. - So the idea is that if slave crashed we throw away master.info and "replace" it with relay-log.info which is now safe thanks to WL#3970.
In case it can be of use, and at Lars' request, here is a file which can do two-phase- commit in a durable way, using shadow pages (a header of 9 bytes and then two pages of 20 bytes, of course all these sizes should be changed, mysys should be used etc). There are two pages, and a byte in the header (the "committed" pointer) tells which of the two pages contains the last committed data. The other page is where the future uncommitted data goes. So, at update time, the page-for-uncommitted is written. Note that the page-for-committed is not modified, so we don't overwrite our important (committed) data. At prepare, we store the XID in the header (this is for recovery). At commit, the uncommitted data becomes committed: we bind the committed pointer to the page with prepared data. The page with older committed data becomes the page of future uncommitted data. We remove the XID (it's useless now). At rollback, we just remove the XID. #include#include #include #include #include #include typedef unsigned long long ulonglong; #define XID_OFFSET 0 #define XID_SIZE 8 #define COMMITTED_PTR_OFFSET 8 #define HEADER_SIZE 9 #define FILE_SIZE 20 /* 4 for xid, 1 for "committed" pointer, FILE_SIZE for the useful space. The file on disk has an additional FILE_SIZE. */ char header[HEADER_SIZE]; char buff[FILE_SIZE]; #define xid (*(ulonglong*)(header+XID_OFFSET)) #define committed (header[COMMITTED_PTR_OFFSET]) int fd; void af_open(char *name) { fd= open(name, O_RDWR|O_CREAT, S_IRWXU); assert(fd >= 0); } static void af_pwrite(int fd, char *buff, int count, int offset) { int res; res= pwrite(fd, buff, count, offset); assert(res == count); } static void af_fsync(int fd) { int res; res= fsync(fd); assert(res == 0); } static void af_write_header() { af_pwrite(fd, header, HEADER_SIZE, 0); } void af_update(char *new) { memcpy(buff, new, FILE_SIZE); } void af_prepare(ulonglong xid_arg) { xid= xid_arg; af_write_header(); af_pwrite(fd, buff, FILE_SIZE, HEADER_SIZE+(!committed)*FILE_SIZE); af_fsync(fd); } void af_rollback() { xid= 0; af_write_header(); af_fsync(fd); } void af_commit() { committed= !committed; af_rollback(); } void af_close() { close(fd); } main() { char test_data[FILE_SIZE]="abcdefghijklmnopqrst"; af_open("af"); af_update(test_data); af_prepare(10); af_commit(); af_update(test_data); af_prepare(10); af_commit(); af_close(); }
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.