WL#3970: Safe slave positions using XA Position Participant

Affects: Server-6.0 — Status: Assigned

SUMMARY
=======
- To fix BUG#26540
- To ensure the the slave always have a correct position, we 
  need to update the position using 2PC. 

IMPLEMENTATION
==============
- Create handlerton for relay-log.info file ("rli file").  Register it
  for slave transactions so that it handles all updates of the rli file.

- On Prepare:
  - Add a new line with XA to rli file (something like this):
    "prepared: pos12456 Xid123456"

- On Commit:
  - When called as a handlerton, then
    remove the "prepared" line and update position.
    Something like this:

      rli->group_relay_log_pos= rli->group_relay_log_pos_prepared;
      rli->group_xid.null();
      memcpy(rli->group_relay_log_name, rli->group_relay_log_name);
      flush_relay_log_info(rli);

- On Rollback
  - When called as a handlerton, then
    remove the "prepared" line and update position
    same as for commit.
    (There can the non-transactional updates, that
    have been logged, so position needs to be updated
    also for rollbacks.)

  COMMENT BY GUILHEM:
  Yes and no, there can be two types of rollbacks in STATEMENT mode:

  1) on master, transaction was only about transactional engines, 
     and slave had to stop in the middle of it (STOP SLAVE, 
     mysqladmin shutdown, or some unexpected duplicate key error), 
     then a rollback is going to happen, but next time slave 
     should restart from the transaction's start. 
     So we must NOT update the position.

  2) on master, transaction updated a non-transactional table 
     and then rolled back so was binlogged as:
     BEGIN;
     UPDATE innodb;
     UPDATE myisam;
     ROLLBACK;
     Then this ROLLBACK, when executed on slave, must update the position.

  In ROW and MIXED modes, any update on a non-transactional table will be
  logged outside the context of a transaction. So there will be no need to
  log rollbacks in such modes.

- On Recovery
  - On recovery() - return the Xid from the relay-log.info
  - On commit_by_xid() - update the position
    if xid matches, otherwise ignore
    (or assert, perhaps - it should not happen)
  - On rollback_by_xid - remove prepared line
    without updating the position or assert if not matches

      rli->group_xid.null();
      flush_relay_log_info(rli);

- Remove the old code that updates the position in relay log info,
  i.e. slave.cc:flush_relay_log_info()
  This now needs to be handled by the handlerton (as described
  above.)

- slave.cc:flush_relay_log_info() needs to have the
  following lines added to store the prepared info:

    if (!rli->group_xid.is_null()) {
      pos=strmov(buff, rli->group_relay_log_name_prepared);
      *pos++='\n';
      pos=longlong2str(rli->group_relay_log_pos_prepared, pos, 10);
      *pos++='\n';
      pos=strmov(buff, rli->group_xid);
      *pos++='\n';
    }

- If the server does not decide to do two phase commit
  (this happens when at least one participant can't do XA),
  then the code needs to update position without the
  prepare line in the commit/rollback calls.

TESTING
=======
- Testing transactional recovery

  Test should (in the ideal case) include tests for:
  a) "Crashing" (actually DBUG_EXECUTE_IFs) slave at various points
  b) Testing with and without XA engines
  c) Testing with mixed XA and non-XA engines
  d) Testing with transactional and non-transactional engines
  e) Testing with all three kinds of groups that we have:
     1) statement groups (e.g. intvar + statement)
     2) RBR group (table map + rbr event)
     3) transaction (BEGIN, statement, COMMIT)

  The below text is a modified description on the test control that
  already exist in the code.  It is from two emails written by Serg:

  Use the test control injections that exist in
  handler.cc:

  % grep crash_  handler.cc
    DBUG_EXECUTE_IF("crash_commit_before", abort(););
      DBUG_EXECUTE_IF("crash_commit_after_prepare", abort(););
      DBUG_EXECUTE_IF("crash_commit_after_log", abort(););
    DBUG_EXECUTE_IF("crash_commit_before_unlog", abort(););
    DBUG_EXECUTE_IF("crash_commit_after", abort(););

  You can crash at any specific point in two phase commit.
  Two possibilities, either you start mysqld as

     mysqld -#d,crash_commit_after_prepare

  and it'll crash at this point on the first commit. Or you can
  activate it runtime:

      SET SESSION debug="d,crash_commit_after_prepare"

  for the current connection or

      SET GLOBAL debug="d,crash_commit_after_prepare"

  to do it for all connections. The next commit will crash at the
  specified point. On restart the recovery must happen automatically,
  and the transaction should be rolled back or committed, depending on
  where you crash - first two "crash-points" are before a transaction is
  written to binlog, meaning a rollback on recovery; others are after
  binlog write, meaning a commit. So, Falcon will always be consistent
  with the binlog. After a crash, even before the restart and recovery
  you can examine binlog independently with mysqlbinlog to see if a
  transaction was logged or not.

  It is also important to be able to crash when some storage engines
  has already commited and others did not. That is in the middle of
  commit loop. Serg didn't have a "crash-point" there, as there were only
  one XA-able storage engine anyway.

  Here's a patch (untested):

  ===== handler.cc 1.313 vs edited =====
  --- 1.313/sql/handler.cc        2007-06-28 00:55:29 +02:00
  +++ edited/handler.cc   2007-06-28 00:54:47 +02:00
  @@ -781,6 +781,7 @@ int ha_commit_one_phase(THD *thd, bool a
           my_error(ER_ERROR_DURING_COMMIT, MYF(0), err);
           error=1;
       }
  +      DBUG_EXECUTE_IF("crash_commit_between_engines", assert(ht == trans->ht););
         status_var_increment(thd->status_var.ha_commit_count);
         *ht= 0;
       }

NOTES
=====
- Note that in the future we would like it to be possible to select
  storage media for rli file (WL#2775) - either use file or use a table.  
  The solution above is for file, but it should be relatively easy to
  change the flush to flush to table instead (if the user has
  configured his system to use table instead of file).  The
  code to store the rli file into a table is not part of this
  fix.

DRAWBACK OF THIS SOLUTION
=========================
- The rli needs to be synced twice to make it safe.
  This could be controlled via options 
  SET SYNC_RLI_TIME=COMMIT, PREPARE_AND_COMMIT
  SET SYNC_RLI_PERIOD=10 (example for every 10th transaction, not very safe)
  This new sync will make a transaction sync to file
  five times or so:
  1. At prepare rli
  2. At commit rli
  3. At prepare storage engine
  4. At commit storage engine (perhaps this is not needed)
  5. At prepare/commit binlog (one-phase participant)
  With WL#2775 the storage engine sync and the rli sync 
  can be done at once, provided that rli is stored in the
  same storage engine as the affected tables of the transaction.

13:50  binlog needs to be synced too (once) and all affected storage
engines (innodb) - twice
13:52  there's a trivial optimization that can help immensely without
sacrificing the transactional safety
13:52  (unlike SET SYNC_RLI_PERIOD=10)
13:53  simply ignore some commits. like, issue only 1 out of 10 commits
13:53  it will make transaction larger
13:53  who cares
13:54  and more of a binlog to re-execute in a crash
13:54  ah, yes. one problem
13:54  in begin ... commit; begin ... commit; begin .. rollback;
13:55  the second commit cannot be ignored, naturally
13:55  but until rollback you don't know that
13:55  the solutuion could be:
13:55  instead of commit do savepoint "COMMIT_WAS_HERE"
13:56  then on rollback you rollback to savepoint
13:56  and every 10th commit you do a real commit
13:56  or every 100th
13:56  reduces the number of syncs everywhere, not only in ril
13:56  and still completely safe in case of a crash


COMPANION TASK (Guilhems suggestion)
====================================
- At Recovery, the relay log could be corrupted, master.info too, 
  or master.info not in sync with relay log, and so we need to 
  automatically re-fetch the pieces of relay log starting 
  from the "safe position". 
  Otherwise it's not crash-safe (WL#3970 makes 
  relay-log.info trustable but not master.info...).
- If we don't do it, this can happen:
  - write event to relay log; crash;
  - so master.info does not account for this event, 
    and when slave is restarted, it fetches the 
    event a second time, and so event will be executed twice.
- So the idea is that if slave crashed we throw away 
  master.info and "replace" it with relay-log.info 
  which is now safe thanks to WL#3970.

In case it can be of use, and at Lars' request, here is a file which can do
two-phase- commit in a durable way, using shadow pages (a header of 9 bytes and
then two pages of 20 bytes, of course all these sizes should be changed, mysys
should be used etc).
There are two pages, and a byte in the header (the "committed" pointer) tells
which of the two pages contains the last committed data. The other page is where
the future uncommitted data goes.
So, at update time, the page-for-uncommitted is written. Note that the
page-for-committed is not modified, so we don't overwrite our important
(committed) data.
At prepare, we store the XID in the header (this is for recovery).
At commit, the uncommitted data becomes committed: we bind the committed pointer
to the page with prepared data. The page with older committed data becomes the
page of future uncommitted data. We remove the XID (it's useless now).
At rollback, we just remove the XID.

#include 
#include 
#include 
#include 
#include 
#include 

typedef unsigned long long ulonglong;
#define XID_OFFSET 0
#define XID_SIZE 8
#define COMMITTED_PTR_OFFSET 8
#define HEADER_SIZE 9
#define FILE_SIZE 20

/*
  4 for xid, 1 for "committed" pointer,
  FILE_SIZE for the useful space.
  The file on disk has an additional FILE_SIZE.
*/
char header[HEADER_SIZE];
char buff[FILE_SIZE];
#define xid (*(ulonglong*)(header+XID_OFFSET))
#define committed (header[COMMITTED_PTR_OFFSET])
int fd;

void af_open(char *name)
{
  fd= open(name, O_RDWR|O_CREAT, S_IRWXU);
  assert(fd >= 0);
}

static void af_pwrite(int fd, char *buff, int count, int offset)
{
  int res;
  res= pwrite(fd, buff, count, offset);
  assert(res == count);
}

static void af_fsync(int fd)
{
  int res;
  res= fsync(fd);
  assert(res == 0);
}

static void af_write_header()
{
  af_pwrite(fd, header, HEADER_SIZE, 0);
}

void af_update(char *new)
{
  memcpy(buff, new, FILE_SIZE);
}

void af_prepare(ulonglong xid_arg)
{
  xid= xid_arg;
  af_write_header();
  af_pwrite(fd, buff, FILE_SIZE, HEADER_SIZE+(!committed)*FILE_SIZE);
  af_fsync(fd);
}

void af_rollback()
{
  xid= 0;
  af_write_header();
  af_fsync(fd);
}

void af_commit()
{
  committed= !committed;
  af_rollback();
}

void af_close()
{
  close(fd);
}

main()
{
  char test_data[FILE_SIZE]="abcdefghijklmnopqrst";
  af_open("af");
  af_update(test_data);
  af_prepare(10);
  af_commit();
  af_update(test_data);
  af_prepare(10);
  af_commit();
  af_close();
}