WL#5493: Binlog crash-safe when master crashed

Affects: Server-5.6   —   Status: Complete   —   Priority: Medium

RATIONALE
=========
* To make binlog itself safe from crash. So that the content in the binlog
  can be recovered/commited/rolledback together with the changes to the
  table (transactioal)


CONTEXT
=======
  It's related to BUG#53530, BUG#53527 and WL#5440.
  In the vicinity: WL#1790.
  Look at BUG#52620 too. /Alfranio

  When pushing this WL, the test for WL#5440 should be pushed 
  as well.

COMMENTS
========
By Yoshinori Matsunobu (12/08/2010):

 The following 1-4 failure scenario should be considered.
 +----------------------+-----+-----+-----+-----+-----+
 |                      |  0  |  1  |  2  |  3  |  4  |
 +----------------------+-----+-----+-----+-----+-----+
 | Binlog Event Header  |  o  |  x  |  xx |  o  |  o  |
 | Binlog Data          |  o  |  -  |  -  |  x  |  xx |
 +----------------------+-----+-----+-----+-----+-----+

 o: correctly written (normal behavior)
 x: invalid (partly written)
 xx: wrongly written (written with 0x33, etc)

 0: No error (normal behavior)
 1: Failed to read header
 2: Read header, but the value (i.e. data_len) was incorrect
 3: Succeeded to read header, but failed to read body
 4: Succeeded to read header, but body was incorrect

 If 1-3 happens, current Log_event::read_log_event() return 0 so it's easy to
handle inside MYSQL_BIN_LOG::recover(), without changing binlog format. 
 If 4 happens, Log_event::read_log_event() will return the invalid event so if a
slave catches the event the slave will fail. My understanding is that Andrei's
WL#2540 (Binlog Checksum) will fix this issue.

 Currently three implementation plans can be considered.
   1) Extending Binlog File Header
      Write "actual binary log size" info into binlog file header. When doing
crash recovery, check actual size, and trim binlog if binlog size is larger than
the value. Finally binlog file size will be the actual size maintained in the
binlog header. 
     Cons 1: Replication compatibility is broken because additional info needs
to be added in the binlog header
     Cons 2: Performance will be suffered because writing to two locations is
required per event. When I tested, appending 256bytes could be done 15000
times/sec, but (writing 8 bytes + appending 256bytes) could be done only 10000
times/sec. 
     Cons 3: sync-binlog should be 1. Otherwise binlog size info inside header
is no longer trustful. For example, lots of "successfully written" events might
be trimmed if binlog size info is not written to a disk for a long time.
     Cons 4: It is not necessary an extra write per event. In fact, we should
flush the binlog cache as usual and sync the file, do the extra write and
sync again. The extra activity happens everytime we need to flush the binlog
cache and not when we need to write an event.  


   2) Writing binlog event size info into binlog event header *after* binlog
data is written (From Mats) 
      Use zero for length (or event type) and write this single byte
      after the entire event has been written. This will allow the file
      to be scanned for the first zero length (or type) and decide the
      length of the file based on that.

   3) Using binlog checksum functionality (WL#2540)
      Use checksums on each event and find the first event that does not
      have a correct checksum and decide the length of the file based on
      that.


PROPOSED SOLUTION
=================
Trim the crashed binlog file to last valid transaction or event
(non-transaction) base on binlog size(valid_pos) when MYSQL server
crashed in the middle of binlog. And add a temp file to make binlog
index file to be crash safe.

HOW TO GET THE POSITION OF THE LAST VALID EVENT
===============================================
Using introduced "actual_size" variable to record the last valid transaction
or event(non-transaction) position, which will be got by checking the first
incorrect checksum of transaction or event(non-transaction), or by checking
the return value of Log_event::read_log_event() when the checksum option is
disabled.


HOW TO TRIM THE CRASHED BINLOG FILE
====================================
Truncate the crashed binlog file to the position of the last valid
transaction or event(non-transaction).


CLEAR LOG_EVENT_BINLOG_IN_USE_F FLAG
=====================================
Clear the LOG_EVENT_BINLOG_IN_USE_F flag in the header of the binlog crashed
after it is trimmed in the recovery.


HOW TO GUARANTEE BINLOG INDEX TO BE CRASH SAFE
===============================================
Create a temp file to save the updated data firstly and then rename
the temp file to be binlog index file for guaranteeing crash safe
when appending data to the binlog index file or purging it.



The checksum part will be embedded after WL#2540 is pushed into main tree.

RECORD VALID POSITION FOR CRASHED BINLOG FILE
==============================================
Record the position of the last "BEGIN" Query_log_event as the valid
position when encountering an incorrect event.
Record the position of the last event as the valid position if
did not encounter any incorrect event and the "BEGIN" and "COMMIT"
OR "XID COMMIT" EVENT will be pair before it.

  while ((ev= Log_event::read_log_event(log,0,fdle)) && ev->is_valid())
  {
    /*
      Recored valid pos for crahsed binlog file
      which contains incorrect events.
    */
    if (ev->get_type_code() == QUERY_EVENT &&
        !strcmp(((Query_log_event*)ev)->query, "BEGIN"))
      *valid_pos= last_valid_pos;
    last_valid_pos= my_b_tell(log);

    if (ev->get_type_code() == XID_EVENT)
    {
      ...
    }
    delete ev;
  }
  /*
     Recored valid pos for crashed binlog file
     which did not contain incorrect events.
  */
  if (!log->error)
    *valid_pos= last_valid_pos;


TRIM THE CRASHED BINLOG FILE
=============================
Truncate the crashed binlog file to valid position.

      /* Change binlog file size to valid_pos */
      if (valid_pos < binlog_size)
      {
        if (my_chsize(file, valid_pos, 0, MYF(MY_WME)))
        {
          sql_print_error("Failed to trim the crashed binlog file "
                          "when master server is recovering it.");
          mysql_file_close(file, MYF(MY_WME));
          return -1;
        }
        else
        {
          sql_print_information("Crashed binlog file %s size is %llu, "
            "but recovered up to %llu. Binlog trimmed to %llu bytes.",
            log_name, binlog_size, valid_pos, valid_pos);
        }
      }


CLEAR LOG_EVENT_BINLOG_IN_USE_F
================================
Clear LOG_EVENT_BINLOG_IN_USE_F flag for the crashed binlog file by
writing the cleared flag to its header.

      /* Clear LOG_EVENT_BINLOG_IN_USE_F */
      my_off_t offset= BIN_LOG_HEADER_SIZE + FLAGS_OFFSET;
      uchar flags= 0;
      if (mysql_file_pwrite(file, &flags, 1, offset, MYF(0)) != 1)
      {
        sql_print_error("Failed to clear LOG_EVENT_BINLOG_IN_USE_F "
                        "for the crashed binlog file when master "
                        "server is recovering it.");
        mysql_file_close(file, MYF(MY_WME));
        return -1;
      }  

GUARANTEE CRASH SAFE WHEN APPENDING BINLOG FILE NAME TO BINLOG INDEX
====================================================================
Copy all the content of index file to the temp file firstly and then
append the log file name to the temp file. Finally move the temp file
to the index file.

GUARANTEE CRASH SAFE WHEN PURGING BINLOG INDEX
====================================================================
Copy the content of index file from index_file_start_offset recored
in log_info to the temp file firstly and then move the temp file to
index file.