WL#5092: RBR: Options for writing partial or full row images in RBR events

Affects: Server-5.6 — Status: Complete

Description
High Level Architecture

Summary
=======

  Implement mysqld option so that the user is able to select
  whether to log minimal or full images on RBR events depending
  on the keys available in the master. When sending minimal rows,
  one can save binlog disk space, network resources and mysqld
  memory footprint.

BACKGROUND
==========

  Before and After Images
  -----------------------

  In row based replication row events may contain two copies for the
  row that they are changing. These are generally known as images. The
  first one, called *before image* (BI), contains data that existed on
  the row before it was actually changed. The second one, called
  *after image* (AI), contain the actual changes. Each, BI and AI,
  usage is confined to two different moments in the execution
  flow. The BI is used while the slave is searching for the row to be
  updated, while AI is used when replaying the changes in the row.

  Both have some restrictions:

    - BI: needs to hold a set of values that can be used by the
      slave to fetch the correct row. In other words, it should
      provide a set a values that *uniquely* identify the row to
      be changed;

    - AI: needs to hold values that are needed to replay all the
      changes, that were actually done during the original
      execution, in an identical set (meaning, same index
      structures, same engine, ...).

  Given that BI and AI have different usages, their usefulness can be
  mapped into the data modification row events:

    - Write_rows_log_event: *requires only AI*.

      There is no need for a BI because, we are adding a record,
      and not changing an existing one. The current
      implementation logs only the AI.

    - Delete_rows_log_event: *requires only BI*.

      There is no need for an AI because, the row ceases to
      exist, as it is removed. However, before removing it, one
      needs to find it, thence BI is required.

    - Update_rows_log_event: *requires both: AI and BI*.

      Both BI and AI, are required. The row needs to be found (BI
      comes to play) before being changed (AI comes to play).

  Summing up, BI must contain values that uniquely identifies rows,
  acting like a primary key equivalent (PKE), while AI must contain
  values that make possible changing the row according to the original
  execution.

  Primary Key Equivalent
  ----------------------

  Tables contain an index structure - mysql calls indexes keys. This
  structure holds information on which keys are declared, and these
  range from plain keys (K), to unique keys (UK), or even a primary
  key (PK).

  PK - When it comes to logging row based events, PK plays an
       important role, as it covers the BI requirement of uniquely
       identifying a given row just by searching using the PK
       value. Thence, if replicating the master contents, one should
       be fine by just logging the PK column. The slave would then use
       this value to search the correct row. Technically, "A PRIMARY
       KEY is a unique index where all key columns must be defined as
       NOT NULL. If they are not explicitly declared as NOT NULL,
       MySQL declares them so implicitly (and silently). A table can
       have only one PRIMARY KEY." [1]

  UK - Unique keys share the same usefulness of the PK, except that
       they need to be declared without nullable parts. From the
       manual [1]: "[...] For all engines, a UNIQUE index allows
       multiple NULL values for columns that can contain NULL."  In
       fact, if there is no PK declared in a table and an application
       requests one, MySQL "[...] returns the first UNIQUE index that
       has no NULL columns as the PRIMARY KEY." [1]
  
   K - Regular keys or nullable UK are of no particular interest when
       logging row events. Given that these are stripped from
       uniqueness property provided by UK and PK, these cannot be used
       to uniquely identify a row.

  There can be the case that a table does not declare any index at
  all, or all indexes are regular keys. In this case the master must
  ensure that the data provided in BI shall be enough to uniquely
  identify the row. As such, the alternative is to log the full
  row. This ensures that when searching (using the BI) and applying
  (using the AI) the next record to be fetched will be uniquely
  identified by all the fields in the BI. Should there be
  indistinguishable rows, searching and updating either one in any
  given order leads to a correct state.

  Consequently, a *Primary Key Equivalent* (PKE) is defined as:

    1. If a PK exists, the PKE is equal to the PK.

    2. Otherwise, if there exists a UK where all columns have the NOT
       NULL attribute, then that is the PKE (if there are more than
       one such UKs, then one is chosen arbitrarily).

    3. Otherwise, the PKE is equal to the set of all columns.

  Hereafter, we will be considering to PK to be a subset of PKE and it
  shall map into items 1. and 2. of PKE definition. Furthermore, if no
  explicit primary key nor UK NOT NULL exists in the table, it is said
  that the table has no PK.

PROBLEM STATEMENT
=================

  Given the definition of PKE, one can have a smaller set of columns
  to be logged (instead of the full row), and still be able to find
  (BI) and update a record (AI), while replaying the row event.

  This can be useful to reduce bandwidth (less network traffic) and
  storage (samller binlogs) usage as well as mysqld memory
  footprint. It becomes even more important if tables contain large
  blobs that do not need to be logged as part of the BI. Currently, in
  MySQL 5.1 GA, PKE is always assumed to be the full row, ie, the
  index structure is ignored. Consequently, AI and BI are always
  logged with all their columns.

  It should be possible for the user to configure this behavior
  such that he could request that BI and AI always log full rows,
  or a PK when available - for the BI - and changed columns only
  - for AI.

SOLUTION
========

  MySQL shall provide an option and the different configurations
  should be:

    - minimal: 
       Means PKE in the before image and changed columns
       in after image

    - full:
       Means all columns in both before and after image

    - noblob:
       Works as full but avoids sending blobs when these are not
       needed. Blobs are still replicated if:

         1. In AI, if they have been changed.
         2. In BI, if they are part of PK.

  It shall be named:

    --binlog-row-image={minimal,noblob,full}

  DEFAULT VALUE SHALL BE: 'FULL'.

  NOTE
  ----

  As a side remark, there was another option considered:

    - reversible 
       means PKE and changed columns in before and after image

  Nevertheless, it was decided that this will not be implemented
  for now, as there is no practical use for it (think Undo
  operations). So, for not risking to get the user confused, and
  because the 'noblob' option supersedes this one, we will put
  this on hold.

  Decision was made over Skype meeting: 
   - attendees: Alfranio, Mats and Luís
   - date: 09/09/09 (11:00 am CEST) 
      
REQUIREMENTS
============

  For the requirements statement, we assume that there is a mysqld
  instance that originally executes some statement and logs it as row
  events. Lets call this instance M (master). We also assume that
  there is a second instance that replays the events. Lets call this
  instance S (slave). S, may not necessarily be a slave, eg, if one is
  replaying some binlog file in an independent instance or even on a
  new master, etc...

  We also assume that there is a table on M Tm and a table on S
  Ts. Events are logged for changes in Tm and replayed on Ts. Unless
  stated otherwise, Tm and Ts are assumed to have same number of
  columns.

  Logging
  -------

  R1. Logging behavior must be done according to the one depicted in
      the following tables:

      - INSERTS

      |-----------+--------------+-----------------------------------|
      | OPTION    | BEFORE IMAGE | AFTER IMAGE                       |
      |-----------+--------------+-----------------------------------|
      | 'minimal' | ---          | All columns where a value         |
      |           |              | was specified. Autoinc columns    |
      |           |              | are also set if not specified.    |
      |-----------+--------------+-----------------------------------|
      | 'noblob'  | ---          | All columns where a value         |
      |           |              | was specified, and all            |
      |           |              | non-blob columns. Autoinc columns |
      |           |              | are also set if not specified.    |
      |-----------+--------------+-----------------------------------|
      | 'full     | ---          | All columns                       |
      |-----------+--------------+-----------------------------------|

      - UPDATES

      |-----------+----------------------------+---------------------------|
      | OPTION    | BEFORE IMAGE               | AFTER IMAGE               |
      |-----------+----------------------------+---------------------------|
      | 'minimal' | PKE                        | All columns where a value |
      |           |                            | was specified             |
      |-----------+----------------------------+---------------------------|
      | 'noblob'  | PKE + all non-blob columns | All columns where a value |
      |           |                            | was specified, and all    |
      |           |                            | non-blob columns          |
      |-----------+----------------------------+---------------------------|
      | 'full     | All columns                | All columns               |
      |-----------+----------------------------+---------------------------|

      - DELETES
 
      |-----------+----------------------------+-------------|
      | OPTION    | BEFORE IMAGE               | AFTER IMAGE |
      |-----------+----------------------------+-------------|
      | 'minimal' | PKE                        | ---         |
      |-----------+----------------------------+-------------|
      | 'noblob'  | PKE + all non-blob columns | ---         |
      |           |                            |             |
      |           |                            |             |
      |-----------+----------------------------+-------------|
      | 'full'    | All columns                | ---         |
      |-----------+----------------------------+-------------|

  Index structures
  ----------------

  - Searching for row:

  R1. If S and M differ in Ts and Tm index structure and there is no
      usable field in the BI for Ts' indexes, then S shall fall back
      to table scan when searching for the record.

      In this case, row matching will be done in a best-effort
      approach, the first row fetched from the SE that matches the
      existing data in the BI will be selected.

  R2. Let Tm be defined with a set of columns that do not exist in
      Ts - Cm. The remaining columns - Cc - exist on both Tm and
      Ts. If M is logging in minimal mode and Tm's PK is declared over
      Cm, then S will:

      - not search for the row and abort, if and only if, Tm's PK is
        declared *entirely* on Cm.

      - use available data on BI, regarding Cc, to conduct either
        index search/scan or table scan, depending on the data
        available for Cc in BI, and Ts' indexes.

  R3. For all binlog-row-image modes and when Ts has more columns than
      Tm, then S will always succeed when finding the row, regardless
      of any index structure differences between Tm and Ts.

  R4. If Ts index structure differs from Tm, then it shall use its own
      indexes while searching the row, provided that there are values
      for the index fields in the BI.

  - Applying the row:

  R5. A row event should always be applied successfully if Tm and Ts
      share the same index structure.

  R6. If Tm and Ts differ in index structures and Ts declares
      incompatible/stricter PK with Tm index, then replication
      may stop - eg, PK on slave does not allow duplicate entries
      for a given field that hadn't such restriction on
      master. This is the current MySQL 5.1 behavior.
  
  R7. If S is applying the events in IDEMPOTENT mode, then it is
      subject to the same restrictions as of MySQL 5.1.

  BLOBS
  -----

  - Related to BI logging:

  R1. If Tm declares a blob column (c1) and c1 is part of the PKE,
      then c1 shall always be replicated in the BI regardless of M's
      binlog-row-image mode.

  R2. If Tm declares a blob column (c1), which is not part of PKE, it
      shall not be logged when M is using minimal or noblob
      binlog-row-image modes.
 
  - Related to AI logging:

  R3. If Tm declares a blob column (c1) and c1 is updated/written then
      it shall be logged in the AI regardless of M's binlog-row-image
      mode.

  R4. If Tm declares a blob column (c1) and this column is not updated
      nor written, then c1 shall not be logged as part of AI when M is
      using noblob or minimal binlog-row-image modes.

  DEFAULT VALUES
  --------------

  - After Image

  R1. Columns missing a value in the AI are given default values
      according to the table definition from the database in
      which the row event is applied. This means that:

      1. Extra fields on slave get default values, as stated in
         the table definition:

         MASTER> CREATE TABLE t1 (a int, b int);
         SLAVE>  CREATE TABLE t1 (a int, b int, c int DEFAULT 100);
         MASTER> INSERT INTO t1 VALUES (1,1);
         sync_slave_with_master
         SLAVE>  SELECT * FROM t1;
         a   b   c
         1   1   100

      2. Fields missing a value in the AI (for instance when
         using 'MINIMAL' option) get the default value as stated
         in the table definition:

         MASTER> CREATE TABLE t1 (a int DEFAULT 100, b int);
         SLAVE>  CREATE TABLE t1 (a int DEFAULT 900, b int);
         MASTER> INSERT INTO t1(b) VALUES (1);
         MASTER> SELECT * FROM t1
         a   b
         100 1
         sync_slave_with_master
         SLAVE>  SELECT * FROM t1;
         a   b
         900 1

  R2. UPDATEs are not affected because they unpack changes (AI) 
      on top of a row fetched from the storage engine (SE).

      This means:

      1. DEFAULT values must have been handled at INSERT time 
         for the given row;
  
      2. Fields in the row fetched from the SE, that have no
         correspondent value in the AI, will keep their previous
         value.

LIST OF BUGS/WLs
================

  REFERENCES
  ----------

    - Implementation and tests:
    WL#5092  : RBR: Options for writing partial or full row images ...
    BUG#14068: RBR: Write only primary key instead of entire before image
    WL#5096  : Write test cases for WL#5092

    - Backports:
    BUG#33055: Replication fails for UPDATE when using falcon Storage engine 

    - Regressions found and fixed:
    BUG#47200: RBR: Absence of PK on slave leads to slave stop.
    BUG#47303: RBR: Replicating from master with PK into slave with KEY fails.
    BUG#49100: RBR: Unexpected behavior when AI contains no usable data ...
    BUG#53643: assert in Field_new_decimal::store_value on slave server
    BUG#53889: slaves stops with 1032; handler error HA_ERR_KEY_NOT_FOUND
    BUG#46554: RBR: pending rows event can be flushed too soon sometimes.


  NOTES
  -----

    N1. WL#5092/BUG#14068 builds on BUG#33055 work, meaning that
        patch for BUG#33055 had to be backported.

    N2. The following regressions were found during WL#5092 and it's
        likely that they have been introduced by BUG#33055, either 
        directly or indirectly, (they are now fixed as part of WL#5092 
        or in WL#5092 tree):
        - BUG#47200
        - BUG#47303
        - BUG#49100
        - BUG#46554

    N3. WL#5092 introduced the following regressions (found
        during QA analysis and are now fixed):
        - BUG#53643
        - BUG#53889

    N4. The following RBR bugs were found while working on WL#5092
        and are already fixed in 5.1+:
        - BUG#53893
        - BUG#47312

REFERENCES
==========

  [1] http://dev.mysql.com/doc/refman/5.1/en/create-table.html
  [2] BUG#33055
  [3] BUG#14068
  [4] BUG#47200
  [5] BUG#47303
  [6] WL#3281
  

PATCHES
=======

  The full patch history is in mysql-5.1-rpl-wl5092 and 
  mysql-next-mr-wl5092 trees.

PB2
===

  http://pb2.norway.sun.com/web.py?template=show_pushes&branch=mysql-next-mr-wl5092

REFACTORINGS
============

Rows_log_event::find_row method must be refactored so that it takes
into account several indexes and checks BI before actually using them.

In what follows the search_key_for_bi function tries to find a useful
index on the slave table. The order of the index searches PK, UK and
K. Search considers the BI available columns, the flags specifying
which kind of index, whether the index is active or not and if it is
defined in slave's table extra columns or not.

The proposed algorithm is the following:

def Rows_log_event::find_row:

  IF invalid_bi (table, &m_cols) THEN
    GOTO err;
  ENDIF

  LET key= CALL search_key_for_bi(table, m_cols, PK_FLAG);

  IF key == NULL THEN
    GOTO index_scan;
  ELSE
    LET error= CALL rnd_pos_by_record(...)
    RETURN error
  ENDIF

index_scan:

  LET key= CALL search_key_for_bi(table, m_cols, 
                                  PK_FLAG | UNIQUE_KEY_FLAG | MULTIPLE_KEY_FLAG);
  IF key == NULL THEN
    GOTO table_scan;
  ELSE
    CALL index_read_map(...)
    IF is_UK(key) OR is_PK(key) THEN
      GOTO ok;
    ELSE    
      WHILE (CALL record_compare(...))
        LET error= CALL index_next(...)
        IF error THEN
          GOTO err;
        ENDIF
      ENDWHILE
    ENDIF
    GOTO ok;
  ENDIF
        
table_scan:
  
  CALL rnd_next(...)
  WHILE error != eof AND (CALL record_compare(...))
    LET error= CALL rnd_next(...)
  ENDWHILE

  ASSERT (error == eof || error == 0)
  GOTO err;

ok:
  RETURN 0;

err:
  RETURN error;

ENDMETHOD