WL#3719: Replication table checksums

Affects: Server-6.x — Status: Assigned

Description
High Level Architecture
Low Level Design

MOTIVATION
----------
User wants to know that master and slave servers have same information.
User wants to check this using an online algorithm.

POSSIBLE IMPLEMENTATION
-----------------------
- When executing CHECK REPLICATION FOR TABLE t, a checksum of each
  table is generated, it is sent over in a separate event, and the
  slave checks that the table contents are consistent with what the
  master had.

DECISION REGARDING THIS WL
--------------------------
- This was discussed on the Stockholm server lead meeting and 
  Brian and Monty really wanted it to ensure that row-based 
  is well-tested and that some customers get happy.

IMPLEMENTATION
--------------
- Master calls checksum function in master for a certain table.
- There is a new event "TABLE CHECKSUM FOR TABLE t".
- When slave gets the event, it executes a check of the checksum.
- If the checksum is wrong, give and error and stop replication.

OPTIONAL
--------
- When the table has been verified on the slave, a note is printed
  in error log saying "Table t1 verified to be consistent with master".

Related community work
----------------------
These Perl scripts do the same thing:

"MySQL Table Checksum 1.1.0 adds many improvements, but the most important is
a new way to ensure slaves have the same data as their master. Instead of
checksumming the slave and the master, it can now insert the checksum
results directly on the master via an INSERT.. SELECT statement. This
statement will replicate to the slave, where a simple query can find tables
that differ from the master. This makes a consistent, lock-free checksum
trivially easy:
http://www.xaprb.com/blog/2007/05/05/mysql-table-checksum-110-released/"

Maintaining a hash incrementally
================================

Hash functions
--------------

Let *Record* be the domain of all records for a table, and let *Hash* denote a
finite set of all record hashes. Let **compute :: Record -> Hash** be a function
from records to record hashes such that the probability of collisions is minimal.

Furthermore, let *Digest* be the domain of table hashes/digests (we use the word
*digest* to differentiate it from the record hash) and let *insert :: (Digest,
Hash) -> Digest* and *remove :: (Digest, Hash) -> Digest* be functions such that
for any table digest *digest* and record hash *hash* the following properties hold:

    delete(insert(digest,hash),hash) = digest

    insert(delete(digest,hash),hash) = digest

In addition, we have function *combine :: Digest x Digest -> Digest* such that
for any two functions *f* and *f'* being either *insert* or *delete*

    combine(f(digest,hash), f'(digest,hash')) = f'(f(digest, hash), hash')

We also assume that there is a unique element 0 in *Digest* (called the
*identity*) such that:

    insert(0,hash) = hash

It is assumed that *delete* and *insert* have a minimal risk for collision,
i.e., given a table digest *digest* and two record hashes *hash* and *hash'*, it
should be improbable that:

  insert(digest, hash) = insert(digest, hash') 
  delete(digest, hash) = delete(digest, hash') 

Typical examples of the record hash function are MD5 or any of the SHA
functions. Finding a pair of good 'insert' and 'delete' requires research.


Data structures
---------------

In order to describe the functions below, we define the following structure:

  struct Table {
    Digest digest;
    Digest trans;
  };


Bootstrapping and Recovery
--------------------------

On starting the server, a hash is computed by iterating over the records of the
table that should be tracked and computing the hash for the table according to
the following algorithm::

  function bootstrap(table)
    table.digest = 0
    for rec in table.records() do
      table.digest = insert(table.digest, compute(rec))
    end
  end


Maintaining hash
----------------

In order to maintain the digest, it is necessary to implement the following
functionality for each of the functions **ha_write_row()**, **ha_update_row()**,
and **ha_delete_row()**:

    function ha_begin(table)
       table.trans = table.digest
    end

    function ha_commit(table)
      table.digest = combine(table.digest, table.trans)
    end

    function ha_rollback(table)
      table.trans = 0
    end

    function ha_delete_row(table, rec)
      table.trans = delete(table.trans, rec)
    end

    function ha_write_row(table, rec)
      table.trans = insert(trans.trans, rec)
    end

    function ha_delete_row(trans, rec)
      table.trans = delete(trans.trans, rec)
      table.trans = insert(trans.trans, rec)
    end

In order to transfer the table checksum to the slave in such a manner that older
versions of the slave does not stop because it is receiving an unrecognized
event, the table checksum will be added to the variable header of the
Table_map_log_event.

With the introduction of an explicit Begin_stmt_log_event (introduced in
WL#????), the position of the Table map log events are free and is not tightly
connected with a statement. The table map log event then serves to transfer
information about table to the slave: for example, checksums.