WL#2687: Write non-transactional binlog events directly to binary log

Affects: Server-5.5 — Status: Complete

Description
High Level Architecture
Low Level Design

Rationale
=========
Mixing transactional and non-transactional tables is not supported by the
current code-base and may lead to inconsistencies between the master and slaves.
In a nutshell, the problem stems from the fact that when mixing transactional
and non-transactional changes in in a transaction, it is often the case that the
non-transactional changes are not written to the binary log when they are made
visible, but rather are delayed and written together with the transactional
changes when the transaction commits.

For instance, consider the following statements and sessions (connections):

con1: BEGIN;
con1: INSERT INTO innodb_t values (1);
con1: UPDATE myisam_t SET a = 10;
con2: BEGIN;
con2: DELETE FROM myisam_t;
con2: COMMIT;
con1: COMMIT;

The binary log produced would look like this:

con2: BEGIN;
con2: DELETE FROM myisam_t;
con2: COMMIT;
con1: BEGIN;
con1: INSERT INTO innodb_t values (1);
con1: UPDATE myisam_t SET a = 10;
con1: COMMIT;

Clearly, on the master the table myisam_t would have with no rows. In contrast,
the slave would end up with (a = 10) in the same table. 

Consider the following statements and sessions (connections):

con1: BEGIN;
con1: INSERT INTO innodb_t values (1);
con2: INSERT INTO myisam_t values (1);
con1: UPDATE myisam_t SET a = a + 10;
con2: UPDATE myisam_t SET a = a * 10;
con1: COMMIT;

The binary log produced would look like this:

con2: INSERT INTO myisam_t values (1);
con2: UPDATE myisam_t SET a = a * 10;

con1: BEGIN;
con1: INSERT INTO innodb_t values (1);
con1: UPDATE myisam_t SET a = a + 10;
con1: COMMIT;

To simplify the reasoning, assume that the only data in table myisam_t is
(a = 1).  In such a case, the master would end up with (a = 110) and the slave
with (a = 20).

Goals
=====
The goal of this WL is to ensure that it is safe to mix changes on both
non-transactional and transactional tables under the ROW and MIXED modes.

Note however that MyISAM tables have issues since a crash might leave data in
the table on restart. So we can't guarantee results for non-transactional tables
unless they are crash-safe (see Section HLS).

Unfortunately, there is nothing that can be done to make this mixture completely
safe under the STMT mode. So, we will not change the behavior for STMT mode.

See bug:

- BUG#40278

Decisions regarding this WL
===========================
- This is a known issue with ROW in 5.1; the fix
  will therefore be pushed into the 5.1 tree.
  -- Trudy Pelzer, 2006-01-24
- Discussed with Mats Kindahl on February 9th.
  This is a larger job than anticipated and won't be
  ready in time for the 5.1 beta code freeze. It will
  be fixed in 5.2 instead.
  -- Trudy Pelzer, 2006-02-10
- Since this is only for non-transactional table updates within
  real transactions and the same problem has always existed for
  statement-based replication, Lars agrees that this is not
  needed before 5.2.
  -- Lars, 2007-02-20
-- This work should be pushed to 5.4
  -- Lars, 2009-06-09

Documentation
=============
After pushing this WL, we need to describe what is described here.

Highlights:
1 - Changing in behavior when in ROW or MIXED modes.
2 - The behavior was kept the same in STMT mode.
3 - Tagging statements as unsafe and how this influences MIXED mode.
4 - sync_bin_log does not happen per event but only when both caches 
are flushed.
5 - In general DDL events are written directly to the binary log. The
exceptions are the "CREATE TABLE ... SELECT FROM".


BUG REFERENCES / HISTORY
========================

 - CLOSED bugs were already pushed to 5.1
 - BUGS marked with WL#2687 will be fixed after WL#2687 is pushed.

|-----------+-------------------------------------------+---------+---------|
| REFERENCE | DESCRIPTION                               | STATUS  | MODE    |
|-----------+-------------------------------------------+---------+---------|
| BUG#28976 | Fixes flag issue that would               | CLOSED  | SBL     |
|           | log M stmts outside trans. boundaries     |         |         |
| BUG#40116 | Fix makes all events a transaction        | CLOSED  | SBL/RBL |
|           | to be written into the transaction cache. |         |         |
| BUG#44581 | Fixes the case where M stmts would        | CLOSED  | SBL     |
|           | lock timeout on N tables on slave         |         |         |
| BUG#45694 | Fixes the same case as BUG#44581          | CLOSED  | SBL     |
|           | but instead of lock timeout we            |         |         |
|           | have deadlock                             |         |         |
| BUG#46129 | Fixes all NTM problems for SBL            | CLOSED  | SBL     |
| BUG#46130 | Fixes Slave not correctly handling        | CLOSED  | SBL     |
|           | "expected errors"                         |         |         |
| BUG#46864 | Fixes uncovered bug by BUG#46129          | CLOSED  | SBL     |
|           | (ignored error flag on slave for          |         |         |
|           | a given stmt)                             |         |         |
| BUG#47287 | Reverts RBL behavior changed by BUG#46129 | CLOSED  | RBL     |
|-----------+-------------------------------------------+---------+---------|
| BUG#40278 | Too risky for 5.1 (lot of changes         | WL#2687 | RBL     |
|           | and refactoring around binlogging).       |         |         |
| BUG#43794 | DUPLICATE of BUG#40278                    | WL#2687 | RBL     |
| BUG#47175 | failing INSERT...SELECT in N              | WL#2687 | RBL     |
|           | table is not logged                       |         |         |
|-----------+-------------------------------------------+---------+---------|

Pushed to mysql-5.1-rep+3
=========================

revision-id: alfranio.correia@sun.com-20091103190256-637o8qxlveikrt3i

Definitions
===========
We are executing a transaction or statement containing a mixture of changes to
non-transactional and transactional tables.

We refer to tables managed by a non-transactional engine as N-tables and tables
managed by transactional engines as T-tables.

We further refer to statements accessing T-tables and changing at least one of
them as T-statements (or simply T); statements accessing N-tables and changing
at least one of them as N-statements (or simply N); statements referencing a
mixture of N- and T-tables and changing at least one of them as M-statements (or
simply M); M-statement is an abstraction for multi-table updates, triggers,
procedures and functions that references a mixture of N- and T-tables and
changes at least one of them.

A transaction can be either explicitly defined by calling START TRANSACTION or
BEGIN (B) or implicitly defined when auto-commit is not on, and is finished by
either calling commit (C) or rollback (R). In particular, a single statement
that accesses a T-table is implicitly in the boundaries of a transaction and
is either automatically committed if its execution succeeds or automatically
rolled back if it fails.

Any transactional table is by definition crash-safe but not all the
non-transactional tables are crash-safe. In MyIsam, for example, a statement
that fails do not have its changes rolled back, if there is any. Even by
changing individual rows, there is still a chance of having inconsistency. 


Solution
========
Row manipulations for N-tables are written to the binary log upon committing or
rolling back the statement that produced the changes, instead of being written
upon committing a transaction.

Note that this solution only works for ROW and MIXED mode and there is no
feasible solution for STMT mode. For the STMT mode, we just make sure that
consistency is preserved for T-tables and the few safe cases involving
N-statements are properly handled.


Solution in STMT mode
=====================
In general, if N-statements were written to the binary log upon committing or
rolling back the statement, causality (i.e. dependency among statements) might
be harmed thus possibly introducing inconsistency into T-tables even without
concurrency. For example, consider the following statements and session
(connection):

con1: BEGIN;
con1: INSERT INTO myisam_t values (1);
con1: INSERT INTO myisam_t values (2);
con1: INSERT INTO innodb_t values (1);
con1: UPDATE innodb_t set v = SELECT sum(*) from myisam_t;
con1: INSERT INTO myisam_t values (3);
con1: COMMIT;

The binary log produced would look like this:

con1: BEGIN;
con1: INSERT INTO myisam_t values (1);
con1: COMMIT
con1: BEGIN;
con1: INSERT INTO myisam_t values (2);
con1: COMMIT

con1: BEGIN;
con1: INSERT INTO innodb_t values (1);
con1: UPDATE innodb_t set v = SELECT sum(*) from myisam_t;
con1: INSERT INTO myisam_t values (3);
con1: COMMIT;

However, if you had written all the T-statements into the binary log upon
committing the statement, the innodb_t table would be inconsistent on the
master.

In general, it is impossible to automatically detect causality/dependency among
statements by just analyzing the statements sent to the server. This happen
because dependency may be hidden in the application code and it is necessary to
know a priori all the statements processed in the boundaries of a transaction
such as in a procedure. Moreover, even for the few cases that we could
automatically address in the server, the computation effort required could make
the approach infeasible.

So, we do as follows:

  - N-statements that happen before all statements that update T-tables in
  a transaction are written directly to the binary log upon committing or
 rolling back the statement.

  - Otherwise, N-statements are classified as unsafe, a warning message is
  printed out and their writing is delayed until a commit or rollback is
  requested on behalf of the transaction.

  - M-statements are always classified as unsafe and delayed.

  - When a N- or M-statement is rolled back, for example due to a duplicate key
  found in their N-table, changes may be left over if the table is not
  crash-safe. In this case, the statement is written to the binary log with 
  the associated error code.

  - Upon committing the transaction, the delayed statements are written to the
  binary log wrapped by a BEGIN/COMMIT.

  - Upon rolling back the transaction, the delayed statements are written to
  the binary log only if an N-table was changed. In this case, the transaction
  is written to the binary log wrapped in a BEGIN/ROLLBACK pair.

Note that in the STMT mode, the only safe combination happens when statements
that update N-tables are executed before all statements that update T-tables
in a transaction.


Solution in ROW Mode
====================
In ROW mode, we do as follows:

  - Rows from N-tables are written to the binary log upon committing or
  rolling back the statement that produced them.

  - In contrast, rows from T-tables are only written upon committing a
  transaction.

  - Clearly, rows from T-tables can be discarded upon rollback.

Note that this mode is completely safe and as such no warning message is printed
out.


Solution in MIXED Mode
======================
In this mode, changes may be written to the binary log as statements or rows.
Any statement classified as unsafe is automatically written to the binary log as
rows based on the facilities already provided by the server.

Thus, the MIXED mode is equivalent to the STMT mode when there is no unsafe
statement. Any unsafe statement, however, has its changes logged as rows which
makes it completely safe. Note that consistency may not be preserved if a user
switches between ROW/MIXED and STMT modes.

Note that M-statements such as 

1: INSERT INTO myisam_t SELECT * FROM innodb_t;

2: INSERT INTO innodb_t SELECT * FROM myisam_t;

are classified as unsafe to ensure that in mixed mode the execution is
completely safe and equivalent to the row mode. Consider the following
statements and sessions (connections) to understand the reason:

con1: INSERT INTO innodb_t VALUES (1);
con1: INSERT INTO innodb_t VALUES (100);

con1: BEGIN
con2: INSERT INTO myisam_t SELECT * FROM innodb_t;
con1: INSERT INTO innodb_t VALUES (200);
con1: COMMIT;

The point is that the concurrent statements may be written into the binary log
in a way different from the execution. For example,

BINARY LOG:

con2: BEGIN;
con2: INSERT INTO myisam_t SELECT * FROM innodb_t;
con2: COMMIT;
con1: BEGIN
con1: INSERT INTO innodb_t VALUES (200);
con1: COMMIT;

....

or

BINARY LOG:

con1: BEGIN
con1: INSERT INTO innodb_t VALUES (200);
con1: COMMIT;
con2: BEGIN;
con2: INSERT INTO myisam_t SELECT * FROM innodb_t;
con2: COMMIT;

Clearly, this may become a problem in STMT mode and setting the statement as
unsafe will make rows to be written into the binary log in MIXED mode and as
such the problem will not stand. 

In STMT mode, although such statement is classified as unsafe, i.e.

INSERT INTO myisam_t SELECT * FROM innodb_t;

there is no enough information to avoid writing it outside the boundaries of a
transaction. This is not a problem if we are considering snapshot isolation
level but if we have pure repeatable read or serializable the lock history on
the slave will be different from the master.

Summary of the solutions
========================
format  operation  T Before  error  action
------  ---------  --------  -----  ------
STMT    T          *         no     delay until commit/rollback is requested

STMT    T          *         yes    ignore them

STMT    N          no        no     no-delay write to the binary log

STMT    N          no        yes    no-delay write to the binary log

STMT    N          yes       no     delay until commit/rollback and warn
                                    "unsafe"

STMT    N          yes       yes    delay until commit/rollback and warn
                                    "unsafe"

STMT    M          *         no     delay until commit/rollback and warn
                                    "unsafe"

STMT    M          *         yes    delay until commit/rollback and warn
                                    "unsafe"

ROW     T          *         no     delay until commit/rollback is requested

ROW     T          *         yes    ignore them

ROW     N          *         no     no-delay write to the binary log

ROW     N          *         yes    no-delay write to the binary log

ROW     M          *         no     for rows from N-tables no-delay;
                                    for rows from T-tables delay until
                                    commit/rollback

ROW     M          *         yes    for rows from N-tables no-delay;
                                    ignore the rows from T-tables

STMT    C          *         no     write delayed changes to the binary log

STMT    C          *         yes    write delayed changes if a N-table was
                                    update wrapped by a rollback; otherwise
                                    ignore them

STMT    R          *         *      write delayed changes if a N-table was
                                    changed; otherwise ignore them

ROW     C          *         no     write delayed changes

ROW     C          *         yes    do nothing, rollback will be called

ROW     R          *         *      ignore them


Requirements
============

1. In STMT mode, causality (i.e. dependency among statements) *shall* be
   preserved by writing the statements to the binary log by the same order that
   they were executed in a transaction.

2. In ROW and MIXED modes, changes to N-tables *shall* be available in the
   binary log when the statement is committed or rolled back. On the 
   other hand, changes to T-tables *shall* be available in the binary log
   when the transaction is committed.

3. In the STMT or MIXED modes, N-statements that are executed before all
   T-statements in a transaction *shall* be available in the binary log when
   the *statement* is committed or rolled back.

4. A rollback of a transaction *shall* leave changes to N-tables in the binary
   log, if there is any. If along with such changes, changes to T-tables are
   also written, the transaction *shall* be end with a rollback to keep slaves 
   consistent with the master.

5. On a crash, any changes to N-tables before the last sync point (as specified
   by ``--sync-binlog`` shall be available/present on disk. No guarantee is
   made regarding changes to N-tables written after the last sync point.

Classifying a statement as unsafe
=================================
Note that the facilities to classify a statement as unsafe are already provided
by the server. In this WL, we only ensure that N-statements after an earlier
T-statement and also M-statements are classified as unsafe.

This means that both statements will trigger the printing out of the following
warning message in STMT mode: "Statement may not be safe to log in statement
format". And in MIXED mode, their changes will be written to the binary log as
rows.


Writing to the binary log
=========================
To guarantee that changes are corretly written to the binary log as previously
decribed, we have two cache structures per transaction: one for transactional
changes (trx-cache) and another one for non-transactional changes (stmt-cache).

The trx-cache is flushed and truncated upon committing or rolling back a
transaction. The stmt-cache is flushed and truncated upon committing or rolling
back a statement, i.e. while finishing it.


Mixing transactional and no-transactional statements
====================================================
The proposed implementation is divided into three layers as follows:

Layer 1 - Decides which format should be used: ROW or STATEMENT.
Layer 2 - Decides which type of cache should be used: stmt-cache or trx-cache.
Layer 3 - Flushes the cache to the binary log.

Note that only the flush operations must be synchronized. We omitted however
the synchronization primitives to avoid cluttering the description. We also
ommitted here the algorithm of some auxiliary functions. Namely,
stmt_tables_of(), row_tables_of(), all_is_transactional() and any_is_mixed()
whose description is presented below:

DESCRIPTION: Returns a set of tables referenced by an internal representation
             of a statement.

INPUT: stmt is a statement in an internal form.

OUTPUT: set of tables referenced by the statement.

set of tables <-- stmt_tables_of(stmt);

DESCRIPTION: Returns a set with the table referenced by a row.

INPUT: a row.

OUTPUT: a set with the table referenced by the row.

set of tables <-- row_tables_of(row);

DESCRIPTION: Checks if all tables in a set of tables passed are transactional.

INPUT: set_tables is a set of tables.

OUTPUT: TRUE if all the tables are transactional, FALSE otherwise.

bool <-- all_is_transactional(set_tables);

DESCRIPTION: Checks if in a set of tables, there are both transactional and
             non-transactional tables.

INPUT: set_tables is a set of tables.

OUTPUT: TRUE if there are both transactional and non-transactional tables,
        FALSE otherwise.

bool <-- any_is_mixed(set_tables);

Note that there are other auxiliary functions whose propose and behavior can
be easily inferred by the name of the function and its parameters. For that
reason, we don't describe them here. 


-----(Layer 1 - Deciding format)-----
INTRODUCTION: N/A

DESCRIPTION: This function decides which format should be used, either row or
statement, and is called before executing a statement and after parsing and
preparing it.

INPUT: mode represents either ROW, STMT or MIXED; stmt is a statement in an
internal form.

OUTPUT: either ROW or STMT which represent how the changes shall be written
to the binary log.

*FORMAT (mode, stm)* ->
  format= (mode == ROW ? ROW : STMT);
  if (mode == STMT or mode == STMT)
  {
    if ((!empty(trx-cache) && !all_is_transactional(stmt_tables_of(stm))) ||
      any_is_mixed(tables_of(stm)))
      if (mode == MIXED or mode == ROW)
        format= ROW;
      else
      {
        print out warning message;
        format= STMT;
      }
  }
  return (format);

-----(Layer 1 - Deciding format)-----

-----(Layer 2 - Caching data)-----
INTRODUCTION: The function to be used is chosen based on the format decision
previously described.

DESCRIPTION: This function writes a row to a cache.

INPUT: row is the changed row to be cached.

OUTPUT: N/A

*CACHE_ROW(row)* ->
   if (all_is_transactional(row_tables_of(row)))
      add_to(trx-cache, row)
   else
      add_to(stmt-cache, row)

DESCRIPTION: This function writes a statement to a cache.

INPUT: stm is an internal representation of the statement to be cached.

OUTPUT: N/A

CACHE_STMT(stm) ->

   if (empty(trx-cache) && !all_is_transactional(stmt_tables_of(stm)))
      add_to(stmt-cache, stm.string)
   else
      add_to(trx-cache, stm.string)

-----(Layer 2 - Caching data)-----

-----(Layer 3 - Flushing data)-----
INTRODUCTION: The function to be used is chosen based on the type of the
operation: commit or rollback. In what follows, we assume that a single
statement, either T, N or M that is successfully executed triggers the following
calls: - COMMIT(false) --> COMMIT(true). On the other hand a
transaction B T|N|M T|N|M C that is successfully executed triggers the
following calls: COMMIT(false) --> COMMIT(false) --> COMMIT(true). The
same reasoning applies to the rollback function.

DESCRIPTION: This function flushes the content of a cache upon commit.

INPUT: If all is equal to FALSE, a statement is being committed, otherwise
the transaction is being committed.

OUTPUT: N/A

COMMIT(all) ->
   if all then
      flush_with_boundaries(trx-cache, COMMIT)
   else
      flush_with_boundaries(stmt-cache, COMMIT)


DESCRIPTION: This function flushes the content of a cache upon rollback.

INPUT: If all is equal to FALSE, a statement is being rolled back, otherwise
the transaction is being rolled back.

OUTPUT: N/A

ROLLBACK(all) ->
   if all then
   {
      if (mode == ROW || mode == MIXED)
         truncate(trx-cache, 0)
      else
         flush_with_boundaries(trx-cache, ROLLBACK)
   }
   else
     flush_with_boundaries(stmt-cache, ROLLBACK)

-----(Layer 3 - Flushing data)-----

Definition table for Layer 1:
Format   (F): (statement (S), row (R))
Error    (E): (error (E), warning (W), ignored (I))


                     STMT            ROW               MIXED

1) B T T C/R         F - S           F - R             F - S
                     E - I           E - I             E - I

2) B T N C/R         F - S           F - R             F - R(N) S(T)
                     E - W(N) I(T)   E - I             E - I

3) T                 F - S           F - R             F - S
                     E - I           E - I             E - I

4) N                 F - S           F - R             F - S
                     E - I           E - I             E - I

5) M                 F - S           F - R             F - R
                     E - W           E - I             E - I

6) B N N T C/R       F - S           F - R             F - S
                     E - I           E - I             E - I

7) B N N C/R         F - S           F - R             F - S
                     E - I           E - I             E - I

8) B M T C/R         F - S           F - R             F - R(N) R(T) S(T)
                     E - W(M) I(T)   E - I             E - I

9) B M N C/R         F - S           F - R             F - R(N) R(T)
                     E - W(M) W(N)   E - I             E - I

10) B N M C/R        F - S           F - R             F - S(N) R(N) R(T)
                     E - W(M) I(N)   E - I             E - I

11) B T M C/R        F - S           F - R             F - R(N) R(T) S(T)
                     E - W(M) I(N)   E - I             E - I

Definition table for Layer 2 and 3:

Caching  (C): (stmt-cache (S), trx-cache (T), ignored (I))
Flushing (F): (upon committing/rolling back a statement (D), upon
              committing/rolling back a transaction (C/R), ignored (I))

                     STMT            ROW               MIXED

                     C - T           C - T             C - T
1.a) B T T C         F - C           F - C             F - C

                     C - T           C - T             C - T
1.b) B T T R         F - I           F - I             F - I

                     C - T           C - S(N) T(T)     C - S(N) T(T)
2.a) B T N C         F - C           F - D(N) C(T)     F - D(N) C(T)

                     C - T           C - S(N) T(T)     C - S(N) T(T)
2.b) B T N R         F - R           F - D(N) I(T)     F - D(N) I(T)

                     C - T           C - T             C - T
3.a) T               F - C           F - C             F - C

                     C - T           C - T             C - T
3.b) T with error    F - I           F - I             F - I

                     C - S           C - S             C - S
4.a) N               F - D           F - D             F - D

                     C - T           C - S(N) T(T)     C - S(N) T(T)
5.a) M               F - C           F - D(N) C(T)     F - D(N) C(T)

                     C - T           C - S(N) T(T)     C - S(N) T(T)
5.b) M with error    F - R           F - D(N) I(T)     F - D(N) I(T)

                     C - S(N) T(T)   C - S(N) T(T)     C - S(N) T(T)
6.a) B N N T C       F - D(N) C(T)   F - D(N) C(T)     F - D(N) C(T)

                     C - S(N) T(T)   C - S(N) T(T)     C - S(N) T(T)
6.b) B N N T R       F - D(N) R(T)   F - D(N) I(T)     F - D(N) I(T)

                     C - S(N)        C - S(N)          C - S(N)
7.a) B N N C         F - D(N)        F - D(N)          F - D(N)

                     C - S(N)        C - S(N)          C - S(N)
7.b) B N N R         F - D(N)        F - D(N)          F - D(N)

                     C - T           C - S(N) T(T)     C - S(N) T(T)
8.a) B M T C         F - C           F - D(N) C(T)     F - D(N) C(T)

                     C - T           C - S(N) T(T)     C - S(N) T(T)
8.b) B M T R         F - R           F - D(N) I(T)     F - D(N) I(T)

                     C - T           C - S(N) T(T)     C - S(N) T(T)
9.a) B M N C         F - C           F - D(N) C(T)     F - D(N) C(T)

                     C - S(N) T(M)   C - S(N) T(T)     C - S(N) T(T)
10.a) B N M C        F - D(N) C(M)   F - D(N) C(T)     F - D(N) C(T)

                     C - S(N) T(M)   C - S(N) T(T)     C - S(N) T(T)
10.b) B N M R        F - D(N) R(M)   F - D(N) I(T)     F - D(N) I(T)

                     C - T           C - S(N) T(T)     C - S(N) T(T)
11.a) B T M C        F - C           F - D(N) C(T)     F - D(N) C(T)

                     C - T           C - S(N) T(T)     C - S(N) T(T)
11.b) B T M R        F - R           F - D(N) I(T)     F - D(N) I(T)