WL#1720: Semi-synchronous replication

Affects: Server-5.5 — Status: Complete

Description
Dependent Tasks
High Level Architecture
Low Level Design

SUMMARY
- The interface for semi-synchronous replication was done in WL#4398.
- This WL#1720 will be closed when the components have been pushed
  to the main tree and are part of the server releases.

RATIONALE

Get HA by having slave acknowledge transactions before it is committed.
At COMMIT time on master, wait for a ACK from slave before return to user
Ensures that data always exist in two places.

DESIGN LEFT TO DO

1. How to configure master?  Esp if there are multiple slaves.
2. Ensure that the solution is general enough, e.g. any engine
   can use it.
3. Check if there can be clean interface to "replicator"

DESCRIPTION

That's a feature request from a customer, who calls it "semi-synchronous
replication":
- user sends COMMIT to master
- master writes the transaction to the binlog
- master waits for the slave to say "I have received the transaction and saved
it into my relay log" (ACK of the slave I/O thread) or the same plus "I have
executed the transaction" (ACK of the slave SQL thread).
Exactly, the thread which writes to the binlog waits for a global variable
"slave_pos" (and "slave_binlog") to be updated (a condition should be signaled),
and the Binlog_dump thread is responsible for upating this global variable: the
Binlog_dump thread sends the events to the slave, and waits for an "OK" packet
from the slave, then broadcasts the condition.
Ideally, for multi-statement transactions, the Binlog_dump thread should not
wait for a slave "OK" for every event; ideally, only for the event which is
really updating the slave's data: for InnoDB, the COMMIT; for MyISAM, any
statement; in other words, all events which get directly written to the disk
binary log, plus the COMMIT event. "This event requires an OK" could be a flag
in the event.

If the master does not receive an OK before a timeout is exceeded, the master
switches this slave to asynchronous replication (and prints a message to the
error log "slave XXX was switched to async rep (ie. no more waits on COMMIT)
because a timeout was exceeded for binary log YYY position ZZZ (when we have
transaction ID later we can print it too").

DBA should be able to request semi-sync or async replication on the fly.

SHOW SLAVE HOSTS would give info about if the slave is doing semi-sync or async
replication and why (timeout exceeded, or DBA requested this mode).

The customer has an urgent need of "ACK that it's in the relay log" more than
"ACK that it's executed by the slave". He would agree to sponsor it.

Customer also suggested that, if the master has more than one slave, the master
waits for a configurable number of slaves to have ACKed the transaction. But
customer does not need this extension immediately.

This new feature is somehow similar to 2-phase commit: master waits for slave's
commit before saying "ok" to the user; however it does not rollback if the slave
failed; hence it may be better for critical sites which need always ongoing
operations.

Guilhem should continue discussion with the customer to clarify and amend the
above description, with Brian too. Monty presently agrees _on_the_idea_ of this
type of replication.

References:
- Google implemented a version of this feature:
  http://mysqlha.blogspot.com/2007/05/semi-sync-replication-for-mysql-5037.html


CAN WE USE THIS INTERFACE?
==========================
We should consider if we can use (a superset/subset of)
this interface for semi-sync.  The below is draft.

Components
----------
MySQL - MySQL Server
Replicator - Library that is linked with MySQL server

Structure
---------
  +--------------------------------+
  |             MySQL              |
  +--------------------------------+
  | Service and Pluging Interfaces |
  +--------------------------------+
  |           Replicator           |
  +--------------------------------+


  
Distributed concurrency control
===============================
A separate thread in the Replicator checks if the prioritized transaction
and the local transactions conflict.  If so then the local transaction
is rolled back.

This proposal is only supporting an optimistic approach.  For the
conservative approach, we also need a notifier hook at the begin of
the transaction.

MySQL - Replicator Interface 
----------------------------
The interface has four main calls:

1) Commit requestor: request_commit(trx_id) (MySQL -> Replicator)

   Before MySQL commit a transaction it sends a request to 
   Replicator with the following information:

   a) Transaction Id

   Replicator replies with Yes or No, indicating if the transaction is
   allowed to commit or not.  

   If "No", then the transaction is rolled back.
   If "Yes', then MySQL can commit the transaction.

   This should be implemented in handler.cc:ha_commit_trans(), after
   2PC prepare and right before the call to ha_commit_one_phase().

   Timeline:

        1             2   3     4   5       6        7         8
    ----|-------------|-------------|-------|--------|---------|--->
          prepare         |     ^     Store   commit   commit   
                          |     |      Xid     SE1      SE2    
                          v     |     
                      Commit Approval
                       (Set seq no)

   NOTE Recovery:
   If the MySQL server crash between 4 and 5, then the Replicator will
   send the transaction to other nodes, while it is not being
   recovered by this MySQL server.  This means that the replicator
   needs to ask the server for its last executed xid, and and then
   send local lost transactions back to the MySQL server for 
   reapplication.  And this as part of the recovery protocol.

   NOTE:
   The DMBS is not allowed to fail after step 5.  The global commit
   has already happened.

   Sample code:

     /* Request commit from replicator */
     if (!replicator->request_commit()) {   /* NEW */
       ha_rollback_trans(thd, all);         /* NEW */
       error= 1;                            /* NEW */
       goto end;                            /* NEW */
     }                                      /* NEW */
     error=ha_commit_one_phase(thd, all) ? (cookie ? 2 : 1) : 0;
     DBUG_EXECUTE_IF("crash_commit_before_unlog", abort(););
     if (cookie)
       tc_log->unlog(cookie, xid);
     DBUG_EXECUTE_IF("crash_commit_after", abort(););
   end:
     if (is_real_trans)
       start_waiting_global_read_lock(thd);
     /* Report commit completed to replicator */
     replicator->report_commit();           /* NEW */

2) Commit report (MySQL -> Replicator)

   After MySQL has committed the transaction, it must report the 
   transaction id to Replicator.

   See code above.

3) Abort report  (MySQL -> Replicator)

   Add call in:
   handler.cc:ha_rollback_trans(THD *thd, bool all)

4) Master: Report rows: log_subscriber::report(event) (MySQL -> Replicator)

   The binlog event will be sent to the subscriber.

   Extractor
   ---------
   Every row that is applied in the handler interface is caught in the
   same way as row-based replication does and reported to Replicator:

   a) Row being changed/binlog event
   b) Transaction Id

   Alternatives:
   a) Trigger level: Catch rows
   b) Handler level: Catch rows
   c) Binlogging level: Catch binlog events 
      (log.cc:MYSQL_BIN_LOG::write(IO_CACHE*))

   DECISION: Try with c.

   Serialization
   -------------
   Will be implemented using log_event.cc:write_data_header()
   functions that are re-written not to write the stuff to file, but
   keep in buffer.

5) Transaction start (MySQL -> Replicator)

   Called when transaction is started 

   a) Transaction id

6) Rollback request (Replicator -> MySQL)

   Replicator can at any time call MySQL with a request to rollback an
   ongoing local transaction.  MySQL then needs to roll it back.

7) Slave Executor

   Alternatives:
   a) SQL level - outside client
   b) SQL level - mysql_parse()
   c) Binlog events
   d) Individual rows, exucute on handler level

   DECISION: Try with c.

   FILTERING
   ---------
   All events except rows events (3 types) and table map events 
   are discarded. 

   EXECUTION
   ---------
   This would probably mean to take the 
   sql_binlog.cc:mysql_client_binlog_statement(THD* thd)
   function and strip away the base64 stuff, and use the 
   same method to execute events.

8) Recovery function  si::get_last_executed_transaction()

   At recovery time the Replicator needs to know the last executed
   transaction, so that it may replay possible local transaction that
   was lost if the server failed between 4, 5 above and any remote
   transaction that was not yet applied.

   This is probably some functionality in log.cc for XA reovery.

WL#4398: Replication Interface for semi-synchronous replication

Semi-synchronous replication has two components, semi-sync master and
semi-sync slave plugins. 

There is a flag argument for BINLOG_DUMP command, which is zero by
default, when semi-sync plugin is loaded and enabled on the slave,
this flag will set to BINLOG_SEMI_SYNC (0x0002) to request
semi-synchronous replication behaviour from the master. The master
will check this flag before it starts drumping binlog if semi-sync is
enabled on master.

Semi-sync is enabled on master by setting global variable
`rpl_semi_sync_enabled' to 1, and disabled if set to 0. There is also
a status variable `Rpl_semi_sync_status' that indicates if synchronous
replication is on or off. Synchronous replication is on by default
when semi-sync is enabled on master. it will be turned off when failed
to get an ack from slave before timeout (rpl_semi_sync_timeout), and
will be turned on again when the slave catches up.

When semi-sync is enabled on master (rpl_semi_sync_enabled = 1), and
synchronous replication is on (Rpl_semi_sync_status = ON), it will
wait for an ack at the end of each transaction. The master only waits
an ack for the last event of the transaction. In order to indicate the
slave to issue a relay for the last event of a transaction, the master
reserves two extra bytes in the event packet header, the first byte is
a magic number (0xEF), the second byte is a sync flag, which is set
when the event is the last event of a transaction is enabled. The
slave checks the sync flag to decide if it needs to send an ack for
the event received from master. The master will not wait for ever for
the ack from slave, it will give up the it does not get an ack after a
given period of time (rpl_semi_sync_timeout), and then turns off
synchronous replication. The master will not wait for an ack for
transactions when synchronous replication is turned off. When the
slave catches up, the master will automatically turn on synchronous
replication again if semi-sync is enabled.

Semi-sync master
================

Semi-sync master component include class ActiveTranx, class
ReplSemiSyncMaster, and semi-sync master plugin implements the
following observers: Trans_observer, Binlog_storage_observer,
Binlog_transmit_observer.

ActiveTranx
-----------

This class maintains a list of active transactions which are waiting
for an ack from slave.

ReplSemiSyncMaster
------------------

This is the main class of semi-sync master component, which maintains
a list of active transaction lists that need ack from slave, and
receives ack from slave.

`updateSyncHeader'

Set the sync flag in the extra header if the binlog filename and
position of the event sending out is recorded in the active
transaction list.

`writeTranxInBinlog'

Create a new active transaction node and record the binlog filename
and position, insert the new node to the active list, and update the
largest position the master is waiting for reply.

`readSlaveReply'

Called by dump thread to wait and read ack from slave, and resume
transactions when get a rely.

`commitTrx'

Called by the SQL transaction thread to wait for an ack from
slave. Will resume when the dump thread get a reply for this
transaction or timeout.

Transaction observer
--------------------

`after_commit'

Master wait for an ack from slave after commit to storage egines.


Binlog storage observer
-----------------------

`after_flush'

Create a new active transaction and record the binlog file name and
position to wait.

Binlog transmit observer
------------------------

`transmit_start'

Check BINLOG_DUMP flag to see if slave requests semi-synchronous
replication. increase `rpl_semi_sync_clients' if it does.

`transmit_stop'

decrease `rpl_semi_sync_clients' if semi-sync is enabled.

`reserve_header'

reserve two extra bytes in the event packet header if slave requested
semi-synchronous replication.

first byte is magic number, 0xFE

sencond byte is sync flag, 0 by default, 1 if the event need a reply.

`before_send_event'

Check if the position of the event sending out is recorded in the
active transaction list maintained by the master, set the sync flag in
the reserved header if it is.

`after_send_event'

If we have set the sync flag above, then we need to wait for the ack
from slave here. Wait for at most `rpl_semi_sync_timeout'
milliseconds.

`after_reset_master'

Reset semi-sync master status variables.

Semi-sync slave
===============

Semi-sync slave plugin implements the Binlog_relay_IO_observer.

ReplSemiSyncSlave
-----------------

Main class of semi-sync slave component.

`slaveReadSyncHeader'

Read the semi-sync extra header, get the sync flag and payload of the
event after remove the extra header.

`slaveReply'

Send a reply to master about the event just received.

Binlog relay IO observer
------------------------

`thread_start'

set `rpl_semi_sync_slave_status' to ON.

`thread_stop'

set `rpl_semi_sync_slave_status' to OFF.

`before_request_transmit'

set the BINLOG_SEMI_SYNC bit in the argument of BINLOG_DUMP command
sending to master.

`after_read_event'

Check the magic number and sync flag in the event packet, remove the
extra bytes in the header.

`after_queue_event'

If the sync flag is set for this event packet, send a reply to master.

`after_reset_slave'

Nothing currently.

Reference
=========
The design of replication interface is available here:
   http://forge.mysql.com/wiki/ReplicationFeatures/ReplicationInterface