WL#7169: Semisync: make master wait for more than one slave to ack back

Affects: Server-5.7 — Status: Complete

Description
High Level Architecture
Low Level Design

EXECUTIVE SUMMARY
=================

This worklog implements an option to make the master wait for more N
slaves to acknowledge back, instead of just one, when semisync is
ON. Choosing to wait for N slaves (N > 1), increases resiliency to
consecutive random failures. It also improves transaction durability
as one transaction gets persisted in more than two servers before the
results are externalized on the master.

USER BENEFIT
============

When using semisync is turned ON, the master holds the session until it
gets an acknowledgement from a slave that it has written the
transaction to the relay log. If both master and semisync slave crash,
then the transaction is lost.

After this worklog, the user will be able to specify that a
transaction should be written to the relay log in N slaves before the
master returns the session to the client. As such, in the event the
master and F slaves crash at the same time (where F < N), either the
client has not seen the results of its transaction, or the transaction
is already persisted in some other slave. Consequently, when promoting
a slave to become the new master it will have learned about this 
transaction (since promotion will create a candidate that knows all 
transactions in the system [1]).

REFERENCES
==========
[1]
http://svenmysql.blogspot.co.uk/2013/03/flexible-fail-over-policies-using-
mysql.html
BUG#11762792: NUMBER OF SLAVES THAT MASTER WAITS BEFORE COMMIT

USER DOCUMENTATION
==================

http://dev.mysql.com/doc/refman/5.7/en/server-system-
variables.html#sysvar_rpl_semi_sync_master_wait_for_slave_count
http://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-3.html

Interface Specification
=======================
I-1: rpl_semi_sync_master_wait_for_slave_count
     It is a system variable defined in semisync master plugin. Users use it to 
     set at least how many acks from different slaves transactions should wait 
     before they can go ahead to engine commit or replying the users.

     Scope:   GLOBAL
     Dynamic: YES
     Type:    NUMBERIC
     Range:   1-65535
     Default: 1

     Note: Current design don't have good performance on big numbers. We choose
           to make it has better performance with the small values, because nearly
           no users require to set a big value.

Analysis On Original Semisync Process
=====================================

* Transaction Sessions
  1. Create ack request node when binlogging
  2. Wait until reply_file_name_ and reply_file_pos_ exceed its position
  3. engine commit

* Dump Threads
  1. Send events and ack request
  2. Receive the ack
  3. reportReplyBinlog()
     Update reply_file_name_ and reply_file_pos_ after receiving an ack.
     Rmove all ack request nodes which are smaller than the ack's position.
     Wake up transaction sessions.

  Note: 2 and 3 will be done in a separate thread afte WL#6630

Design of the worklog
=====================
From above analysis we can know that reply_file_name_ and reply_file_pos_ is
the brige between transaction sessions and dump threads.

We understand the variables as that any transactions that their positions are
smaller than the variables can go ahead.

In another words, it means all events before the variables are already
replicated to at least N(rpl_semi_sync_master_wait_for_slave_count) slaves.

So the design changes the code only related to dump thread, but not transaction
session.

* Changes on dump thread.

  3. if (rpl_semi_sync_master_wait_for_slave_count == 1) call
     reportReplyBinlog(). otherwise go 4.

  4. AckContainer::insert()
     update ack information.
     Check if some positions already get acks from at least N
     (rpl_semi_sync_master_wait_for_slave_count) slaves. If true then return
     the ack information.
  5. Call reportReplyBinlog if insert() return any ack information.

* AckContainer
  A container to maintain the latest ack information of slaves.

  ack information includes:
  AckInfo {
    slave's id
    max ack's log file name.
    max ack's log file pos.
  };

* ack array
  The container is an array and its size equals to the value of
  rpl_semi_sync_master_wait_for_slave_count - 1.
  Any slave can only take one slot and its new ack will cover its old ack if it
  already takes a slot. When the array is full and a new ack is coming from a
  slave which does not take a slot, then the events before the minimum ack of
  the array are already replicated to at least
  rpl_semi_sync_master_wait_for_slave_count slaves. So call ReportReplyBinlog()
  to update related variables, and empty minimum acks from the array.

* insert()
  Maintain the ack array, after receiving an ack.

  DEFINITION:
  AckInfo* insert(uint32 server_id, const char *log_file_name,
                  my_off_t log_file_pos)

  LOGIC:
  1. Update ack information if there is an old ack of the slave in the array
     update the slot's position with log_file_name and log_file_name if the
     slot's server id equals to the server_id. Otherwise add it into an empty
     slot.

  2. if the container is full
     {
       find the minimum ack of the global array(including the one inserting)
       empty the slots that their ack positions are same to the minimum ack.
       insert the inserting ack to a slot if it isn't the minimum ack.
       return the minimum ack.
     }

  3. insert the ack into the array.
  4. return NULL.

* resize()
  Change the array's size when rpl_semi_sync_master_wait_for_slave_count is
  changed.

  DEFINITION:
  int resize(unsigned int size, const AckInfo **ackinfo);

  LOGIC:
  Backup the ack array.
  Create a new ack array that has size-1 slots if size-1 is greater than 0.
  Call insert() to insert each ack of the old ack array into the new one.
  Free the old ack array.
  if insert() returns any ack pointer, then return it to caller.

* Sys_rpl_semi_sync_master_wait_for_slave_count
  The object of systme variable rpl_semi_sync_master_wait_for_slave_count.

  DEFINITION:
  static Sys_var_uint Sys_rpl_semi_sync_master_wait_for_slave_count;

* fix_rpl_semi_sync_master_wait_for_slave_count()
  handle and initialize internal things(e.g global ack array) when
  rpl_semi_sync_master_wait_for_slave_count is changed. it is called by
  Sys_rpl_semi_sync_master_wait_for_slave_count.

  DEFINITION:
  static void fix_rpl_semi_sync_master_wait_for_slave_count(MYSQL_THD thd,
                                                            SYS_VAR *var,
                                                            void *ptr,
                                                            const void *val);

  LOGIC:
  if new value is not same to old value
  {
    Call AckContainer::resize to change the array.
    Call reportReplyBinlog if resize returns any ack pointer.
  }

* Switch off semisync master
  Semisync master will switch off if slave numbers is less then
  N(rpl_semi_sync_master_wait_for_slave_count) and it will switch on until it
  receives an ACK from at least N(rpl_semi_sync_master_wait_for_slave_count)
  slaves.

* How to recovery the crashed master
  Currently, all binlogged transactions will be recovered. We didn't change the
  behavior in this worklog. WL#7042 will do some changes to make the recovery
  more reasonable and practicable.