WL#342: Replication Heartbeat

Affects: Server-5.5   —   Status: Complete

RATIONALE
---------
1. To solve BUG#20435 - extra relay log rotation;
2. to detect failures more easily and precisely;
3. to have better second_behind_master value on slave (BUG#29309).

DESIGN
------

1. New extension to CHANGE MASTER:
   CHANGE MASTER SET master_heartbeat_period= val;
   to be stored in master info file and mi object
   to provide the value sent to master.

   The value 'val' needs to be within some reasonable interval.
   As the cost of handling creation, sending and treating the event
   on slave side is supposed to be low, the val can be as small as 
   1 seconds even less.

2. Slave io thread prepares and sends the query
     'SET SESSION @master_heartbeat_period= val'
   to master. From the query
   the master's dump thread finds out slave's preference about heartbeat
   sending period.

3. The heartbeat event is created by master's DUMP thread and sent each time
   @master_heartbeat_period elapses which designate nothing has added to binlog
   for the period time.

4. On the slave's side the heartbeat event is handled exclusively by IO thread
   avoiding its recording to relay log as well as engaging slave sql thread.
   Upon receiving the last sent event's coordinates are compared against
   the ones slave io thread maintains and updates per each received event,
   except some "phantoms" including the new Heartbeat.
   The heartbeat event does not update that local information.

5. The Heartbeat period and the number of received event should be monitorable
    via SHOW STATUS like 'slave_heartbeat period'  and
        SHOW STATUS like 'slave_received_heartbeats' respectively.

FEATURES
--------
A newer heartbeat-aware slave will not have any error response from 
an unintelligent old master about the slave connecting time query 
  set @master_heartbeat_period=val;
and naturally the old master will not send heartbeats.
In that case, slave will show its chosen heartbeat value in the status,
but there will be no real actions.

Using a the user variable @master_heartbeat_period instead of the system one
avoids displaying the name within a list of available variables for a plain
user session.
Existence of the user variable master_heartbeat_period can be noticed only via
the general query log.

User observable behaviour
--------------------------
1. Requesting from the slave to send heartbeats from master with a period:

   CHANGE MASTER master_heartbeat_period= val
   
    where val is the period being of the decimal type with the value in the
    range [0.001, 4294967] seconds.
    Notice, that heartbeats are sent by the master only if there is no
    more unsent events in the actual binlog file for a period longer that
    master_heartbeat_period.
    Whenever the master's binlog is updated with an event, the waiting
    for heartbeat sending condition gets reset.

    If `val' is zero no hearbeats will be sending.
    Notice, heartbeat is active by default with the period
    slave_net_timeout/2.

2.  SHOW STATUS like 'slave_heartbeat_period'
    
    Slave's side status variable which gets the value from either
    CHANGE MASTER, master.info or implicitly as `slave_net_timeout/2'
    (the default).
    The denominator 2 provides a reasonable default period to guarantee
    no reconnection will happen to an idling master upon elapsing
    slave_net_timeout.

3. SHOW STATUS like 'slave_received_heartbeats';

    The counter that initializes at slave init time, increments by
    every received heartbeat and resets to zero with CHANGE MASTER;
    The memory size for the counter is the size for ulonglong i.e
    normally 8 bytes.
    Overflowing it even with the fastest heartbeat is possible only
    on a cosmic time scale.

4.  RESET SLAVE;
   
    resets the current heartbeat's period to the default (see 2.).
    The check of the valid range remains after 
    computing slave_net_timeout/2 with dropping the period's value to
    the max allowable if the ratio would be greater.

5.  SET @@global.slave_net_timeout=`value less than the current hb period`

    is warned as that'd be an irrational intention.

OLD INFO:
---------
Suggestion from Jeremy Zawodny at OSCON, July 2001, as a discussion idea:

Need active heartbeat detection mechanism for replication monitoring.  
Writing an error to log file is insufficient notice of failure.  
Instead need a way to trigger an alert message to DBA or 
a method for a watcher program to immediately detect 
when a slave has failed.