WL#6120: Change master without stopping slave

Affects: Server-5.7 — Status: Complete

In order to add/alter an option using CHNAGE MASTER TO command, currently it is 
necessary to do a STOP SLAVE before CHANGE MASTER. This worklog relaxes this 
constraint. Lets look at the three scenarios below to understand more about this 
task.

1) BOTH IO AND SQL THREAD ARE STOPPED

   When both the slave threads are stopped, there will be no change in
   behaviour. The CHANGE MASTER command will behave as it does now.

2) IO THREAD IS STOPPED, SQL THREAD IS RUNNING

   In order to switch the I/O thread over to read from another master it is
   currently necessary to stop the SQL thread as well. This worklog implements
   support for re-directing the I/O thread to another master without having to
   stop the SQL thread first, wherever possible.

3) SQL THREAD IS STOPPED, IO THREAD IS RUNNING

   In order to CHANGE MASTER TO RELAY_LOG_FILE/RELAY_LOG_POS/MASTER_DELAY, we
   currently have to stop the SQL thread as well. This worklog will allow these
   CHANGE MASTER options without having to stop the IO thread, if possible.

1) Currently, the way we move from a topology M1->S to M2->S is:

    a) STOP SLAVE
    b) SHOW SLAVE STATUS to get (Read_Master_Log_Pos, Master_Log_File)
    c) START SLAVE UNTIL 
    d) SELECT MASTER_POS_WAIT()
    e) CHANGE MASTER 
    f) START SLAVE 

   The proposal is to reduce these steps to just

      CHANGE MASTER 

   wherever applicable. See points (a-d) below regarding these rules.

   a) If IO thread is running and SQL thread is stopped:
         - CHANGE MASTER TO RELAY_LOG_FILE/RELAY_LOG_POS/MASTER_DELAY will be
           allowed.
         - All other CHANGE MASTER options will be disallowed

   b) If SQL thread is running and IO thread is stopped:
         - CHANGE MASTER TO RELAY_LOG_FILE/RELAY_LOG_POS/MASTER_DELAY will be
           disallowed
         - All other CHANGE MASTER options will be allowed.

   c) CHANGE MASTER TO MASTER_AUTO_POSITION=1 will be allowed only if both IO
      and SQL threads are stopped.

   d) If the receiver/applier is running and the slave has open temporary
      tables, we print a warning on CHANGE MASTER.

2) In the above mentioned change, there could be an instant of time when
   the IO thread is reading from M2 and the SQL thread is executing events that
   had been received from M1, *both at the same time*. Also there is no overhead
   of killing and spawning new threads.

3) Currently, CHANGE MASTER purges relay log files unless the command uses
   RELAY_LOG_FILE/RELAY_LOG_POS option. This behavior will be kept intact when
   the both thread are stopped. The reason for this is that we can't remove the
   relay log(s) with a running SQL thread.
   When any one thread is running while we do a CHANGE MASTER, we dont delete
   relaylogs. The relaylog deletion can be handled by using the relay-log-purge
   option.

4) With Statement based replication (SBR), we don't recommend using temporary
   tables. One reason is that there is a possibility that the temporary tables
   are left open forever on a failover. To warn users that there could be such
   a situation we introduce warnings in the error log when one does a change
   master or stop slave. More precisely, we follow the following rules:
    
     4.1 change master should never drop temp tables
     4.2 We introduce a new command to drop temp tables. 
     4.3 The options under change master can be grouped under three groups:
         a) To change a connection configuration but remain connected to       
     
            the same master.                                                   
     
         b) To change positions in binary or relay log(eg: master_log_pos).    
     
         c) To change the master you are replicating from.  
         Change master should generate a warning if there are open temp tables
         in cases a and b above.
     4.4 Stop slave should generate a warning if there are open temp tables.

1) Removed the following check for both IO and SQL threads stopped from
   change_master()

   if (thread_mask) // We refuse if any slave thread is running
   {
     my_message(ER_SLAVE_MUST_STOP, ER(ER_SLAVE_MUST_STOP), MYF(0));
     ret= true;
     goto err;
   }

2) Introduced error/warning in the following cases:
   
   a) Added a new error- ER_SLAVE_IO_THREAD_MUST_STOP (This operation cannot be
      performed with a running slave sql thread; run STOP SLAVE IO_THREAD
      first).

   b) Added a new warning- ER_WARN_OPEN_TEMP_TABLES_MUST_BE_ZERO (This operation
      may not be safe when the slave has temporary tables. The tables will be
      kept open until the server restarts or until the tables are deleted by any
      replicated DROP statement. Suggest to wait until slave_open_temp_tables =
      0.) 

4) Added checks for disallowing all CHNAGE MASTER options except
   RELAY_LOG_FILE/RELAY_LOG_POS/MASTER_DELAY when the SQL thread is running.

5) Added code to allow CHANGE MASTER TO RELAY_LOG_FILE/ RELAY_LOG_POS/
   MASTER_DELAY when SQL thread is running and IO thread is stopped.

6) Following is the pseudocode describing changes in this WL:

  If both receiver and applier are stopped
    no change in behavior

  if any change master options affect the receiver:
    if receiver is executing:
      error(ER_SLAVE_IO_THREAD_MUST_STOP)

  if any change master options affect the applier:
    if applier is executing:
      error(ER_SLAVE_SQL_THREAD_MUST_STOP)

  if applier is executing and applier has open temporary tables and the user
  does a change master/stop slave:
      warning(ER_WARN_OPEN_TEMP_TABLES_MUST_BE_ZERO)

  if there was an error above:
    return TRUE (consistent with current semantics)

7) Added the following tests in mtr's rpl suite.

   a) rpl_change_master_without_stopping_slaves.test
   b) rpl_change_master_open_temp_tables.test
   c) rpl_change_master_relay_log_purge.test

8) While doing this, we also worked on the readability, maintainability of
   the change_master() function.