WL#2775: System tables for master.info, relay_log.info

Affects: Server-5.6 — Status: Complete

SUMMARY
-------
Replace the master.info and relay_log.info files with system tables.

MOTIVATION
----------
1. Get rid of the files making the system more easy to manage.
   This becomes especially important with WL#1697 Multi-source
   since that implementation would create sets of files
   for *every* named master.

2. System table updates should be intermixed with the 
ongoing transaction, thus making the replication more fault tolerant.  
I.e. it would be impossible for the slave to crash between the data 
update and the relay_log_pos update...

3. Brian wants this as part of multi-source (WL#1697).


RELATED WORKLOGS
----------------
(Impact on and by)
WL#1697 - Multi-source replication.
WL#3584 - Group-ids (global-bin-log positions).
WL#4209 - Integrate Backup with Replication.
WL#4621 - Recovery process to synchronize the master.info, the relay-log.info
and the relay-log after a failure.
WL#5125	- Refactory of Slave master.info and slave.info

(Previous work towards this worklog)
WL#3089 - Refactoring for multi-source and info system tables.

(Additional thoughts)
WL#3835 - Slave tables to track progress of replication.

(Different alternative with XA)
WL#3970 - Safe slave positions.


DECISIONS
---------
This should be first discussed with Mats and then reviewed by 
Peter G.  Configuration of this new system table needs to be 
very clearly specified.  -- Lars Thalmann, 2009-01-26


OPEN ISSUES
-----------
1. Perhaps consider storing the password as encrypted.

"Due to the lack of time, we are not going to consider this as
part of this worklog." -- Alfranio, 2009-10-16

2. Although the code can be easily extended to support WL#1697 -
Multi-source replication, this still requires additional work.

3. We are not considering the integration with Backup, WL#4209 - 
Integrate Backup with Replication.


BUGS
----
BUG#55105 - Some PB2 tests fail sporadically with SIGSEGV in
inline_mysql_mutex_lock.

BUG#56324 - Race Condition while shutting down MySQL: "PSI_server"

BUG#51054 - mysqld segfault after binlog enabled

The bugs above need to be fixed so we can properly test this worklog.

REQUIREMENTS
------------
(REQ-1) It shall be possible to store the information in any kind of storage
engine.

(REQ-2) - It shall be possible to easily extend the proposed solution to handle
multi-source replication. See also WL#1697.

(REQ-3) The master.info table shall be updated whenever: (i) a relay log is
rotated, (ii) CHANGE MASTER is executed and (iii) I/O Thread is stopped.

(REQ-4) After a failure, a recovery process should throw away possible corrupted
relay log files and rely on the information in the relay.info
to start re-fetching data from the master. See also WL#4621.

(REQ-5) The relay.info table shall be updated whenever: (i) the relay log file
is rotated, (ii) transaction SQL Thread is stopped, (iii) while processing a
Xid_log_event, (iv) after processing commit or rollback, i.e. a Query_log_event,
and (v)after processing any statement written to the binary log without a
transaction context.

(REQ-6) The system should cope with the fact that an user may switch the type of
the repository without disabling the previous one.

REQ-1
-----
It shall be possible to create the system tables using any kind of storage
engine. Specifically, the system tables are the slave_master_info (master.info)
and slave_relay_log_info (relay.info) and by default both tables are created
using the default engine, which currently is MyISAM. This is done because the
MyIsam storage engine is always available.

For those users concerned about reliability, a "super user" may change the type
of the engine at any time. It is also possible to specify the type of the engine
in the installation procedure by defining the option -rep-engine=engine.

Common users are not be able to change (e.g. drop) the system tables. However, a
"super user" may alter them but this is not recommend as some changes may stop
the slave. For instance, the primary key should never be
changed or removed. 

In particular, changing default values for the columns in the system tables does
not affect default values defined in the server. This means that when a "CHANGE
MASTER" is executed the default values specified in the system tables are
ignored as they are overwritten by those internally defined in the server.
However, the "super user" can still stop the replication and directly perform
changes to the system tables and in such cases the default values defined for
the columns will be used.

Users are not able to directly update the system tables but any user can
read from it. In particular, a super user may read and also acquire a read
lock on them thus blocking the replication until the lock is released.

A read lock can be acquired with the command: "FLUSH TABLES WITH READ LOCK"


REQ-2
-----
It shall be as simple as having an *unique* index (PK or UK not null) in
master.info and relay.info tables in the column that stores the master id.
So supporting multi-source is as simple as adding new entries to the tables.


REQ-3
-----
(REQ-3) The master.info table shall be updated whenever: (i) a relay log is
rotated, (ii) CHANGE MASTER is executed and (iii) I/O Thread is stopped.

One may also control if the master.info is updated after reading an event from
the master and writing it to relay log. This is done through the option
--sync-master-info = N, where N is the number of events after which the update
will take place.

There is a key difference however between the use of a file and a system table.

When we update the positions into a file, we optimistically write to the OS'
cache and eventually the changes will be written to disk if the server does not
crash. If we want to guarantee that the data is successfully written to disk
right after returning from the update call we need to set the option
--sync-master-info.

However, there is no optimistic write to a table which means that nothing is
written before the sync period is reached. By default the value is 0 which
means that data is persisted only in cases (i), (ii) and (iii).

There is no need to set --sync-master-info as the recovery process described in
REQ-4 is mandatory to provide reliability/data integrity and by consequence some
possible corrupted relay log files are thrown away and the positions stored in
the master.info are updated based on the positions in the relay.info. Deleted
events are re-fetched from the master if necessary.


REQ-4
-----
To correctly recovery from failures, one shall relay on the following option:
--relay-log-recovery.  


REQ-5
-----
(REQ-5) The relay.info table shall be updated whenever: (i) the relay log file
is rotated, (ii) transaction SQL Thread is stopped, (iii) while processing a
Xid_log_event, (iv) after processing commit or rollback, i.e. a Query_log_event,
and (v)after processing any statement written to the binary log without a
transaction context.

The Xid_log_event is a commit for transactional engines and must be handled
differently to provide reliability/data integrity. While committing updates to
transactional engines the following behavior shall be implemented:

  . If the relay.info is stored in a transactional repository, for instance, a
  system table created using Innodb, the positions are updated in the context
  of the transaction that updated data. Therefore, should the server crash 
  before successfully committing the transaction the changes to the position 
  tables will be rolled back too

  . If the relay.info is stored in a non-transactional repository, for instance,
  a file or a system table created using MyIsam, the positions are update after
  processing the commit as in (iv) and (v).

In particular, when there are mixed transactions, i.e a transaction that updates
both transaction and non-transactional engines, the Xid_log_event is still used
but reliability/data integrity cannot be achieved as we shall explain in what
follows. 

Changes to non-transactional engines, such as MyIsam, cannot be rolled back if a
failure happens. For that reason, there is no point in updating the positions
within the boundaries of any on-going transaction. This is true for both commit
and rollback. If a failure happens after processing the pseudo-transaction but
before updating the positions, the transaction will be re-executed when the
slave is up most likely causing an error that needs to be manually circumvented.
This is a well-known issue when non-transactional statements are executed.

Specifically, if rolling back any transaction, positions are updated outside the
transaction boundaries. See WL#4801 for additional details.

Statements written to the binary log outside the boundaries of a transaction are
DDLs or maintenance commands which are not transactional. This means that they
cannot be rolled back if a failure happens. In such cases, the positions are
updated after processing the events. If a failure happens after processing the
statement but before updating the positions, the statement will be
re-executed when the slave is up most likely causing an error that needs to be
manually circumvented. This is a well-known issue when non-transactional
statements are executed.

The --sync-relay-log-info does not have effect when a transactional system table
is used.


REQ-6
-----
  The following table describes what shall happen when an user restarts the
  server with different repositories:

  |--------------+-----------------------+-----------------------|
  | Exists \ Opt |         TABLE         |          FILE         |
  |--------------+-----------------------+-----------------------|
  | !T,F         | Update (and delete F) | Read F                |
  | T,F          | ERROR                 | ERROR                 |
  | !T,!F        | Fill in T             | Create and Fill in F  |
  | T,!F         | Read T                | Update (and delete T) |
  |--------------+-----------------------+-----------------------|

  .F  - file with data, 
  .!F - no file,
  .!T - no data in table, 
  .T - table with data.


By default the persistent storage system is a file for both the master info and
relay log info. Two options are provided to change persistent storage system:

master_info_repository = (FILE | TABLE)

relay_log_info_repository = (FILE | TABLE)

ARCHITECTURE
------------
We extend the architecture defined in the WL#5125 as fallows:


         ----------------------     ----------------------
         |  Rpl_info_Handler  |<>---|   Rpl_info_fields  |
         ----------------------     ----------------------
                   ^
                   |
         ----------------------     -------------------------
         |   Rpl_info_table   |<>---| Rpl_info_table_access |
         ----------------------     -------------------------

A factory is responsible for creating and assembling this set of components,
i.e., setting the dependencies between the instances as follows:

                             ----------------------
                             |   Rpl_info_factory |
                             ---------------------- 

              ----------------------     ----------------------
              |     Master_info    |<>---|    Rpl_info_table  |
              ----------------------     ----------------------

              ----------------------     ----------------------
              |   Relay_log_info   |<>---|    Rpl_info_table  |
              ----------------------     ----------------------

CLASSES
-------

Rpl_info_Handler
================
. This an abstract class that provides a simple interface to build persistence
mechanisms.


Rpl_info_factory
================
. Creates and assembles the set of components presented in this section.


Master_info
===========
. Master_info will handle the following information:
  - Master_Log_File - The name of the master binary log currently being read
    from the master.
  - Read_Master_Log_Pos - The current position within the master binary log
    that have been read from the master.
  - Master_Host - The host name of the master.
  - Master_User - The user name used to connect to the master.
  - Master_Password (not shown by SHOW SLAVE STATUS) - The password used to
    connect to
    the master.
  - Master_Port - The network port used to connect to the master.
  - Connect_Retry - The period (in seconds) that the slave will wait before
    trying to reconnect to the master.
  - Master_SSL_Allowed - Indicates whether the server supports SSL connections.
  - Master_SSL_CA_File - The file used for the Certificate Authority (CA)
    certificate.
  - Master_SSL_CA_Path - The path to the Certificate Authority (CA)
    certificates.
  - Master_SSL_Cert - The name of the SSL certificate file.
  - Master_SSL_Cipher - The name of the cipher in use for the SSL connection.
  - Master_SSL_Key - The name of the SSL key file.
  - Master_SSL_Verify_Server_Cert - Whether to verify the server certificate.

Any change that may happen to the structure of the Master_info will be handled
at merging time.


Relay_log_info
==============
. Relay_log_info will handle the following information:
  - Relay_Log_File - The name of the current relay log file.
  - Relay_Log_Pos  - The current position within the relay log file. Events up
  to this position have been executed on the slave database.
  - Relay_Master_Log_File - The name of the master binary log file from which
  the events in the relay log file were read.
  - Exec_Master_Log_Pos - The equivalent position within the master's binary log
  file of events that have already been executed.

Any change that may happen to the structure of the Relay_log_info will be
handled at merging time.


class Rpl_info_table_access:
============================
. Provides the necessary means and interfaces to create threads and access 
information stored in a system table. I has the following methods:

  open_table() opens and locks a table in a database but before doing that,
  saves the previous lock state.

  close_table() releases the table lock and close the table and at the end
  restores the previous lock state.

  bool find_info_id() puts the cursor pointing to the record that corresponds
  to the server id.

  bool load_info_fields() prepares a data structure (e.g. Rpl_info_fields) to
  read data from a table.

  bool store_info_fields() prepares a data structure (e.g. Rpl_info_fields) to
  write data into a table.

  create_fake_thd() creates a fake thread to access a table when there is no
  current_thread.

  drop_fake_thd() drops a fake thread created to access a table.

{
public:
  Rpl_info_table_access();
  virtual ~Rpl_info_table_access();

  bool open_table(THD* thd, const char *dbstr, const char *tbstr,
                  uint max_num_field, enum thr_lock_type lock_type,
                  TABLE** table, Open_tables_state* backup);
  bool close_table(THD* thd, TABLE* table, Open_tables_state* backup);
  bool find_info_id(uint idx, LEX_STRING, TABLE*);
  bool load_info_fields(uint max_num_field, Field **fields, ...);
  bool load_info_fields(uint max_num_field, Field **fields,
                        Rpl_info_fields *field_values);
  bool store_info_fields(uint max_num_field, Field **fields, ...);
  bool store_info_fields(uint max_num_field, Field **fields,
                         Rpl_info_fields *field_values);
  THD *create_fake_thd();
  bool drop_fake_thd(THD* thd, bool error);

private:

  Rpl_info_table_access& operator=(const Rpl_info_table_access& info);
  Rpl_info_table_access(const Rpl_info_table_access& info);
};

class Rpl_info_table : public Rpl_info_handler
=====================
. This is a concrete class that inherits from the Rpl_info_handler.
See WL#5125.

{
public:
  Rpl_info_table(uint nparam, uint param_field_id, const char* param_schema,
                 const char *param_table);
  virtual ~Rpl_info_table();
private:

  Rpl_info_table_access *access;

  int do_init_info();
  int do_check_info();
  void do_end_info();
  int do_flush_info(const bool force);
  int do_reset_info();

  bool is_transactional();

  int do_prepare_info_for_read();
  int do_prepare_info_for_write();
  bool do_set_info(const int pos, const char *value);
  bool do_set_info(const int pos, const int value);
  bool do_set_info(const int pos, const ulong value);
  bool do_set_info(const int pos, const float value);
  bool do_set_info(const int pos, const Server_ids *value);
  bool do_get_info(const int pos, char *value, const size_t size,
                   const char *default_value);
  bool do_get_info(const int pos, int *value,
                   const int default_value);
  bool do_get_info(const int pos, ulong *value,
                   const ulong default_value);
  bool do_get_info(const int pos, float *value,
                   const float default_value);
  bool do_get_info(const int pos, Server_ids *value,
                   const Server_ids *default_value);

  Rpl_info_table& operator=(const Rpl_info_table& info);
  Rpl_info_table(const Rpl_info_table& info);
};


class Rpl_info_fields:
======================
. Provides a sequence of elements that can be used as a temporary area
to store information before flushing it to a repository.

  bool init() -  allocate memory according to the number of elements in the
  master_info ro relay_log_info.


  void restore() - Restore the pointer from "saved" to "use". For instance,
  this may happen when a the repository is a table and one reads from it.

  resize() - Extends memory if the pre-allocated space is not enough to
  temporary store data.
{
public:

  Rpl_info_fields(int param_ninfo): field(0),
    ninfo(param_ninfo) { };
  virtual ~Rpl_info_fields();

  bool configure();
  bool resize(int needed_size, int pos);

  info_fields *field;

private:
  int ninfo;

  Rpl_info_fields& operator=(const Rpl_info_fields& fields);
  Rpl_info_fields(const Rpl_info_fields& fields);
};

struct info_fields
{
  LEX_STRING use;
  LEX_STRING saved;
  int size;
} typedef info_fields;


In what follows, we describe the structure of the master.info (i.e.
slave_master_info) and relay.info (i.e. slave_relay_log_info) tables.


TABLES
------

MASTER.INFO

CREATE TABLE IF NOT EXISTS slave_master_info 
(
Master_id INTEGER UNSIGNED NOT NULL, 

Lines INTEGER UNSIGNED NOT NULL,

Master_log_name TEXT NOT NULL,

Master_log_pos INTEGER UNSIGNED NOT NULL,

Host TEXT utf8_bin,

User TEXT,

Password TEXT,

Port INTEGER UNSIGNED NOT NULL,

Connect_retry INTEGER UNSIGNED NOT NULL,

Ssl BOOLEAN NOT NULL,

Ssl_ca TEXT,

Ssl_capath TEXT,

Ssl_cert TEXT,

Ssl_cipher TEXT,

Ssl_key TEXT,

Ssl_verify_servert_cert BOOLEAN NOT NULL,

Heartbeat FLOAT NOT NULL,

Ignore_server_ids TEXT,

Retry_count INTEGER UNSIGNED NOT NULL,  

PRIMARY KEY(master_id(128))) 
ENGINE=MYISAM COMMENT 'Master Information';

RELAY.INFO:

CREATE TABLE IF NOT EXISTS slave_relay_log_info 
(
Master_id INTEGER UNSIGNED NOT NULL,

Relay_log_name TEXT,

Relay_log_pos INTEGER UNSIGNED NOT NULL,

Master_log_name TEXT NOT NULL,

Master_log_pos INTEGER UNSIGNED NOT NULL,

PRIMARY KEY(master_id(128)))
ENGINE=MYISAM COMMENT 'Relay Log Information';

For both tables UTF-8 is used as default.

-----

NOTE: Changing these tables directly by the user is not recommended. Updating
them while the slave is running will lead to unexpected/unpredictable behaviour,
thus is not supported.