WL#11636: InnoDB: Clone Remote provisioning

Affects: Server-8.0   —   Status: Complete

In WL#9210, we supported clone of remote database server. In order to provision
a replica using clone, it involved multiple steps as shown below.

1. Initialize a dummy database directory in recipient node and start mysql server.
2. Connect to this server and execute CLONE command to clone data from remote 
donor server to a new data directory in recipient.
3. Shut down mysql server in recipient node and restart on new data directory.
4. Delete/Remove the dummy data directory in recipient

This worklog aims to simplify the method of provisioning by allowing clone
directly into recipients current data directory and also allow cloning process
to be completely driven via SQL command after a server is started.

Also, before cloning, it is advantageous to detect some of the error cases
early. This WL would also support pre-conditions checks before clone.

A. User commands: Provision/Re-provision a node
------------------------------------------------
1. Choose a donor instance to clone from

2. Install clone plugin in donor 

3. Have clone user in donor with BACKUP_ADMIN privilege

4. Choose the recipient instance to Re-provision
   For new server provisioning, create a DB and start server.

5. Install clone plugin in recipient

6. Have admin user in clone with SYSTEM_VARIABLES_ADMIN privilege.
   Set valid clone sources.

7. Have clone user with following privilege
     CLONE_ADMIN : allow to drop and replace current database

* This privilege also allows clone to  
     - block all DDL  [implicit BACKUP_ADMIN]
     - restart server [implicit SHUTDOWN privilege]

8. Connect using clone user and clone

CLONE command would transparently replace the current data directory and restart
server. We use RESTART feature here introduced in WL#9809. The clone operation
can be monitored from another session.

Clone operation:
---------------
1. Take backup lock to prevent any more DDL in recipient.
2. Drop all existing user tablespace/data : 
      reduces the DB size for re-provisioning
3. Clone from remote instance in place
    - for user tablespace files copy as it is
    - for all system files keep it as .clone
      e.g. for DD mysqld.ibd.clone
5. Restart server
    - Replace system files from .clone
    - usual server start up [initialize DD, built-in plugin etc.]
    - allow users to connect

Clone Privilege:
----------------
1. [Existing] SYSTEM_VARIABLES_ADMIN privilege in recipient: 
   Allow configuring valid data source(s) for clone
      mysqld configuration variable that would store the list of IP:PORT
      ;; ....
      user with the above privilege can modify the configuration.
2. [New] CLONE_ADMIN privilege in recipient: 
    allow clone to replace current database instance
      CLONE INSTANCE FROM ...
      * It can clone only clone from one of the pre-configured
        locations defined in [A]
C. [Existing] BACKUP_ADMIN in both donor and recipient
D. [Existing] SHUTDOWN privilege in recipient for automatic RESTART

B. Pre-condition check
----------------------
1. Confirm that MYSQL server versions of donor and recipient match
2. Confirm that the OS is same in donor and recipient
3. Confirm that the disk space available in recipient is enough
4. Confirm the paths for files stored outside data directory
5. Confirm that the all plugins of donor are deployed in recipient
6. Confirm that the character sets of donor and recipient match
7. Confirm that Innodb specific data configuration match e.g innodb_page_size
8. Confirm that the security key plugins match
Functional requirements:
------------------------
F-1: CLONE SQL must be able to replace recipient server data directory.

F-2: CLONE SQL must be able to restart recipient server transparently.

* Recipient server must be running with a monitoring process. This requirement
is same as RESTART command. [https://dev.mysql.com/doc/refman/8.0/en/restart.html]

* If no monitoring process is present, CLONE would fail with an error during
restart and shutdown the server. The server must be restarted manually to
complete the clone operation.

F-3: User shall be able to check status(success/error) of last CLONE SQL.

F-4: User shall be able to connect and re-issue CLONE SQL command in case
     of recovery failure after CLONE.
 
F-5: User shall be able to monitor progress of CLONE.

F-6: CLONE SQL must check compatibility of donor and recipient servers.
  a. Match MYSQL server versions
  b. Match OS and platform(32/64 bit)
  c. Check for enough disk space in recipient
  d. Check paths for files stored outside data directory
  e. plugins of donor are there in recipient including security key plugins
  f. character sets of donor and recipient match
  g. Innodb specific data configurations
     -innodb_page_size
     -innodb_data_file_path count and size

F-7: CLONE SQL must keep the server configurations as it is in recipient. Any
configuration persisted with "SET PERSIST" before CLONE must also remain same
and persisted.
https://dev.mysql.com/doc/refman/8.0/en/server-option-variable-reference.html

Non-functional requirements:
----------------------------
NF-1: Security: Must have privilege in recipient to replace data using CLONE.

NF-2: Security: Must have configuration to restrict set of donors.

Limitations
-----------
L-1: No DDL during cloning [Truncate is considered DDL here]
L-2: Tables in "Storage Engines" other than Innodb are not cloned. e.g. MyIsam
and CSV storage engine tables would be empty after clone.
L-3: IPV6 address format is not supported as host name for clone. User can use
alias string instead.
I-1: SQL
--------
I-1.1: Make "DATA DIRECTORY" specification optional in CLONE SQL.

  CLONE INSTANCE FROM USER@HOST:PORT 
  IDENTIFIED BY ''
  [DATA DIRECTORY [=]'']
  [REQUIRE [NO] SSL];

To replace current instance data with cloned data and restart server one needs
to issue CLONE SQL command without specifying data directory. e.g.

  CLONE INSTANCE FROM USER@HOST:PORT IDENTIFIED BY '';

Note: "CLONE INSTANCE" SQL command was introduced in "WL#9210: Clone Remote
Replica".

I-2: Configurations
-------------------
I-2.1 clone_autotune_concurrency:
Scope: Global
Dynamic: Yes
Type: Boolean
Default Value:  TRUE
Description: Dynamically check and tune number of threads to use for clone up to
clone_max_concurrency. If set to FALSE, clone_max_concurrency threads are
spawned to perform clone.

I-2.2 clone_enable_compression:
Scope: Global
Dynamic: Yes
Type: Boolean
Default Value: FALSE
Description: Enable compression of data at network layer. It saves network
bandwidth at the cost of cpu. It is also possible to achieve higher overall data
transfer rate.

I-2.3 clone_ddl_timeout:
Scope: Global
Dynamic: Yes
Type: Integer
Default Value: 300 [5 minutes]
Minimum Value: 0 [skip DDL lock]
Maximum Value: 2592000[1 month]
Description: Time in seconds to wait for backup lock. Clone cannot work with
concurrent DDLs currently. It would acquire a BACKUP lock in donor and recipient
during clone operation. Clone waits for all DDL commands to finish. Once clone
has acquired the backup lock, all DDL commands wait for clone to finish.

"0" is special value to indicate no backup lock. In this case clone would
immediately come out with error if any ddl is attempted during clone.

I-2.4 clone_max_data_bandwidth:
Scope: Global
Dynamic: Yes
Type: Integer
Default Value: 0 [unlimited]
Minimum Value: 0 [unlimited]
Maximum Value: 1048576 [1 TiB/sec]
Description: Maximum data transfer rate in MiB/sec to throttle clone. This
parameter can be used to control impact on donor instance. This limit should be
set only when donor disk i/o bandwidth is saturated impacting performance. 0 is
special value indicating "unlimited" which is also the default. The default
allows clone to operate at highest data transfer rate and provides best performance.

I-2.5 clone_max_network_bandwidth:
Scope: Global
Dynamic: Yes
Type: Integer
Default Value: 0 [unlimited]
Minimum Value: 0 [unlimited]
Maximum Value: 1048576 [1 TiB/sec]
Description: Maximum approximate network transfer rate in MiB/sec to throttle
clone. This parameter can be used to control impact on network bandwidth. This
limit should be set only when network bandwidth is saturated impacting
performance of donor. 0 is special value indicating "unlimited" which is also
the default. The default allows clone to operate at highest data transfer rate
over network and provides best performance.

I-2.6 clone_valid_donor_list:
Scope: Global
Dynamic: Yes
Type: String
Default Value: NULL
Description: List of valid donor addresses allowed to clone from. The format of
the string is HOST1:PORT1,HOST2:PORT2,..." without any space in between.

I-3: Privileges
---------------
We need following privileges for clone.
 1.[Existing] SYSTEM_VARIABLES_ADMIN: admin to update CLONE_VALID_DONOR_LIST.

 2.[Existing] BACKUP_ADMIN: user connecting to donor to block DDL

 3.[New] CLONE_ADMIN: user executing clone at recipient to
     a. drop all data
     b. block all DDL [Implicit BACKUP_ADMIN privilege]
     c. restart/shutdown server [Implicit SHUTDOWN privilege]
 
 4.[Existing] BACKUP_ADMIN: USER in donor instance to clone data.

I-3.1: CLONE_ADMIN: This new privilege is needed to use this feature of
replacing current instance data with cloned data.

I-4: Performance Schema views
-----------------------------
Theses views are specific to clone recipient server and shows the information of
currently running or last executed clone command. The results survive across
shutdown and restart.

These views are owned by clone plugin and visible only after installing clone
plugin.

I-4.1 performance_schema.clone_status: Status of current or last executed clone
command.
+-----------------+--------------+------+-----+---------+-------+
| Field           | Type         | Null | Key | Default | Extra |
+-----------------+--------------+------+-----+---------+-------+
| ID              | int(6)       | YES  |     | NULL    |       |
| PID             | int(6)       | YES  |     | NULL    |       |
| STATE           | char(16)     | YES  |     | NULL    |       |
| BEGIN_TIME      | timestamp(3) | YES  |     | NULL    |       |
| END_TIME        | timestamp(3) | YES  |     | NULL    |       |
| SOURCE          | varchar(256) | YES  |     | NULL    |       |
| DESTINATION     | varchar(512) | YES  |     | NULL    |       |
| ERROR_NO        | int(6)       | YES  |     | NULL    |       |
| ERROR_MESSAGE   | varchar(512) | YES  |     | NULL    |       |
| BINLOG_FILE     | varchar(512) | YES  |     | NULL    |       |
| BINLOG_POSITION | bigint(20)   | YES  |     | NULL    |       |
| GTID_EXECUTED   | varchar(4096)| YES  |     | NULL    |       |
+-----------------+--------------+------+-----+---------+-------+
ID         : Unique clone identifier in current instance.
PID        : Process list ID of the session executing clone
STATE      : Current state of clone command
             "Not Started", "In Progress", "Completed", "Failed"
BEGIN_TIME : Timestamp when clone started
END_TIME   : Timestamp when clone finished
SOURCE     : Remote instance address in the form "host:port" 
             "LOCAL INSTANCE" if 'CLONE LOCAL'.
DESTINATION: Cloned data directory.
             "LOCAL INSTANCE" if cloning to current data directory. 

ERROR_NO        : Error number if clone is unsuccessful.
ERROR_MESSAGE   : Error message string
BINLOG_FILE     : Consistent binary log file name up to which data is cloned
BINLOG_POSITION : Consistent binary log file offset up to which data is cloned 
GTID_EXECUTED   : Consistent GTIDs for cloned data

I-4.2 performance_schema.clone_progress : Progress information of current or
last executed clone command.
+---------------+--------------+------+-----+---------+-------+
| Field         | Type         | Null | Key | Default | Extra |
+---------------+--------------+------+-----+---------+-------+
| ID            | int(6)       | YES  |     | NULL    |       |
| STAGE         | char(32)     | YES  |     | NULL    |       |
| STATE         | char(16)     | YES  |     | NULL    |       |
| BEGIN_TIME    | timestamp(6) | YES  |     | NULL    |       |
| END_TIME      | timestamp(6) | YES  |     | NULL    |       |
| THREADS       | int(6)       | YES  |     | NULL    |       |
| ESTIMATE      | bigint(20)   | YES  |     | NULL    |       |
| DATA          | bigint(20)   | YES  |     | NULL    |       |
| NETWORK       | bigint(20)   | YES  |     | NULL    |       |
| DATA_SPEED    | int(10)      | YES  |     | NULL    |       |
| NETWORK_SPEED | int(10)      | YES  |     | NULL    |       |
+---------------+--------------+------+-----+---------+-------+

ID         : Unique clone identifier in current instance.
STAGE      : Clone stage name 
             "DROP DATA", "FILE COPY", "PAGE_COPY", "REDO_COPY"
             "FILE_SYNC", "RESTART", RECOVERY
STATE      : Current state of this stage
BEGIN_TIME : Timestamp when stage is started
END_TIME   : Timestamp when stage is finished

THREADS       : Number of concurrent threads used in stage
ESTIMATE      : Estimated amount of data for current stage in bytes
DATA          : Amount of data transferred in current state in bytes
NETWORK       : Amount of network data transferred in current state in bytes
DATA_SPEED    : Current speed of data transfer in bytes/sec
NETWORK_SPEED : Current speed of network transfer in bytes/sec

I-5: Files and formats
----------------------
- A directory name #clone is created under data directory in recipient.

- Under #clone directory following files are created
  1.#view_progress: persists performance_schema.clone_progress data
  2.#view_status: Persists persists performance_schema.clone_status data

  3.#status_in_progress: Temporary file that exists when clone in progress
  4.#status_error: Temporary file to indicate incomplete clone.
  5.#status_recovery: Temporary file to hold recovery status information

  6.#new_files: List of all files created during clone
  7.#replace_files: List of all files to be replaced during recovery

- For all system table spaces that cannot be dropped, the cloned tablespace is
kept in file named .clone in same location where the tablespace
file exists. e.g.
  /mysql/mysql.ibd.clone
  /mysql/ibdata1.clone

All these files keeps data in text format. The data is not crash safe. It should
be ok as the recipient has no user data to loose and the information in these
files are not critical. In worst case there might be some stray left-overs after
a crash which can be cleaned by next clone command.

I-6: Error Codes
----------------
ER_CLONE_DDL_IN_PROGRESS
"Concurrent DDL is performed during clone operation. Please try again."

ER_CLONE_TOO_MANY_CONCURRENT_CLONES
"Too many concurrent clone operations. Maximum allowed - 1."

ER_CLONE_DONOR 
"Clone Donor Error: "

ER_CLONE_PROTOCOL
"Clone received unexpected response from Donor : "

ER_CLONE_DONOR_VERSION
"Clone Donor MySQL version:  is different from Recipient MySQL version "

ER_CLONE_OS
  eng "Clone Donor OS:  is different from Recipient "

ER_CLONE_PLATFORM
  eng "Clone Donor platform: <> is different from Recipient platform: <>"

ER_CLONE_CHARSET
  eng "Clone Donor collation:  is unavailable in Recipient."

ER_CLONE_CONFIG
  eng "Clone Configuration : Donor value:  is different from
Recipient value: "

ER_CLONE_SYS_CONFIG
  eng "Clone system configuration: "

ER_CLONE_PLUGIN_MATCH
  eng "Clone Donor plugin  is not active in Recipient."

ER_CLONE_LOOPBACK
  eng "Clone cannot use loop back connection while cloning into current data
directory."

ER_CLONE_DISK_SPACE
  eng "Clone estimated database size is . Available space  is not
enough."

User Scenario
-------------
US-1: Clone mysqld instance "inst-d" data to mysqld instance "inst-r".

Preconditions:
--------------
 1. Both inst-d and inst-r are up and running.
 2. User "admin-r" in inst-r has ALL privileges
 3. User "admin-d" in inst-d has ALL privileges
 4. inst-d is reachable from node where inst-r is running
    with host-d:port-d.

Steps
-----
A. Setup donor: Connect to inst-d with "admin-d"
  mysql> CREATE USER 'clone'@'%' identified by "xyz";
  mysql> GRANT BACKUP_ADMIN on *.* to 'clone'@'%';
  mysql> GRANT SELECT ON performance_schema.* TO 'clone'@'%';
  mysql> INSTALL PLUGIN CLONE SONAME "mysql_clone.so";

B. Setup recipient: Connect to inst-r with "admin-r"
  mysql> CREATE USER 'clone'@'%' identified by "xyz";
  mysql> GRANT BACKUP_ADMIN on *.* to 'clone'@'%';
  mysql> GRANT CLONE_ADMIN on *.* to 'clone'@'%';
  mysql> GRANT SELECT ON performance_schema.* TO 'clone'@'%';
  mysql> INSTALL PLUGIN CLONE SONAME "mysql_clone.so";
  mysql> SET GLOBAL clone_valid_donor_list = "host-d:port-d";

C. Start Clone: Connect to inst-r with "clone"
  mysql> CLONE INSTANCE FROM clone@host-d:port-d identified by "xyz";
  ...running ...

D. Check clone status: Make another connection to inst-r with "clone"
  mysql> SELECT * FROM performance_schema.clone_status;
  ...
  mysql> SELECT * FROM performance_schema.clone_progress;
  ...

E. If need to abort clone while in progress ...
  mysql> KILL QUERY ;  /* PID from performance_schema.clone_status */
  mysql> SELECT * FROM performance_schema.clone_status;
 
F. If not aborted in step [E], wait for clone to finish in step [C]. 

G. If CLONE command comes back with error in step [E], then take appropriate
action and back to step[C] to re-issue CLONE.

G. CLONE command finished successfully. All connections would be disconnected at
this point from inst-r and the instance will reboot. By this time the following
clone stages would be over. 
  "DROP DATA"
  "FILE COPY"
  "PAGE_COPY"
  "REDO_COPY"
  "FILE_SYNC",
The instance inst-r would be unavailable during the last 2 stages "RESTART" &
RECOVERY.
*Refer to "STAGE" column of performance_schema.clone_stage.

G. Keep trying to reconnect to inst-r with user "clone". 

*Note that if recovery is successful then inst-r is a clone of inst-d. OTOH, if
recovery is unsuccessful inst-d would be brought back with old data set without
any user data so that the error can be investigated and clone can be re-issued. 

7. Once connected verify that the clone command has been successful.
  mysql> SELECT STATE from performance_schema.clone_status;
+-----------+
| STATE     |
+-----------+
| Completed |
+-----------+

8. Check details of each stage (e.g. start and stop time)
mysql> select stage, state, end_time from performance_schema.clone_progress;
+-----------+-----------+----------------------------+
| stage     | state     | end_time                   |
+-----------+-----------+----------------------------+
| DROP DATA | Completed | 2019-01-27 22:45:43.141261 |
| FILE COPY | Completed | 2019-01-27 22:45:44.457572 |
| PAGE COPY | Completed | 2019-01-27 22:45:44.577330 |
| REDO COPY | Completed | 2019-01-27 22:45:44.679570 |
| FILE SYNC | Completed | 2019-01-27 22:45:44.918547 |
| RESTART   | Completed | 2019-01-27 22:45:48.583565 |
| RECOVERY  | Completed | 2019-01-27 22:45:49.626595 |
+-----------+-----------+----------------------------+

9. In case step [7] shows error in recovery stage, check the log files and
reissue clone - back to step[C].
  mysql> select state, error_no, error_message from performance_schema.clone_status;
+-----------+----------+---------------+
| STATE     | ERROR_NO | ERROR_MESSAGE |
+-----------+----------+---------------+
| Failed    |      xxx | "xxxxxxxxxxx" |
+-----------+----------+---------------+
The design is mostly trivial and understandable from source code. Some notes on
specific design aspects.

Clone files and failure handling
--------------------------------
We define following states for each file during clone operation. The state is
persistent i.e. we can derive the state based on the presence of one or more of
the clone files.
   1. 
   2. .clone
   3. .save

FILE_STATE_NORMAL   : [1] exists
FILE_STATE_CLONED   : [1] and [2] exists
FILE_STATE_SAVED    : [2] and [3] exists
FILE_STATE_REPLACED : [1] and [3] exists

State transfer:
---------------
[FILE_STATE_NORMAL] --> [FILE_STATE_CLONED]
  Remote data is cloned into another file named .clone.

[FILE_STATE_CLONED] --> [FILE_STATE_SAVED]
  Before recovery the data file is saved in a file named .save.

[FILE_STATE_SAVED] --> [FILE_STATE_REPLACED]
  Before recovery the cloned file is moved to data file.

[FILE_STATE_REPLACED] --> [FILE_STATE_NORMAL]
  After successful recovery the saved data file is removed.

Every state transition involves a single file create, delete or rename and we
consider them atomic.
In case of a failure the state rolls back exactly in reverse order.

We use two more files to keep a list of files to consider for rollback.
  1. #clone/#new_files     all new files to be removed in case of rollback
  2. #clone/#replace_files all files for rollback based on above state

These two files are written during clone operation and are not crash safe.
However, if there is a crash during clone operation, the state could be either
FILE_STATE_NORMAL or FILE_STATE_CLONED. So in worst case (when we missed
identifying files in state FILE_STATE_CLONED), we could have some
.clone files left over which should be cleaned up by next clone and
there should not be any issue with server startup after failure.

Auto tuning
-----------
1. We maintain one stat object for data and network byte transfer statistics.
   myclone::Client - One per thread
   myclone::Client_Share - One per clone operation shared between all threads
   myclone::Client_Stat - Contained in myclone::Client_Share.

2. Stat maintains array of thread specific context. A thread updates data and
network bytes in its own context after each block of cloned data is processed.
  myclone::Thread_Info

3. Stat maintains common information updated only by master thread at specific
interval. This includes total data
  - total data and network bytes transferred by all threads
  - history of data speed and network speed for last few evaluations

4. Master thread attempts to increase the number of threads in steps trying to
evaluate if it positive effect based on the history of data speed. Currently
there are some fixed set of rules defined for tuning. It might change in future.
Client_Stat::tune_has_improved()
- Each time we target doubling the current number of threads.
- We do it in steps of Thread_Tune_Auto::m_step = 4
- We check and stop if speed has degraded more than 5% after any step
- We check and stop if speed improvement is < 10% after 50% of target is reached
- After reaching target we proceed for next target if improvement is > 25%

5. At each stage we attempt to tune the state only once. We could change it in
future as the design supports adding new threads anytime.
  Thread_Tune_Auto::State
  [INIT] -> [ACTIVE] -> [DONE]

6. We don't attempt to decrease threads as of now. We could change it in future
as the design supports dropping threads anytime.

Throttle
--------
1. Master thread reads the global configuration at intervals and sets the target
bandwidth per thread.

2. Threads check the and compares the target bandwidth and current speed at
fixed intervals and sleeps if the current speed is more than desired.
   myclone::Client::check_and_throttle()