WL#9209: InnoDB: Clone local replica

Affects: Server-8.0   —   Status: Complete

Create a server plugin that can be used to retrieve a snapshot of the running
system. 

Here we would support syntax to take a Physical Snapshot of the database and
store it in same machine/node where the database server is running.

We should be able to start mysqld server on the cloned directory and access
data. The clone operation should work with concurrent DMLs on the database.

This is one of the child work logs for provisioning a clone for replication.
Functional
----------
F-1: Must support simple command to clone all data stored in InnoDB
       INSTALL PLUGIN clone SONAME 'mysql_clone.so';
       CLONE LOCAL DATA DIRECTORY [=] 'data_dir';
       UNINSTALL plugin clone;

F-2: Must have SUPER privilege to execute CLONE command.

F-3: Must support cloning external system tablespaces - cloned to data directory.

F-4: Must allow to stop a running clone command. (KILL QUERY processlist_id)


Non-functional
--------------
NF-1: [Availability] Should not block donor for long time (< 1 second)

NF-2: [Performance] Should not have significant impact on donor performance
                    (should not be worse than MEB) 

Limitations
-----------
1. No DDL during cloning [Truncate is considered DDL here]
* Concurrent Truncate would be supported later

2. No support for cloning encrypted database (TDE)
* Encrypted Database would be supported by WL#9682 later

3. No support for cloning general tablespaces with absolute path.
* Cannot use same path as it would conflict with source database for local
clone. We would support this feature in WL#9210 for remote clone.
=== Clone plugin Operation ===
In this WL, we clone data stored in innodb only. MySQL binary logs or data
stored in other Storage Engines, like MyISAM, are not cloned.

Other than the data transfer, the Clone plugin module is responsible to
initialize, copy and end snapshot with InnoDB and apply the data to new data
directory.

* Start InnoDB Snapshot
* Copy and apply InnoDB snapshot data
* End Snapshot

Next section explains the interfaces that innodb SE would expose for Clone
plugin to achieve this.

====Innodb Handlerton Interface====
Here, we need two sets of handlerton interfaces. 
# copy snapshot data
# apply snapshot data
The interfaces are showing only the major arguments and also not showing the
obvious length parameters.

copy data
=========
* clone_begin(locator[IN/OUT], clone_type[IN], ...)

Initialize the clone operation. Locator is the logical pointer to the data
snapshot. Type is for supporting multiple ways to take physical snapshot if needed.

* clone_copy(locator[IN], callback_context[IN], ...)

callback_context
{
...
clone_file_cbk(from_file_descriptor[IN], copy_length[IN], ...) 
clone_buffer_cbk(from_data_buffer[IN], copy_length[IN], ...)

data_descriptor
data_descriptor_length
...
}

The function repeatedly calls the user callback(s) with snapshot data to be
consumed by caller. Here the user of the interface is the "clone plugin" which
implements the callback functions. There are two callback functions that caller
needs to provide. One is called with a open file descriptor and length to be
copied and another is called with the data in memory buffer and length. Both
cases a serialized data descriptor is passed that describes the data and should
be passed to applying snapshot data.

The two callbacks provides an easy way to transfer snapshot data from the
database. When some files need to be copied directly the descriptor is passed
and clone plugin can directly apply the the data to the destination. Some data
is possibly already in memory [buffer pages], and is more convenient to pass via
memory. For those cases buffer callback is used.

In either case, the data descriptor provides the metadata needed for SE to
identify and apply the data appropriately while applying snapshot data.

* clone_end(locator[IN], ...)
Ends a clone operation deleting the locator from cache.

apply data
==========
* clone_apply_begin(locator[IN/OUT], data_directory[IN],...)

Initialize data apply. Prepare to copy snapshot data in target data directory.

* clone_apply(locator[IN], callback_context[IN], ...)

callback_context
{
...
clone_apply_file_cbk(to_file_descriptor[IN], copy_length[IN], ...) 

data_descriptor
data_descriptor_length
...
}

Interface to apply data to the new data directory. "Clone plugin" needs to call
this interface for every piece of data provided by clone_copy callback
function(s). The function clone_apply calls the user callback
"clone_apply_file_cbk()" with the file descriptor to copy the data to. "clone
plugin" implements the callback to copy the data from source to destination. 

* apply_end (locator[IN], ...)
Ends the apply phase.

Control Flow  between "clone plugin" and SE
===========================================
[SE] => Implemented in SE(Innodb)
[CLONE] => Implemented in "Clone plugin"

"clone plugin" ->clone_begin()[SE]
"clone plugin" ->clone_apply_begin()[SE]

"clone plugin" ->clone_copy()[SE] ->clone_file/buffer_callback[CLONE]
->clone_apply[SE] -> clone_apply_file_cbk[CLONE]

"clone plugin" ->clone_apply_end()[SE]
"clone plugin" ->clone_copy_end()[SE]

Here SE provides the file descriptor/buffer to copy data to/from and "Clone
plugin" transfers the data.

Innodb[SE] Design
==================
We have three basic entities here with "one to many" relationship with the next
one.

Snapshot --> Clone handle --> Task

1. Snapshot: A dynamic snapshot of the database. Multiple concurrent snapshot
can be supported by design. Current WL would support one snapshot at a time and
can be extended later.

2. Clone Handle: Handle to copy the snapshot. One for each CLONE SQL command. By
design, multiple clone handle can be attached to the same snapshot. Current WL
would support one "Clone Handle" per "Snapshot". 

It can be extended later. When multiple clone handles share a snapshot,
they would have the exact same snapshot of the database and would not need to
create its own metadata and end points for snapshot stages.

3. Task: Task to do the actual operation. By design, multiple tasks can be
attached to a Clone Handle to concurrently do the operation. Current WL would
support only one task per "Clone Handle". Future performance improvement might
extend it.

Currently, in this WL, there would be one Task, one Clone Handle and one
Snapshot.

1. Snapshot:
============
For a clone operation, InnoDB would create a "Clone Handle" identified by a
locator returned to the clone plugin. The "Clone Handle" would be attached to a
"Snapshot". The The diagram below shows multiple states of the "Snapshot" during
cloning.

[INIT] ---> [FILE COPY] ---> [PAGE COPY] ---> [REDO COPY] -> [Done]
               

The idea is to use page delta for the larger part of the copy operation (file
copy) and use redo for the second part to get to a consistent snapshot.

We would also implement two innodb system wide features "Page Tracking" & "Redo
Archiving" to be used by clone operation. The tracking would start from the
current log sys lsn i.e. the lsn for the last finished mtr. Note that the
changes before this lsn may not  be flushed to redo or disk yet and caller
should have appropriate considerations as explained below.

Snapshot States
---------------
INIT: The clone object is initialized identified by a locator.

FILE COPY: The state changes from INIT to "FILE COPY" when snapshot_copy
interface is called. Before making the state change we start "Page Tracking" at
lsn "CLONE START LSN". In this state we copy all database files and send to the
caller.

PAGE COPY: The state changes from "FILE COPY" to "PAGE COPY" after all files are
copied and sent. Before making the state change we start "Redo Archiving" at lsn
"CLONE FILE END LSN" and stop "Page Tracking". In this state, all modified pages
as identified by Page IDs between "CLONE START LSN" and "CLONE FILE END LSN" are
read from "buffer pool" and sent. We would sort the pages by space ID, page ID
to avoid random read(donor) and random write(recipient) as much as possible.

REDO COPY: The state changes from "PAGE COPY" to "REDO COPY" after all modified
pages are sent. Before making the state change we stop "Redo Archiving" at lsn
"CLONE LSN". This is the LSN of the cloned database. We would also need to
capture the replication coordinates at this point in future. It should be the
replication coordinate of the last committed transaction up to the "CLONE LSN".
We send the redo logs from archived files in this state from "CLONE FILE END
LSN" to "CLONE LSN" before moving to "Done" state.

Done: The clone object is kept in this state till destroyed by snapshot_end() call.

Page Tracking
-------------
Modified pages can be tracked either at the time when mtr adds them to flush
list or at the time they are flushed to disk by I/O threads. We choose to track
it when they are flushed to disk to keep it consistent with checkpoint and not
to impact mtr concurrency. 

The consistency of this phase is defined as follows.
* At start, it guarantees to track all pages that are not yet flushed. All
flushed pages would be included in "FILE COPY".

* At end, it ensures that the pages are tracked at least up to the checkpoint
LSN. All modifications after checkpoint would be included in "REDO COPY".

Key design points:
==================
A. We would have a in-memory buffer to queue the modified page and space IDs.

B. We would check the frame lsn of the page before modification and won't add
the page if it is more than the lsn at which we start page tracking. This would
avoid duplicate page IDs.

C. The buffer would be flushed/appended to a disk file by the archiver
background thread from time to time.

D. We would not maintain the metadata information separately. The file name and
header would have the information.

E. The file header would have the start and end lsn. The end lsn would be
written when we stop tracking.

If the memory buffer gets full, it can block the page flush. The case, however,
seems very unlikely as we would need 1GB of page modification to fill 512k of
the buffer with 16k page size. We would have enough time to flush it to disk as
the amount of redo that would be needed to flush to disk by that time would be
much higher than that.

F. If not enough memory, throw warning and stop tracking. This might force
cloning to restart but is an unlikely case.

We would provide a way to create page tracking client context with four interfaces.
    PageArchClientCtx::Start()
    PageArchClientCtx::Stop()
    PageArchClientCtx::get_pages(cbk_func()[IN], cbk_ctx[IN], ...)
    PageArchClientCtx::Release()

Once the caller releases certain range the corresponding file(s) would be
deleted. Later, when we provide "Page Tracking" for external user, we might set
some flag to disable this automatic purging. Also, currently the files would be
purged at start up.

Redo Archiving
--------------
A. The archiver background thread would start copying the redo from checkpoint
LSN onwards. We define "Archived LSN" as the lsn up to which redo is archived.

*Archiver background thread is common for redo log and page tracking.

B. We would copy in chunks from redo file to archive file from last copied lsn
to the current flush lsn.

C. The copy should generally be faster than the redo generated by the system.
mtr's would wait before going over "archived lsn preventing redo overwrite.

C. Archive file would be named by lsn.

D. We would provide a way to create log archiving client context with four
interfaces.
    LogArchClientCtx::Start()
    LogArchClientCtx::Stop()
    LogArchClientCtx::get_files(cbk_func()[IN], cbk_ctx[IN], ...)
    LogArchClientCtx::Release()

Once the caller releases on certain range the corresponding file(s) would be
deleted. Later, when we provide "Redo Archiving" for external user, we might set
some flag to disable this automatic purging. Also, currently the files would be
purged at server start up.
Server:
=======
Grammar changes to support CLONE SQL command.
  sql/sql_lex.h
  sql/sql_yacc.yy

Clone Plugin:
=============
New Plugin for transferring data from source to destination by interacting with
SE [Innodb].

plugin/clone/src
            /include

  clone_plugin.cc : plugin interface
  clone_local.cc  : clone operation
  clone_os.cc     : OS abstraction for raw data copy [sendfile/read/write]

SE [Innodb]
===========
Clone: storage/innobase/clone
  clone0clone.cc     : clone task and runtime operation
  clone0snapshot.cc  : snapshot management
  clone0copy.cc      : copy specific methods
  clone0apply.cc     : apply specific methods
  clone0desc.cc      : serialized data descriptor

Archiver: storage/innobase/arch
  arch0arch.cc
  arch0page.cc
  arch0log.cc