WL#9209: InnoDB: Clone local replica
Affects: Server-8.0
—
Status: Complete
Create a server plugin that can be used to retrieve a snapshot of the running system. Here we would support syntax to take a Physical Snapshot of the database and store it in same machine/node where the database server is running. We should be able to start mysqld server on the cloned directory and access data. The clone operation should work with concurrent DMLs on the database. This is one of the child work logs for provisioning a clone for replication.
Functional ---------- F-1: Must support simple command to clone all data stored in InnoDB INSTALL PLUGIN clone SONAME 'mysql_clone.so'; CLONE LOCAL DATA DIRECTORY [=] 'data_dir'; UNINSTALL plugin clone; F-2: Must have SUPER privilege to execute CLONE command. F-3: Must support cloning external system tablespaces - cloned to data directory. F-4: Must allow to stop a running clone command. (KILL QUERY processlist_id) Non-functional -------------- NF-1: [Availability] Should not block donor for long time (< 1 second) NF-2: [Performance] Should not have significant impact on donor performance (should not be worse than MEB) Limitations ----------- 1. No DDL during cloning [Truncate is considered DDL here] * Concurrent Truncate would be supported later 2. No support for cloning encrypted database (TDE) * Encrypted Database would be supported by WL#9682 later 3. No support for cloning general tablespaces with absolute path. * Cannot use same path as it would conflict with source database for local clone. We would support this feature in WL#9210 for remote clone.
=== Clone plugin Operation === In this WL, we clone data stored in innodb only. MySQL binary logs or data stored in other Storage Engines, like MyISAM, are not cloned. Other than the data transfer, the Clone plugin module is responsible to initialize, copy and end snapshot with InnoDB and apply the data to new data directory. * Start InnoDB Snapshot * Copy and apply InnoDB snapshot data * End Snapshot Next section explains the interfaces that innodb SE would expose for Clone plugin to achieve this. ====Innodb Handlerton Interface==== Here, we need two sets of handlerton interfaces. # copy snapshot data # apply snapshot data The interfaces are showing only the major arguments and also not showing the obvious length parameters. copy data ========= * clone_begin(locator[IN/OUT], clone_type[IN], ...) Initialize the clone operation. Locator is the logical pointer to the data snapshot. Type is for supporting multiple ways to take physical snapshot if needed. * clone_copy(locator[IN], callback_context[IN], ...) callback_context { ... clone_file_cbk(from_file_descriptor[IN], copy_length[IN], ...) clone_buffer_cbk(from_data_buffer[IN], copy_length[IN], ...) data_descriptor data_descriptor_length ... } The function repeatedly calls the user callback(s) with snapshot data to be consumed by caller. Here the user of the interface is the "clone plugin" which implements the callback functions. There are two callback functions that caller needs to provide. One is called with a open file descriptor and length to be copied and another is called with the data in memory buffer and length. Both cases a serialized data descriptor is passed that describes the data and should be passed to applying snapshot data. The two callbacks provides an easy way to transfer snapshot data from the database. When some files need to be copied directly the descriptor is passed and clone plugin can directly apply the the data to the destination. Some data is possibly already in memory [buffer pages], and is more convenient to pass via memory. For those cases buffer callback is used. In either case, the data descriptor provides the metadata needed for SE to identify and apply the data appropriately while applying snapshot data. * clone_end(locator[IN], ...) Ends a clone operation deleting the locator from cache. apply data ========== * clone_apply_begin(locator[IN/OUT], data_directory[IN],...) Initialize data apply. Prepare to copy snapshot data in target data directory. * clone_apply(locator[IN], callback_context[IN], ...) callback_context { ... clone_apply_file_cbk(to_file_descriptor[IN], copy_length[IN], ...) data_descriptor data_descriptor_length ... } Interface to apply data to the new data directory. "Clone plugin" needs to call this interface for every piece of data provided by clone_copy callback function(s). The function clone_apply calls the user callback "clone_apply_file_cbk()" with the file descriptor to copy the data to. "clone plugin" implements the callback to copy the data from source to destination. * apply_end (locator[IN], ...) Ends the apply phase. Control Flow between "clone plugin" and SE =========================================== [SE] => Implemented in SE(Innodb) [CLONE] => Implemented in "Clone plugin" "clone plugin" ->clone_begin()[SE] "clone plugin" ->clone_apply_begin()[SE] "clone plugin" ->clone_copy()[SE] ->clone_file/buffer_callback[CLONE] ->clone_apply[SE] -> clone_apply_file_cbk[CLONE] "clone plugin" ->clone_apply_end()[SE] "clone plugin" ->clone_copy_end()[SE] Here SE provides the file descriptor/buffer to copy data to/from and "Clone plugin" transfers the data. Innodb[SE] Design ================== We have three basic entities here with "one to many" relationship with the next one. Snapshot --> Clone handle --> Task 1. Snapshot: A dynamic snapshot of the database. Multiple concurrent snapshot can be supported by design. Current WL would support one snapshot at a time and can be extended later. 2. Clone Handle: Handle to copy the snapshot. One for each CLONE SQL command. By design, multiple clone handle can be attached to the same snapshot. Current WL would support one "Clone Handle" per "Snapshot". It can be extended later. When multiple clone handles share a snapshot, they would have the exact same snapshot of the database and would not need to create its own metadata and end points for snapshot stages. 3. Task: Task to do the actual operation. By design, multiple tasks can be attached to a Clone Handle to concurrently do the operation. Current WL would support only one task per "Clone Handle". Future performance improvement might extend it. Currently, in this WL, there would be one Task, one Clone Handle and one Snapshot. 1. Snapshot: ============ For a clone operation, InnoDB would create a "Clone Handle" identified by a locator returned to the clone plugin. The "Clone Handle" would be attached to a "Snapshot". The The diagram below shows multiple states of the "Snapshot" during cloning. [INIT] ---> [FILE COPY] ---> [PAGE COPY] ---> [REDO COPY] -> [Done]The idea is to use page delta for the larger part of the copy operation (file copy) and use redo for the second part to get to a consistent snapshot. We would also implement two innodb system wide features "Page Tracking" & "Redo Archiving" to be used by clone operation. The tracking would start from the current log sys lsn i.e. the lsn for the last finished mtr. Note that the changes before this lsn may not be flushed to redo or disk yet and caller should have appropriate considerations as explained below. Snapshot States --------------- INIT: The clone object is initialized identified by a locator. FILE COPY: The state changes from INIT to "FILE COPY" when snapshot_copy interface is called. Before making the state change we start "Page Tracking" at lsn "CLONE START LSN". In this state we copy all database files and send to the caller. PAGE COPY: The state changes from "FILE COPY" to "PAGE COPY" after all files are copied and sent. Before making the state change we start "Redo Archiving" at lsn "CLONE FILE END LSN" and stop "Page Tracking". In this state, all modified pages as identified by Page IDs between "CLONE START LSN" and "CLONE FILE END LSN" are read from "buffer pool" and sent. We would sort the pages by space ID, page ID to avoid random read(donor) and random write(recipient) as much as possible. REDO COPY: The state changes from "PAGE COPY" to "REDO COPY" after all modified pages are sent. Before making the state change we stop "Redo Archiving" at lsn "CLONE LSN". This is the LSN of the cloned database. We would also need to capture the replication coordinates at this point in future. It should be the replication coordinate of the last committed transaction up to the "CLONE LSN". We send the redo logs from archived files in this state from "CLONE FILE END LSN" to "CLONE LSN" before moving to "Done" state. Done: The clone object is kept in this state till destroyed by snapshot_end() call. Page Tracking ------------- Modified pages can be tracked either at the time when mtr adds them to flush list or at the time they are flushed to disk by I/O threads. We choose to track it when they are flushed to disk to keep it consistent with checkpoint and not to impact mtr concurrency. The consistency of this phase is defined as follows. * At start, it guarantees to track all pages that are not yet flushed. All flushed pages would be included in "FILE COPY". * At end, it ensures that the pages are tracked at least up to the checkpoint LSN. All modifications after checkpoint would be included in "REDO COPY". Key design points: ================== A. We would have a in-memory buffer to queue the modified page and space IDs. B. We would check the frame lsn of the page before modification and won't add the page if it is more than the lsn at which we start page tracking. This would avoid duplicate page IDs. C. The buffer would be flushed/appended to a disk file by the archiver background thread from time to time. D. We would not maintain the metadata information separately. The file name and header would have the information. E. The file header would have the start and end lsn. The end lsn would be written when we stop tracking. If the memory buffer gets full, it can block the page flush. The case, however, seems very unlikely as we would need 1GB of page modification to fill 512k of the buffer with 16k page size. We would have enough time to flush it to disk as the amount of redo that would be needed to flush to disk by that time would be much higher than that. F. If not enough memory, throw warning and stop tracking. This might force cloning to restart but is an unlikely case. We would provide a way to create page tracking client context with four interfaces. PageArchClientCtx::Start() PageArchClientCtx::Stop() PageArchClientCtx::get_pages(cbk_func()[IN], cbk_ctx[IN], ...) PageArchClientCtx::Release() Once the caller releases certain range the corresponding file(s) would be deleted. Later, when we provide "Page Tracking" for external user, we might set some flag to disable this automatic purging. Also, currently the files would be purged at start up. Redo Archiving -------------- A. The archiver background thread would start copying the redo from checkpoint LSN onwards. We define "Archived LSN" as the lsn up to which redo is archived. *Archiver background thread is common for redo log and page tracking. B. We would copy in chunks from redo file to archive file from last copied lsn to the current flush lsn. C. The copy should generally be faster than the redo generated by the system. mtr's would wait before going over "archived lsn preventing redo overwrite. C. Archive file would be named by lsn. D. We would provide a way to create log archiving client context with four interfaces. LogArchClientCtx::Start() LogArchClientCtx::Stop() LogArchClientCtx::get_files(cbk_func()[IN], cbk_ctx[IN], ...) LogArchClientCtx::Release() Once the caller releases on certain range the corresponding file(s) would be deleted. Later, when we provide "Redo Archiving" for external user, we might set some flag to disable this automatic purging. Also, currently the files would be purged at server start up.
Server: ======= Grammar changes to support CLONE SQL command. sql/sql_lex.h sql/sql_yacc.yy Clone Plugin: ============= New Plugin for transferring data from source to destination by interacting with SE [Innodb]. plugin/clone/src /include clone_plugin.cc : plugin interface clone_local.cc : clone operation clone_os.cc : OS abstraction for raw data copy [sendfile/read/write] SE [Innodb] =========== Clone: storage/innobase/clone clone0clone.cc : clone task and runtime operation clone0snapshot.cc : snapshot management clone0copy.cc : copy specific methods clone0apply.cc : apply specific methods clone0desc.cc : serialized data descriptor Archiver: storage/innobase/arch arch0arch.cc arch0page.cc arch0log.cc
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.