WL#4538: Make BACKUP/RESTORE commands interruptible
Affects: Server-6.0
—
Status: Complete
RATIONALE. This worklog is to fix BUG#35079, i.e. ensure that BACKUP/RESTORE commands can be interrupted in the middle if user hits Ctrl+C or client connection is killed. Note. Before new design can be completed, number of issues in the existing infrastructure for giving feedback from BACKUP/RESTORE commands should be fixed: BUG#40303 - This is needed so that in case an error is reported after an interruption, client still sees "Query execution was interrupted" error. BUG#40305 - This will give a feedback about the fact that command has been interrupted. BUG#40307 - This will add information useful for estimating possible problems after interrupting destructive RESTORE operation.
Here we define what feedback should be given to the user if BACKUP or RESTORE command is interrupted. Amending existing feedback ========================== Backup code reports errors and warnings in several places: - the BACKUP/RESTORE statement can return error as any other SQL statement, - errors and warnings are pushed on the connection error stack, - errors, warnings and notes are written to server's error log, - errors and other information are logged in the backup_progress log, - the status of the operation is reported in backup_history log. During normal operation, basic information about progress of the operation is logged. For BACKUP operation the following messages are sent: [Note] Backup: Starting backup process [Note] Backup: Backing up all databases or [Note] Backup: Backing up selected databases [Note] Backup: Backup completed These are sent only to server's error log. If there are no errors or warnings, the server's error stacks remains empty and the statement returns positive reply (the backup_id). In case of error, it is pushed on the error stack, logged in the servers error log as: [ERROR] Backup:and error reply is send to the client. There are some quirks of the existing error reporting infrastructure which will be relevant for giving correct feedback in case of interruption. 1. In case of error, the final note "Backup: Backup completed" is sometimes logged and sometimes not. There is no entry in the log which warns about the fact that the operation has *not* been completed. [To be fixed by BUG#40305] 2. The note "Backup: Backing up selected databases" does not inform about the databases being backed up. Also, in case of RESTORE no information about restored databases is logged. [To be fixed by BUG#40307] 3. In case of several errors reported, the client gets the *last* of them as the error from the SQL statement. But for other SQL statements the *first* error is returned and it seems more appropriate. [To be fixed by BUG#40303] All these are related to giving feedback in case of interrupting BACKUP/RESTORE statement by user. For example warning about uncompleted operation would be important in that case too. Logging list of databases affected by the operation will give better feedback for the user who interrupts it. Also, returning the first logged error is important for correct feedback in case of interruption, because in this case errors might be reported during clean-up after the interruption and still the interruption notification should be sent to the client, not the clean-up errors. Feedback when BACKUP/RESTORE is interrupted =========================================== When BACKUP/RESTORE command is interrupted, a note will be placed in the log and the operation will be aborted, generating "Operation aborted" warnings as in the case of abort on error. For example: [Note] Restore: Starting restore process. [Note] Restore: Restoring database(s) `db1`,`db2`,... [Note] Restore: Operation has been interrupted. [Warning] Restore: Operation aborted - data might be corrupted. The BACKUP/RESTORE statement will return error "Query execution was interrupted", like any other SQL statement which is interrupted. The same error will be placed on the error stack. In backup logs, change to "cancelled" state will be reported [1]. When BACKUP or RESTORE operation is interrupted, some clean-up must be done. For example the backup/restore drivers used in the operation must be properly shut down. Errors can happen during this clean-up phase. This may lead to additional error messages being reported together with the information that the operation has been interrupted. Example: [Note] Backup: Starting backup process. [Note] Backup: Backing up database(s) `db1`,`db2`,... [Note] Backup: Operation has been interrupted. [ERROR] Backup: MyISAM backup driver can't cancel its backup operation. [ERROR] Backup: Default backup driver can't cancel its backup operation. [Warning] Backup: Operation aborted. The BACKUP statement still returns "Query execution was interrupted". The error stack will contain: Error | xxxx | Query execution was interrupted. Error | yyyy | MyISAM backup driver can't cancel its backup operation. Error | zzzz | Default backup driver can't cancel its backup operation. In some cases, it will be possible that the statement is interrupted just after detecting an error. For example: [Note] Backup: Starting backup process. [Note] Backup: Backing up database(s) `db1`,`db2`,... [ERROR] Backup: Out of memory. [Note] Backup: Operation has been interrupted. [Warning] Backup: Operation aborted. In that case the BACKUP statement will return the first reported error, that is "Out of memory". The error stack will contain: Error | yyyy | Out of memory. Error | xxxx | Query execution was interrupted. In backup logs (backup_progress and backup_history) the fact that BACKUP/RESTORE command has been interrupted will be noted by reporting change to a new "cancelled" state (added to the existing "starting", "running", "error" etc.) [1]. As with other state changes, this will insert a new entry in the backup_progress log and update the entry in backup_history log. ---- [1] In the current server, the change of state to "cancelled" and any errors detected after an interruption will not show in backup log tables because of the issue described in BUG#39924.
Note: The design below assumes that BUG#40303 (returning first error reported) is fixed. Without it, the error response sent in case of interruption will sometimes not be as specified in HLS (it will be the last error reported, not the information about statement interruption). HOW INTERRUPTIONS ARE DETECTED ============================== In this design we rely on the server code to detect interruptions, such as when user hits Ctrl+C or closes connection. Checks for this condition are present in many places in the server code (esp. when the code waits for something). If interruption is detected, the thd->killed member is set to non-zero value. Backup code will check for thd->killed flag and assume that if it is non-zero then an interruption has been detected by the server. CHANGES IN THE KERNEL CODE ========================== 1. New method for detecting and reporting statement interruption will be added to backup::Logger: /** Report the fact that BACKUP/RESTORE operation has been cancelled. This method does nothing if no interruption has happened or if an interruption has already been reported. Otherwise it logs ER_BACKUP_INTERRUPTED note. @note The server takes care of giving feedback to the client in case of a cancelled statement - nothing needs to be done here (nothing pushed on the server's error stack) @returns TRUE if interruption has been detected and reported, FALSE otherwise. */ bool Logger::report_killed() 2. New error message. ER_BACKUP_INTERRUPTED eng "Operation has been interrupted." Note that when written to server's error log, it will be prefixed by "Backup:" or "Restore:". 3. Changed semantics of Logger::report_error() This is the main method for reporting errors in backup code. It will be changed so that it detect an interruption if it happened and then report the interruption instead of the error. This is appropriate in most cases. For example consider this code snippet: int ret= some_function(); if (ret) { log.report_error(ER_SOME_ERROR,...); return ...; } Some_function() can return non-zero return code also in the case an interruption has been detected inside it. In this case it will be incorrect to report ER_SOME_ERROR - the interruption should be reported instead. Thus report_error() will call Logger::report_killed() before doing anything else, and in case an interruption was detectd it will not log the error message. Note: If interruption happens but some_functions() returns success, the execution will continue until some other called function detects the interruption and signals error, or until we return to the caller, which will be then responsible for detecting it. 4. Detecting interruptions in the code. Checks for interruptions will be made at high level of the call hierarchy, inside Backup_restore_ctx methods and in the write/restore_table_data() functions. Methods of Backup_restore_ctx are supposed to detect interruptions and stop execution. All other functions and methods might, but don't have to detect interruptions. Hence when an external function or method is called from Backup_restore_ctx, the caller should check for interruption regardless of whether the call was successful or not. For example like in this place in Backup_restore_ctx::prepare_for_backup(): int ret= Logger::init(BACKUP, query); // Logs errors if (report_killed()) { fatal_error(ER_QUERY_INTERRUPTED); rerurn NULL; } if (ret) { fatal_error(ER_BACKUP_LOGGER_INIT); return NULL; } Regardless of whether in case of interruption Logger::init() returns error or not, report_killed() will be called and in case it detects (and reports) interruption, the context will be moved into error state and prepare_for_backup() method will return immediately signalling error (by returning NULL). Note that even if Logger::init() detects and reports an interruption, this code still works ok, because in that case report_killed() will not report the interruption for the second time but it will return TRUE informing about the fact that it has happened. If report_killed() says there is no interruption, then we will continue as before and check the value returned by Logger::init(). Another situation is when calling Backup_restore_ctx methods which should detect and report interruptions. For example, consider a call to Backup_restore_ctx::prepare() from Backup_restore_ctx::prepare_for_backup(): if (prepare(backupdir, orig_loc)) // Logs errors and detects interruptions return NULL; In this case there is no need to call report_killed() because if an interruption has happened then prepare() would detect and report it. 4. Interruptions during backup/restore of table data. In write_table_data() the main interruption check-point will be Scheduler::step() method which will start as follows: int Scheduler::step() { // Bail-out in case of an interruption. if (m_log.report_killed()) return ER_QUERY_INTERRUPTED; // Pick next driver to pump data from. ... In restore_table_data(), report_killed() will be used after calls to driver methods, as in this example: // Initialize the drivers. for (uint n=0; n < info.snap_count(); ++n) { res= drv[n]->begin(0); if (log.report_killed()) goto error; if (res == backup::ERROR) { log.report_error(ER_BACKUP_INIT_RESTORE_DRIVER, info.m_snap[n]->name()); goto error; } ... Similar, after calls to backup stream library functions. For example: case READING: bzero(&chunk_info, sizeof(chunk_info)); ret= bstream_rd_data_chunk(&s, &chunk_info); if (log.report_killed()) goto error; switch (ret) { ... FEEDBACK IN BACKUP LOGS ======================= When Logger::report_killed() reports the fact that interruption has happened, it will also report change to "cancelled" state by calling Logger::report_state(BUP_CANCEL). The latter will forward the notification to the backup log subsystem, (calling Backup_log::state(BUP_CANCEL)) which will write an appropriate entry to backup_progress log and record the new state in backup_history log. Limitation ---------- Because of BUG#39924 any writes/updates of backup logs made after the current connection has been killed (thd->killed is non-zero) can not be performed. Because of this, the change of state to BUP_CANCEL, even though reported by the code, will not be seen in the backup log tables. This should improve as soon as BUG#39924 is fixed. ADDITIONAL FEEDBACK =================== The changes implemented here should ensure that in case of an interruption, message [Note] Backup: Operation has been interrupted is correctly reported together with a change to "cancelled" state. Additionally, The usual closing message [Note] Backup: Backup completed Should be replaced by more appropriate one, e.g.: [Warning] Backup: Operation aborted or [Warning] Restore: Operation aborted - data might be corrupted This is important not only in the context of this task, but also when BACKUP/RESTORE is aborted due to errors. Fixing the final message is the scope of BUG#40305. When DBA sees the last message, it is useful to know which databases were affected by the operation. Logging the list of databases being backed up or restored is the scope of BUG#40307. TESTING STRATEGY ================ A simple test scenario is to create two connections to a server and then do the following. con1: run BACKUP or RESTORE command, which stops at an debug synchronization point and waits for a signal to continue. con2: kill the BACKUP/RESTORE query running in con1 and send signal so that it can continue. con1: wait for the query to terminate and see returned error. Also, examine warning stack and backup log tables. This scenario will be repeated for several synchronization points placed at various stages of BACKUP/RESTORE command execution. Another test scenario will use error insertion to trigger errors after an interruption has happened (e.g., when drivers are shut down) to see that contents of error stack and error reply from BACKUP/RESTORE are as expected. Limitation ---------- I don't know any easy way of examining server's error log from inside MTR test case. Also, until BUG#39924 is fixed, the interruption will not be traced in backup logs (however, test will show contents of these logs after interruption). This leaves checking only the error response given by interrupted BACKUP/RESTORE operation and the contents of server's error stack.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.