WL#4538: Make BACKUP/RESTORE commands interruptible

Affects: Server-6.0 — Status: Complete

Description
High Level Architecture
Low Level Design

RATIONALE.

This worklog is to fix BUG#35079, i.e. ensure that BACKUP/RESTORE commands can
be interrupted in the middle if user hits Ctrl+C or client connection is killed.

Note. Before new design can be completed, number of issues in the existing
infrastructure for giving feedback from BACKUP/RESTORE commands should be fixed:

BUG#40303 - This is needed so that in case an error is reported after an
interruption, client still sees "Query execution was interrupted" error.

BUG#40305 - This will give a feedback about the fact that command has been
interrupted.

BUG#40307 - This will add information useful for estimating possible problems
after interrupting destructive RESTORE operation.

Here we define what feedback should be given to the user if BACKUP or
RESTORE command is interrupted.

Amending existing feedback
==========================

Backup code reports errors and warnings in several places:

- the BACKUP/RESTORE statement can return error as any other SQL statement,
- errors and warnings are pushed on the connection error stack,
- errors, warnings and notes are written to server's error log,
- errors and other information are logged in the backup_progress log,
- the status of the operation is reported in backup_history log.

During normal operation, basic information about progress of the operation is
logged. For BACKUP operation the following messages are sent:

  [Note] Backup: Starting backup process
  [Note] Backup: Backing up all databases  
or
  [Note] Backup: Backing up selected databases  
  [Note] Backup: Backup completed  

These are sent only to server's error log. If there are no errors or warnings,
the server's error stacks remains empty and the statement returns positive reply
(the backup_id). 

In case of error, it is pushed on the error stack, logged in the servers error
log as:

  [ERROR] Backup: 

and error reply is send to the client.

There are some quirks of the existing error reporting infrastructure which will
be relevant for giving correct feedback in case of interruption.

1. In case of error, the final note "Backup: Backup completed" is sometimes   
   logged and sometimes not. There is no entry in the log which warns about the 
   fact that the operation has *not* been completed. [To be fixed by BUG#40305]

2. The note "Backup: Backing up selected databases" does not inform about the 
   databases being backed up. Also, in case of RESTORE no information about 
   restored databases is logged. [To be fixed by BUG#40307]

3. In case of several errors reported, the client gets the *last* of them as the 
   error from the SQL statement. But for other SQL statements the *first* error 
   is returned and it seems more appropriate. [To be fixed by BUG#40303]

All these are related to giving feedback in case of interrupting BACKUP/RESTORE
statement by user. For example warning about uncompleted operation would be
important in that case too. Logging list of databases affected by the operation
will give better feedback for the user who interrupts it. Also, returning the
first logged error is important for correct feedback in case of interruption,
because in this case errors might be reported during clean-up after the
interruption and still the interruption notification should be sent to the
client, not the clean-up errors.

Feedback when BACKUP/RESTORE is interrupted
===========================================

When BACKUP/RESTORE command is interrupted, a note will be placed in the log and
the operation will be aborted, generating "Operation aborted" warnings as in the
case of abort on error. For example:

  [Note] Restore: Starting restore process.
  [Note] Restore: Restoring database(s) `db1`,`db2`,...
  [Note] Restore: Operation has been interrupted.
  [Warning] Restore: Operation aborted - data might be corrupted.

The BACKUP/RESTORE statement will return error "Query execution was
interrupted", like any other SQL statement which is interrupted. The same error
will be placed on the error stack. In backup logs, change to "cancelled" state
will be reported [1].

When BACKUP or RESTORE operation is interrupted, some clean-up must be done. For
example the backup/restore drivers used in the operation must be properly shut
down. Errors can happen during this clean-up phase. This may lead to additional
error messages being reported together with the information that the operation
has been interrupted. Example:

  [Note] Backup: Starting backup process.
  [Note] Backup: Backing up database(s) `db1`,`db2`,...
  [Note] Backup: Operation has been interrupted.
  [ERROR] Backup: MyISAM backup driver can't cancel its backup operation.
  [ERROR] Backup: Default backup driver can't cancel its backup operation.
  [Warning] Backup: Operation aborted.

The BACKUP statement still returns "Query execution was interrupted". The error
stack will contain:

  Error | xxxx | Query execution was interrupted.
  Error | yyyy | MyISAM backup driver can't cancel its backup operation.
  Error | zzzz | Default backup driver can't cancel its backup operation.

In some cases, it will be possible that the statement is interrupted just after
detecting an error. For example:

  [Note] Backup: Starting backup process.
  [Note] Backup: Backing up database(s) `db1`,`db2`,...
  [ERROR] Backup: Out of memory.
  [Note] Backup: Operation has been interrupted.
  [Warning] Backup: Operation aborted.

In that case the BACKUP statement will return the first reported error, that is
"Out of memory". The error stack will contain:

  Error | yyyy | Out of memory.
  Error | xxxx | Query execution was interrupted.


In backup logs (backup_progress and backup_history) the fact that BACKUP/RESTORE
command has been interrupted will be noted by reporting change to a new
"cancelled" state (added to the existing "starting", "running", "error" etc.)
[1]. As with other state changes, this will insert a new entry in the
backup_progress log and update the entry in backup_history log.

----
[1] In the current server, the change of state to "cancelled" and any errors
detected after an interruption will not show in backup log tables because of the
issue described in BUG#39924.

Note: The design below assumes that BUG#40303 (returning first error reported)
is fixed. Without it, the error response sent in case of interruption will
sometimes not be as specified in HLS (it will be the last error reported, not
the information about statement interruption).

HOW INTERRUPTIONS ARE DETECTED
==============================

In this design we rely on the server code to detect interruptions, such as when 
user hits Ctrl+C or closes connection. Checks for this condition are present in 
many places in the server code (esp. when the code waits for something). If 
interruption is detected, the thd->killed member is set to non-zero value. 
Backup code will check for thd->killed flag and assume that if it is non-zero 
then an interruption has been detected by the server.

CHANGES IN THE KERNEL CODE
==========================

1. New method for detecting and reporting statement interruption will be added
to backup::Logger:

/**
  Report the fact that BACKUP/RESTORE operation has been cancelled.

  This method does nothing if no interruption has happened or if an 
  interruption has already been reported. Otherwise it logs
  ER_BACKUP_INTERRUPTED note.

  @note The server takes care of giving feedback to the client in case of a 
  cancelled statement - nothing needs to be done here (nothing pushed on the 
  server's error stack)

  @returns TRUE if interruption has been detected and reported, FALSE otherwise.
 */ 
bool Logger::report_killed()

2. New error message.

 ER_BACKUP_INTERRUPTED
   eng "Operation has been interrupted."

Note that when written to server's error log, it will be prefixed by "Backup:"
or "Restore:".

3. Changed semantics of Logger::report_error()

This is the main method for reporting errors in backup code. It will be changed 
so that it detect an interruption if it happened and then report the 
interruption instead of the error. This is appropriate in most cases. For 
example consider this code snippet:

  int ret= some_function();
  if (ret)
  {
    log.report_error(ER_SOME_ERROR,...);
    return ...;
  }

Some_function() can return non-zero return code also in the case an interruption 
has been detected inside it. In this case it will be incorrect to report 
ER_SOME_ERROR - the interruption should be reported instead. Thus report_error() 
will call Logger::report_killed() before doing anything else, and in case an 
interruption was detectd it will not log the error message.

Note: If interruption happens but some_functions() returns success, the 
execution will continue until some other called function detects the 
interruption and signals error, or until we return to the caller, which will be 
then responsible for detecting it. 

4. Detecting interruptions in the code.

Checks for interruptions will be made at high level of the call hierarchy,
inside Backup_restore_ctx methods and in the write/restore_table_data() functions. 

Methods of Backup_restore_ctx are supposed to detect interruptions and stop
execution. All other functions and methods might, but don't have to detect
interruptions. Hence when an external function or method is called from
Backup_restore_ctx, the caller should check for interruption regardless of
whether the call was successful or not. For example like in this place in
Backup_restore_ctx::prepare_for_backup():

  int ret= Logger::init(BACKUP, query);  // Logs errors   
  if (report_killed())
  {
    fatal_error(ER_QUERY_INTERRUPTED);
    rerurn NULL;
  }
  if (ret)
  {
    fatal_error(ER_BACKUP_LOGGER_INIT);
    return NULL;
  }

Regardless of whether in case of interruption Logger::init() returns error or
not, report_killed() will be called and in case it detects (and reports)
interruption, the context will be moved into error state and
prepare_for_backup() method will return immediately signalling error (by
returning NULL). 

Note that even if Logger::init() detects and reports an interruption, this code
still works ok, because in that case report_killed() will not report the
interruption for the second time but it will return TRUE informing about the
fact that it has happened.

If report_killed() says there is no interruption, then we will continue as
before and check the value returned by Logger::init().

Another situation is when calling Backup_restore_ctx methods which should detect
and report interruptions. For example, consider a call to
Backup_restore_ctx::prepare() from Backup_restore_ctx::prepare_for_backup():

  if (prepare(backupdir, orig_loc))  // Logs errors and detects interruptions
    return NULL;

In this case there is no need to call report_killed() because if an interruption
has happened then prepare() would detect and report it. 

4. Interruptions during backup/restore of table data.

In write_table_data() the main interruption check-point will be
Scheduler::step() method which will start as follows:

int Scheduler::step()
{
  // Bail-out in case of an interruption.
  if (m_log.report_killed())
    return ER_QUERY_INTERRUPTED;

  // Pick next driver to pump data from.
  ...

In restore_table_data(), report_killed() will be used after calls to driver
methods, as in this example:

  // Initialize the drivers.
  for (uint n=0; n < info.snap_count(); ++n)
  {
    res= drv[n]->begin(0);
    if (log.report_killed())
      goto error;
    if (res == backup::ERROR)
    {
      log.report_error(ER_BACKUP_INIT_RESTORE_DRIVER, info.m_snap[n]->name());
      goto error;
    }
  ...

Similar, after calls to backup stream library functions. For example:

  case READING:

    bzero(&chunk_info, sizeof(chunk_info));
    ret= bstream_rd_data_chunk(&s, &chunk_info);

    if (log.report_killed())
      goto error;
    switch (ret) {
  ...


FEEDBACK IN BACKUP LOGS
=======================

When Logger::report_killed() reports the fact that interruption has happened, it
will also report change to "cancelled" state by calling
Logger::report_state(BUP_CANCEL). The latter will forward the notification to
the backup log subsystem, (calling Backup_log::state(BUP_CANCEL)) which will
write an appropriate entry to backup_progress log and record the new state in
backup_history log.

Limitation
----------
Because of BUG#39924 any writes/updates of backup logs made after the current
connection has been killed (thd->killed is non-zero) can not be performed.
Because of this, the change of state to BUP_CANCEL, even though reported by the
code, will not be seen in the backup log tables. This should improve as soon as
BUG#39924 is fixed.

ADDITIONAL FEEDBACK
===================

The changes implemented here should ensure that in case of an interruption, message

  [Note] Backup: Operation has been interrupted

is correctly reported together with a change to "cancelled" state. Additionally,
The usual closing message

  [Note] Backup: Backup completed

Should be replaced by more appropriate one, e.g.:

  [Warning] Backup: Operation aborted

or

  [Warning] Restore: Operation aborted - data might be corrupted

This is important not only in the context of this task, but also when
BACKUP/RESTORE is aborted due to errors. Fixing the final message is the scope
of BUG#40305.

When DBA sees the last message, it is useful to know which databases were
affected by the operation. Logging the list of databases being backed up or
restored is the scope of BUG#40307.

TESTING STRATEGY
================

A simple test scenario is to create two connections to a server and then do the
following.

con1: run BACKUP or RESTORE command, which stops at an debug synchronization
      point and waits for a signal to continue.
con2: kill the BACKUP/RESTORE query running in con1 and send signal so that it
can continue.
con1: wait for the query to terminate and see returned error. Also, examine
warning stack and backup log tables.

This scenario will be repeated for several synchronization points placed at
various stages of BACKUP/RESTORE command execution.

Another test scenario will use error insertion to trigger errors after an
interruption has happened (e.g., when drivers are shut down) to see that
contents of error stack and error reply from BACKUP/RESTORE are as expected.

Limitation
----------
I don't know any easy way of examining server's error log from inside MTR test
case. Also, until BUG#39924 is fixed, the interruption will not be traced in
backup logs (however, test will show contents of these logs after interruption).
This leaves checking only the error response given by interrupted BACKUP/RESTORE
operation and the contents of server's error stack.