WL#4384: Online backup: Revise error reporting

Affects: Server-6.0 — Status: Complete

RATIONALE
Make sure there are error messages for all failure conditions.
To solve BUG#36529.

SUMMARY
The goal of this task is to give an overview of all errors we want to detect 
when performing a backup or restore operation and for each error make sure that 
it is adequately reported by the code.

This task should ensure that error reporting is as good as it can be when the 
current (6.0) error reporting infrastructure is used. It should not change the 
infrastructure itself since there is a separate WL#4190 devoted to improving it.

Deliverables:

1. List of errors which should and can be detected by the 6.0 online backup system.

2. For each error in the list which is not correctly reported by the code the
number of WL or BUG which will fix that. If necessary, new bugs should be reported.

All listed reporting errors was fixed with patch for BUG#39089.

This task can be divided into several subtasks checking different aspects of 
error reporting inside backup code.

Things to check
===============

A) Are errors detected in all places where a function which can signal an error 
is called?

B) Are all detected errors reported accordingly (that is, an error message is 
send to the logger)?

C) Are all error messages internationalized?

D) Do all error messages used for reporting errors describe the problem 
adequately? Note that the description should be targeted at DBA and should 
contain information helping to detect the exact cause of the problem.

E) Is all available information reported when error is detected? For example, 
when we open backup stream it can fail for many different reasons but we always 
report it as "Can not open backup location" but perhaps we have more information 
available about the cause of the problem and this should be reported (i.e. 
several different error messages instead of a single, common one).

F) Should we add new code or change existing code so that we detect more 
problems which should be reported to the user? 

Marking functions/methods which report errors and these which don't
===================================================================

As described in WL#3904, backup kernel functions and methods are divided into
two categories: ones which report errors to the user and ones which don't
generate any error messages but report errors only to the caller (which is then
responsible for logging them if needed).

For example, Backup_restore_ctx::do_backup() should log any detected errors,
while all Image_info methods do not log errors but only report them to the
caller (because they can be called in various contexts).

For each existing function and method, it should be decided whether it reports
errors or not. This information should be stored in the documentation of the
function/method.


What follows is an attempt to create a list of potential problems which should 
be reported as errors.

Main phases of BACKUP operation
===============================

B1. Prepare context for the operation.

B2. Open backup stream.

B3. Collect info about objects which should be backed up.

B4. Write header and catalogue of the image.

B5. Write object's meta-data.

B6. Create and initialize backup drivers needed for the operation.

B7. Prepare context for drivers (open and lock tables).

B8. Poll drivers for backup data stepping through all the phases of the backup
protocol.

B7. Shut down the drivers.

B8. Write backup image summary.

B9. Close backup stream.


Main phases of RESTORE operation
================================

R1. Prepare context for the operation.

R2. Open backup stream.

R3. Read header and catalogue.

R4. Read meta-data and re-create objects.

R5. Create and initialize restore drivers.

R6. Read snapshot data and send it to drivers following all steps of restore
protocol.

R7. Shut down restore drivers.

R8. Close backup stream.

List of errors.
===============

- malformed file path (B2/R2)

- trying to write to existing file (B2)

- trying to read from non-existing file (R2)

- can not open file (for reading or writing)
  -- error on OS level
  -- error on backup stream lib level

- wrong magic byte sequence at the beginning of the stream (R2).

- unknown stream format (version number too high) (R2).

- stream read/write error.
  -- error on the OS level,
  -- error on backup stream lib level,
  -- lack of resources (e.g. memory for input buffers),
  -- disk full (do we need to do it in 6.0?),
  -- absurd data contents (e.g. time value),

- error reading image header (R3)
  -- unknown snapshot type.
  -- unknown engine (name or version) for native snapshot.
  -- wrong data snapshot format, i.e. version number higher than the version of 
     the driver.
  -- (warning?) future server version.

- non-existing object selected for backup.

- no backup driver found for a table.

- NDB table selected for backup.
  (shall we make it a separate error or handle as "no backup driver found")

- an object on which one of the selected objects depend is missing (backup set
inconsistency). (shall we detect that in 6.0?)

- referential errors inside backup image (on restore)
  -- image requires unknown backup engine,
  -- unknown item type in the catalogue,
  -- entry in meta-data section contains invalid catalogue coordinates
  -- data chunk for invalid table/snapshot.

- errors when using server object services (B3,B5,R4)
  -- when listing objects present in the server
  -- when determining object dependencies
  -- when serializing an object
  -- when re-creating an object
   --- unknown version of serialization string
   --- malformed serialization string
   --- failed creation of the object

- errors when using other server services:
  -- can't block/unblock DDLs
  -- can't block/unblock commits
  -- can't prepare server/connection for backup/restore (lock tables, turn fkey
constraints off and so on)
  -- can't read binlog position

- errors related to interaction between backup/restore and replication
  (see WL#4209)

- errors when using backup/restore drivers
  -- backup engine failed to create a driver
  -- driver reports error in one of its methods
   --- when initializing driver
   --- when cancelling operation
   --- when freeing driver
   --- during normal operation (get/send_data(), prepare(), lock() and so on)

- backup/restore thread has been killed

- internal error detected in backup kernel
 -- could not create dynamic object instance (new failed)
 -- could not allocate memory
 -- errors when using thread library

Questions
=========

Q: How to report OS level errors?
Q: How to report errors detected in stream library?
Q: How to report errors from server services?
Q: When server services are used, should these functions push errors/warnings on
thread's error stack as they do now? If not, how should they report errors?
Q: How to report errors from backup/restore drivers?

Limitations
===========

Backup/restore drivers can signal that an error has been detected but they give
no information about the cause of the problem.


Approach 
======== 

Based on a walk-through of error reporting issues identified by Rafal
(see Low-Level Design) and feedback on email, the following approach
will be used to improve the error reporting:

I have walked through parts of the error reporting issues that Rafal
identified and I have a few issues that I would like to discuss so
far:

1. The error reporting is not very uniform between classes, some
   functions return bool, some int with varying semantics.  I am not
   going to to make any attempt to change anything here, but will try
   to follow approach used in each individual class.


2. The standard way to log error is to call
   Backup_restore_ctx::fatal_error().  So when introducing error
   logging, I will do it similar to this (fatal_error() returns the
   error that is reported):

-  // FIXME: detect errors (when dbit==NULL). Perhaps just assert.
-  Image_info::Iterator *dbit= m_catalog->get_dbs();
+  Image_info::Iterator *dbit=m_catalog->get_dbs();
+  if (!dbit) {
+    return fatal_error(ER_BACKUP_CAT_ENUM);
+  }
 
   Not all code as access to Backup_restore_ctx object.  In such code,
   one can make the function return the specific error code, and catch
   that at a level where the backup context is available.
   

3. For calls to functions that does not return error, this fact should
   be commented in the code in order to explain why there is no error
   handling. E.g., 

-    // FIXME: detect errors.
-    // FIXME: error logging.
-    m_thd->main_da.reset_diagnostics_area();
+    m_thd->main_da.reset_diagnostics_area();  // Never reports error


4. List::push_back will push an error internally if out of memory.
   One can either check the return of push_back or one can check for
   error afterwards by calling thd->is_error():

-  // FIXME: detect errors if  reported.
-  // FIXME: error logging.
   field_list.push_back(new Item_empty_string(STRING_WITH_LEN("backup_id")));
   if (context.m_thd->is_error()) 
   {
     goto err;
   }


5. Generally, pthread_mutex_lock/unlock/init/destroy etc. is not
   checked in the code.  This is OK because there is a special
   pushbuild mode that covers such errors.


6. SQL allocators push an error into the message stack if they fail to
   allocate memory.  It is not necessary to push another messages here
   should allocation fail. E.g,

-      // FIXME: detect errors. Don't assert here but report error instead.
-      // FIXME: error logging.
       TABLE_LIST *ptr= backup::mk_table_list(*tbl, TL_WRITE, m_thd->mem_root);
-      DBUG_ASSERT(ptr);
+      if (!ptr) // Failed to allocate, error has already been reported
+      {
+	return ER_OUT_OF_RESOURCES;
+      }

   Note: new operator may be overloaded to use SQL allocators.  Hence,
   it may not be necessary to push an error should new fail, either.  


7. If something unexpected goes wrong, we should fail the
   backup/restore operation even if it might be possible to recover
   from it.  E.g., should closing a stream fail, backup will report
   failure even if backup image was successfully produced.

8. In some cases, one should continue executing a function after a
   failure.  E.g., if one close fails, one should still attempt to
   close the rest:

-void Input_stream::close()
+bool Input_stream::close()
 {
+  bool ret= true;
   if (m_fd < 0)
-    return;
+    return true;
+
+  if (bstream_close(this) == BSTREAM_ERROR) {
+    ret= false;
+  }

-  bstream_close(this);
 #ifdef HAVE_COMPRESS
   if (m_with_compression)
   {
@@ -700,7 +719,8 @@ void Input_stream::close()
     my_free(zbuf, (MYF(0)));
   }
 #endif
-  Stream::close();
+  ret &= Stream::close();
+  return ret;
 }


9. Error handling with respect to the protocol class should be done
   with caution.  Some clean-up may be needed in order to recover from
   a failure.

This patch  adds comments marking 
the places in the code where errors are not detected or not logged, as listed below.

Calls made but no errors detected
=================================

Here is a list of places inside backup kernel code where calls are made but no
errors are detected. Note that the backup tree is changing so the line numbers
and other details might change.

The list is based on mysql-6.0-backup tree rev 2672.

Things internal to the backup kernel
------------------------------------

Backup_restore_ctx::unlock_tables()
 called at Backup_restore_ctx::do_restore() around line 992 in kernel.cc
 called at Backup_restore_ctx::close() around line 737 in kernel.cc

== catalogue and metadata

backup::Image_info::get_db_objects()
 called at Backup_restore_ctx::restore_triggers_and_events() around line 885 in
kernel.cc

backup::Image_info::get_dbs()
 called at Backup_restore_ctx::restore_triggers_and_events() around line 874 in
kernel.cc

backup::Snapshot_info::get_table()
 called at Backup_restore_ctx::lock_tables_for_restore() around line 672 in
kernel.cc

backup::Image_info::get_tablespaces() 
 called at Backup_info::Global_iterator() around line 1243 in backup_info.cc

== backup stream

backup::Input_stream::next_chunk()
 called at Backup_restore_ctx::do_restore() around line 973 in kernel.cc

backup::Stream::close()
 called at Backup_restore_ctx::close() around line 750 in kernel.cc

== table data backup

backup::Scheduler::Pump_iterator()
 called at backup::Scheduler::cancel_backup() around line 906 in data_backup.cc
 called at backup::Scheduler::cancel_backup() around line 907 in data_backup.cc

backup::Scheduler::cancel_backup()
 called at backup::Scheduler::step() around line 774 in data_backup.cc
 called at backup::Scheduler::prepare() around line 926 in data_backup.cc
 called at backup::Scheduler::lock() around line 949 in data_backup.cc
 called at backup::Scheduler::unlock() around line 949 in data_backup.cc

backup::Scheduler::remove_pump()
 called at backup::Scheduler::step() around line 772 in data_backup.cc
 called at backup::Scheduler::cancel_backup() around line 906 in data_backup.cc
 called at backup::Scheduler::cancel_backup() around line 907 in data_backup.cc

backup::Backup_pump::cancel()
 called at backup::Scheduler::cancel_backup() around line 906 in data_backup.cc
 called at backup::Scheduler::cancel_backup() around line 907 in data_backup.cc

backup::Backup_pump::end()
 called at backup::Scheduler::step() around line 769 in data_backup.cc

backup::Block_writer::drop_buf()
 called at backup::Backup_pump::pump() around line 1232 in data_backup.cc
 called at backup::Backup_pump::pump() around line 1250 in data_backup.cc

== misc.

link_table_list()
 called at Backup_restore_ctx::lock_tables_for_restore() around line 672 in
kernel.cc

mk_table_list()
 called at Backup_restore_ctx::lock_tables_for_restore() around line 672 in
kernel.cc


Calling services from the kernel
--------------------------------

ddl_blocker_disable()
 called at Backup_restore_ctx::close() around line 741 in kernel.cc


Calling server code from the kernel
-----------------------------------

close_thread_tables()
 called at Backup_restore_ctx::do_restore() around line 983 in kernel.cc
 called at Backup_restore_ctx::do_restore() around line 984 in kernel.cc
 called at Backup_restore_ctx::close() around line 737 in kernel.cc

mdl_alloc_lock()
 called at set_table_list() around line 134 in backup_aux.cc

my_eof()
 called at send_reply() around line 285 in kernel.cc

mysql_reset_errors()
 called at Backup_restore_ctx::prepare() around line 369 in kernel.cc
 called at Backup_restore_ctx::prepare() around line 371 in kernel.cc

MYSQL_BINLOG::get_current_log()
 called at write_table_data() around line 582 in data_backup.cc

Diagnostic_arena::reset_diagnostic_area()
 called at Backup_restore_ctx::do_restore() around line 983 in kernel.cc
 called at Backup_restore_ctx::do_restore() around line 984 in kernel.cc


Protocol::prepare_for_resend()
 called at send_reply() around line 280 in kernel.cc

Protocol::send_fields()
 called at send_reply() around line 275 in kernel.cc

Protocol::store()
 called at send_reply() around line 281 in kernel.cc

Protocol::write()
 called at send_reply() around line 283 in kernel.cc


Calling commmon library code (mysys and friends)
------------------------------------------------

bitmap_init()
 called at backup::Backup_pump() around line 993 in data_backup.cc

hash_init()
 called at Backup_info() around line 309 in bacup_info.cc

init_alloc_root()
 called at backup::Image_info() around line 19 in image_info.cc

llstr()
 called at send_reply() around line 281 in kernel.cc

List::push_back()
 called at Backup_info() around line 320 in backup_info.cc
 called at Backup_info() around line 320 in backup_info.cc
 called at Backup_restore_ctx::restore_triggers_and_events() around line 893 in
kernel.cc
 called at send_reply() around line 274 in kernel.cc
 called at write_table_data() around line 462 in data_backup.cc


Calling Logger methods (we would probably never report errors from these)
-------------------------------------------------------------------------

backup::Logger::report_stats_post()
 called at Backup_restore_ctx::do_restore() around line 961 in kernel.cc
 called at Backup_restore_ctx::do_restore() around line 1034 in kernel.cc
 called at Backup_restore_ctx::do_backup() around line 821 in kernel.cc
 called at Backup_restore_ctx::do_backup() around line 846 in kernel.cc

backup::Logger::report_stats_pre()
 called at Backup_restore_ctx::do_restore() around line 961 in kernel.cc
 called at Backup_restore_ctx::do_restore() around line 1034 in kernel.cc
 called at Backup_restore_ctx::do_backup() around line 821 in kernel.cc
 called at Backup_restore_ctx::do_backup() around line 846 in kernel.cc

backup::Logger::save_errors()
 called at Backup_restore_ctx::prepare() around line 369 in kernel.cc
 called at Backup_restore_ctx::prepare() around line 371 in kernel.cc


Errors not logged even if detected
==================================

Here is a list of methods and functions where calls are made such that even if
error is detected, no error message is send to the logs. I.e. neither the direct
caller nor any of the callers which call given function or method, produce an
error message.

Note that for some of the calls it might turn out that the called function never
reports any errors. But this should be analysed in each case.

In kernel.cc
------------

Backup_restore_ctx::prepare()
 calling mysql_reset_errors()
 calling backup::Logger::save_errors()

Backup_restore_ctx::prepare_restore()
 calling Restore_info()

Backup_restore_ctx::close()
 calling ddl_blocker_disable()
 calling backup::Stream::close()

Backup_restore_ctx::do_backup()
 calling backup::Logger::report_stats_pre()
 calling backup::Logger::report_stats_post()

Backup_restore_ctx::do_restore()
 calling close_thread_tables()
 calling Diagnostic_arena::reset_diagnostic_area()
 calling backup::Logger::report_stats_pre()
 calling backup::Logger::report_stats_post()

Backup_restore_ctx::lock_tables_for_restore()
 calling link_table_list()

Backup_restore_ctx::restore_triggers_and_events()
 calling List::push_back()

send_reply()
 calling List::push_back()
 calling Protocol::send_fields()
 calling Protocol::prepare_for_resend()
 calling llstr()
 calling Protocol::store()
 calling Protocol::write()
 calling my_eof()

In backup_info.cc
-----------------

Backup_info()
 calling hash_init()
 calling List::push_back()

In image_info.cc
----------------

backup::Image_info::add_table()
 calling backup::Image_info::Db::add_table()
 calling backup::Image_info::add_snapshot()

backup::Image_info::get_db_objects()
 calling backup::Image_info::Dbobj_iterator()

backup::Image_info::get_dbs()
 calling backup::Image_info::Db_iterator()

In data_backup.cc
-----------------

write_table_data()
 calling List::push_back()
 calling MYSQL_BINLOG::get_current_log()

backup::Backup_pump::pump()
 calling backup::Block_writer::drop_buf()

backup::Scheduler::cancel_backup()
 calling backup::Scheduler::Pump_iterator()

restore_table_data()
 calling Restore_driver::free()

In backup_aux.h
---------------

mk_table_list()
 calling alloc_root()


=================================

Bug report BUG#39089 has been created to fix the specific issues identfied by Rafal.