WL#866: Online backup: MyISAM native driver (redo-log algorithm, modify source file, log byte changes)

Affects: Server-6.0 — Status: Complete

Description
High Level Architecture
Low Level Design

SUMMARY
--------

Make MyISAM capable of supporting the Backup API of WL#3169.

DESCRIPTION
---------
create a backup driver in MyISAM so that MyISAM tables can be backed up online,
and restored by the backup API (WL#3169).

Given that the original WL#866 of 2004 had a piece of its logic inside the
MySQL table cache, while the backup API's design imposes that the
logic be entirely in the engine, a re-design of WL#866 (compared to how it was
in 2004) is needed; however big pieces of the original design of WL#866 can be
reused.

Two variants
============

NO_INDEX
--------
= copy only the MYD file (MYI being redundant information)
- requires an index rebuild on the backup, which takes a lot of
time&disk&CPU on large tables; rebuild can be done at 3 moments:
   NO_INDEX/a. during backup
   NO_INDEX/b. after backup, possibly on a spare machine
   NO_UNDEX/c. during restore.
+ restored table is optimized (more efficient indexes)

FULL_INDEX
----------
= copy the MYD and MYI file
- more server-disturbing backup (more copying, more logging in MyISAM)
- bigger backup

If we call NO_INDEX/b uncommon for many DBAs, at least uncommon for
other DBMSs, and NO_INDEX/c unacceptable (that would mean long
downtime at restore), then many DBAs will have NO_INDEX/a as only
choice, which is at least as much disturbing as FULL_INDEX.

Both methods will be coded at the same time. If they will both be accessible to
the user and how, is to be decided.

BACKGROUND
----------

As specified by the backup API's design:
the MyISAM handlerton receives a list of tables to back up from the
upper backup API.
The MyISAM handlerton then sends packets of data (containing the
backup of requested tables) to the upper backup API.
Cooperation between the upper backup API and the MyISAM handlerton
ensures that the backup of this handlerton and of other handlertons is
of a single point in time (named "validity point").
The upper layer makes sure that no DDL happens on tables to be backed
up (note: this must include REPAIR/OPTIMIZE TABLE).

Here are proposals for FULL_INDEX, changes to do to have NO_INDEX are
written in the end.

PROPOSAL FOR BACKUP
-------------------

Definition: "backup logging" is that any update to a MyISAM table
(data or index file) is logged into a MyISAM-specific log; the
catching of updates happens in mi_*() functions and is already partly
existing: see option --log-isam; however --log-isam does "logical
logging" ("this row inserted") while we need "physical logging"
("these X bytes written at offset Y"). pwrite() write() ftruncate()
are the 3 system calls to catch, as well as writes to mmaped memory
(--myisam-use-mmap).

1. Make sure all tables (open now or in the future) do backup
logging ("begin()" method of WL#3473, calling a new function
mi_backup_start_logging_for_tables())

1a. Old table instances

A condition of correctness of online backup is that:
after the copy process has started, any update done to a table-to-back-up must
be present in the log.
A boolean member MYISAM_SHARE::backup_logging is added.
All physical writes to the data/index files (my_write, my_pwrite, memcpy to a
mmap'ed arean my_chsize) are made to be of this form:
  @code
  {
    write_to_data_or_index_file;
    if ((atomic read of MYISAM_SHARE::backup_logging)==1)
      write log record to backup log;
  }
  @endcode
(for "cached" writes via IO_CACHE (WRITE_CACHE) or via the key cache, post_write
callbacks are added to IO_CACHE and to the key cache).
 
mi_backup_start_logging_for_tables() sets MYISAM_SHARE::backup_logging to 1
using an atomic write (include/my_atomic.h). Atomic write happens before or
after atomic read above. If before, change will be in the log. If after, it is
also after the write_to_data_or_index_file and thus change will be in the copy.
So correctness is always guaranteed. Note the importance of checking
MYISAM_SHARE::logging always _after_ write_to_data_or_index_file for the
reasoning to hold.

1b. New table instances

In mi_open(), the name of the table is compared against the list of
tables to back up; if it matches, it declares itself ready for backup
logging.

2. Write data to the backup ("get_data()" method of WL#3473)

Dirty copying of all matching MYD and MYI files is done. Meanwhile,
backup logging appends to the log.
Upper layer is notified that the copy is done.
Copying of the log may start at this point.

3. Create a validity point ("prelock()" method of WL#3473)

This requires to wait for all running updates (on matching tables) to
finish.
Suggestion by Monty: do a LOCK TABLES READ on all matching tables currently open
(actually Monty suggested mi_open()+thr_multi_lock(), which is the same as LOCK
TABLES READ); for other matching tables (currently closed), if they open, their
opener should (in mi_open()) do a thr_lock() on the table and give this thr_lock
to the backup thread.
Shortcut by Guilhem, before implementing Monty's suggestion: do a LOCK TABLES
READ on all matching tables (this has the drawback that it cannot work for a lot
of tables, due to OS file descriptor limits; this is why it is not a final
solution). "A lot" is not well-defined: a normal Linux can have 200,000 open
files.
Because this may be a long blocking wait above, prelock() only launches a
separate thread which will do this locking. Then the driver will asynchronously
monitor the success of of that thread, inside get_data() calls. When finally
locks have all been acquired, any remaining cached data (WRITE CACHE, key cache)
is flushed to the table (thus goes into the log), backup logging is stopped for
all tables and backup log is closed.

4. lock() method of WL#3473

An empty operation as tables have been locked already

4. Finalize ("get_data()" method of WL#3473)

After the validity point has been created, copy the rest of the backup log to
the upper backup API. Delete the backup log.

PROPOSAL FOR RESTORE
--------------------

Background: the upper restore API first creates tables via a CREATE
TABLE, then asks the engine to restore the table.
1. Restore all data provided by the upper backup API into files
(overwriting the empty ones created by CREATE TABLE, but following
symbolic links) ("send_data()" method of WL#3473)
2. apply the backup log and delete it ("end()" method of WL#3473)

How to do NO_INDEX:
- we must still backup the header of the MYI (contains important
not-in-MYD information like the latest allocated auto_increment value,
the head of the list of deleted rows in the MYD etc); to make it
simple we copy 64k as we know that the header always fits in 64k
- backup logging must still happen for changes to this header
- no need to flush the indexes as the header is not in the key cache
- at some moment (NO_INDEX/a NO_INDEX/b or NO_INDEX/c) the backup log
is applied and the index is rebuilt. For now we do it at NO_INDEX/c
i.e. at restore time.
NO_INDEX and FULL_INDEX are both implemented.