WL#3576: Online Backup: Point in time recovery (Sync w/ binlog & rpl)

Affects: Server-6.0   —   Status: Complete

SUMMARY
-------

1. On master: Store binlog position

2. On slave: Stop replication on DDL error (document gotcha)
     - Store replication state
     - Slave position
     - master.info
     - relay.info

3. Add to image summary block with binlog position 


Binlog synchronization problem
------------------------------
We need to hinder that the backed up database have different contents
compared to the binary log position.  To do this we probably needs to
have some kind of commit mutex.

One alternative is to use the binlog mutex, but if we do, then we
(probably) need to extend the time this mutex is locked, so it might
be better to introduce a new backup mutex (possibly dependent on that
the binlog mutex is taken first).


Suggested solution
------------------
When the backup is synchronization-locking the engines, it gets an
exclusive commit lock.  Any other statements/transactions that are
committed need shared access to this lock.


Simple commit protocol (enginges that do not support XA, i.e. 2PC):

 ------+-------------+-------->

   Statement     Write log

           SHARED
            LOCK



Two-phase commit protcol (our current internal XA implementation):

 ------+-------------+-------------+-------->

    Prepare         Log          Execute
                into binlog    Commit in DB

                         SHARED
                          LOCK



Backup protocol locking phase:

 ------+-------------+-------->
      Lock        Unlock

          EXCLUSIVE
            LOCK


OPEN ISSUES
---------------------
On Mon, Oct 30, 2006 at 07:08:22PM +0100, Guilhem Bichot wrote:
> So, in the new scheme with prelock() and such, I wonder where the
> binlog stands. Taking the binlog's mutex is possibly far from <1s: in
> flushing a X MB cached transaction to the binlog will hold
> the binlog's mutex for the time it takes to write X MB to disk (and if
> this cache went into a disk-based tmp file, it will also include
> reading X MB from disk). With X big (e.g. 200 MB), it's a problem.
> So the binlog may not be capable of an instant lock().
> On the other hand, if you acquire the binlog's mutex in a prelock()
> call, then do a prelock() on MyISAM, the whole server will be stalled
> for long (because the binlog's mutex is held during the MyISAM's long
> prelock()). None of the two solutions is satisfying in all cases.
> 
> So I would recommend doing this (maybe you already have planned it?):
> acquire the binlog's mutex only in lock(), but require engines
> (including the binlog) to observe a timeout (put it in the engines'
> API), something like:
> 
> foreach e in all_engines
>   e->prelock();
> retry:
> timeout=some_constant; // the time allowed to lock all engines (2 seconds?)
> end_max=now()+timeout;
> foreach e in all_engines
> {
>   e->lock(timeout);
>   timeout=end_max-now(); // the time remaining for next engines
>   if (timeout==0 || lock of e failed with timeout error) // cannot continue
>   {
>     foreach e1 in all_locked_engines
>       e1->unlock();
>     sleep 10;
>     if (retried too many times already)
>       finally bail out or increase sleep time;
>     else
>       goto retry;
>   }
>   all_locked_engines.append(e);
> }
> 
> 
> This is useful not only for the binlog, but also in this scenario: if
> engine A has a problem which makes it impossible for it to complete
> the lock() in the requested time, we would want to not stall other
> engines forever.
> Ideally, an engine should, in lock(), do no disk reads or writes
> (because those can get I/O errors, disk full problems etc). Only
> memory manipulations.
> 
> Now what if engine A has a bug and does not timeout after the
> requested timeout? I don't know what to do for that. Maybe hope this
> never happens.

(...)

On Mon, Oct 30, 2006 at 07:08:22PM +0100, Guilhem Bichot wrote:
> Rafal wrote:
> > Ok, I see the issue. We have discussed this with Lars and our idea is to 
> > improve locking in transaction commit code so that committing changes to 
> > the engines and writing corresponding event to binlog will be a single unit 
> > which will be separated from other threads (so that nothing can happen 
> > in-between). WL#3576 will be devoted to this issue.
> 
> This would mean serializing commits of the binlog and of engines, I
> am not sure this is desirable from a performance point of
> view. In 4.1 it was serialized, in 5.0 Serg de-serialized it. Please,
> ask Serg for his advice on this idea before proceeding.
Solution with a critical zone
=============================

The proposed solution is to introduce a critical zone which will be entered by
threads just before writing changes to the binlog and leaved after unlocking
tables. Thus the following stages of statement execution will be contained
inside the critical zone:

1. writing to binlog
2. committing transaction
3. unlocking tables

Let's call these actions the critical sequence. Table unlocking is included in
the critical sequence for the sake of non-transactional engines. See below for
explanation and definiton of (in)visible data changes.

The backup thread, when creating VP, will close the critical zone (so that no
new threads can enter it) and wait for all threads which are inside to leave it.
 The sequence of actions in the synchronization phase will be as follows:

1. 
2. 
3. 
4. 
5. 
6. 

This way it will be ensured that during VP creation no thread is executing the
critical sequence and thus the binlog is in sync with the commited/visible data
changes.

As a side effect, all commits will be blocked during the synchronization phase.
This will ensure the "data freeze during synchronization" requirement of the
backup protocol for all transactional engines (with the exception of NDB). This
is planed to use it in the general, consistent snapshot driver.

Non-transactional engines and invisible data changes
====================================================

Changes to data stored in non-transactional engine are done by a thread in the
following way:

1. 
2. 
3. 

The lock is taken at the beginning of statement execution and released when the
statement is finished. Thus before the lock is released, all changes belong to
an unfinished statement and should not be included in the backup image. Let's
call such changes *invisible*. It is the job of (native) backup driver to ensure
that no invisible changes are included in its backup image - we assume that this
is the case.

For backup/binlog synchronization, we must guarantee that when binlog position
is saved during VP creation, there are no invisible data changes which have been
already written to the binlog. Just after writing to binlog it contains data
changes which are not visible yet. They become visible when tables are unlocked.
Thus, for non-transactional engines table unlock plays the same role as commit
for transactional ones and hence it is included in the critical sequence.

How to implement the critical zone
==================================

There is a global write lock which basically implements a critical zone
mechanism. If we were using this, then the backup thread will do something like
this:

1. lock_global_write_lock()
2. create VP and store binlog position
3. unlock_global_write_lock()

However, this solution has disadvantages which make me to reject it:

a) Taking global read lock blocks any thread which wants to change table data -
it is not possible to optimize and block only these threads which change the
tables being backed-up.

b) Apparently, there are two places where the global read lock is checked: at
the beginning of execution of a data-changing statement and inside
ha_commit_trans(). It is not clear which one is in effect but ether case is not
good:

- If it is at the beginning of a statement, then there will be unnecessary
blocking during backup synchronization. It is enough to block the critical
sequence but global read lock will block whole statement execution. In case of
long INSERT ... SELECT it can make a difference and freeze the whole instance
even for seconds.

- If it is inside ha_commit_trans() then it's too late. We need writing to the
binlog to also happen inside the critical zone.

c) The global write lock is not taken into account when applying Rows
replication events in the SQL thread but we need to block this as well.

d) It seems that the critical zone mechanism can be useful in couple of other
places and thus it would be good to have a general solution which can be applied
when needed. For example, I might need another critical zone in the ARCHIVE
backup driver, when I want to switch from parallel updates to serialized ones
and make sure that the change has happened.

For the above reasons I propose to implement a general critical zone mechanism,
independent from the global write lock, and use it for backup synchronization.
If it proves to be good, it would be even possible to change the implementation
of global write lock to use the same mechanism.

I have some ideas for the implementation of the critical zone but I think the
details can be shown in a separately reviewed patch, if this design is approved.
After all, the zone is rather trivial to implement. I would use pthread's
read-write locks on platforms which support it and otherwise use a shared
counter protected with mutexes.

When the zone should be entered
===============================

A thread which changes data in tables should enter the zone when executing the
critical sequence (binlog write, commit, unlock tables) after all data changes
have been requested. The critical sequence is executed in these places in the code:

1. In mysql_{insert,delete,update}() functions implementing INSERT/DELETE/UPDATE
statements.

2. In classes select_insert, multi_delete, multi_update implementing multi row
versions of the above operations.

3. In Rows_log_event::exec_event() method applying Row replication event.

Note that the zone is not used for meta-data operations as these will be blocked
during the whole backup process.

Optimization of the zone (to be considered)
===========================================

Threads which do not modify any tables being backed-up don't have to enter the
critical zone - these modifications are irrelevant for backup/binlog
synchronization. 

Let's call a thread "participating" if it modifies data in a table being
backed-up. Note that this notion makes sense only during an ongoing backup
operation.

During backup synchronization we want to block only participating threads. To do
this we introduce global backup_synchronization flag which is set true by the
backup thread in the synchronization phase:

1. backup_synchronization= TRUE
2. 
3. 
4. 
5. 
6. backup_synchronization= FALSE

Each thread which is going to modify any tables will set a flag determining if
it is going to enter the zone or not. This flag, call it enter_zone, can be set
while tables are being locked. The logic is as follows:

if (!backup_synchronization)
  enter_zone= TRUE;
else if()
  enter_zone= TRUE;
else
  enter_zone= FALSE;

When executing the critical sequence, a thread enters the critical zone only if
enter_zone flag is set. This way it is possible for a non-participating thread
to skip the zone and not being blocked when the zone is closed by the backup
thread. Note that if backup_synchronization==FALSE then all threads will enter
the zone, but none of them will be blocked because the zone is not going to be
closed.

With this optimization the following invariant holds: for any thread which
executes the critical sequence one of these is true,

a) thread is in the critical zone, or
b) thread does not participate in the on-going backup process.

After closing the zone and waiting until it is empty, alternative b) is excluded
and therefore we know that no participating thread is executing the critical
sequence as required. 

Note: with this optimization, it is possible that a non-participating thread
adds a binlog entry during backup synchronization phase but this should not hurt
as the entry will not concern any of the tables being backed-up.

Here are some ideas about how to implement the critical zone. First a general
API for using a zone is described. This API is independent from the way the zone
is implemented. Then we show how the zone can be easily implemented using
read-write lock mechanism of the pthread library.

Critical Zone API
=================

A critical zone is represented by an instance of Critical_zone class holding any
resources needed for operation of the zone.

The instance has methods:

- close(): close the zone so that no thread can enter it.
- wait(): wait until the zone is empty.
- open(): open the zone after it was closed.

- enter(): enter the zone. Method returns a ticket - instance of
Critical_zone::Ticket class. A thread is in the zone as long as it holds the
ticket. Ticked can be dropped explicitly (using drop() method) or automatically
when ticket's destructor is called.

Usage Examples:

Critical_zone zone; // global instance representing a zone

thread 1:

Critical_zone::Ticket t= zone.enter(); // enter the zone

t.drop() // leave the zone

thread 2:

{
 Critical_zone::Ticket t= zone.enter(); //enter zone
 
} // zone will be leaved automatically when t is destructed.

thread 3:

zone.close();
zone.wait();
// now, no threads are in the zone
zone.open(); 

Implementation using pthread's read-write locks.
================================================

Pthread library provides mechanism called read-write locks. Any number of
threads can take a read lock at the same time. A single thread can take a write
lock and if it succeeds then no other thread holds a lock (either read or
write). Thus:

- to enter a zone, take a read lock (pthread_rwlock_rdlock()),
- to leave zone, release the read lock taken (pthread_rwlock_unlock()),
- to close zone, take a write lock (pthread_rwlock_wrlock()),
- to open zone, release the write lock taken (pthread_rwlock_unlock()).

Note that in this case Critical_zone::close() which calls
pthread_rwlock_wrlock() will wait for the zone to be emptied and thus
Critical_zone::wait() will be empty.

Pthread_rwlock_rdlock() will be called when creating a ticket object. Matching
pthread_rwlock_unlock() will be called in ticket's drop() method, also called
from its destructor.

See this patch  for initial implementation
of the above ideas.