WL#6835: InnoDB: GCS Replication: Deterministic Deadlock Handling (High Prio Transactions in InnoDB)

Affects: Server-Prototype Only — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

EXECUTIVE SUMMARY
=================

In database state machine (DBSM), once a transaction is certified,
it MUST commit.
It cannot abort in one replica and be committed in another. This is part 
of the DBSM termination protocol. This task designs and implements 
high prio transactions that shall never be chosen to abort on a 
deadlock scenario. They will always commit (unless the database is 
in serious meltdown). Typically, transactions that have passed the
certification test need to commit (and eventually break the locks of
locally running transaction - ie, those that are still in the optimistic
execution phase).

CHALLENGES
==========

1. It can be that some commands require breaking locks in MDL in 
   addition to the locks in InnoDB. This needs to be investigated.

2. Check that InnoDB does not have intermediate locking (lockset 
   matches the write set used in certification).

Lets assume that a transaction is a sequence of Read, Write and
Commit/Abort operations. Transactions can be optimistic or
certified. If they have been certified already, they have higher
priority over the optimistic ones. In the following list of
requirements, we use the notation:
  
  T1=({... operations ...}, priority=[0..N]) 

  operations: commit - C, Abort - A, Read - R(?), Write - W(?)

to denote transactions in our system. In reality, the DBSM only needs
two levels of priority (optimistic or certified), but lets assume that
each transaction has a field denoting multiple levels of priority. The
higher the number the higher the priority.

R1. W-W lock conflict between optimistic and certified transaction.

    Consider the following transactions:
    - T1=({ R(B), W(B), C}, priority=1)
    - T2=({R(C), R(B), W(B)}, priority=0).

    T1 is being applied by the replication applier and T2 is being
    executed by a user.

    There is a conflict, since T2 holds a write lock on B while at the
    same time T1 needs it. Since T1 has higher priority than T2, the
    lock manager MUST abort T2 and give the lock to T1.
   
R2. R-W conflict between optimistic and certified transaction.

    Consider the following transactions:
    - T1=({R(B), W(B), C}, priority=1) 
    - T2=({R(B),R(C)}, priority=0). 

    T1 is being applied by the replication applier and T2 is being
    executed by a user.

    It is clear that there is a conflict, since T2 holds a read lock on
    B, while at the same time T1 is requesting a write lock on it. In 
    this case, two things can happen:

      1. T2 has been clearly marked/identified as a READ ONLY transaction
         by the user (START TRANSACTION READ ONLY) and as such T1 may 
         wait until T2 finishes, to proceed and grab the locks;

      2. There is no way to know whether T2 is READ ONLY or not, even
         if it has just issued read operations so far. In that case,
         T2 MUST be aborted and the locks granted to T1 as if T2 was a
         regular RW transaction.

    The approach to take at this point in time is implementation dependent.
    Aborting READ ONLY transactions does not hurt correctness, but may
    hurt performance.

R3. W-W-W conflict between multiple optimistic transactions and a
    certified transaction.

    Consider the following transactions being executed and applied at
    the same time:
    - T1= ({R(B), W(B), C}, priority=1)
    - T2= ({R(Z), R(B), W(Z), W(B)}, priority=0)
    - T3= ({R(C), R(B), W(B)}, priority=0)

    Assume that T2 holds the write lock on B, T3 is waiting for the
    write lock on B. Then T1 passes the certification test and comes 
    in to apply the changes to B. 

    Then, T2 *and* T3 MUST be aborted and write lock on B granted to
    T1.

R4. R-R-W conflict between multiple optimistic transactions and a
    certified transaction.

    Consider the following transactions being executed and applied at
    the same time:
    - T1= ({R(B), W(B), C}, priority=1)
    - T2= ({R(Z), R(B)}, priority=0)
    - T3= ({R(C), R(B)}, priority=0)

    Assuming that both T2 and T3 hold read locks on B.

    If both T2 and T3 are marked as READ ONLY, then T1 may wait for
    them to finish, but no other read transaction is granted lock on
    B. 

    Otherwise, those transactions that cannot be deemed as READ ONLY
    transactions MUST be aborted.

    Alternatively, T2 and T3 can also be aborted immediately, without
    T1 having to wait for the READ ONLY transactions to finish.

    The approach is implementation dependent and aborting T2 and T3 does
    not hurt correctness. Can hurt performance though.

R5. R-W-W conflict between multiple optimistic transactions and a
    certified transaction.

    Consider the following transactions being executed and applied at
    the same time.
    - T1= ({R(B), W(B), C}, priority=1)
    - T2= ({R(Z), R(B)}, priority=0)
    - T3= ({R(C), R(B), W(B)}, priority=0)

    T2 is marked as a READ ONLY transaction. T3 wants to update B, so
    is waiting for the lock from T2. T1 comes in from replication and
    needs to be applied.

    In this case, the lock manager may wait for T2 to finish or abort
    T2 immediately. For now, lets consider this implementation
    dependent as it does not hurt correctness.

    T3 MUST be aborted.

Limitations
===========
We will not rollback internal background persistent statistics gathering 
transactions. Higher priority transactions will end up waiting for such 
transactions, if a X -> S conflict arises. This should be in general be
rare and not affect overall throughput.

The problem is that currently we don't have a mechanism to block them while
and async rollback is taking place. If this is ever a real problem it should
be fixed separately.

INTRODUCTION
============

The database machine approach to replication (DBSM) is a multi-master
update everywhere replication protocol, based on group communication.
It is very simple and relies on a distributed state machine, where at
its core is an Atomic Broadcast, which basically allows all servers to
see all transactions in the same order everywhere.

In a nutshell, in the DBSM replication protocol, the execution of a
RW transaction, T, is split into three major stages:

  1. T executes optimistically at site S1, i.e., there is no a priori
     coordination with other transactions running concurrently on
     other sites; This is the first stage;

  2. T is globally ordered, between all transactions that are
     committing or have already committed. At this point, T is checked
     for conflicts among other preceding transactions. The procedure
     of checking for conflicts is known as certification. This is the
     second stage;

  3. T is applied at all sites, if no conflicts found, or rolled back 
     at S1 if a conflict was found.

Since T is globally ordered, all sites see all transactions in the
same order, so they can decide deterministically and unilaterally which
transactions conflict among themselves (they check if the write-sets
of committing transactions intersect - first committer wins).

If T passes the certification test, it is said to be serializable with
the previously committed transactions in the system, then it must be
applied everywhere. Since all sites receive the same transactions on
the same order and deterministically decide the same fate for T,
eventually T - as any other certified transaction - is applied
everywhere. Therefore, a certified transaction may be regarded as a
higher priority transaction.

THE PROBLEM
===========

In the paper, Fernando Pedone [1] mentions three cases to consider
when a certified transaction, T, is obtaining its locks on a remote
site:

  1. A locally and optimistic transaction, Ta, is executing and T
     requests locks that Ta is holding;

  2. A local transaction, Tb, has executed and was broadcast, but has
     not been certified yet, and T requests locks that Tb is holding;

  3. A certified transaction, Tc, is being applied, and T requests
     locks that Tc is holding.

Case #3 can be addressed by the multi-threaded applier. It shall not
apply certified transactions that update the same items in the
database in parallel. A future optimization would be, not do this
waiting and let the lock scheduler reorder the lock acquisition to
avoid deadlocks (transactions have priority to the locks depending on
the serialization order established by the replication protocol).

Case #2 shall be handled by the replication layer. If a transaction is
in the replication domain and gets aborted by a certified one, the
replication layer will register the fact and once the broadcast comes
around (and the certification outcome is decided - must be abort as
well since there was a conflict - the transaction is removed from the
context.).

Case #1 is the case that needs to be addressed by the storage engine
(InnoDB). In this case, roughly for any Ta executing optimistically 
that holds locks that a certified transaction T is requesting, then 
Ta must be aborted and locks granted to T (this is detailed in the
requirements section).

REFERENCES
==========

[1] http://infoscience.epfl.ch/record/88286/files/PGS03.pdf

There are three parts to this problem:

 1. Allow transactions that are tagged as high priority to jump the lock queue

 2. Kill transactions asynchronously

 3. See R2.1, we don't rollback active read-only transactions

Limitation, we will not kill internal transactions, even if they are blocking high 
priority transactions.

Tagging transactions
====================
InnoDB will ask the server for the transaction priority and also to arbitrate when 
there is a lock wait detected. If the server can't select a victim then InnoDB 
will rollback the requestor if both transactions are high priority. If both are 
non-high priority transactions then during a deadlock check existing rules apply, 
no change.

/**
Check if the transaction can be rolled back
@param[in] requestor    Session requesting the lock
@param[in] holder       Session that holds the lock
@return the session that will be rolled back, null don't care */

THD*
thd_trx_arbitrate(THD* requestor, THD* holder)
{
        /* Non-user (thd==0) transactions by default can't rollback, in
        practice DDL transactions should never rollback and that's because
        they should never wait on table/record locks either */

        ut_a(holder != NULL);
        ut_a(holder != requestor);

        THD*    victim = thd_tx_arbitrate(requestor, holder);

        ut_a(victim == NULL || victim == requestor || victim == holder);

        return(victim);
}

/**
@param[in] thd          Session to check
@return the priority */

int
thd_trx_priority(THD* thd)
{
        return(thd == NULL ? 0 : thd_tx_priority(thd));
}



Internal transactions are transactions started from within InnoDB. Any transaction 
that doesn't have the trx_t::mysql_thd data member set is deemed to be an internal 
transaction.

The locking semantics have to be changed to implement #1. When a high priority 
transaction has to wait for a transaction that already holds the record lock. The 
high priority transaction will first add its own lock at the head of the record 
lock wait queue. Then it will rollback any transactions were ahead in the queue 
that are waiting. This mechanism works in a similar way to the current cancel lock 
wait and rollback. We know that the transactions are inside InnoDB and waiting 
therefore all we have to do is set their rollback flag and wake them up.

Killing transactions that are active and ahead in the queue requires extra work. 
We first have to figure out what these transactions are doing. e.g., is the 
control inside InnoDB code or not. Once they are inside InnoDB it is close to 
impossible for us to know exactly what they are doing. Therefore we only check for 
two states, inside InnoDB or outside InnoDB. All user transactions must pass 
through the handler interface methods. For all these methods we add a "gate" by 
introducing a class: TrxInInnoDB(trx_t*);

The role of this class is two fold:

 1. Prevent transactions from crossing the boundary if it detects that the 
transaction is being considered for asynchronous rollback.

 2. Prevent asynchronous rollback if the transaction has crossed its boundary.

To check these control flow (execution context) states we introduce:

/** If this flag is set then the transaction cannot be rolled back
asynchronously. */
static const ib_uint32_t TRX_FORCE_ROLLBACK_DISABLE = 1 << 29;

/** Was the transaction rolled back asynchronously or by the
owning thread. This flag is relevant only if TRX_FORCE_ROLLBACK
is set.  */     
static const ib_uint32_t TRX_FORCE_ROLLBACK_ASYNC = 1 << 30;
                
/** Mark the transaction for forced rollback */
static const ib_uint32_t TRX_FORCE_ROLLBACK = 1 << 31;

/** For masking out the above four flags */
static const ib_uint32_t TRX_FORCE_ROLLBACK_MASK = 0x1FFFFFFF;

These flags are stored in trx_t::in_innodb. In order to reduce the mutex overhead 
they are encoded as the right most 3 bits, so that if performance becomes an issue 
we can use atomics to note and control the "inside InnoDB" state.

/** For tracking the transaction life cycle used in asynchronous rollback */
struct TrxVersion {
        TrxVersion(trx_t* trx);

        /**
        @return true if the trx_t instance is the same */
        bool operator==(const TrxVersion& rhs) const
        {
                return(rhs.m_trx == m_trx);
        }

        trx_t*          m_trx;
        ulint           m_version;
};


trx_t {
        ...

       /* These are protected by the trx_t::mutex */

       ib_uint32_t     in_innodb;       /*!< if the thread is executing
                                        in the InnoDB context count > 0. */

       bool            abort;           /*!< if this flag is set then
                                        this transaction must abort when
                                        it can */

       hit_list_t      hit_list;        /*!< List of transactions (and their
                                        version numbers) to kill,
                                        when a high priority transaction
                                        is blocked on a lock wait. */

       os_thread_id_t  killed_by;       /*!< The transaction (thread id) that
                                        wants to kill this transaction 
                                        asynchronously */

        /** Version of this instance. It is incremented each time the
        instance is re-used in trx_start_low() and on commit. It is used
        to track whether a transaction has been restarted since it was
        tagged for asynchronous rollback. */
        ulint           version;

        ...
};


Use case for asynchronous rollback
----------------------------------
CON1:
START TRANSACTION;
UPDATE t1 SET c1=1 WHERE c1=0;

CON2:
START TRANSACTION HIGH_PRIORITY;
UPDATE t1 SET c1=2 WHERE c1=0;
COMMIT;

CON1:
COMMIT;

The transaction on CON1 was rollback asynchronously but it will only find out when 
it goes to do a COMMIT, it will get an ER_ERROR_DURING_COMMIT error. This is 
because AFAIK our client protocol doesn't have a mechanism to push the 
FORCED_ROLLBACK event to the client. The ramifications for us are that we need to 
keep track of this new transaction state. For this we introduce a new state:

        /** Same as not started but with additional semantics that it
        was rolled back asynchronously the last time it was active. */
        TRX_STATE_FORCED_ROLLBACK

When a transaction is rolledback asynchronously we need to change its state to the 
above earlier we would set the state to TRX_STATE_NOT_STARTED. When the user 
thread has read the forced rollback state it will change it to 
TRX_STATE_NO_STARTED.

The way it works
================
When a high priority transaction needs to rollback the transactions ahead in the 
queue. It sets their state to :

        trx->abort = true;

        /* Note that we will attempt an async rollback. The _ASYNC
        flag will be cleared if the transaction is rolled back
        synchronously before we get a chance to do it. */

        trx->in_innodb |= TRX_FORCE_ROLLBACK | TRX_FORCE_ROLLBACK_ASYNC;

Setting the trx->in_innodb flags will block the transaction from entering InnoDB 
code past the TrxInInnoDB barrier. Those that are inside InnoDB can still keep 
executing. Currently we don't do any checks. These could be added later if 
required, should be trivial provided we know the transaction context.

        trx->killed_by = os_thread_get_curr_id();

        m_trx->kill.push_back(trx);

Collect all the transactions in the lock queue ahead hat need to be rolled back in 
trx_t::kill.

Once the victim transactions are tagged and their state changed the high priority 
transaction will exit the lock code and just before it does the actual wait it 
will kill all the transactions. It will wait for the transactions that are active 
inside InnoDB code complete before killing them.

                For each victim_trx in trx_t::hit_list =>

                /* Check that the transaction isn't active inside
                InnoDB code. We have to wait while it is executing
                in the InnoDB context. */
                trx_mutex_enter(victim_trx);

                ut_ad(!(trx->in_innodb & TRX_FORCE_ROLLBACK_DISABLE));

                while (victim_trx->version == version
                       && (victim_trx->in_innodb & TRX_FORCE_ROLLBACK_MASK) > 0
                       && trx_is_started(victim_trx)) {

                        trx_mutex_exit(victim_trx);

                        os_thread_sleep(20);

                        trx_mutex_enter(victim_trx);
                }

                ut_ad(it->m_version <= victim_trx->version);

                bool    rollback = victim_trx->version == it->m_version;

                ut_ad((victim_trx->in_innodb & TRX_FORCE_ROLLBACK)
                      || !rollback);

                trx_mutex_exit(victim_trx);

                char    buffer[1024];

                if (trx_is_started(victim_trx) && rollback) {

                        trx_id_t        id = victim_trx->id;

                        ut_ad(victim_trx->in_innodb & TRX_FORCE_ROLLBACK_ASYNC);

                        ut_ad(victim_trx->version == it->m_version);

                        trx_rollback_for_mysql(victim_trx);

                        ib_logf(IB_LOG_LEVEL_INFO,
                                "Killed transaction: ID: " TRX_ID_FMT " - %s",
                                id,
                                thd_security_context(
                                        victim_trx->mysql_thd,
                                        buffer, sizeof(buffer), 512));
                }

                trx_mutex_enter(victim_trx);

                victim_trx->in_innodb &= TRX_FORCE_ROLLBACK_MASK;

                trx_mutex_exit(victim_trx);