WL#6835: InnoDB: GCS Replication: Deterministic Deadlock Handling (High Prio Transactions in InnoDB)
Affects: Server-Prototype Only
—
Status: Complete
EXECUTIVE SUMMARY ================= In database state machine (DBSM), once a transaction is certified, it MUST commit. It cannot abort in one replica and be committed in another. This is part of the DBSM termination protocol. This task designs and implements high prio transactions that shall never be chosen to abort on a deadlock scenario. They will always commit (unless the database is in serious meltdown). Typically, transactions that have passed the certification test need to commit (and eventually break the locks of locally running transaction - ie, those that are still in the optimistic execution phase). CHALLENGES ========== 1. It can be that some commands require breaking locks in MDL in addition to the locks in InnoDB. This needs to be investigated. 2. Check that InnoDB does not have intermediate locking (lockset matches the write set used in certification).
Lets assume that a transaction is a sequence of Read, Write and Commit/Abort operations. Transactions can be optimistic or certified. If they have been certified already, they have higher priority over the optimistic ones. In the following list of requirements, we use the notation: T1=({... operations ...}, priority=[0..N]) operations: commit - C, Abort - A, Read - R(?), Write - W(?) to denote transactions in our system. In reality, the DBSM only needs two levels of priority (optimistic or certified), but lets assume that each transaction has a field denoting multiple levels of priority. The higher the number the higher the priority. R1. W-W lock conflict between optimistic and certified transaction. Consider the following transactions: - T1=({ R(B), W(B), C}, priority=1) - T2=({R(C), R(B), W(B)}, priority=0). T1 is being applied by the replication applier and T2 is being executed by a user. There is a conflict, since T2 holds a write lock on B while at the same time T1 needs it. Since T1 has higher priority than T2, the lock manager MUST abort T2 and give the lock to T1. R2. R-W conflict between optimistic and certified transaction. Consider the following transactions: - T1=({R(B), W(B), C}, priority=1) - T2=({R(B),R(C)}, priority=0). T1 is being applied by the replication applier and T2 is being executed by a user. It is clear that there is a conflict, since T2 holds a read lock on B, while at the same time T1 is requesting a write lock on it. In this case, two things can happen: 1. T2 has been clearly marked/identified as a READ ONLY transaction by the user (START TRANSACTION READ ONLY) and as such T1 may wait until T2 finishes, to proceed and grab the locks; 2. There is no way to know whether T2 is READ ONLY or not, even if it has just issued read operations so far. In that case, T2 MUST be aborted and the locks granted to T1 as if T2 was a regular RW transaction. The approach to take at this point in time is implementation dependent. Aborting READ ONLY transactions does not hurt correctness, but may hurt performance. R3. W-W-W conflict between multiple optimistic transactions and a certified transaction. Consider the following transactions being executed and applied at the same time: - T1= ({R(B), W(B), C}, priority=1) - T2= ({R(Z), R(B), W(Z), W(B)}, priority=0) - T3= ({R(C), R(B), W(B)}, priority=0) Assume that T2 holds the write lock on B, T3 is waiting for the write lock on B. Then T1 passes the certification test and comes in to apply the changes to B. Then, T2 *and* T3 MUST be aborted and write lock on B granted to T1. R4. R-R-W conflict between multiple optimistic transactions and a certified transaction. Consider the following transactions being executed and applied at the same time: - T1= ({R(B), W(B), C}, priority=1) - T2= ({R(Z), R(B)}, priority=0) - T3= ({R(C), R(B)}, priority=0) Assuming that both T2 and T3 hold read locks on B. If both T2 and T3 are marked as READ ONLY, then T1 may wait for them to finish, but no other read transaction is granted lock on B. Otherwise, those transactions that cannot be deemed as READ ONLY transactions MUST be aborted. Alternatively, T2 and T3 can also be aborted immediately, without T1 having to wait for the READ ONLY transactions to finish. The approach is implementation dependent and aborting T2 and T3 does not hurt correctness. Can hurt performance though. R5. R-W-W conflict between multiple optimistic transactions and a certified transaction. Consider the following transactions being executed and applied at the same time. - T1= ({R(B), W(B), C}, priority=1) - T2= ({R(Z), R(B)}, priority=0) - T3= ({R(C), R(B), W(B)}, priority=0) T2 is marked as a READ ONLY transaction. T3 wants to update B, so is waiting for the lock from T2. T1 comes in from replication and needs to be applied. In this case, the lock manager may wait for T2 to finish or abort T2 immediately. For now, lets consider this implementation dependent as it does not hurt correctness. T3 MUST be aborted. Limitations =========== We will not rollback internal background persistent statistics gathering transactions. Higher priority transactions will end up waiting for such transactions, if a X -> S conflict arises. This should be in general be rare and not affect overall throughput. The problem is that currently we don't have a mechanism to block them while and async rollback is taking place. If this is ever a real problem it should be fixed separately.
INTRODUCTION ============ The database machine approach to replication (DBSM) is a multi-master update everywhere replication protocol, based on group communication. It is very simple and relies on a distributed state machine, where at its core is an Atomic Broadcast, which basically allows all servers to see all transactions in the same order everywhere. In a nutshell, in the DBSM replication protocol, the execution of a RW transaction, T, is split into three major stages: 1. T executes optimistically at site S1, i.e., there is no a priori coordination with other transactions running concurrently on other sites; This is the first stage; 2. T is globally ordered, between all transactions that are committing or have already committed. At this point, T is checked for conflicts among other preceding transactions. The procedure of checking for conflicts is known as certification. This is the second stage; 3. T is applied at all sites, if no conflicts found, or rolled back at S1 if a conflict was found. Since T is globally ordered, all sites see all transactions in the same order, so they can decide deterministically and unilaterally which transactions conflict among themselves (they check if the write-sets of committing transactions intersect - first committer wins). If T passes the certification test, it is said to be serializable with the previously committed transactions in the system, then it must be applied everywhere. Since all sites receive the same transactions on the same order and deterministically decide the same fate for T, eventually T - as any other certified transaction - is applied everywhere. Therefore, a certified transaction may be regarded as a higher priority transaction. THE PROBLEM =========== In the paper, Fernando Pedone [1] mentions three cases to consider when a certified transaction, T, is obtaining its locks on a remote site: 1. A locally and optimistic transaction, Ta, is executing and T requests locks that Ta is holding; 2. A local transaction, Tb, has executed and was broadcast, but has not been certified yet, and T requests locks that Tb is holding; 3. A certified transaction, Tc, is being applied, and T requests locks that Tc is holding. Case #3 can be addressed by the multi-threaded applier. It shall not apply certified transactions that update the same items in the database in parallel. A future optimization would be, not do this waiting and let the lock scheduler reorder the lock acquisition to avoid deadlocks (transactions have priority to the locks depending on the serialization order established by the replication protocol). Case #2 shall be handled by the replication layer. If a transaction is in the replication domain and gets aborted by a certified one, the replication layer will register the fact and once the broadcast comes around (and the certification outcome is decided - must be abort as well since there was a conflict - the transaction is removed from the context.). Case #1 is the case that needs to be addressed by the storage engine (InnoDB). In this case, roughly for any Ta executing optimistically that holds locks that a certified transaction T is requesting, then Ta must be aborted and locks granted to T (this is detailed in the requirements section). REFERENCES ========== [1] http://infoscience.epfl.ch/record/88286/files/PGS03.pdf
There are three parts to this problem: 1. Allow transactions that are tagged as high priority to jump the lock queue 2. Kill transactions asynchronously 3. See R2.1, we don't rollback active read-only transactions Limitation, we will not kill internal transactions, even if they are blocking high priority transactions. Tagging transactions ==================== InnoDB will ask the server for the transaction priority and also to arbitrate when there is a lock wait detected. If the server can't select a victim then InnoDB will rollback the requestor if both transactions are high priority. If both are non-high priority transactions then during a deadlock check existing rules apply, no change. /** Check if the transaction can be rolled back @param[in] requestor Session requesting the lock @param[in] holder Session that holds the lock @return the session that will be rolled back, null don't care */ THD* thd_trx_arbitrate(THD* requestor, THD* holder) { /* Non-user (thd==0) transactions by default can't rollback, in practice DDL transactions should never rollback and that's because they should never wait on table/record locks either */ ut_a(holder != NULL); ut_a(holder != requestor); THD* victim = thd_tx_arbitrate(requestor, holder); ut_a(victim == NULL || victim == requestor || victim == holder); return(victim); } /** @param[in] thd Session to check @return the priority */ int thd_trx_priority(THD* thd) { return(thd == NULL ? 0 : thd_tx_priority(thd)); } Internal transactions are transactions started from within InnoDB. Any transaction that doesn't have the trx_t::mysql_thd data member set is deemed to be an internal transaction. The locking semantics have to be changed to implement #1. When a high priority transaction has to wait for a transaction that already holds the record lock. The high priority transaction will first add its own lock at the head of the record lock wait queue. Then it will rollback any transactions were ahead in the queue that are waiting. This mechanism works in a similar way to the current cancel lock wait and rollback. We know that the transactions are inside InnoDB and waiting therefore all we have to do is set their rollback flag and wake them up. Killing transactions that are active and ahead in the queue requires extra work. We first have to figure out what these transactions are doing. e.g., is the control inside InnoDB code or not. Once they are inside InnoDB it is close to impossible for us to know exactly what they are doing. Therefore we only check for two states, inside InnoDB or outside InnoDB. All user transactions must pass through the handler interface methods. For all these methods we add a "gate" by introducing a class: TrxInInnoDB(trx_t*); The role of this class is two fold: 1. Prevent transactions from crossing the boundary if it detects that the transaction is being considered for asynchronous rollback. 2. Prevent asynchronous rollback if the transaction has crossed its boundary. To check these control flow (execution context) states we introduce: /** If this flag is set then the transaction cannot be rolled back asynchronously. */ static const ib_uint32_t TRX_FORCE_ROLLBACK_DISABLE = 1 << 29; /** Was the transaction rolled back asynchronously or by the owning thread. This flag is relevant only if TRX_FORCE_ROLLBACK is set. */ static const ib_uint32_t TRX_FORCE_ROLLBACK_ASYNC = 1 << 30; /** Mark the transaction for forced rollback */ static const ib_uint32_t TRX_FORCE_ROLLBACK = 1 << 31; /** For masking out the above four flags */ static const ib_uint32_t TRX_FORCE_ROLLBACK_MASK = 0x1FFFFFFF; These flags are stored in trx_t::in_innodb. In order to reduce the mutex overhead they are encoded as the right most 3 bits, so that if performance becomes an issue we can use atomics to note and control the "inside InnoDB" state. /** For tracking the transaction life cycle used in asynchronous rollback */ struct TrxVersion { TrxVersion(trx_t* trx); /** @return true if the trx_t instance is the same */ bool operator==(const TrxVersion& rhs) const { return(rhs.m_trx == m_trx); } trx_t* m_trx; ulint m_version; }; trx_t { ... /* These are protected by the trx_t::mutex */ ib_uint32_t in_innodb; /*!< if the thread is executing in the InnoDB context count > 0. */ bool abort; /*!< if this flag is set then this transaction must abort when it can */ hit_list_t hit_list; /*!< List of transactions (and their version numbers) to kill, when a high priority transaction is blocked on a lock wait. */ os_thread_id_t killed_by; /*!< The transaction (thread id) that wants to kill this transaction asynchronously */ /** Version of this instance. It is incremented each time the instance is re-used in trx_start_low() and on commit. It is used to track whether a transaction has been restarted since it was tagged for asynchronous rollback. */ ulint version; ... }; Use case for asynchronous rollback ---------------------------------- CON1: START TRANSACTION; UPDATE t1 SET c1=1 WHERE c1=0; CON2: START TRANSACTION HIGH_PRIORITY; UPDATE t1 SET c1=2 WHERE c1=0; COMMIT; CON1: COMMIT; The transaction on CON1 was rollback asynchronously but it will only find out when it goes to do a COMMIT, it will get an ER_ERROR_DURING_COMMIT error. This is because AFAIK our client protocol doesn't have a mechanism to push the FORCED_ROLLBACK event to the client. The ramifications for us are that we need to keep track of this new transaction state. For this we introduce a new state: /** Same as not started but with additional semantics that it was rolled back asynchronously the last time it was active. */ TRX_STATE_FORCED_ROLLBACK When a transaction is rolledback asynchronously we need to change its state to the above earlier we would set the state to TRX_STATE_NOT_STARTED. When the user thread has read the forced rollback state it will change it to TRX_STATE_NO_STARTED. The way it works ================ When a high priority transaction needs to rollback the transactions ahead in the queue. It sets their state to : trx->abort = true; /* Note that we will attempt an async rollback. The _ASYNC flag will be cleared if the transaction is rolled back synchronously before we get a chance to do it. */ trx->in_innodb |= TRX_FORCE_ROLLBACK | TRX_FORCE_ROLLBACK_ASYNC; Setting the trx->in_innodb flags will block the transaction from entering InnoDB code past the TrxInInnoDB barrier. Those that are inside InnoDB can still keep executing. Currently we don't do any checks. These could be added later if required, should be trivial provided we know the transaction context. trx->killed_by = os_thread_get_curr_id(); m_trx->kill.push_back(trx); Collect all the transactions in the lock queue ahead hat need to be rolled back in trx_t::kill. Once the victim transactions are tagged and their state changed the high priority transaction will exit the lock code and just before it does the actual wait it will kill all the transactions. It will wait for the transactions that are active inside InnoDB code complete before killing them. For each victim_trx in trx_t::hit_list => /* Check that the transaction isn't active inside InnoDB code. We have to wait while it is executing in the InnoDB context. */ trx_mutex_enter(victim_trx); ut_ad(!(trx->in_innodb & TRX_FORCE_ROLLBACK_DISABLE)); while (victim_trx->version == version && (victim_trx->in_innodb & TRX_FORCE_ROLLBACK_MASK) > 0 && trx_is_started(victim_trx)) { trx_mutex_exit(victim_trx); os_thread_sleep(20); trx_mutex_enter(victim_trx); } ut_ad(it->m_version <= victim_trx->version); bool rollback = victim_trx->version == it->m_version; ut_ad((victim_trx->in_innodb & TRX_FORCE_ROLLBACK) || !rollback); trx_mutex_exit(victim_trx); char buffer[1024]; if (trx_is_started(victim_trx) && rollback) { trx_id_t id = victim_trx->id; ut_ad(victim_trx->in_innodb & TRX_FORCE_ROLLBACK_ASYNC); ut_ad(victim_trx->version == it->m_version); trx_rollback_for_mysql(victim_trx); ib_logf(IB_LOG_LEVEL_INFO, "Killed transaction: ID: " TRX_ID_FMT " - %s", id, thd_security_context( victim_trx->mysql_thd, buffer, sizeof(buffer), 512)); } trx_mutex_enter(victim_trx); victim_trx->in_innodb &= TRX_FORCE_ROLLBACK_MASK; trx_mutex_exit(victim_trx);
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.