Innodb Performance Schema data lock instrumentation

Data lock iterators

To provide content to the performance_schema.data_locks table, innodb implements Innodb_data_lock_iterator.

Likewise, table performance_schema.data_wait_locks is populated with Innodb_data_lock_wait_iterator.

Both these iterators need to return the data present in the innodb engine memory, which imply to take the proper mutex locks when inspecting it. The structure to inspect here is the transaction list (trx_sys)

How to implement this scan is critical for performances.

No full scan

Consider this implementation:

Take all necessary locks
Scan all the innodb internal locks
Report all of them to the performance schema
Release all the locks taken

This implementation materializes the entire table.

The benefits with this approach are:

The materialized table is consistent

The problems with this approach are:

The innodb engine is frozen for the entire duration, for a time that is unpredictable.
Memory consumption spikes, without bounds
Materializing all rows upfront is incompatible with supporting an index

For example with N = 10,000 transactions, a single scan reports all 10,000 transaction locks.

This alternative is rejected.

No single row scan

Consider this implementation:

Take all necessary locks
Resume the scan on innodb internal locks for 1 record
Report this record to the performance schema
Release all the locks taken

This implementation returns a row for a single transaction, or even a single lock, at a time.

The benefits with this approach are:

Memory consumption is well bounded, and low.

The problems with this approach are:

Data reported can be very inconsistent.
Implementing a restartable scan, on a very dynamic structure, without holding any lock, is complex.
Even assuming how to implement a scan is resolved, looping N times to find element i, i+1, i+2 ... in a list ends up having a complexity in O(N^2), consuming CPU.

For example with N = 10,000 transactions, the trx_list would be scanned 10,000 times to return 1 record each time. The total number of operations on the list is 100 Millions.

This alternative is rejected.

Restartable batch scan

What is implemented is:

Take all necessary locks
Resume the scan on innodb internal locks, for a range containing at most RANGE transactions
Report all the records of transactions in the range to the p_s
Release all the locks taken

This is a compromise, with the following properties:

Memory consumption is bounded, by the number of transactions processed in each range.
The duration of mutex locks on innodb structures is bounded by the number of records in each range
The data returned is not consistent, but at least it is "consistent by chunks"
As the trx_sys->rw_trx_list and trx_sys->mysql_trx_list lists we iterate over are in constant flux (transactions can start or end, when we don't hold trx_sys->mutex), and the lists aren't sorted by id nor by address, we can't use position within the list, nor pointer to trx, nor trx->id, as a "persistent cursor". Instead we use trx_immutable_id(trx) to map them to space of natural numbers, and divide that space into ranges, so that each range contains RANGE transactions (except for the last which can have less).
The overall scan complexity is O(N/RANGE)*N*lg(RANGE), as there will be at most (N/RANGE) ranges, and establishing where each range ends requires O(Nlg(RANGE)) operations. It could be O(N) if we used n_th_element, but that would require a Omega(N) memory, so instead we use a heap of size O(RANGE) to figure out the RANGE-th smallest not yet processed transaction. Each operation on this heap takes O(lg(RANGE)) and in worst case we need O(N) of them. In practice, if the sequence is random, the number of operations on the heap is expected to be sum(RANGE/i)=O(RANGE*lg(N)) as i-th element of the sequence has roughly RANGE/i chance of being among the RANGE smallest so far. So the total expected time for random sequences is around (N/RANGE)*(N+RANGE*lg(N)*lg(RANGE)) which is O(N^2/RANGE + Nlg(N)lg(RANGE)). This is still technically O(N^2), but as the RANGE is 256, in practice it should be reasonable. For example with N = 10,000 transactions, there are 40 batches at the trx list, where each batch reports (up to) 256 trx, with the trx locks. The total number of operations on the list is 400 thousands plus some number of operations on the heap, which we expect to be ~160 thousands and each takes ~8 steps within the heap.