Innodb Performance Schema data lock instrumentation

Data lock iterators

To provide content to the performance_schema.data_locks table, innodb implements Innodb_data_lock_iterator.

Likewise, table performance_schema.data_wait_locks is populated with Innodb_data_lock_wait_iterator.

Both these iterators need to return the data present in the innodb engine memory, which imply to take the proper mutex locks when inspecting it. The structure to inspect here is the transaction list (trx_sys)

How to implement this scan is critical for performances.

No full scan

Consider this implementation:

Take all necessary locks
Scan all the innodb internal locks
Report all of them to the performance schema
Release all the locks taken

This implementation materializes the entire table.

The benefits with this approach are:

The materialized table is consistent

The problems with this approach are:

The innodb engine is frozen for the entire duration, for a time that is unpredictable.
Memory consumption spikes, without bounds
Materializing all rows upfront is incompatible with supporting an index

For example with N = 10,000 transactions, a single scan reports all 10,000 transaction locks.

This alternative is rejected.

No single row scan

Consider this implementation:

Take all necessary locks
Resume the scan on innodb internal locks for 1 record
Report this record to the performance schema
Release all the locks taken

This implementation returns a row for a single transaction, or even a single lock, at a time.

The benefits with this approach are:

Memory consumption is well bounded, and low.

The problems with this approach are:

Data reported can be very inconsistent.
Implementing a restartable scan, on a very dynamic structure, without holding any lock, is complex.
Even assuming how to implement a scan is resolved, looping N times to find element i, i+1, i+2 ... in a list ends up having a complexity in O(N^2), consuming CPU.

For example with N = 10,000 transactions, the trx_list would be scanned 10,000 times to return 1 record each time. The total number of operations on the list is 100 Millions.

This alternative is rejected.

Restartable batch scan

What is implemented is:

As lock sys data structures are sharded with each shard having own latch, we inspect the shards one by one to avoid latching whole lock system
We first process table locks, then record locks
Table locks are processed one table at a time
Record locks are processed one internal hash table bucket at a time

This is a compromise, with the following properties:

Memory consumption is bounded, by the number of locks in each bucket and on each table.
The duration of mutex locks on innodb structures is bounded by the number of locks in each bucket and on each table.
The data returned is not consistent, but at least it is "consistent by chunks"