MySQL 8.0.40
Source Code Documentation
|
Performance schema aggregates.
Aggregates tables are tables that can be formally defined as SELECT ... from EVENTS_WAITS_HISTORY_INFINITE ... group by 'group clause'.
Each group clause defines a different kind of aggregate, and corresponds to a different table exposed by the performance schema.
Aggregates can be either:
'EVENTS_WAITS_HISTORY_INFINITE' is a table that does not exist, the best approximation is EVENTS_WAITS_HISTORY_LONG. Aggregates computed on the fly in fact are based on EVENTS_WAITS_CURRENT, while aggregates computed on demand are based on other EVENTS_WAITS_SUMMARY_BY_xxx tables.
To better understand the implementation itself, a bit of math is required first, to understand the model behind the code: the code is deceptively simple, the real complexity resides in the flyweight of pointers between various performance schema buffers.
An event measured by the instrumentation has many attributes. An event is represented as a data point P(x1, x2, ..., xN), where each x_i coordinate represents a given attribute value.
Examples of attributes are:
Computing an aggregate per thread is fundamentally different from computing an aggregate by instrument, so the "_BY_THREAD" and "_BY_EVENT_NAME" aggregates are different dimensions, operating on different x_i and x_j coordinates. These aggregates are "orthogonal".
A given x_i attribute value can convey either just one basic information, such as a number of bytes, or can convey implied information, such as an object fully qualified name.
For example, from the value "test.t1", the name of the object schema "test" can be separated from the object name "t1", so that now aggregates by object schema can be implemented.
In math terms, that corresponds to defining a function: F_i (x): x --> y Applying this function to our point P gives another point P':
F_i (P): P(x1, x2, ..., x{i-1}, x_i, x{i+1}, ..., x_N) --> P' (x1, x2, ..., x{i-1}, f_i(x_i), x{i+1}, ..., x_N)
That function defines in fact an aggregate ! In SQL terms, this aggregate would look like the following table:
CREATE VIEW EVENTS_WAITS_SUMMARY_BY_Func_i AS SELECT col_1, col_2, ..., col_{i-1}, Func_i(col_i), COUNT(col_i), MIN(col_i), AVG(col_i), MAX(col_i), -- if col_i is a numeric value col_{i+1}, ..., col_N FROM EVENTS_WAITS_HISTORY_INFINITE group by col_1, col_2, ..., col_{i-1}, col{i+1}, ..., col_N.
Note that not all columns have to be included, in particular some columns that are dependent on the x_i column should be removed, so that in practice, MySQL's aggregation method tends to remove many attributes at each aggregation steps.
For example, when aggregating wait events by object instances,
Now, the "test.t1" --> "test" example was purely theory, just to explain the concept, and does not lead very far. Let's look at a more interesting example of data that can be derived from the row event.
An event creates a transient object, PFS_wait_locker, per operation. This object's life cycle is extremely short: it's created just before the start_wait() instrumentation call, and is destroyed in the end_wait() call.
The wait locker itself contains a pointer to the object instance waited on. That allows to implement a wait_locker --> object instance projection, with m_target. The object instance life cycle depends on _init and _destroy calls from the code, such as mysql_mutex_init() and mysql_mutex_destroy() for a mutex.
The object instance waited on contains a pointer to the object class, which is represented by the instrument name. That allows to implement an object instance --> object class projection. The object class life cycle is permanent, as instruments are loaded in the server and never removed.
The object class is named in such a way (for example, "wait/sync/mutex/sql/LOCK_open", "wait/io/file/maria/data_file) that the component ("sql", "maria") that it belongs to can be inferred. That allows to implement an object class --> server component projection. Back to math again, we have, for example for mutexes: F1 (l) : PFS_wait_locker l --> PFS_mutex m = l->m_target.m_mutex F1_to_2 (m) : PFS_mutex m --> PFS_mutex_class i = m->m_class F2_to_3 (i) : PFS_mutex_class i --> const char *component = substring(i->m_name, ...) Per components aggregates are not implemented, this is just an illustration. F1 alone defines this aggregate: EVENTS_WAITS_HISTORY_INFINITE --> EVENTS_WAITS_SUMMARY_BY_INSTANCE (or MUTEX_INSTANCE) F1_to_2 alone could define this aggregate: EVENTS_WAITS_SUMMARY_BY_INSTANCE --> EVENTS_WAITS_SUMMARY_BY_EVENT_NAME Alternatively, using function composition, with F2 = F1_to_2 o F1, F2 defines: EVENTS_WAITS_HISTORY_INFINITE --> EVENTS_WAITS_SUMMARY_BY_EVENT_NAME Likewise, F_2_to_3 defines: EVENTS_WAITS_SUMMARY_BY_EVENT_NAME --> EVENTS_WAITS_SUMMARY_BY_COMPONENT and F3 = F_2_to_3 o F_1_to_2 o F1 defines: EVENTS_WAITS_HISTORY_INFINITE --> EVENTS_WAITS_SUMMARY_BY_COMPONENT What has all this to do with the code ? Functions (or aggregates) such as F_3 are not implemented as is. Instead, they are decomposed into F_2_to_3 o F_1_to_2 o F1, and each intermediate aggregate is stored into an internal buffer. This allows to support every F1, F2, F3 aggregates from shared internal buffers, where computation already performed to compute F2 is reused when computing F3. @section OBJECT_GRAPH Object graph In terms of object instances, or records, pointers between different buffers define an object instance graph. For example, assuming the following scenario: - A mutex class "M" is instrumented, the instrument name is "wait/sync/mutex/sql/M" - This mutex instrument has been instantiated twice, mutex instances are noted M-1 and M-2 - Threads T-A and T-B are locking mutex instance M-1 - Threads T-C and T-D are locking mutex instance M-2 The performance schema will record the following data: - EVENTS_WAITS_CURRENT has 4 rows, one for each mutex locker - EVENTS_WAITS_SUMMARY_BY_INSTANCE shows 2 rows, for M-1 and M-2 - EVENTS_WAITS_SUMMARY_BY_EVENT_NAME shows 1 row, for M The graph of structures will look like: @startuml object "PFS_wait_locker (T-A, M-1)" as TAM1 object "PFS_wait_locker (T-B, M-1)" as TBM1 object "PFS_wait_locker (T-C, M-2)" as TCM2 object "PFS_wait_locker (T-D, M-2)" as TDM2 object "PFS_mutex (M-1)" as M1 M1 : m_wait_stat object "PFS_mutex (M-2)" as M2 M2 : m_wait_stat object "PFS_mutex_class (M)" as M M : m_wait_stat M <-- M1 M1 <-- TAM1 M1 <-- TBM1 M <-- M2 M2 <-- TCM2 M2 <-- TDM2 M1 .right. M2 TAM1 .right. TBM1 TBM1 .right. TCM2 TCM2 .right. TDM2 note right of M PFS_mutex_class displayed in TABLE EVENTS_WAITS_SUMMARY_BY_EVENT_NAME end note note right of M2 PFS_mutex displayed in TABLE EVENTS_WAITS_SUMMARY_BY_INSTANCE end note note right of TDM2 PFS_wait_lockers displayed in TABLE EVENTS_WAITS_CURRENT end note @enduml @section ON_THE_FLY On the fly aggregates 'On the fly' aggregates are computed during the code execution. This is necessary because the data the aggregate is based on is volatile, and can not be kept indefinitely. With on the fly aggregates: - the writer thread does all the computation - the reader thread accesses the result directly This model is to be avoided if possible, due to the overhead caused when instrumenting code. @section HIGHER_LEVEL Higher level aggregates 'Higher level' aggregates are implemented on demand only. The code executing a SELECT from the aggregate table is collecting data from multiple internal buffers to produce the result. With higher level aggregates: - the reader thread does all the computation - the writer thread has no overhead. @section MIXED Mixed level aggregates The 'Mixed' model is a compromise between 'On the fly' and 'Higher level' aggregates, for internal buffers that are not permanent. While an object is present in a buffer, the higher level model is used. When an object is about to be destroyed, statistics are saved into a 'parent' buffer with a longer life cycle, to follow the on the fly model. With mixed aggregates: - the reader thread does a lot of complex computation, - the writer thread has minimal overhead, on destroy events. @section IMPL_WAIT Implementation for waits aggregates For waits, the tables that contains aggregated wait data are: - EVENTS_WAITS_SUMMARY_BY_ACCOUNT_BY_EVENT_NAME - EVENTS_WAITS_SUMMARY_BY_HOST_BY_EVENT_NAME - EVENTS_WAITS_SUMMARY_BY_INSTANCE - EVENTS_WAITS_SUMMARY_BY_THREAD_BY_EVENT_NAME - EVENTS_WAITS_SUMMARY_BY_USER_BY_EVENT_NAME - EVENTS_WAITS_SUMMARY_GLOBAL_BY_EVENT_NAME - FILE_SUMMARY_BY_EVENT_NAME - FILE_SUMMARY_BY_INSTANCE - SOCKET_SUMMARY_BY_INSTANCE - SOCKET_SUMMARY_BY_EVENT_NAME - OBJECTS_SUMMARY_GLOBAL_BY_TYPE The instrumented code that generates waits events consist of: - mutexes (mysql_mutex_t) - rwlocks (mysql_rwlock_t) - conditions (mysql_cond_t) - file I/O (MYSQL_FILE) - socket I/O (MYSQL_SOCKET) - table I/O - table lock - idle The flow of data between aggregates tables varies for each instrumentation. @subsection IMPL_WAIT_MUTEX Mutex waits @verbatim mutex_locker(T, M) | | [1] | |-> pfs_mutex(M) =====>> [B], [C] | | | | [2] | | | |-> pfs_mutex_class(M.class) =====>> [C] | |-> pfs_thread(T).event_name(M) =====>> [A], [D], [E], [F] | | [3] | 3a |-> pfs_account(U, H).event_name(M) =====>> [D], [E], [F] . | . | [4-RESET] . | 3b .....+-> pfs_user(U).event_name(M) =====>> [E] . | 3c .....+-> pfs_host(H).event_name(M) =====>> [F] @endverbatim How to read this diagram: - events that occur during the instrumented code execution are noted with numbers, as in [1]. Code executed by these events has an impact on overhead. - events that occur during TRUNCATE TABLE operations are noted with numbers, followed by "-RESET", as in [4-RESET]. Code executed by these events has no impact on overhead, since they are executed by independent monitoring sessions. - events that occur when a reader extracts data from a performance schema table are noted with letters, as in [A]. The name of the table involved, and the method that builds a row are documented. Code executed by these events has no impact on the instrumentation overhead. Note that the table implementation may pull data from different buffers. - nominal code paths are in plain lines. A "nominal" code path corresponds to cases where the performance schema buffers are sized so that no records are lost. - degenerated code paths are in dotted lines. A "degenerated" code path corresponds to edge cases where parent buffers are full, which forces the code to aggregate to grand parents directly.
Implemented as:
table_ews_by_thread_by_event_name::make_row()
table_events_waits_summary_by_instance::make_mutex_row()
table_ews_global_by_event_name::make_mutex_row()
table_ews_by_account_by_event_name::make_row()
table_ews_by_user_by_event_name::make_row()
table_ews_by_host_by_event_name::make_row()
Table EVENTS_WAITS_SUMMARY_BY_INSTANCE is a 'on the fly' aggregate, because the data is collected on the fly by (1) and stored into a buffer, pfs_mutex. The table implementation [B] simply reads the results directly from this buffer.
Table EVENTS_WAITS_SUMMARY_GLOBAL_BY_EVENT_NAME is a 'mixed' aggregate, because some data is collected on the fly (1), some data is preserved with (2) at a later time in the life cycle, and two different buffers pfs_mutex and pfs_mutex_class are used to store the statistics collected. The table implementation [C] is more complex, since it reads from two buffers pfs_mutex and pfs_mutex_class.
rwlock_locker(T, R) | | [1] | |-> pfs_rwlock(R) =====>> [B], [C] | | | | [2] | | | |-> pfs_rwlock_class(R.class) =====>> [C] | |-> pfs_thread(T).event_name(R) =====>> [A] | ...
Implemented as:
table_ews_by_thread_by_event_name::make_row()
table_events_waits_summary_by_instance::make_rwlock_row()
table_ews_global_by_event_name::make_rwlock_row()
cond_locker(T, C) | | [1] | |-> pfs_cond(C) =====>> [B], [C] | | | | [2] | | | |-> pfs_cond_class(C.class) =====>> [C] | |-> pfs_thread(T).event_name(C) =====>> [A] | ...
Implemented as:
table_ews_by_thread_by_event_name::make_row()
table_events_waits_summary_by_instance::make_cond_row()
table_ews_global_by_event_name::make_cond_row()
file_locker(T, F) | | [1] | |-> pfs_file(F) =====>> [B], [C], [D], [E] | | | | [2] | | | |-> pfs_file_class(F.class) =====>> [C], [D] | |-> pfs_thread(T).event_name(F) =====>> [A] | ...
Implemented as:
table_ews_by_thread_by_event_name::make_row()
table_events_waits_summary_by_instance::make_file_row()
table_ews_global_by_event_name::make_file_row()
table_file_summary_by_event_name::make_row()
table_file_summary_by_instance::make_row()
socket_locker(T, S) | | [1] | |-> pfs_socket(S) =====>> [A], [B], [C], [D], [E] | | [2] | |-> pfs_socket_class(S.class) =====>> [C], [D] | |-> pfs_thread(T).event_name(S) =====>> [A] | | [3] | 3a |-> pfs_account(U, H).event_name(S) =====>> [F], [G], [H] . | . | [4-RESET] . | 3b .....+-> pfs_user(U).event_name(S) =====>> [G] . | 3c .....+-> pfs_host(H).event_name(S) =====>> [H]
Implemented as:
PFS_account::aggregate_waits()
PFS_host::aggregate_waits()
table_ews_by_thread_by_event_name::make_row()
table_events_waits_summary_by_instance::make_socket_row()
table_ews_global_by_event_name::make_socket_row()
table_socket_summary_by_event_name::make_row()
table_socket_summary_by_instance::make_row()
table_ews_by_account_by_event_name::make_row()
table_ews_by_user_by_event_name::make_row()
table_ews_by_host_by_event_name::make_row()
table_locker(Thread Th, Table Tb, Event = I/O or lock) | | [1] | 1a |-> pfs_table(Tb) =====>> [A], [B], [C] | | | | [2] | | | |-> pfs_table_share(Tb.share) =====>> [B], [C] | | | | [3] | | | |-> global_table_io_stat =====>> [C] | | | |-> global_table_lock_stat =====>> [C] | 1b |-> pfs_thread(Th).event_name(E) =====>> [D], [E], [F], [G] | | | | [ 4-RESET] | | | |-> pfs_account(U, H).event_name(E) =====>> [E], [F], [G] | . | | . | [5-RESET] | . | | .....+-> pfs_user(U).event_name(E) =====>> [F] | . | | .....+-> pfs_host(H).event_name(E) =====>> [G] | 1c |-> pfs_thread(Th).waits_current(W) =====>> [H] | 1d |-> pfs_thread(Th).waits_history(W) =====>> [I] | 1e |-> waits_history_long(W) =====>> [J]
Implemented as:
table_tiws_by_table::make_row()
table_os_global_by_type::make_table_row()
table_ews_global_by_event_name::make_table_io_row()
, table_ews_global_by_event_name::make_table_lock_row()
table_ews_by_thread_by_event_name::make_row()
table_ews_by_account_by_event_name::make_row()
table_ews_by_user_by_event_name::make_row()
table_ews_by_host_by_event_name::make_row()
table_events_waits_common::make_row()
table_events_waits_common::make_row()
table_events_waits_common::make_row()
For stages, the tables that contains aggregated data are:
start_stage(T, S) | | [1] | 1a |-> pfs_thread(T).event_name(S) =====>> [A], [B], [C], [D], [E] | | | | [2] | | | 2a |-> pfs_account(U, H).event_name(S) =====>> [B], [C], [D], [E] | . | | . | [3-RESET] | . | | 2b .....+-> pfs_user(U).event_name(S) =====>> [C] | . | | 2c .....+-> pfs_host(H).event_name(S) =====>> [D], [E] | . . | | . . | [4-RESET] | 2d . . | 1b |----+----+----+-> pfs_stage_class(S) =====>> [E]
Implemented as:
PFS_account::aggregate_stages()
PFS_host::aggregate_stages()
table_esgs_by_thread_by_event_name::make_row()
table_esgs_by_account_by_event_name::make_row()
table_esgs_by_user_by_event_name::make_row()
table_esgs_by_host_by_event_name::make_row()
table_esgs_global_by_event_name::make_row()
For statements, the tables that contains individual event data are:
For statements, the tables that contains aggregated data are:
statement_locker(T, S) | | [1] | 1a |-> pfs_thread(T).event_name(S) =====>> [A], [B], [C], [D], [E] | | | | [2] | | | 2a |-> pfs_account(U, H).event_name(S) =====>> [B], [C], [D], [E] | . | | . | [3-RESET] | . | | 2b .....+-> pfs_user(U).event_name(S) =====>> [C] | . | | 2c .....+-> pfs_host(H).event_name(S) =====>> [D], [E] | . . | | . . | [4-RESET] | 2d . . | 1b |----+----+----+-> pfs_statement_class(S) =====>> [E] | 1c |-> pfs_thread(T).statement_current(S) =====>> [F] | 1d |-> pfs_thread(T).statement_history(S) =====>> [G] | 1e |-> statement_history_long(S) =====>> [H] | 1f |-> statement_digest(S) =====>> [I]
Implemented as:
PFS_account::aggregate_statements()
PFS_host::aggregate_statements()
table_esms_by_thread_by_event_name::make_row()
table_esms_by_account_by_event_name::make_row()
table_esms_by_user_by_event_name::make_row()
table_esms_by_host_by_event_name::make_row()
table_esms_global_by_event_name::make_row()
table_events_statements_current::make_row()
table_events_statements_history::make_row()
table_events_statements_history_long::make_row()
table_esms_by_digest::make_row()
For transactions, the tables that contains individual event data are:
For transactions, the tables that contains aggregated data are:
transaction_locker(T, TX) | | [1] | 1a |-> pfs_thread(T).event_name(TX) =====>> [A], [B], [C], [D], [E] | | | | [2] | | | 2a |-> pfs_account(U, H).event_name(TX) =====>> [B], [C], [D], [E] | . | | . | [3-RESET] | . | | 2b .....+-> pfs_user(U).event_name(TX) =====>> [C] | . | | 2c .....+-> pfs_host(H).event_name(TX) =====>> [D], [E] | . . | | . . | [4-RESET] | 2d . . | 1b |----+----+----+-> pfs_transaction_class(TX) =====>> [E] | 1c |-> pfs_thread(T).transaction_current(TX) =====>> [F] | 1d |-> pfs_thread(T).transaction_history(TX) =====>> [G] | 1e |-> transaction_history_long(TX) =====>> [H]
Implemented as:
PFS_account::aggregate_transactions()
PFS_host::aggregate_transactions()
table_ets_by_thread_by_event_name::make_row()
table_ets_by_account_by_event_name::make_row()
table_ets_by_user_by_event_name::make_row()
table_ets_by_host_by_event_name::make_row()
table_ets_global_by_event_name::make_row()
table_events_transactions_current::rnd_next()
, table_events_transactions_common::make_row()
table_events_transactions_history::rnd_next()
, table_events_transactions_common::make_row()
table_events_transactions_history_long::rnd_next()
, table_events_transactions_common::make_row()
For memory, there are no tables that contains individual event data.
For memory, the tables that contains aggregated data are:
memory_event(T, S) | | [1] | 1a |-> pfs_thread(T).event_name(S) =====>> [A], [B], [C], [D], [E] | | | | [2] | | 1+ | 2a |-> pfs_account(U, H).event_name(S) =====>> [B], [C], [D], [E] | . | | . | [3-RESET] | . | 1+ | 2b .....+-> pfs_user(U).event_name(S) =====>> [C] | . | 1+ | 2c .....+-> pfs_host(H).event_name(S) =====>> [D], [E] | . . | | . . | [4-RESET] | 2d . . | 1b |----+----+----+-> global.event_name(S) =====>> [E]
Implemented as:
carry_memory_stat_delta()
PFS_account::aggregate_memory()
PFS_host::aggregate_memory()
table_mems_by_thread_by_event_name::make_row()
table_mems_by_account_by_event_name::make_row()
table_mems_by_user_by_event_name::make_row()
table_mems_by_host_by_event_name::make_row()
table_mems_global_by_event_name::make_row()
For errors, there are no tables that contains individual event data.
For errors, the tables that contains aggregated data are:
error_event(T, S) | | [1] | 1a |-> pfs_thread(T).event_name(S) =====>> [A], [B], [C], [D], [E] | | | | [2] | | | 2a |-> pfs_account(U, H).event_name(S) =====>> [B], [C], [D], [E] | . | | . | [3-RESET] | . | | 2b .....+-> pfs_user(U).event_name(S) =====>> [C] | . | | 2c .....+-> pfs_host(H).event_name(S) =====>> [D], [E] | . . | | . . | [4-RESET] | 2d . . | 1b |----+----+----+-> global.event_name(S) =====>> [E]
Implemented as:
PFS_account::aggregate_errors()
PFS_host::aggregate_errors()
table_ees_by_thread_by_error::make_row()
table_ees_by_account_by_error::make_row()
table_ees_by_user_by_error::make_row()
table_ees_by_host_by_error::make_row()
table_ees_global_by_error::make_row()