WL#4895: PERFORMANCE_SCHEMA Instrumenting Table IO
Affects: Server-5.6
—
Status: Complete
Add the necessary instrumentation for tracking table input/output. Include row-level accesses to temporary/persistent base tables and indexes.
We will treat table io as a "wait" so data will appear in EVENTS_WAITS_CURRENT, in wait histories, and in wait summaries. The EVENT_NAME column will be "wait/io/table/..." -- compare "wait/io/file/...". The big change is that EVENTS_WAITS_CURRENT has two rows. Molecules and Atoms ------------------- Other waits (mutexes, file io, net io, memory io) are "atomic", that is, they do not include other waits, they do not overlap with other waits. Table io can be different; sometimes table io might include file io or memory movement (for example "update row" might be done with "read file" or might cause "memcpy"). We might logically conclude that table io is not class = wait, it is a separate class in the hierarchy statements -> stages -> table io -> waits. But that would significantly increase the number of performance_schema tables, and the amount of nesting. We want to avoid such trouble for something that's after all nearly atomic. We'll use the word "molecular" for events that are low level like this. Table io is a molecular wait. Possibly global read lock is also molecular. Molecular events don't get their own set of tables, but they do get their own rows in EVENTS_WAITS_CURRENT. Therefore EVENTS_WAITS_CURRENT will usually have two rows: #1 The latest wait event of any kind #2 The latest table io wait event (may be same as #1). For example, suppose that there is a 'row fetch' which causes a 'file read'. EVENTS_WAITS_CURRENT might look like this: Row# EVENT_NAME TIMER_START TIMER_END ---- ---------- ----------- --------- 1 wait/io/file/myisam/dfile 10001 10002 2 wait/io/table/sql/handler 10000 NULL In this example a table io event started before a file io event and has not finished, so it's clear that the file io event is "nested" in the table io. The table io event will not be discarded from EVENTS_WAITS_CURRENT until it has finished. In all other respects, table io treatment is the same as for other waits. What to instrument ------------------ Operations that affect whole tables -- CREATE, ALTER, DROP -- are visible as statements, so there's no need to monitor them separately. Some operations that affect blocks -- "block read", "block write" -- are visible as file io, so there's no need to monitor them separately. Other operations that affect blocks -- marking "has been read" or "will be flushed" -- take no time and are therefore not appropriate for instrumenting. Operations that affect data rows are: 'fetch' (or 'select'), 'insert' (or 'write') 'update', 'delete'. We don't consider 'lock' to be a table io operation. We don't include 'binlog' which shows up as file io. If an insert triggers an insert, that's two operations. If an update causes a cascade, that's two operations. In another worklog task we say 'fetch' is a 'stage'. [ Note added 2010-07-19: an update which causes a cascade via an InnoDB foreign key is not seen as two operations because our monitoring is only in the server. Okay. ] Instrumenting in handler.cc --------------------------- The Performance Schema instrumentation will be in sql/handler.cc, in ha_write_row, ha_update_row, and ha_delete_row. For example, the current mysql-next-mr code for handler.cc is: " int handler::ha_write_row(uchar *buf) { int error; Log_func *log_func= Write_rows_log_event::binlog_row_logging_function; DBUG_ENTER("handler::ha_write_row"); MYSQL_INSERT_ROW_START(...); /* DTrace instrument start */ mark_trx_read_write(); error= write_row(buf); <-- REPLACE THIS WITH INSTRUMENTING CALL MYSQL_INSERT_ROW_DONE(...); /* DTrace instrument end */ if (unlikely(error != 0)) DBUG_RETURN(error); if (unlikely(error= binlog_log_row(table, 0, buf, log_func))) DBUG_RETURN(error); /* purecov: inspected */ DBUG_RETURN(0); } " We will replace "write_row(buf)" with something like "start_table_wait ... write_row(buf) ... end_table_wait". We will do the same in ha_update_row and ha_delete_row. This means we have row io instrumentation in one place, rather than one place per storage engine. This also means we are instrumenting inside DTrace instrumentation. Peter had been thinking that it would be good to instrument in the same places as DTrace so we could compare our results to DTrace results when measuring the same things. But that's unimportant. Mikael Ronström says ""... we have decided that the Performance Schema probes will be inside of the start and end DTrace probes ... The reason I think was that DTrace probes has unpredictable timing since user code can be added there while Performance Schema code is predictable since no user code can be added to it." This also means that we are instrumenting outside innodb_metrics instrumentation, which is done at a deeper level. Sergei Golubchik thinks instrumenting in handler.cc is right. Marc Alff sees nothing wrong with it. However, the only operations are 'insert' (i.e. write), 'update', 'delete'. And only for individual rows. Instrumenting bulk io (Deferred) -------------------------------- This section has been moved to WL#5372 PERFORMANCE_SCHEMA Instrumenting Table IO (Deferred). Instrumenting 'fetch' (Deferred) -------------------------------- This section has been moved to WL#5372 PERFORMANCE_SCHEMA Instrumenting Table IO (Deferred). Indexing index operations (Deferred) ------------------------------------ This section has been moved to WL#5372 PERFORMANCE_SCHEMA Instrumenting Table IO (Deferred). EVENTS_WAITS_CURRENT -------------------- We can find table io individual event information with: SELECT * FROM EVENTS_WAITS_CURRENT; The columns of EVENTS_WAITS_CURRENT are as described for WL#2360, and we include the ones that will appear for WL#4678. Some columns might be NULL for users with insufficient privileges. THREAD_ID INTEGER Same as for mutexes. EVENT_ID BIGINT Same as for mutexes. EVENT_NAME VARCHAR(128) See below. SOURCE VARCHAR(64) Same as for mutexes. TIMER_START BIGINT Same as for mutexes. TIMER_END BIGINT Same as for mutexes. TIMER_WAIT BIGINT Same as for mutexes. SPINS INTEGER See below. OBJECT_SCHEMA VARCHAR(64) See below. OBJECT_NAME VARCHAR(512) See below. OBJECT_TYPE VARCHAR(64) See below. OBJECT_INSTANCE_BEGIN BIGINT See below. NESTING_EVENT_ID BIGINT Same as for mutexes. OPERATION VARCHAR(16) See below. NUMBER_OF_BYTES BIGINT See below. FLAGS INTEGER NULL. See below. EVENT_NAME. Following the "taxonomy" convention of WL#2360: Class: wait Order io Family table Genus sql Species handler Knowing that 'table io' is a subclass of 'io' which is a subclass of 'wait', uncontroversially class/order/family = 'wait/io/table'. The genus and species are arbitrary, they're based on the fact that the current instrument is in sql/handler.cc. If we were instrumenting some other method for a subclass of handler, it would still be appropriate to have the same event name. This does mean that when one says "GROUP BY EVENT_NAME" one will get all events (fetch, insert, update, delete, index read, etc.) in the same group, thus the same row. SPINS. NULL for a typical table-io operation, but could be the number of an index if later we instrument certain index accesses. Certainly SPINS is not an appropriate column name for an index number, so we are also considering using OBJECT_INSTANCE_BEGIN for this purpose, or a new column. OBJECT_SCHEMA. The name of the schema (database) that contains the table or index. OBJECT_NAME: The name of the table. Not a correlation name, synonym, or view name. It would be desirable to specify that it may not be a partition name or tablespace name, but we will say that's implementor-defined. Not an index name, even if the operation affects an index. OBJECT_TYPE. 'TABLE' (persistent base table), or 'TEMPORARY TABLE'. Even when we support indexes, we'll say 'TABLE' because index operations will be visible as "index fetch" in the OPERATION column. OBJECT_INSTANCE_BEGIN could be a ROWID, if there is a BIGINT identifier known at the time of the table io. Otherwise it's NULL. Putting "Block number" or "Offset within block" could be considered. However, we will say this is implementor-defined. It can be NULL. NESTING_EVENT_ID. Originally the wording here was: "Table io is nested within stages which are nested within statements. Undecided detail: should we indicate that an atomic wait (e.g. file io) is nested within a table io?" Now see the later section "NESTING_EVENT_ID (Deferred)". OPERATION is 'insert' or 'update' or 'delete'; later 'fetch' will be possible. In summaries a 'fetch' should be counted as a 'read', and any other operation should be counted as a 'write'. NUMBER_OF_BYTES should be known. If a row has 17 bytes including the size of a header, NUMBER_OF_BYTES = 17. It might be unknown when the event starts, so often NUMBER_OF_BYTES is NULL. FLAGS. Provided that there is no significant overhead, this will have things like "index read was done in KEYREAD mode". Here is Sergei Golubchik's opinion: " Different extra hints are important. At least these: HA_EXTRA_CACHE, HA_EXTRA_KEYREAD, HA_EXTRA_KEY_CACHE, HA_EXTRA_WRITE_CACHE, HA_EXTRA_DELETE_CANNOT_BATCH, HA_EXTRA_UPDATE_CANNOT_BATCH, but [... maybe more]. The above are values from enum ha_extra_function. For flags we'll probably need to remove HA_EXTRA_ prefix or rename them completely, using names clearer to the end user. They're important, because they affect how the read, for example, or delete is performed, and affect the spead. That is, without knowing these flags the row (in the waits table) cannot be interpreted correctly. " Meanwhile Marc Alff has proposed that this FLAGS column could have data type SET('cache','keyread','...'). Unfortunately the meaning of the bits can depend on the event's genus or operation, but Peter Gulutzan thinks we should try to work out something that's readable. That's for "later": he likes to think of cosmetic enhancements as paint, and we should paint after all the walls of the structure are up. We will consider FLAGS content to be implementor-defined. OTHER POSSIBLE COLUMNS: If we can guess that we will need to read 1000 rows, and we know that we're on the 500th fetch, that's useful information for a progress indicator. But perhaps it's more useful for a stage. Generally we do not want to include metadata in performance schema tables. So if you want to find out (for example) what the tablespace is, look elsewhere. NESTING_EVENT_ID (Deferred) --------------------------- This section has been moved to WL#5372 PERFORMANCE_SCHEMA Instrumenting Table IO (Deferred). Summaries --------- The history and summary tables EVENTS_WAITS_HISTORY, EVENTS_WAITS_HISTORY_LONG, EVENTS_SUMMARY_BY_THREAD..., and so on are along the same lines as for other events. There will not be a summary by table. That is part of a different task which comes later. See "Object Summaries" in WL#4816. Double Timing (Deferred) ------------------------ By default, in SETUP_INSTRUMENTS, we have rows like: NAME ENABLED TIMED ---- ------- ----- wait/io/table/sql/handler YES YES That is: table io should be instrumented and timed. With some storage engines table io is little more than a memory copy, so the instrumentation might take more time than the operation. The only slow operations will be due to file io, so we'll see this in EVENTS_WAITS_HISTORY: EVENT_NAME OPERATION TIMER_START TIMER_END TIMER_WAIT ---------- --------- ----------- --------- ---------- wait/io/table.... fetch 100 200 100 wait/io/file... read 101 199 98 In this example the file io starts after the table io starts, and the file io ends before the table io ends. So statements like SELECT SUM(TIMER_WAIT) FROM EVENTS_WAITS_HISTORY; are misleading, they count the same thing twice. This does not affect EVENTS_WAITS_SUMMARY_BY_EVENT_NAME or other aggregation which keep summaries separated. Some possible solutions (pick one): 1. Recommend to users that they should set TIMED='NO' for table instruments. This will cause TIMER_START and TIMER_END and TIMER_WAIT values to be NULL or zero. 2. Add a new column TIMER_WAIT_OF_NESTED_EVENTS. This goes up whenever a nested event finishes. The value in a table io TIMER_WAIT column is calculated then as (TIMER_END - TIMER_START) - TIMER_WAIT_OF_NESTED_EVENTS. 3. When calculating a global aggregate, rather than using a simple SUM(), add only the events which have no double counting. That is, skip wait events which are flagged as "nested". This requires a new column. Voting so far: Sergei Golubchik prefers "3". This matter has not been resolved, so for now we will ignore table io in all timing aggregations. That is, SUM_TIMER_WAIT and MIN_TIMER_WAIT and MAX_TIMER_WAIT and AVG_TIMER_WAIT do not include timer information from anything in the wait/io/table family. We will have to resolve this in a later worklog task. Global row io counters ---------------------- We want to be compatible with existing counters that track global io. At least they might be useful for checking validity of our results. But if our instrumentation is in a different place, our numbers will differ. Slow log: ROWS_EXAMINED. SHOW STATUS: Innodb_rows_read, Innodb_rows_updated, Key_reads, Key_writes. Google's SHOW ... STATISTICS patch: number of rows fetched per index, number of rows fetched or changed per table. DTrace: see source code, and MySQL Reference Manual "Row-Level Probes" http://dev.mysql.com/doc/refman/5.5/en/dba-dtrace-ref-rowlevel.html innodb_metrics: unknown Example from EVENTS_WAITS_CURRENT --------------------------------- THREAD_ID: 2 EVENT_ID: 815 EVENT_NAME: wait/io/table/sql/handler SOURCE: handler.cc:5509 TIMER_START: 405354593873242 TIMER_END: 405354595869556 TIMER_WAIT: 1996314 SPINS: NULL OBJECT_SCHEMA: test OBJECT_NAME: t1 OBJECT_TYPE: TABLE OBJECT_INSTANCE_BEGIN: 200 NESTING_EVENT_ID: NULL OPERATION: insert NUMBER_OF_BYTES: NULL FLAGS: 0 References ---------- dev-private thread "WL#2360 Performance Schema, Instrumenting Table IO" starting with [mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=28725 dev-private thread "WL#4895 PERFORMANCE_SCHEMA Instrumenting Table IO" attached file containing most-relevant-looking emails up to 2010-05-04
The code in bzr mysql-next-mr-perfschema contains the low-level structures and main algorithms, commented and set apart from other code with if/endif, to support the situation as described in the HLS. That suffices as an LLD for a task of this nature.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.