WL#4895: PERFORMANCE_SCHEMA Instrumenting Table IO

Affects: Server-5.6   —   Status: Complete

Add the necessary instrumentation for tracking
table input/output. Include row-level accesses
to temporary/persistent base tables and indexes.
We will treat table io as a "wait" so data will
appear in EVENTS_WAITS_CURRENT, in wait histories,
and in wait summaries. The EVENT_NAME column will be
"wait/io/table/..." -- compare "wait/io/file/...".

The big change is that EVENTS_WAITS_CURRENT has two rows.

Molecules and Atoms
-------------------

Other waits (mutexes, file io, net io, memory io) are
"atomic", that is, they do not include other waits,
they do not overlap with other waits. Table io can
be different; sometimes table io might include file io
or memory movement (for example "update row" might
be done with "read file" or might cause "memcpy").

We might logically conclude that table io is not
class = wait, it is a separate class in the hierarchy
statements -> stages -> table io -> waits.
But that would significantly increase the number of
performance_schema tables, and the amount of nesting.
We want to avoid such trouble for something that's
after all nearly atomic. We'll use the word "molecular"
for events that are low level like this. Table io is
a molecular wait. Possibly global read lock is also molecular.

Molecular events don't get their own set of tables,
but they do get their own rows in EVENTS_WAITS_CURRENT.
Therefore EVENTS_WAITS_CURRENT will usually have two rows:
#1 The latest wait event of any kind
#2 The latest table io wait event (may be same as #1).

For example, suppose that there is a 'row fetch' which
causes a 'file read'. EVENTS_WAITS_CURRENT might look like this:

Row# EVENT_NAME                 TIMER_START TIMER_END
---- ----------                 ----------- ---------
   1 wait/io/file/myisam/dfile        10001 10002
   2 wait/io/table/sql/handler        10000 NULL

In this example a table io event started before a
file io event and has not finished, so it's clear
that the file io event is "nested" in the table io.

The table io event will not be discarded from
EVENTS_WAITS_CURRENT until it has finished.

In all other respects, table io treatment is the
same as for other waits.

What to instrument
------------------

Operations that affect whole tables -- CREATE,
ALTER, DROP -- are visible as statements, so
there's no need to monitor them separately.

Some operations that affect blocks -- "block read",
"block write" -- are visible as file io, so
there's no need to monitor them separately.
Other operations that affect blocks -- marking
"has been read" or "will be flushed" -- take no
time and are therefore not appropriate for
instrumenting.

Operations that affect data rows are:
'fetch' (or 'select'), 'insert' (or 'write')
'update', 'delete'.
We don't consider 'lock' to be a table io operation.
We don't include 'binlog' which shows up as file io.
If an insert triggers an insert, that's two operations.
If an update causes a cascade, that's two operations.
In another worklog task we say 'fetch' is a 'stage'.

[ Note added 2010-07-19: an update which causes a cascade
via an InnoDB foreign key is not seen as two operations
because our monitoring is only in the server. Okay. ]

Instrumenting in handler.cc
---------------------------

The Performance Schema instrumentation will be
in sql/handler.cc, in ha_write_row, ha_update_row,
and ha_delete_row. For example, the current
mysql-next-mr code for handler.cc is:
"
int handler::ha_write_row(uchar *buf)
{
  int error;
  Log_func *log_func= Write_rows_log_event::binlog_row_logging_function;
  DBUG_ENTER("handler::ha_write_row");

  MYSQL_INSERT_ROW_START(...); /* DTrace instrument start */
  mark_trx_read_write();
  error= write_row(buf);       <-- REPLACE THIS WITH INSTRUMENTING CALL
  MYSQL_INSERT_ROW_DONE(...);  /* DTrace instrument end */
  
  if (unlikely(error != 0))
    DBUG_RETURN(error);
  if (unlikely(error= binlog_log_row(table, 0, buf, log_func)))
    DBUG_RETURN(error); /* purecov: inspected */
  DBUG_RETURN(0);
}
"
We will replace "write_row(buf)" with something like
"start_table_wait ... write_row(buf) ... end_table_wait".
We will do the same in ha_update_row and ha_delete_row.

This means we have row io instrumentation in
one place, rather than one place per storage engine.

This also means we are instrumenting inside DTrace
instrumentation. Peter had been thinking that it would
be good to instrument in the same places as DTrace so we
could compare our results to DTrace results when
measuring the same things. But that's unimportant.
Mikael Ronström says ""... we have decided that the
Performance Schema probes will be inside of the start
and end DTrace probes ... The reason I think was that
DTrace probes has unpredictable timing since user code
can be added there while Performance Schema code is
predictable since no user code can be added to it." 

This also means that we are instrumenting outside
innodb_metrics instrumentation, which is done at a
deeper level.

Sergei Golubchik thinks instrumenting in handler.cc is right.
Marc Alff sees nothing wrong with it.

However, the only operations are 'insert' (i.e. write),
'update', 'delete'. And only for individual rows.

Instrumenting bulk io (Deferred)
--------------------------------

This section has been moved to
WL#5372 PERFORMANCE_SCHEMA Instrumenting Table IO (Deferred).

Instrumenting 'fetch' (Deferred)
--------------------------------

This section has been moved to
WL#5372 PERFORMANCE_SCHEMA Instrumenting Table IO (Deferred).

Indexing index operations (Deferred)
------------------------------------

This section has been moved to
WL#5372 PERFORMANCE_SCHEMA Instrumenting Table IO (Deferred).

EVENTS_WAITS_CURRENT
--------------------

We can find table io individual event information with:
SELECT * FROM EVENTS_WAITS_CURRENT;
The columns of EVENTS_WAITS_CURRENT are as described for WL#2360,
and we include the ones that will appear for WL#4678.
Some columns might be NULL for users with insufficient privileges.

THREAD_ID             INTEGER         Same as for mutexes.
EVENT_ID              BIGINT          Same as for mutexes.
EVENT_NAME            VARCHAR(128)    See below.
SOURCE                VARCHAR(64)     Same as for mutexes.
TIMER_START           BIGINT          Same as for mutexes.
TIMER_END             BIGINT          Same as for mutexes.
TIMER_WAIT            BIGINT          Same as for mutexes.
SPINS                 INTEGER         See below.
OBJECT_SCHEMA         VARCHAR(64)     See below.
OBJECT_NAME           VARCHAR(512)    See below.
OBJECT_TYPE           VARCHAR(64)     See below.
OBJECT_INSTANCE_BEGIN BIGINT          See below.
NESTING_EVENT_ID      BIGINT          Same as for mutexes.
OPERATION             VARCHAR(16)     See below.
NUMBER_OF_BYTES       BIGINT          See below.
FLAGS                 INTEGER         NULL. See below.

EVENT_NAME. Following the "taxonomy" convention of WL#2360:
Class:     wait
Order      io
Family     table
Genus      sql
Species    handler
Knowing that 'table io' is a subclass of
'io' which is a subclass of 'wait', uncontroversially
class/order/family = 'wait/io/table'.
The genus and species are arbitrary, they're based
on the fact that the current instrument is in
sql/handler.cc. If we were instrumenting some
other method for a subclass of handler, it would
still be appropriate to have the same event name.
This does mean that when one says "GROUP BY EVENT_NAME"
one will get all events (fetch, insert, update, delete,
index read, etc.) in the same group, thus the same row.

SPINS. NULL for a typical table-io operation, but could
be the number of an index if later we instrument certain
index accesses. Certainly SPINS is not an appropriate column
name for an index number, so we are also considering using
OBJECT_INSTANCE_BEGIN for this purpose, or a new column.

OBJECT_SCHEMA. The name of the schema (database) that
contains the table or index.

OBJECT_NAME: The name of the table. Not a correlation name,
synonym, or view name. It would be desirable to specify
that it may not be a partition name or
tablespace name, but we will say that's implementor-defined.
Not an index name, even if the operation affects an index.

OBJECT_TYPE. 'TABLE' (persistent base table), or
'TEMPORARY TABLE'. Even when we support indexes,
we'll say 'TABLE' because index operations will
be visible as "index fetch" in the OPERATION column.

OBJECT_INSTANCE_BEGIN could be a ROWID, if there is a
BIGINT identifier known at the time of the table io.
Otherwise it's NULL. Putting "Block number" or "Offset
within block" could be considered. However, we will
say this is implementor-defined. It can be NULL.

NESTING_EVENT_ID. Originally the wording here was:
"Table io is nested within stages
which are nested within statements. Undecided detail:
should we indicate that an atomic wait (e.g.
file io) is nested within a table io?"
Now see the later section "NESTING_EVENT_ID (Deferred)".

OPERATION is 'insert' or 'update' or 'delete'; later
'fetch' will be possible. In summaries a 'fetch' should
be counted as a 'read', and any other operation should
be counted as a 'write'.

NUMBER_OF_BYTES should be known. If a row has 17 bytes
including the size of a header, NUMBER_OF_BYTES = 17.
It might be unknown when the event starts, so often
NUMBER_OF_BYTES is NULL.

FLAGS. Provided that there is no significant overhead,
this will have things like "index read was done in
KEYREAD mode". Here is Sergei Golubchik's opinion:
"
Different extra hints are important. At least these:
HA_EXTRA_CACHE, HA_EXTRA_KEYREAD, HA_EXTRA_KEY_CACHE,
HA_EXTRA_WRITE_CACHE, HA_EXTRA_DELETE_CANNOT_BATCH,
HA_EXTRA_UPDATE_CANNOT_BATCH, but [... maybe more].
The above are values from enum ha_extra_function.
For flags we'll probably need to remove HA_EXTRA_ prefix or rename them
completely, using names clearer to the end user.
They're important, because they affect how the read, for example, or
delete is performed, and affect the spead. That is, without knowing
these flags the row (in the waits table) cannot be interpreted
correctly.
"
Meanwhile Marc Alff has proposed that this FLAGS column could
have data type SET('cache','keyread','...').
Unfortunately the meaning of the bits can depend
on the event's genus or operation, but Peter Gulutzan
thinks we should try to work out something that's readable.
That's for "later": he likes to think of cosmetic
enhancements as paint, and we should paint after
all the walls of the structure are up.
We will consider FLAGS content to be implementor-defined.

OTHER POSSIBLE COLUMNS: If we can guess that we will
need to read 1000 rows, and we know that we're on the
500th fetch, that's useful information for a progress
indicator. But perhaps it's more useful for a stage.

Generally we do not want to include metadata in
performance schema tables. So if you want to find
out (for example) what the tablespace is, look elsewhere.

NESTING_EVENT_ID (Deferred)
---------------------------

This section has been moved to
WL#5372 PERFORMANCE_SCHEMA Instrumenting Table IO (Deferred).

Summaries
---------

The history and summary tables EVENTS_WAITS_HISTORY,
EVENTS_WAITS_HISTORY_LONG, EVENTS_SUMMARY_BY_THREAD...,
and so on are along the same lines as for other events.

There will not be a summary by table. That is
part of a different task which comes later. See "Object
Summaries" in WL#4816.

Double Timing (Deferred)
------------------------

By default, in SETUP_INSTRUMENTS, we have rows like:
  NAME                       ENABLED TIMED
  ----                       ------- -----
  wait/io/table/sql/handler  YES     YES
That is: table io should be instrumented and timed.

With some storage engines table io is little more
than a memory copy, so the instrumentation might
take more time than the operation. The only slow
operations will be due to file io, so we'll see
this in EVENTS_WAITS_HISTORY:
  EVENT_NAME        OPERATION TIMER_START  TIMER_END TIMER_WAIT
  ----------        --------- -----------  --------- ----------
  wait/io/table....     fetch         100        200        100
  wait/io/file...       read          101        199         98
In this example the file io starts after the table io starts,
and the file io ends before the table io ends. So statements like
SELECT SUM(TIMER_WAIT) FROM EVENTS_WAITS_HISTORY;
are misleading, they count the same thing twice.

This does not affect EVENTS_WAITS_SUMMARY_BY_EVENT_NAME
or other aggregation which keep summaries separated.

Some possible solutions (pick one):

1. Recommend to users that they should set TIMED='NO'
for table instruments. This will cause TIMER_START and
TIMER_END and TIMER_WAIT values to be NULL or zero.

2. Add a new column TIMER_WAIT_OF_NESTED_EVENTS. This
goes up whenever a nested event finishes. The value in
a table io TIMER_WAIT column is calculated then as
(TIMER_END - TIMER_START) - TIMER_WAIT_OF_NESTED_EVENTS.

3. When calculating a global aggregate, rather than
using a simple SUM(), add only the events which have
no double counting. That is, skip wait events which
are flagged as "nested". This requires a new column.

Voting so far:
Sergei Golubchik prefers "3".

This matter has not been resolved, so for now we will
ignore table io in all timing aggregations. That is,
SUM_TIMER_WAIT and MIN_TIMER_WAIT and MAX_TIMER_WAIT
and AVG_TIMER_WAIT do not include timer information
from anything in the wait/io/table family. We will
have to resolve this in a later worklog task.

Global row io counters
----------------------

We want to be compatible with existing counters
that track global io. At least they might be
useful for checking validity of our results.
But if our instrumentation is in a different
place, our numbers will differ.

Slow log: ROWS_EXAMINED.

SHOW STATUS: Innodb_rows_read, Innodb_rows_updated,
Key_reads, Key_writes.

Google's SHOW ... STATISTICS patch:
number of rows fetched per index,
number of rows fetched or changed per table.

DTrace: see source code, and MySQL Reference Manual "Row-Level Probes"
http://dev.mysql.com/doc/refman/5.5/en/dba-dtrace-ref-rowlevel.html

innodb_metrics: unknown

Example from EVENTS_WAITS_CURRENT
---------------------------------

THREAD_ID: 2
EVENT_ID: 815
EVENT_NAME: wait/io/table/sql/handler
SOURCE: handler.cc:5509
TIMER_START: 405354593873242
TIMER_END: 405354595869556
TIMER_WAIT: 1996314
SPINS: NULL
OBJECT_SCHEMA: test
OBJECT_NAME: t1
OBJECT_TYPE: TABLE
OBJECT_INSTANCE_BEGIN: 200
NESTING_EVENT_ID: NULL
OPERATION: insert    
NUMBER_OF_BYTES: NULL
FLAGS: 0

References
----------

dev-private thread "WL#2360 Performance Schema, Instrumenting Table IO"
starting with
[mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=28725

dev-private thread "WL#4895 PERFORMANCE_SCHEMA Instrumenting Table IO"

attached file containing most-relevant-looking emails up to 2010-05-04
The code in bzr mysql-next-mr-perfschema contains the low-level
structures and main algorithms, commented and set
apart from other code with if/endif, to support the
situation as described in the HLS.
That suffices as an LLD for a task of this nature.