WL#4678: PERFORMANCE_SCHEMA Instrumenting File IO

Affects: Server-5.5   —   Status: Complete   —   Priority: Very High

A certain group of people said that "file io" should have
priority as the next thing to instrument in performance schema.
[mysql intranet]/secure/mailarchive/mail.php?folder=211&mail=2942

We will add the necessary instrumentation and PERFORMANCE_SCHEMA tables.
Names are "io" and "file"
-------------------------

The proper name is "io". We reject the other suggested terms
"IO", "i/o", "input/output". We don't want to use /s in names
since we use / as a separator. We don't want to use upper case
because we've chosen to use lower case for other things.

The proper name is "file". We reject the other suggested terms
"disk", "permanent storage". We don't want "disk" because the
data storage device might be a solid-state drive, and because
we don't really know whether file io causes disk io, or vice
versa (we don't monitor disk io caused by memory mapping here).
We don't want "permanent storage" because it's an uncommon term.

Thus we preferred lower case rather than upper case.
And we decided that "io" is a common abbreviation which everyone knows.
And we avoided using a punctuation character inside a name.
And "file io" is the name we started with.

Class, Order, Family, and Genus
-------------------------------

The class/order/family/genus hierarchy is used for the file event's
EVENT_NAME and for SETUP_INSTRUMENTS name.

Recall that a typical WAIT looks like this:
'Wait/Synch/MUTEX/LOCK_plugin'.
Class = Wait.
 Order = Synch (Synchronization Object).
  Family = Mutex.
   Genus = LOCK_plugin.

For File IO, we still have a structured name.
We didn't agree how it's broken down, and what's in it.
Peter Gulutzan suggested 'Disk' is an order within the Wait class
Sergei Golubchik disagrees, or at least he thinks Peter can be illogical.
Anyway, it won't be 'Disk', and there didn't seem to be a reason to go
for some other way, so perhaps that disagreement is now a moot point.

The choices suggested were:
The class is 'io', the order is 'file', then come lower levels.
The class is 'file', then come lower levels.
The class is 'wait', the order is 'file', then come lower levels.
The class is 'wait', the order is 'io', then come lower levels.
Without consensus, but without fierce disagreement, we pick the last choice.

Class = wait.
  Order = io.
    Family = file
      Genus = plugin-name e.g. 'maria', or 'sql', or 'mysys'
        Lowest Level = object type

Example: " wait/io/file/sql/master_info".


Class = 'wait':
We pick 'wait' for events that the server might 'wait' for.
So 'io' is in the same class as 'synch'.

Order = 'io':
We pick 'io' for events that involve 'io'.
Types of io are 'net' and 'file'.

Family = 'file':
We pick 'file' for events that involve 'the device that we use
for permanent storage (commonly the same as disk)', which is
normally formatted into files (but the category is still file
even for raw disk io).

Genus = plugin-name or 'sql' or 'mysys':
Examples of plugin-name are 'archive', 'blackhole', 'csv',
'example', 'falcon', 'federated', 'heap', 'innobase',
'maria', 'myisam', 'myisammrg', 'ndb' -- these happen to
be names of directories underneath /storage in MySQL source.
The other possibilities are 'sql' and 'mysys' -- these happen
to be names of top-level directories in MySQL source.

Lowest level = object type:
For example:
"tablespace"                anything made with CREATE TABLESPACE
"index"                     as in a MyISAM .MYI file
"data"                      as in a MyISAM .MYD file
"frm"                       as in a format (.frm) file
"backup"                    object of a BACKUP or RESTORE statement
"log"                       MySQL server binlog
"error log"                 MySQL logs that aren't binlogs
"temporary"                 any temporary file
"outfile"                   object of an OUTFILE clause
"table partition"           not the same as "index partition"
"other"                     what Peter forgot, e.g. general log
"unknown"                   not NULL becaue it's a "known unknown"
Notice that we don't mind if there are spaces, we don't use '_'s.
There may be more object types in some storage engines.

There is no lower level for 'read', 'write', etc.

Examples of names
-----------------

These examples come from the current mysql-6.0-perf implementation.
There are here for illustration. This section is not a compulsory
part of the specification.

 wait/io/file/sql/ERRMSG
 wait/io/file/sql/FRM
 wait/io/file/sql/Init
 wait/io/file/sql/file_parser
 wait/io/file/sql/DbOpt
 wait/io/file/sql/binlog
 wait/io/file/sql/binlog_index
 wait/io/file/sql/log
 wait/io/file/sql/backuplog
 wait/io/file/sql/tclog
 wait/io/file/sql/LOAD_FILE
 wait/io/file/sql/log_event_data
 wait/io/file/sql/log_event_info
 wait/io/file/sql/master_info
 wait/io/file/sql/send_file
 wait/io/file/sql/relay_log_info
 wait/io/file/sql/des_key_file
 wait/io/file/sql/partition
 wait/io/file/sql/load
 wait/io/file/sql/NDB
 wait/io/file/sql/case_test
 wait/io/file/sql/pid
 wait/io/file/sql/map
 wait/io/file/sql/partition_syntax
 wait/io/file/sql/global_ddl_log
 wait/io/file/mysys/proc_meminfo
 wait/io/file/mysys/cnf
 wait/io/file/mysys/charset
 wait/io/file/myisam/backup_log
 wait/io/file/myisam/dfile
 wait/io/file/myisam/kfile
 wait/io/file/tina/metadata
 wait/io/file/tina/data
 wait/io/file/tina/update
 wait/io/file/maria/Control
 wait/io/file/maria/key_file
 wait/io/file/maria/data_file
 wait/io/file/maria/data_tmp
 wait/io/file/maria/index_tmp
 wait/io/file/maria/iext
 wait/io/file/maria/LogDescriptorDirectory
 wait/io/file/maria/recovery_trace

EVENTS_WAITS_CURRENT
--------------------

We can find file io individual event information with:
SELECT * FROM EVENTS_WAITS_CURRENT;
The columns of EVENTS_WAITS_CURRENT are as described for WL#2360,
but there are new ones at the end which are specifically for file io.

THREAD_ID             INTEGER         Same as for mutexes.
EVENT_ID              BIGINT          Same as for mutexes.
EVENT_NAME            VARCHAR(128)    See below.
SOURCE                VARCHAR(64)     Same as for mutexes.
TIMER_START           BIGINT          Same as for mutexes.
TIMER_END             BIGINT          Same as for mutexes.
TIMER_WAIT            BIGINT          Same as for mutexes.
SPINS                 INTEGER         NULL
OBJECT_SCHEMA         VARCHAR(64)     NULL
OBJECT_NAME           VARCHAR(512)    See below.
OBJECT_TYPE           VARCHAR(64)     NULL
OBJECT_INSTANCE_BEGIN BIGINT          See below.
NESTING_EVENT_ID      BIGINT          Same as for mutexes.
OPERATION             VARCHAR(16)     New. See below.
NUMBER_OF_BYTES       BIGINT          New. See below.
FLAGS                 INTEGER         New. See below.

OBJECT_NAME: This has the file name. (We originally picked
OBJECT_INSTANCE_BEGIN for WL#2360 but we are changing it.)
We acknowledge that VARCHAR(64) is too small, this is now adjusted.
This always means "the file name as we know it in the table".
It is not the ultimate file that exists if we follow a symlink.
It is the encoded file name, we don't show an object's unencoded name.
For any event which does not have a file name, for example when
InnoDB tablespace is a raw device, the value of this column is NULL.
In other contexts this column could be a hex representation of an
address. In this case "file io" is just a pointer in memory, so
declaring it VARCHAR(512) won't add much.

OBJECT_INSTANCE_BEGIN: This is the offset within a file, when
that information is knowable without extra trouble. Thus, if
operation='seek', we will have the argument of the seek here.
We expect that for most operations the value will be NULL.

OPERATION: This is usually a verb that looks like a C API function name.
Rules:
* Generic names with no variants that depend on library or operating system
* Anything that relates to "disk" file activity, not just IO
* Nothing that relates to special situations like mmap or virtual
* Direct non-file action e.g. 'reformat floppy' will be added when needed.
The possible values are:
"open"
"seek"       including fsetpos and rewind
"read"
"write"      including fputc
"close"
"create"
"delete"
"mkdir"     prefer unixish name even if it's Windows CreateDirectory
"rmdir"     prefer unixish name even if it's Windows RemoveDirectory
"write"
"tell"       including fgetpos
"rename"
Other posssible names for this column: INTENT, ACTIVITY, ACTION.
Let's suppose there's an odd case: instead of forming a 100-byte string
and writing it, the programmer wrote a loop which says putc 100 times.
Do we really want to monitor each putc? Peter says "yes".

NUMBER_OF_BYTES: Initially (before the instrumented code completes)
this may be the "requested" number of bytes, or zero, or NULL.
After the instrumented code completes, it is the "actual" number
of bytes. If the byte count is irrelevant, then this is NULL.

FLAGS: If operation is 'open' and there is an O_DIRECT flag,
then we will save a flag value here. For other operations expect zero.
For most operations, the SOURCE column tells you where to look in the
source code.

OTHER POSSIBLE COLUMNS: We decided we don't need a column for "error
return" -- if there's a significant error, then the server will
stop and display an error message, so this would be redundant.

File Name
---------

The only item that seems troublesome is the file name.
Of course it has 'file name' when you select from the table.
But we're expecting that the source is a pointer to a string.
A read/write function call doesn't provide that automatically.
So Marc maintains a table which associates file-descriptor-pointer or
file-number with name. This table must, alas, be controlled by a
mutex at the moment. But Sergei has a lock-free hash table implementation,
and Marc intends to use that eventually. Marc is confident that
there will be no need for mutexes or similar synchronization
protections, and is confident that lookups cause no significant
overhead.

History and Event Summary Tables
--------------------------------

The history and summary tables are exactly the same as,
or equivalent to, EVENTS_WAITS_HISTORY, EVENTS_WAITS_SUMMARY_BY_THREAD,
EVENTS_WAITS_SUMMARY_GLOBAL, etc.

There is no change for the description of EVENTS_WAITS_SUMMARY tables.
The io events are summarized just like the mutex events.
The new columns in EVENTS_WAITS_CURRENT (FILE_NAME and OPERATION and
NUMBER_OF_BYTES) are not grouped or summarized in these tables.

Not all higher level event summary tables are done via this worklog task.

Object Summaries
----------------

Originally we felt that there could be no object summaries for files.
For reasons, see "[ OBJECT SUMMARIES ]" in WL#4816.
But see the suggestion for SETUP_OBJECTS, also in WL#4816.

WL#4816 also mentions WAIT_IO_FILE_READ_COUNT_STAR etc.

Well, we do have object summaries: FILE_SUMMARY_BY_INSTANCE and
FILE_SUMMARY_BY_EVENT.

Columns in object summary tables include:
OBJECT_NAME               VARCHAR          i.e. file name
EVENT_NAME                VARCHAR
COUNT_READ                BIGINT
COUNT_WRITE               BIGINT
SUM_NUMBER_OF_BYTES_READ  BIGINT
SUM_NUMBER_OF_BYTES_WRITE BIGINT

OBJECT_NAME is the pseudo-grouping column. EVENT_NAME is neither a
grouping column nor an aggregate column, but there may be a functional
dependency, so it's okay here. The other columns are aggregations.
The form of aggregation columns (as with the columns COUNT_STAR
and SUM_TIMER_WAIT mentioned in WL#2360) should always be:
aggregate-operator '_' name-of-column-we're-aggregating, or STAR.

Peter assumes we'll want a SETUP_OBJECTS table, as described in WL#2360
in section "Object Summaries", and it will have an entry for "file".

There is still some dispute about what we do with data for file X
after we close file X. There is still some dispute about what we
call the summary tables for this case. There is no dispute that we
will change FILE_SUMMARY tables significantly as we progress with
our ideas about summary tables (WL#4816). This section is not a compulsory
part of the specification.

Example
-------

This example comes from what actually was observed while rnnning with
mysql-6.0-perf.

mysql> select * from events_waits_current where thread_id = 1\G
*************************** 1. row ***************************
            THREAD_ID: 1
             EVENT_ID: 2
           EVENT_NAME: wait/io/file/sql/pid
               SOURCE: mysqld.cc:9258
          TIMER_START: 197474322940
            TIMER_END: 197506166308
           TIMER_WAIT: 31843368
                SPINS: NULL
        OBJECT_SCHEMA: NULL
          OBJECT_NAME: /usr/local/mysql/var/linux.pid
          OBJECT_TYPE: NULL
OBJECT_INSTANCE_BEGIN: NULL
     NESTING_EVENT_ID: NULL
            OPERATION: FILEWRITE
      NUMBER_OF_BYTES: 6
1 row in set (0.00 sec)

Example:

mysql> select * from events_waits_summary_by_event_name
       where event_name like '%/io/%' and count_wait > 0\G
*************************** 1. row ***************************
    EVENT_NAME: wait/io/file/sql/ERRMSG
    COUNT_WAIT: 3
SUM_TIMER_WAIT: 163141234
MIN_TIMER_WAIT: 9438828
AVG_TIMER_WAIT: 54380411
MAX_TIMER_WAIT: 137474608
*************************** 2. row ***************************
    EVENT_NAME: wait/io/file/sql/FRM
    COUNT_WAIT: 296
SUM_TIMER_WAIT: 1439109522
MIN_TIMER_WAIT: 63852
AVG_TIMER_WAIT: 4861856
MAX_TIMER_WAIT: 93699054
*************************** 3. row ***************************
    EVENT_NAME: wait/io/file/sql/pid
    COUNT_WAIT: 1
SUM_TIMER_WAIT: 31843368
MIN_TIMER_WAIT: 31843368
AVG_TIMER_WAIT: 31843368
MAX_TIMER_WAIT: 31843368
*************************** 4. row ***************************
    EVENT_NAME: wait/io/file/mysys/charset
    COUNT_WAIT: 1
SUM_TIMER_WAIT: 68423052
MIN_TIMER_WAIT: 68423052
AVG_TIMER_WAIT: 68423052
MAX_TIMER_WAIT: 68423052
*************************** 5. row ***************************
    EVENT_NAME: wait/io/file/maria/key_file
    COUNT_WAIT: 11
SUM_TIMER_WAIT: 41439948
MIN_TIMER_WAIT: 1364054
AVG_TIMER_WAIT: 3767268
MAX_TIMER_WAIT: 7402450
*************************** 6. row ***************************
    EVENT_NAME: wait/io/file/maria/data_file
    COUNT_WAIT: 2
SUM_TIMER_WAIT: 17721434
MIN_TIMER_WAIT: 8298882
AVG_TIMER_WAIT: 8860717
MAX_TIMER_WAIT: 9422552
6 rows in set (0.01 sec)

Notice, above, how the third row in EVENTS_WAITS_SUMMARY_BY_EVENT_NAME
shows that in the grouping for "wait/io/file/sql/pid" we had one wait,
and its sum timer_wait was 31843368 time units (probably approximately
0.0000318 seconds). This obviously is due to the wait/io/file/sql/pid
event which is in EVENTS_WAITS_CURRENT for thread 1.

mysql> select * from file_summary_by_instance where count_read > 0 limit 1\G
*************************** 1. row ***************************
                FILE_NAME: /usr/local/mysql/share/mysql/english/errmsg.sys
               EVENT_NAME: wait/io/file/sql/ERRMSG
               COUNT_READ: 3
              COUNT_WRITE: 0
 SUM_NUMBER_OF_BYTES_READ: 43250
SUM_NUMBER_OF_BYTES_WRITE: 0
1 row in set (0.01 sec)

mysql> select * from file_summary_by_event_name where count_read > 0 limit 1\G
*************************** 1. row ***************************
               EVENT_NAME: wait/io/file/sql/ERRMSG
               COUNT_READ: 3
              COUNT_WRITE: 0
 SUM_NUMBER_OF_BYTES_READ: 43250
SUM_NUMBER_OF_BYTES_WRITE: 0
1 row in set (0.00 sec)

The above two examples show what is in fact the effect of the same event,
but in one case it's effectively been grouped by FILE_NAME, in the other
case it's effectively been grouped by EVENT_NAME.

Private Remarks
---------------

Some private remarks have been removed from HLS to Progress Reports
for 2009-02-18.