WL#4678: PERFORMANCE_SCHEMA Instrumenting File IO
Affects: Server-5.5
—
Status: Complete
A certain group of people said that "file io" should have priority as the next thing to instrument in performance schema. [mysql intranet]/secure/mailarchive/mail.php?folder=211&mail=2942 We will add the necessary instrumentation and PERFORMANCE_SCHEMA tables.
Names are "io" and "file" ------------------------- The proper name is "io". We reject the other suggested terms "IO", "i/o", "input/output". We don't want to use /s in names since we use / as a separator. We don't want to use upper case because we've chosen to use lower case for other things. The proper name is "file". We reject the other suggested terms "disk", "permanent storage". We don't want "disk" because the data storage device might be a solid-state drive, and because we don't really know whether file io causes disk io, or vice versa (we don't monitor disk io caused by memory mapping here). We don't want "permanent storage" because it's an uncommon term. Thus we preferred lower case rather than upper case. And we decided that "io" is a common abbreviation which everyone knows. And we avoided using a punctuation character inside a name. And "file io" is the name we started with. Class, Order, Family, and Genus ------------------------------- The class/order/family/genus hierarchy is used for the file event's EVENT_NAME and for SETUP_INSTRUMENTS name. Recall that a typical WAIT looks like this: 'Wait/Synch/MUTEX/LOCK_plugin'. Class = Wait. Order = Synch (Synchronization Object). Family = Mutex. Genus = LOCK_plugin. For File IO, we still have a structured name. We didn't agree how it's broken down, and what's in it. Peter Gulutzan suggested 'Disk' is an order within the Wait class Sergei Golubchik disagrees, or at least he thinks Peter can be illogical. Anyway, it won't be 'Disk', and there didn't seem to be a reason to go for some other way, so perhaps that disagreement is now a moot point. The choices suggested were: The class is 'io', the order is 'file', then come lower levels. The class is 'file', then come lower levels. The class is 'wait', the order is 'file', then come lower levels. The class is 'wait', the order is 'io', then come lower levels. Without consensus, but without fierce disagreement, we pick the last choice. Class = wait. Order = io. Family = file Genus = plugin-name e.g. 'maria', or 'sql', or 'mysys' Lowest Level = object type Example: " wait/io/file/sql/master_info". Class = 'wait': We pick 'wait' for events that the server might 'wait' for. So 'io' is in the same class as 'synch'. Order = 'io': We pick 'io' for events that involve 'io'. Types of io are 'net' and 'file'. Family = 'file': We pick 'file' for events that involve 'the device that we use for permanent storage (commonly the same as disk)', which is normally formatted into files (but the category is still file even for raw disk io). Genus = plugin-name or 'sql' or 'mysys': Examples of plugin-name are 'archive', 'blackhole', 'csv', 'example', 'falcon', 'federated', 'heap', 'innobase', 'maria', 'myisam', 'myisammrg', 'ndb' -- these happen to be names of directories underneath /storage in MySQL source. The other possibilities are 'sql' and 'mysys' -- these happen to be names of top-level directories in MySQL source. Lowest level = object type: For example: "tablespace" anything made with CREATE TABLESPACE "index" as in a MyISAM .MYI file "data" as in a MyISAM .MYD file "frm" as in a format (.frm) file "backup" object of a BACKUP or RESTORE statement "log" MySQL server binlog "error log" MySQL logs that aren't binlogs "temporary" any temporary file "outfile" object of an OUTFILE clause "table partition" not the same as "index partition" "other" what Peter forgot, e.g. general log "unknown" not NULL becaue it's a "known unknown" Notice that we don't mind if there are spaces, we don't use '_'s. There may be more object types in some storage engines. There is no lower level for 'read', 'write', etc. Examples of names ----------------- These examples come from the current mysql-6.0-perf implementation. There are here for illustration. This section is not a compulsory part of the specification. wait/io/file/sql/ERRMSG wait/io/file/sql/FRM wait/io/file/sql/Init wait/io/file/sql/file_parser wait/io/file/sql/DbOpt wait/io/file/sql/binlog wait/io/file/sql/binlog_index wait/io/file/sql/log wait/io/file/sql/backuplog wait/io/file/sql/tclog wait/io/file/sql/LOAD_FILE wait/io/file/sql/log_event_data wait/io/file/sql/log_event_info wait/io/file/sql/master_info wait/io/file/sql/send_file wait/io/file/sql/relay_log_info wait/io/file/sql/des_key_file wait/io/file/sql/partition wait/io/file/sql/load wait/io/file/sql/NDB wait/io/file/sql/case_test wait/io/file/sql/pid wait/io/file/sql/map wait/io/file/sql/partition_syntax wait/io/file/sql/global_ddl_log wait/io/file/mysys/proc_meminfo wait/io/file/mysys/cnf wait/io/file/mysys/charset wait/io/file/myisam/backup_log wait/io/file/myisam/dfile wait/io/file/myisam/kfile wait/io/file/tina/metadata wait/io/file/tina/data wait/io/file/tina/update wait/io/file/maria/Control wait/io/file/maria/key_file wait/io/file/maria/data_file wait/io/file/maria/data_tmp wait/io/file/maria/index_tmp wait/io/file/maria/iext wait/io/file/maria/LogDescriptorDirectory wait/io/file/maria/recovery_trace EVENTS_WAITS_CURRENT -------------------- We can find file io individual event information with: SELECT * FROM EVENTS_WAITS_CURRENT; The columns of EVENTS_WAITS_CURRENT are as described for WL#2360, but there are new ones at the end which are specifically for file io. THREAD_ID INTEGER Same as for mutexes. EVENT_ID BIGINT Same as for mutexes. EVENT_NAME VARCHAR(128) See below. SOURCE VARCHAR(64) Same as for mutexes. TIMER_START BIGINT Same as for mutexes. TIMER_END BIGINT Same as for mutexes. TIMER_WAIT BIGINT Same as for mutexes. SPINS INTEGER NULL OBJECT_SCHEMA VARCHAR(64) NULL OBJECT_NAME VARCHAR(512) See below. OBJECT_TYPE VARCHAR(64) NULL OBJECT_INSTANCE_BEGIN BIGINT See below. NESTING_EVENT_ID BIGINT Same as for mutexes. OPERATION VARCHAR(16) New. See below. NUMBER_OF_BYTES BIGINT New. See below. FLAGS INTEGER New. See below. OBJECT_NAME: This has the file name. (We originally picked OBJECT_INSTANCE_BEGIN for WL#2360 but we are changing it.) We acknowledge that VARCHAR(64) is too small, this is now adjusted. This always means "the file name as we know it in the table". It is not the ultimate file that exists if we follow a symlink. It is the encoded file name, we don't show an object's unencoded name. For any event which does not have a file name, for example when InnoDB tablespace is a raw device, the value of this column is NULL. In other contexts this column could be a hex representation of an address. In this case "file io" is just a pointer in memory, so declaring it VARCHAR(512) won't add much. OBJECT_INSTANCE_BEGIN: This is the offset within a file, when that information is knowable without extra trouble. Thus, if operation='seek', we will have the argument of the seek here. We expect that for most operations the value will be NULL. OPERATION: This is usually a verb that looks like a C API function name. Rules: * Generic names with no variants that depend on library or operating system * Anything that relates to "disk" file activity, not just IO * Nothing that relates to special situations like mmap or virtual * Direct non-file action e.g. 'reformat floppy' will be added when needed. The possible values are: "open" "seek" including fsetpos and rewind "read" "write" including fputc "close" "create" "delete" "mkdir" prefer unixish name even if it's Windows CreateDirectory "rmdir" prefer unixish name even if it's Windows RemoveDirectory "write" "tell" including fgetpos "rename" Other posssible names for this column: INTENT, ACTIVITY, ACTION. Let's suppose there's an odd case: instead of forming a 100-byte string and writing it, the programmer wrote a loop which says putc 100 times. Do we really want to monitor each putc? Peter says "yes". NUMBER_OF_BYTES: Initially (before the instrumented code completes) this may be the "requested" number of bytes, or zero, or NULL. After the instrumented code completes, it is the "actual" number of bytes. If the byte count is irrelevant, then this is NULL. FLAGS: If operation is 'open' and there is an O_DIRECT flag, then we will save a flag value here. For other operations expect zero. For most operations, the SOURCE column tells you where to look in the source code. OTHER POSSIBLE COLUMNS: We decided we don't need a column for "error return" -- if there's a significant error, then the server will stop and display an error message, so this would be redundant. File Name --------- The only item that seems troublesome is the file name. Of course it has 'file name' when you select from the table. But we're expecting that the source is a pointer to a string. A read/write function call doesn't provide that automatically. So Marc maintains a table which associates file-descriptor-pointer or file-number with name. This table must, alas, be controlled by a mutex at the moment. But Sergei has a lock-free hash table implementation, and Marc intends to use that eventually. Marc is confident that there will be no need for mutexes or similar synchronization protections, and is confident that lookups cause no significant overhead. History and Event Summary Tables -------------------------------- The history and summary tables are exactly the same as, or equivalent to, EVENTS_WAITS_HISTORY, EVENTS_WAITS_SUMMARY_BY_THREAD, EVENTS_WAITS_SUMMARY_GLOBAL, etc. There is no change for the description of EVENTS_WAITS_SUMMARY tables. The io events are summarized just like the mutex events. The new columns in EVENTS_WAITS_CURRENT (FILE_NAME and OPERATION and NUMBER_OF_BYTES) are not grouped or summarized in these tables. Not all higher level event summary tables are done via this worklog task. Object Summaries ---------------- Originally we felt that there could be no object summaries for files. For reasons, see "[ OBJECT SUMMARIES ]" in WL#4816. But see the suggestion for SETUP_OBJECTS, also in WL#4816. WL#4816 also mentions WAIT_IO_FILE_READ_COUNT_STAR etc. Well, we do have object summaries: FILE_SUMMARY_BY_INSTANCE and FILE_SUMMARY_BY_EVENT. Columns in object summary tables include: OBJECT_NAME VARCHAR i.e. file name EVENT_NAME VARCHAR COUNT_READ BIGINT COUNT_WRITE BIGINT SUM_NUMBER_OF_BYTES_READ BIGINT SUM_NUMBER_OF_BYTES_WRITE BIGINT OBJECT_NAME is the pseudo-grouping column. EVENT_NAME is neither a grouping column nor an aggregate column, but there may be a functional dependency, so it's okay here. The other columns are aggregations. The form of aggregation columns (as with the columns COUNT_STAR and SUM_TIMER_WAIT mentioned in WL#2360) should always be: aggregate-operator '_' name-of-column-we're-aggregating, or STAR. Peter assumes we'll want a SETUP_OBJECTS table, as described in WL#2360 in section "Object Summaries", and it will have an entry for "file". There is still some dispute about what we do with data for file X after we close file X. There is still some dispute about what we call the summary tables for this case. There is no dispute that we will change FILE_SUMMARY tables significantly as we progress with our ideas about summary tables (WL#4816). This section is not a compulsory part of the specification. Example ------- This example comes from what actually was observed while rnnning with mysql-6.0-perf. mysql> select * from events_waits_current where thread_id = 1\G *************************** 1. row *************************** THREAD_ID: 1 EVENT_ID: 2 EVENT_NAME: wait/io/file/sql/pid SOURCE: mysqld.cc:9258 TIMER_START: 197474322940 TIMER_END: 197506166308 TIMER_WAIT: 31843368 SPINS: NULL OBJECT_SCHEMA: NULL OBJECT_NAME: /usr/local/mysql/var/linux.pid OBJECT_TYPE: NULL OBJECT_INSTANCE_BEGIN: NULL NESTING_EVENT_ID: NULL OPERATION: FILEWRITE NUMBER_OF_BYTES: 6 1 row in set (0.00 sec) Example: mysql> select * from events_waits_summary_by_event_name where event_name like '%/io/%' and count_wait > 0\G *************************** 1. row *************************** EVENT_NAME: wait/io/file/sql/ERRMSG COUNT_WAIT: 3 SUM_TIMER_WAIT: 163141234 MIN_TIMER_WAIT: 9438828 AVG_TIMER_WAIT: 54380411 MAX_TIMER_WAIT: 137474608 *************************** 2. row *************************** EVENT_NAME: wait/io/file/sql/FRM COUNT_WAIT: 296 SUM_TIMER_WAIT: 1439109522 MIN_TIMER_WAIT: 63852 AVG_TIMER_WAIT: 4861856 MAX_TIMER_WAIT: 93699054 *************************** 3. row *************************** EVENT_NAME: wait/io/file/sql/pid COUNT_WAIT: 1 SUM_TIMER_WAIT: 31843368 MIN_TIMER_WAIT: 31843368 AVG_TIMER_WAIT: 31843368 MAX_TIMER_WAIT: 31843368 *************************** 4. row *************************** EVENT_NAME: wait/io/file/mysys/charset COUNT_WAIT: 1 SUM_TIMER_WAIT: 68423052 MIN_TIMER_WAIT: 68423052 AVG_TIMER_WAIT: 68423052 MAX_TIMER_WAIT: 68423052 *************************** 5. row *************************** EVENT_NAME: wait/io/file/maria/key_file COUNT_WAIT: 11 SUM_TIMER_WAIT: 41439948 MIN_TIMER_WAIT: 1364054 AVG_TIMER_WAIT: 3767268 MAX_TIMER_WAIT: 7402450 *************************** 6. row *************************** EVENT_NAME: wait/io/file/maria/data_file COUNT_WAIT: 2 SUM_TIMER_WAIT: 17721434 MIN_TIMER_WAIT: 8298882 AVG_TIMER_WAIT: 8860717 MAX_TIMER_WAIT: 9422552 6 rows in set (0.01 sec) Notice, above, how the third row in EVENTS_WAITS_SUMMARY_BY_EVENT_NAME shows that in the grouping for "wait/io/file/sql/pid" we had one wait, and its sum timer_wait was 31843368 time units (probably approximately 0.0000318 seconds). This obviously is due to the wait/io/file/sql/pid event which is in EVENTS_WAITS_CURRENT for thread 1. mysql> select * from file_summary_by_instance where count_read > 0 limit 1\G *************************** 1. row *************************** FILE_NAME: /usr/local/mysql/share/mysql/english/errmsg.sys EVENT_NAME: wait/io/file/sql/ERRMSG COUNT_READ: 3 COUNT_WRITE: 0 SUM_NUMBER_OF_BYTES_READ: 43250 SUM_NUMBER_OF_BYTES_WRITE: 0 1 row in set (0.01 sec) mysql> select * from file_summary_by_event_name where count_read > 0 limit 1\G *************************** 1. row *************************** EVENT_NAME: wait/io/file/sql/ERRMSG COUNT_READ: 3 COUNT_WRITE: 0 SUM_NUMBER_OF_BYTES_READ: 43250 SUM_NUMBER_OF_BYTES_WRITE: 0 1 row in set (0.00 sec) The above two examples show what is in fact the effect of the same event, but in one case it's effectively been grouped by FILE_NAME, in the other case it's effectively been grouped by EVENT_NAME. Private Remarks --------------- Some private remarks have been removed from HLS to Progress Reports for 2009-02-18.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.