WL#2360: Performance Schema
Affects: Server-5.5
—
Status: Complete
Example ------- Leaving out many details and variations, this is what might happen ... The DBA uses an SQL statement to "enable MUTEXES instrumentation". This statement requires special privileges because global monitoring has a slight impact on performance of all jobs. MUTEXES is a class of performance events that we can monitor. When a MySQL thread is about to acquire a mutex, it sets variables in its own area (possibly THD), including "start time" based on a low-overhead timer/counter. Whenever a thread has finished acquiring a mutex, it sets variables in the same structure, such as: "end time". A user says SELECT ... FROM PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT. The thread which is handling this knows that all PERFORMANCE_SCHEMA information is in the memory of other threads. It scans them and makes rows: Thread ID Event Name Timer Wait --------- ---------- ---------- 1 "...mutex..." 11 2 "...mutex..." NULL /* (still waiting) */ The HLS has the variations and some implementation details. Terminology: the job which sets the mutexes is "Provider", the job which reads PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT is a "Consumer".
WL#2373: Use cycle counter for timing
WL#2595: kernel-independent atomic operations
WL#4876: Parse options before initializing mysys
WL#2595: kernel-independent atomic operations
WL#4876: Parse options before initializing mysys
Contents -------- Example Why we're doing this The performance_schema schema SETUP PERFORMANCE_SCHEMA . EVENTS_WAITS_CURRENT Writing performance-structure components Ring buffers and PERFORMANCE_SCHEMA . EVENTS_WAITS_HISTORY Reading performance-structure components Consumer Snapshot Timer and Time Units Summaries Sampling Problem: Background threads and processes Problem: No mutex for reading performance-structure Overhead Removals Trace DTrace Embedded Terminology Requests for comment Performance_xxx routines Administrative Source code References For this one task, we can show information related to a few mutexes. (File IO is also in but as a separate worklog task.) "Mutexes" is an arbitrary choice; there will be other categories. We want to show as much mutex-related information as any other tool could provide, but continuously and as SQL tables. Upon task completion it will be demonstrable that the mechanism works, the Consumer doesn't crash, and the Provider is fast. Showing all performance-related information, instead of just a few mutexes, is a bigger job. The claim is: this task and only this task is going in the right direction. It is a genuine all-accoutrement monitor, instead of a temporary substitute. Example ------- See High-Level Description. Why we're doing this -------------------- See High_Level Description. The performance_schema schema ----------------------------- There is a new "database" (i.e. schema) PERFORMANCE_SCHEMA. The PERFORMANCE_SCHEMA schema is much like INFORMATION_SCHEMA, since it contains no persistent base tables, can be read but never changed by users, and always exists. We could use INFORMATION_SCHEMA rather than have a separate PERFORMANCE_SCHEMA schema. However, the PERFORMANCE_SCHEMA tables have a different purpose, so Peter thought they should be separate. See also WL#4513 "Implement PERFORMANCE_SCHEMA". Proposals for the schema name were DYNAMIC (but that is a reserved word in standard SQL), PERFORMANCE_SCHEMA (Jim Winstead's suggestion), PERFORMANCE (Peter's preference), and DBTrace because it sounds like DTrace. There was a vote (see dev-private thread "WL#2360 Vote: schema name and table name"). The winner was PERFORMANCE_SCHEMA. Values in PERFORMANCE_SCHEMA tables are non-deterministic, non-consistent, non-repeatable. PERFORMANCE_SCHEMA tables cannot be locked, indexed, or backed up. Selections on PERFORMANCE_SCHEMA tables should not be cached. Initially Peter thought: unless all those things are true, a table should be elsewhere, e.g. mysql database or INFORMATION_SCHEMA or perhaps even a new schema like FIPS_DOCUMENTATION http://www.itl.nist.gov/fipspubs/fip127-2.htm. But to make INSTRUMENTS and ACTORS and CONSUMERS (and PERFORMANCE_TIMERS) fit in, we add: "The schema also contains tables which exist solely as aids for managing the main performance tables and the monitoring process." There are four table groups in the performance_schema schema: "Setup" see "SETUP" section "Actors" e.g. threads, connections, events "Objects" e.g. locks, buffers, files, mutexes (see WL#4674) "Alien" tables that aren't part of this project but still fit in this schema because they're transient, unbackuped, etc. "Events" results of instrumentation There will eventually be more than 100 tables in PERFORMANCE_SCHEMA. Our first tables will be setup tables and several events table, including PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT. For all tables described in following sections, the implementor may change BIGINT to BIGINT UNSIGNED or change INTEGER TO INTEGER UNSIGNED if overflow will otherwise occur. SETUP ----- The use of flags variables, e.g. SET GLOBAL performance_flags = {'' | 'MUTEXES'}; seemed too cumbersome given that there are so many flags, that there are so many ways to set them, and that users need to search and display the combinations as structures. So the settings will be in multiple performance_schema tables, SETUP_INSTRUMENTS, SETUP_CONSUMERS, and possibly SETUP_ACTORS depending on decisions for WL#4674 PERFORMANCE_SCHEMA Setup For Actors. The instruments table has rows for bits of code that are instrumented, e.g. "mutex for lock open". The actors table has rows for people or jobs which invoke the code, e.g. "joe@mysql.com". The consumers table has rows for summary tables, e.g. "the long history". SETUP_INSTRUMENTS table: NAME VARCHAR(128) ENABLED enum('YES','NO') TIMED enum('YES','NO') SETUP_CONSUMERS table: NAME VARCHAR(64) ENABLED enum('Yes','No') Initially the only rows in SETUP_INSTRUMENTS are '%Mutex%' rows, according to this specification. The implementor may choose to do more. (Note added later: the implementor chose to do more.) Later we'll add worklog tasks for several classes or subclasses. A very tentative list: short_functions e.g. string-comparison. only if mysqld is built with DEBUG mutexes mutexes, semaphores, spins ... all synchronization objects locks row lock, table lock, metadata lock. WL#2333 file "System IO" read, write, seek, open, close, delete. WL#4678 table "Table IO" read (row) etc. index "Index IO" read (key) etc. memory malloc/free, and perhaps memcpy strcmp etc. WL#3249 net other "other operating system call" idle stage well-known internal function e.g. sort/parse/commit, WL#4813 sql_functions function written with CREATE FUNCTION statement long_operations long operations (any operations which might take > N seconds) statements WL#2515 compound_statements transactions procedures engine_defined there might be more than one 'engine_defined' bit plugin_defined there might be more than one 'plugin_defined' bit all combination of all the above NB: We have now specified that, for io, class=wait, order=io. So the statement "io is in the same class as mutexes" is true. That's decided in WL#4678. At start-up, all instruments are globally disabled. They should always be enabled. But it's a new feature so we'll be cautious about making it the default. There might be a mysqld option to turn everything on at the start. There will be options for mysqld startup to specify the maximum lengths of internal tables. The setup is elaborate, so it's possible to make precise customizations. The ordinary user can ignore all the elaborateness, by just accepting the defaults or by using statements that turn everything on | off. It does take time to test each option, and we'll have to see if that's bad. Interesting effect: changing the instrumentation flag does not affect code currently waiting. So if an event started with enabled='yes', then a few rows might still be written in EVENTS_WAITS_CURRENT after you changed to enabled='no'." It would be possible to reset existing counters during updating of setup tables. But we won't. It's probably more convenient for users if we "freeze" counters. Description of another setup table, "PERFORMANCE_TIMERS", comes later. The rest of this section has rationale and details about setup tables. It is not necessary for everyone to read it. Most people will benefit by skipping to the next section. These details were discussed on dev-private, ending with this email: [mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=25914 We split "SETUP" into three setup tables, SETUP_INSTRUMENTS and SETUP_CONSUMERS, and another table which we'll talk about in WL#4674. Some columns that are applicable to instruments are meaningless for consumers, and vice versa. [ WHY WE USE SETUP TABLES ] Maybe we'll have 1000 instruments plus 100 consumers. That's 1100 sets of flags. To handle so many, the use of tables give us abilities: (1) to avoid new syntax. (2) to turn multiple flag on/off at the same time for any user. (3) to change by group, e.g. "all mutex" or "all mutex lock open". (4) to see and change other people's flags, with appropriate privileges. (5) to have multiple flags per instrument, e.g. both "enabled" and "timed". (6) to inspect the complete setup. (7) to save a setup and reload it later. We discussed alternative flag-setting mechanisms ("SET performance_flags ...", "MONITOR") but they lacked some of these abilities. [ SETUP_INSTRUMENTS TABLE ] NAME VARCHAR(128) ENABLED enum('YES','NO') TIMED enum('YES','NO') Some alternatives to the name INSTRUMENTS: PROVIDERS, INSTRUMENTATION. Effectively there's a primary key over the NAME column, although show CREATE TABLE won't show it. There is no numeric identifier like INSTRUMENT_ID. We considered a COMMENT column. This would be the implementor's explanation of what is being measured, or what parameters mean. We decided it's unnecessary. We considered a TRACE_ENABLED column. This would cause contents to be dumped to a file, as well as go into a table. (Perhaps a trace file is more correctly a CONSUMER.) We decided it's unnecessary. We considered a DTRACE column. This would enable DTrace provider code, if any, that's in the same place as the instrumentation. We decided it's unnecessary. We considered a PLUGIN column. This would call a user-supplied plugin, with parameters = the same data as an EVENTS_WAITS row, whenever the instrumentation is enabled. We decided it's unnecessary. We considered a MORE column. This would invoke getrusage, or whatever other mechanism the chip or the operating system provides, to give material in addition to what the timer gives, for example "number of page faults". We decided it's unnecessary. The ENABLED and TIMED columns are updatable. But see later section "Privileges". There is no automatic ordering. Contents may change in every version, even minor versions. [ SETUP_INSTRUMENTS."NAME" COLUMN ] Definition: NAME VARCHAR(128) CHARACTER SET UTF8 COLLATE UTF8_GENERAL_CI NOT NULL Instrument names follow a 'taxonomy' convention: Phylum: Instrument /* not used, there's only one phylum */ Class: Wait Order Synchronization Object /* abbreviated 'Synch' */ Family Mutex Genus Lock_open Species Server Individual program_name.c_line_xx /* not used, our lowest level is species */ These all go together in the the NAME column, separated by '/'s. For example: 'Wait/Synch/MUTEX/LOCK_plugin'. Whenever a person instruments a new piece of code, he/she should specify a name following the convention. We will have a test to ensure that. We considered having individual columns, that is, CLASS VARCHAR(), ORDER VARCHAR(), FAMILY VARCHAR(), GENUS VARCHAR(), SPECIES VARCHAR(). We decided that's unnecessary. We considered going down to the "Individual" level, that is, 'Wait/Synch/MUTEX/LOCK_plugin/x.c_line_xx'. We decided that's unnecessary for the name in SETUP_INSTRUMENTS. It will appear in some Consumer tables, though. Alternatives to the name NAME are: WAIT_NAME. You can't update the NAME column. [ SETUP_INSTRUMENTS."ENABLED" COLUMN ] Definition: ENABLED enum('YES','NO') DEFAULT 'NO' NOT NULL ENABLED always means: "the instrumentation is on". However, it isn't necessarily on for all threads. Setting another flag in the SETUP_ACTORS table, described in WL#4674, would solve that ... but other means exist. [ SETUP_INSTRUMENTS."TIMED" COLUMN ] Definition: TIMED enum('YES','NO') DEFAULT 'NO' NOT NULL This turns the timer on. If the instrumentation is enabled but the timer is off, then the performance code always generates NULL for start and end times. You cannot specify the type of timer here. That's determined by a setting in another table, SETUP_TIMERS. We'll talk about timers in a later section. There was a suggestion that we could have the timer on "every 10th time". This could be done by making TIMED an integer and changing it to '10' for all users. We decided that's unnecessary. [ SETUP_CONSUMERS TABLE ] NAME VARCHAR(64) ENABLED enum('Yes','No') Alternatives to the name CONSUMERS are: TABLES. Effectively there's a primary key over the NAME column, although show CREATE TABLE won't show it. There is no numeric identifier. The ENABLED column is updatable. But see later section "Privileges". We considered having a MAXIMUM_ROWS column, that is, each table would be variable size. A little like an equivalent of profiling_history_size. We decided that's a bad idea, it's too risky. [ SETUP_CONSUMERS."NAME" column ] Definition: VARCHAR(64) CHARACTER SET UTF8 COLLATE UTF8_GENERAL_CI This contains, non-taxonomically, the name of a table. For example: 'EVENTS_WAITS', 'EVENTS_WAITS_HISTORY'. [ SETUP_CONSUMERS."ENABLED" COLUMN ] Definition: ENABLED enum('YES','NO') DEFAULT 'NO' NOT NULL If a consumer is enabled, then any data that's gathered (by the instrumentation) goes into, or is summarized into, the consumer table. If it's off, the table is frozen. Suppose, then, that all the CONSUMERS flags are disabled. That means that there is no point gathering the data. But there is no effect on SETUP_INSTRUMENTS.ENABLED when this happens. The gathering of the data depends on SETUP_INSTRUMENTS.ENABLED and perhaps the SETUP_ACTORS values (see WL#4674), not on SETUP_CONSUMERS.ENABLED. [ SETUP_CONSUMERS."NUMBER_OF_FILLED_ROWS" column ] (This column was removed on 2009-03-03. It turned out to be useless.) [ PRIVILEGES ] You also need privileges on the setup tables and columns. There is no script in mysql_install_db which says GRANT UPDATE (ENABLED, TIMED) ON SETUP_INSTRUMENTS TO [user]; GRANT UPDATE (ENABLED) ON SETUP_CONSUMERS TO [user]; ... it's not needed, since root starts off with all privileges and can grant to others. There is no need for SETUP privilege. Even if there was an ALTER privileges, you wouldn't be able to change the default values of setup tables. We may document that users simply lack privileges to do absurd things with PERFORMANCE_SCHEMA tables. In fact, though, the main prevention is that the storage engine cannot perform tasks like ALTER, so users will see a message referring to the storage engine rather than a missing-privilege error message. ALTER is always impossible, things like CREATE TABLE and DROP TABLE and CREATE TRIGGER may be possible for privileged users but will be useless and harmless. [ CHANGES HAPPEN IMMEDIATELY ] If you set a flag for another user, there's almost no delay enabling it. That's different from what you might be used to when you say "SET GLOBAL flag_name". It's necessary because you might want to start monitoring something that's already running. [ VIEWS ] Doubtless, with experience, we'll find that the raw setup tables aren't what users typically want to get at. They'll want views. We won't give them any. A little aside about what Peter thinks one of our goals is. We are not trying to make this feature "easy to use". We are trying to make it easy for everybody else to make this feature easy to use. For example, we give scope for the Enterprise folk to build a tool with it, for code contributors to add their own instruments, and for end users to add convenient views. Anyway, the agents who really want to use these particular tables in complex ways are either DBAs (who by definition know what they're doing) or tools programs (which of course will be written by people who don't worry about convenience). [ EXAMPLES OF STATEMENTS ] Assume default database = the performance_schema schema. SELECT * FROM SETUP_INSTRUMENTS; This lists all the instruments currently in existence. Sample rows of result: +----------------------------------------------------------+---------+-------+ | NAME | ENABLED | TIMED | +----------------------------------------------------------+---------+-------+ | wait/synch/mutex/sql/LOCK_des_key_file | YES | YES | | wait/synch/mutex/mysys/KEY_CACHE::cache_lock | YES | YES | | wait/synch/mutex/myisam/MI_SORT_INFO::mutex | YES | YES | | wait/synch/mutex/tina/tina_mutex | YES | YES | | wait/synch/mutex/archive/archive_mutex | YES | YES | | wait/synch/mutex/archive/azio_container::thresh_mutex | YES | YES | | wait/synch/mutex/blackhole/blackhole_mutex | YES | YES | | wait/synch/mutex/heap/HP_SHARE::intern_lock | YES | YES | | wait/synch/mutex/maria/MARIA_SHARE::key_del_lock | YES | YES | | wait/synch/mutex/myisammrg/MYRG_INFO::mutex | YES | YES | | wait/synch/rwlock/sql/LOCK_grant | YES | YES | | wait/synch/rwlock/mysys/WT_RESOURCE::lock | YES | YES | | wait/synch/rwlock/mysys/SAFE_HASH_mutex | YES | YES | | wait/synch/rwlock/myisam/MYISAM_SHARE::key_root_lock | YES | YES | | wait/synch/rwlock/maria/MARIA_KEYDEF::root_lock | YES | YES | | wait/synch/cond/sql/PAGE::cond | YES | YES | | wait/synch/cond/mysys/WT_RESOURCE::cond | YES | YES | | wait/synch/cond/myisam/MI_SORT_INFO::cond | YES | YES | | wait/synch/cond/archive/azio_container::threshhold | YES | YES | | wait/synch/cond/maria/MARIA_SHARE::intern_cond | YES | YES | | wait/io/file/sql/ERRMSG | YES | YES | | wait/io/file/mysys/proc_meminfo | YES | YES | | wait/io/file/myisam/backup_log | YES | YES | | wait/io/file/tina/metadata | YES | YES | | wait/io/file/maria/control | YES | YES | +----------------------------------------------------------+---------+-------+ UPDATE SETUP_INSTRUMENTS SET ENABLED='no' WHERE NAME LIKE '%/RWLOCK/%'; This turns off instrumentation for all RWLOCK instrumentation. CREATE TABLE test.SAVED_SETTINGS AS SELECT * FROM SETUP_INSTRUMENTS; ... UPDATE SETUP_INSTRUMENTS SET ENABLED=(SELECT ENABLED FROM test.SAVED_SETTINGS y WHERE x.NAME=y.NAME); This saves the SETUP_INSTRUMENTS names and flags, and then (presumably some time later) restores them. To say that I want to stop updating of every performance_schema table but one: UPDATE SETUP_CONSUMERS SET ENABLED='No' WHERE NAME<>'EVENTS_WAITS'; UPDATE SETUP_INSTRUMENTS SET ENABLED='YES'; UPDATE SETUP_CONSUMERS SET ENABLED='YES'; This turns on everything. We just recommend to ordinary users: by default everything is off, but we recommend turning it on. We may later require "per-actor" pre-filtering, perhaps with something like INSERT INTO SETUP_ACTORS VALUES ('ALL','ALL','ALL'); /* see WL#4674 */ To disable all the Mutex instrumentation, I say UPDATE SETUP_INSTRUMENTS SET ENABLED='NO' WHERE WAIT_NAME LIKE '/Mutex/%'; To disable the timer for all cases, I say: UPDATE SETUP_INSTRUMENTS SET Timed='no'; PERFORMANCE_SCHEMA . EVENTS_WAITS_CURRENT ----------------------------------------- Developer discussion of this section is in a dev-private thread starting with [mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=25970 This section is now called EVENTS_WAITS_CURRENT because we won't use the earlier proposals for table names. An "event" is the execution of a snippet of code that is recorded by an instrument. EVENTS_WAITS_CURRENT will show what the latest event was for each thread (actor). The PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT table always exists, but if everything in the setup_instruments is disabled, or all the actors are somehow disabled (see WL#4674), then there are no current rows in it. PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT will be the first table we make, because: 1. originally mutex waits were an object of interest for another task. Times have changed, but changing our starting place seems unnecessary. Every user may be granted a SELECT privilege for any or all PERFORMANCE_SCHEMA tables, and no other privilege is necessary for viewing. This is reasonable so no user may update the tables (except for some columns in the setup tables, see SETUP section), even clearing the counters is impossible. Peter proposes that users without PROCESS privilege will be able to see other users' EVENTS_WAITS rows, even though such users cannot do SHOW PROCESSLIST for other users. Some other PERFORMANCE_SCHEMA tables will have sensitive information, for example SQL statement text, but we will just change the sensitive columns to NULL, or skip the row, instead of denying access to the table. If the server was started with mysqld --log-bin=binlog_file_name, we will flag all performance_schema tables as unsafe to replicate with both statement-level and row-level replication, so the replication code can generate an appropriate warning or error when then the user changes a persistent non-temporary table and refers to Performance Schema table, for example by saying CREATE TABLE t AS SELECT * FROM PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT. However, WL#2360 won't dictate what the error message is or what its effects are, replication decides that. If somebody wants SHOW as well as SELECT, that's fine, as a separate task. The SHOW statement would read PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT. [ THE CLASSES OF EVENTS TABLES ] From the SETUP section you'll know that we have a hierarchy like this: " Class: Wait Order Synchronization Object Family Mutex Genus Lock_open Species Server Individual program_name.c_line_xx /* not used, our lowest level is species */ These all go together in the the NAME column, separated by '/'s. " We will have a separate set of tables for each class. For example there will be a table EVENTS_STATEMENTS_CURRENT analogous to EVENTS_WAITS_CURRENT. In the rest of this section I'll only talk about EVENTS_WAITS because for WL#2360 we only are instrumenting waits; however, the same descriptions will apply for all the other EVENTS tables. [ EVENTS_WAITS_CURRENT TABLE ] Summary of definition: THREAD_ID Actually a connection id, possibly a background-thread id EVENTS_WAITS_ID A numeric identifier to distinguish from all other acts EVENTS_WAITS_NAME Corresponds to SETUP_INSTRUMENTS.NAME. Obviously it will start with 'Wait...'. TIMER_* Three columns calculated with the aid of the timer OBJECT_* Four columns that identify the object of the act List of columns and data types: THREAD_ID BIGINT EVENT_ID BIGINT EVENT_NAME VARCHAR(128) SOURCE VARCHAR(64) TIMER_START BIGINT TIMER_END BIGINT TIMER_WAIT BIGINT SPINS INTEGER OBJECT_SCHEMA VARCHAR(64) OBJECT_NAME VARCHAR(64) OBJECT_TYPE VARCHAR(64) OBJECT_INSTANCE_BEGIN BIGINT NESTING_EVENT_ID BIGINT All VARCHAR(64) columns are CHARACTER SET utf8 COLLATE utf8_general_ci, subject to change if we do something about identifiers for version 6.1. Note added by Peter Gulutzan 2009-10-18: Earlier we said utf8_unicode_ci. Alexander Barkov suggested a change to utf8_general_ci, Marc Alff and I concurred, and Mikael Ronström made no objection.) [ EVENTS_WAITS_CURRENT."THREAD_ID" COLUMN ] THREAD_ID BIGINT Identifier of the "thread" that's doing the event. Actually this is a connection id, possibly a background-thread id. It's the same as MySQL "Id" for those connection threads for which "Id" is defined. It used to be defined as INTEGER. But it's now BIGINT. That's because we might identify some threads by a special number. We might discuss special numbers in later emails. For now just take it that it's a number and doesn't have to be VARCHAR(64). The original idea was that this would correspond to what SHOW PROCESSLIST shows. However, we monitor more threads, and must supply different data. Therefore the implementor will add a new PERFORMANCE_SCHEMA.PROCESSLIST table which shows all threads. Having an extra temporary table in PERFORMANCE_SCHEMA is deemed harmless and we won't require its removal for purposes of WL#2360 review. Former name: WAIT_CONNECTION_ID [ EVENTS_WAITS_CURRENT."EVENT_ID" COLUMN ] EVENT_ID BIGINT Identifier of the event. The columns (THREAD_ID, EVENT_ID) are unique and not null. They constitute a primary key. That is, if a row in EVENTS_WAITS_CURRENT has THREAD_ID = 1 and EVENT_ID = 12345, then there is no other row in EVENTS_WAITS_CURRENT with the same THREAD_ID number and the same EVENT_ID number. Also there is no row in one of the other EVENTS_..._CURRENT tables (e.g. EVENTS_STATEMENTS_CURRENT) with the same THREAD_ID number and the same EVENT_ID number. EVENT_ID is a thread-local counter, that is, there's one for each thread. It starts at 0 when the corresponding THD object is created. It goes up by 1, that is, there is a source-code instruction to increment event_id, in the start_wait() for every instrument. Originally EVENT_ID was 32-bit unsigned. We changed to BIGINT after seeing that the number gets high quickly in real-world testing. EVENT_ID is not useful for comparing to see whether one event precedes another event in another thread. However, timer_start will be useful for such comparisons unless the user chooses a millisecond or tick timer. [ EVENTS_WAITS_CURRENT."EVENT_NAME" COLUMN ] EVENT_NAME VARCHAR(128) Identifies what is being waited for. This corresponds to a value in SETUP_INSTRUMENTS."NAME". For example, "Wait/Synch/MUTEX/LOCK_plugin". In any EVENTS_WAITS table, all names will begin with "Wait/...". EVENT_NAME has a full hierarchy. [ EVENTS_WAITS_CURRENT."SOURCE" COLUMN ] SOURCE VARCHAR(64) This is a string which tells us about the source code where the wait-related instruction takes place. We recommend to the person who adds the instrument: take this from __FILE__ and __LINE__. [ EVENTS_WAITS_CURRENT."TIMER_START" COLUMN ] TIMER_START BIGINT The value of TIMER_START is the moment that an event began. This will contain whatever the instrument code gets when it calls the user-designated timer (for example CYCLE) after adjustment to "canonical time units". This column, and the related columns TIMER_END and TMER_WAIT, are all BIGINT. Data types like TIME or INTERVAL would be inappropriate even for MILLISECOND timers because we lack fractional-seconds precision. The "start" value is therefore an integral number of units since an arbitrary moment like "when mysqld started". There is no way to retrieve or reset this start value. [ EVENTS_WAITS_CURRENT."TIMER_END" COLUMN ] TIMER_END BIGINT The value of TIMER_END is the moment that an event ended. This will contain whatever the instrument code gets when it calls the user-designated timer (for example CYCLE) after adjustment to "canonical time units". If the instrument has called the timer for TIMER_START but has not yet completed the call for TIMER_END (presumably because it's still executing the instrumented code), then TIMER_END value is NULL. [ EVENTS_WAITS_CURRENT."TIMER_WAIT" COLUMN ] TIMER_WAIT BIGINT This should tell us "how many time-units elapsed for this instrumentation". This is called WAIT_TIME in Robin's document. Some remarks about the meaning of TIMER_WAIT follow. For understanding the following, remember that an instrument generally looks like this: get-the-time() for the sake of TIMER_START do-the-event get-the-time() for the sake of TIMER_END But get-the-timer() itself takes time. If TIMER_END is NOT NULL, then TIMER_WAIT = TIMER_END - TIMER_START. If TIMER_END is NULL, then TIMER_WAIT is NULL. Alternatively, we could have calculated it as current_time - TIMER_START. Notice that the user has no way of getting current_time directly (there is no user-callable function that uses the current enabled timer). Either way, TIMER_WAIT is not exactly the same as "the time elapsed for the event", because -- depending on the point within the timer call when the value is actually retrieved -- either the TiMER_END or the TIMER_START call will not have the "overhead" of the call itself. The minimum overhead in cycles is available in PERFORMANCE_TIMERS, but we won't convert it to canonical time units and add it to TIMER_WAIT. There are also a few cycles wasted for checking whether timing is enabled, which will probably be a flag-test based on settings in setup tables. These cycles are not included in TIMER_WAIT. We calculate TIMER_START and TIMER_END using the same setup values. For example, if an instrument flag is "enabled" at the start, then we regard it as "enabled" at the end, even if the user has changed it. We'll discuss that in the section "Writing performance-structure components". Former name: WAIT_TIME [ EVENTS_WAITS_CURRENT."SPINS" COLUMN ] SPINS INTEGER This is the number of "spins" for a mutex, or else it's not. Peter has to mention this because there might be a whole group of columns whose values depend on what family the wait belongs to (for example is it a mutex), and even on peculiarities of the storage engine (for example a spin counter is often not going to be available). So sometimes there is going to be a bit of possibly-useful extra information, but usually it's "not available" or "not applicable" and therefore NULL. 5.0 SHOW MUTEX STATUS had Spin_waits and Spin_rounds for a while. But now you only see them when the moon is full. http://dev.mysql.com/doc/refman/5.0/en/show-mutex-status.html Someday somebody (perhaps not Sergei Golubchik) will say that we must keep SHOW MUTEX STATUS / SHOW ENGINE INNODB MUTEX STATUS until there's some other way to know about spins. Usefulness is a less important consideration. Peter considered: in that case, what folks really want is an "EXTRA" column and we'd leave it up to the person who adds the instrument, to decide what it means. But we'll decide: no. If you have extra information, you should petition to add new columns. Two spins for the same mutex acquisition are not counted as separate events. They won't get different EVENT_ID numbers. [ EVENTS_WAITS_CURRENT."OBJECT_..." COLUMNS ] OBJECT_SCHEMA VARCHAR(64) OBJECT_NAME VARCHAR(64) OBJECT_TYPE VARCHAR(64) OBJECT_INSTANCE_BEGIN BIGINT These four columns together identify the thing that's being acted on. It's up to the implementor, and the person adding the instrumentation, to decide "what's being acted upon". For example, maybe when we're "reading" we're reading a table, and at the same time we're reading a tablespace, and at the same time we're reading a partition file. Which one of these things is the "object" is something we deliberately keep vague. But for a mutex, it's reasonable to think that this is "usually null" except for a few cases where "it might be a table name". And for file io, it's reasonable to expect a file name and a byte number or block number. Maybe VARCHAR(64) is small but we won't increase it. Just truncate too-long values. There is no OBJECT_CATALOG column, since we probably won't support catalogs for a long time. [ EVENTS_WAITS_CURRENT."NESTING_EVENT_ID" COLUMN ] NESTING_EVENT_ID BIGINT This is always NULL. The idea is that we'll store the EVENT_ID of the immediately-encompassing event. For a wait, the nesting id will be of a stage or a statement. [ PRE-JOINED COLUMNS ] There has been discussion of essential columns that come with the instrumenting. But what about information that we could steal from elsewhere? For example INFORMATION_SCHEMA.PROCESSLIST has ID, USER, HOST, DB, COMMAND, TIME, STATE, INFO. If we want to know, for a given wait, "what user and host and statement text" it is associated with, we could join: SELECT ... FROM PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT a, INFORMATION_SCHEMA.PROCESSLIST b WHERE a.thread_id = b.id; Or we could assume that users will usually want those columns, we'd add USER and HOST and INFO to EVENTS_WAITS_CURRENT. Such columns were in Robin's original proposal for sesscurrwait. Denormalizing is a fine thing in this situation. Peter believes we'll have to steal columns from other tables to avoid joining. But he's not positive. The answer will depend on performance, and perhaps on user feedback. He is not proposing that pre-joined columns should exist at this time. [ OTHER EVENTS TABLES ] The EVENTS_WAITS_CURRENT table is the fundamental table. All the other EVENTS tables are derived from CURRENT information. At least, that's the way that things will appear. The other EVENTS tables will be histories, summaries, snapshots and metrics. For example (NB: this includes tables in future versions): EVENTS_WAITS_HISTORY EVENTS_WAITS_HISTORY_LONG EVENTS_WAITS_SUMMARY EVENTS_WAITS_SUMMARY_BY_THREAD_BY_CLASS EVENTS_WAITS_SUMMARY_BY_THREAD_BY_ORDER EVENTS_WAITS_SUMMARY_BY_THREAD_BY_FAMILY EVENTS_WAITS_SUMMARY_BY_THREAD_BY_GENUS EVENTS_WAITS_SUMMARY_BY_THREAD_BY_TIME ... EVENTS_WAITS_SUMMARY_GLOBAL_BY_CLASS ... EVENTS_WAITS_SNAPSHOT ... EVENTS_WAITS_HISTORY_METRIC ... Each of these table names may appear in SETUP_CONSUMERS. Aggregation topics will be in later sections like "Summaries". Writing performance-structure components ---------------------------------------- For illustration suppose we have: one actor = a connection thread, and one instrument = a mutex acquisition. What code do we need to support one consumer = the events_waits_current table? By answering that we'll show the important structures and guidelines. Start with the fact that #ifdefs exist. The important one is HAVE_PERFORMANCE_MONITORING. There's also HAVE_PERFORMANCE_SCHEMA which is what goes on if one says ./configure --with-perfschema. (We have two #defines rather than one, because it's conceivable that someday we'll want the performance interface without the schema.) By default both #defines are on for a generic build like "BUILD/compile-pentium" (notice that we didn't say "-debug" or "-max"). There's also a "Performance Schema Engine", which we'll describe in the section "Reading performance-structure components". Recommendation to the Implementor: do not use the abbreviation PSE for Performance Schema Engine. Currently the MySQL server code has this: mysql_mutex_lock(X) We want it to have this: #ifdef HAVE_PERFORMANCE_MONITORING /* Look at SETUP_INSTRUMENTS."ENABLED" column value for mutex X */ if (setup.instruments.enabled" = "yes" and actors.enabled = "yes") start_wait(parameters); mysql_mutex_lock(X); end_wait(parameters); else mysql_mutex_lock(X); /* instrumentation disabled */ #else mysql_mutex_lock(X); /* instrumentation undefined */ #endif Yes, there will be a macro, the macro will be the contents of the LLD. And performance-structure has these components: (I'm not specifying whether the structure is private or opaque or secret; nor am I specifying whether the components are really stored or can be accessed via pointers to consts; nor am I specifying order and type! Those are implementor decisions. This is just the easiest way to clarify what's necessary. The field names correspond to column names in EVENTS_WAITS_CURRENT.) NAME LENGTH DESCRIPTION thread_id ? We're storing in a thread-specific area. Not necessarily THD and not necessarily local. But if it's global, it's indexed by thread_id. So there's no need to store thread_id now. event_id 8 A thread-local unsigned auto-increment number. event_name 8 An integer code for the type of event, or a pointer to a name of the event, or a pointer to a structure which has event_name + flags e.g. PERFORMANCE_MONITORING_GLOBAL_FLAG. timer_start 8 value returned by timer function timer_end 8 value returned by timer function spins 4 only certain storage engines will set this object_id 8 How can we get information for object_schema, object_name, and object_type? Minutes of the WL#2360 meeting, Riga 2008 say: - Naming tables Igor: what's the problem with a string copy ? Sergei Petrunia: implement a name dictionary To Igor we have to say "maybe we have to do a string copy". To Sergei Petrunia we have to say: "not until WL#2760". We could store an entire string here, e.g. "schema_name.table_name table", and indeed we might have to do so until there is a true object id, see WL#2760 re unique numeric identifiers. If it's a mutex, this can point to a key address. We can defer discussions of this until we have to refer to schema objects. object_instance_begin 8 If object type = file, this is offset within file. If object type = table, this is some sort of row id. If object type = mutex, this can have address in memory. In earlier editions of this specification, there were more columns: timer_release 0 For when a mutex was freed string-for-comments ? For "Statements" text, for "sql_functions" name, etc. Those columns no longer are there. For the SOURCE column in EVENTS_WAITS_CURRENT, Peter has not indicated the underlying performance-struc. Peter expects that there will be a constant pointer to event_name, and recommends that the source be concatenated to that (it can be stripped later). Everyone expects that the value will be a concatenation of __FILE__ and __LINE__ but we're stating that only as a recommendation to the person who adds the instrument. If something else is more appropriate, fine, as long as it's constant. So start_wait() operates thus: * Figure out what the active timer function is, as decided by SETUP_TIMERS.TIMER_NAME. * Point to the new instance of performance-structure. Getting the location is the job of the Implementor. * Fill in event_name, object_id, object_instance_begin. Supplying the parameters for these is the job of the person who adds the instrument. (But the macro should lighten this task.) * Fill in timer_start. Calling the active timer function is the job of the Implementor. * Set timer_end = 0. So if timer_end = 0, we know a job is still waiting. (The performance_schema . events_waits_current section says this appears as NULL.) And end_wait() operates thus: * Figure out the same active timer as start_wait() figured out. * Point to the same place that start_wait() pointed to. * Fill in spins if it's relevant. * Fill in timer_end. Calling the active timer function is the job of the Implementor. If (end-timer - start_timer) i.e. timer_wait = 5 000 000, we know that a job waited for 5 microseconds before the mutex call finished. Maybe. This figure is in "pseudo picoseconds" (see "Canonical Time Units"). It is probably reasonably close to the truth if the counter is really based on time of day. However, we make no guarantees of similarity to wall-clock time, and for cycle timers there could be considerable variance, and certainly it's not accurate to the nearest picosecond, not even to the nearest 1000 picoseconds. Users are free to make calculation themselves after checking for frequency changes, knowing the risks. This is a recommendation to the Implementor, it is not a requirement: Peter Zaitsev inspired the following suggestion. Suppose there are two pieces of instrumented code closely together: INSTRUMENTED_CODE_MACRO(mutex_1); INSTRUMENTED_CODE_MACRO(mutex_2); Since there is no code between the pieces, probably nothing has happened. In that case: do not bother to test the enabled flags for the second piece (assume they're the same as for the first piece), and do not bother to calculate timer_start (assume it's the same as timer_end from the first piece, plus 1 canonical time unit). We can dismiss the possibility that we'd get swapped out between the pieces. But how do we know there is no code between the two pieces? There might be automatic ways, for example look at row number of assembly output. But there should be a manual way, so that the person who is adding the instrument can specify something like start_wait_without_checking_enabled_flags_and_calling_timer(); instead of start_wait(). Ring Buffers and PERFORMANCE_SCHEMA . EVENTS_WAITS_HISTORY ---------------------------------------------------------- The performance-structure occurs N times for each thread. This is a ring (circular) buffer. N is fixed, it's a #define with a value of 10. We do not allow changing N at runtime. We do allow changing N at startup with mysqld --performance_structures = 20. We do allow changing N with a ./configure option. We do allow for a long history in a separate table, which might not be enabled. Since performance-structure occurs N times per thread, there is an additional pointer or index -- call it performance_structure_current -- so we know "what occurrence of performance-structure is current". The Provider's Start procedure will re-calculate performance_structure_current. One can see what the last N events were, with a separate table in the PERFORMANCE_SCHEMA schema, namely EVENTS_WAITS_HISTORY. The definition of PERFORMANCE_SCHEMA . EVENTS_WAITS_HISTORY is the same as the definition of PERFORMANCE_SCHEMA . EVENTS_WAITS_CURRENT. In fact EVENTS_WAITS_CURRENT can be regarded as a view of EVENTS_WAITS_HISTORY, containing the last 1 row per thread rather than the last N rows. It is possible to clear any history table with a TRUNCATE statement. For example: TRUNCATE TABLE PERFORMANCE_SCHEMA.EVENTS_WAITS_HISTORY; After this, a SELECT from EVENTS_WAITS_HISTORY will return zero rows, although the size of the underlying performance-structure is unaffected, and there will soon be new rows. Effects on EVENTS_WAITS_CURRENT are undefined. The required privilege is DROP; however, granting DROP on a history table will not enable anyone to really drop it. Reading performance-structure components ---------------------------------------- To handle an SQL statement which reads from an EVENTS table, such as SELECT ... FROM PERFORMANCE.EVENTS_WAITS_CURRENT ... the Consumer must read from the performance-structure in-memory areas. With performance-structure setting as described in the previous sections, it is easy for the consumer to see at any given time: whether a job is doing mutex acquisition, how long it has waited, what the mutex is for, and so on. The instrumentation provides it all. Notable aspects that affect reading are: * The Consumer has to do some interpreting and translating of what is in performance-structure. This is okay. The provider must be fast. The consumer can be slow. It's conceivable that in some shops the instrumentation will be on for everything all day and so will be invoked millions of times, and there will be a summary at the end of the day so reading will be invoked a few dozen times. * Because of the definition of the performance schema (see earlier section "The performance schema") there must be tolerance for errors. It seems reasonable to hope that we'll show sensible stuff 999 times out of 1000, but we give unobtrusiveness much higher priority than correctness. The reading will be via a new "performance schema" storage engine. That means that SHOW ENGINES will show the performance schema, and it means that "read row" calls become, after appropriate transmogrification in the storage-engine interface, accesses to the in-memory performance structures. We already know that this is feasible because Marc Alff's prototype does it this way. Consumer Snapshot ----------------- The Consumer can take a snapshot at the start of the statement: if (the statement refers to a performance_schema table e.g. events_waits_current) for (each thread) for (each occurrence of performance-structure) copy performance-structure data to local temporary in-memory table After that, all readings from events_waits_current would be readings of the local temporary table, and therefore consistent. But assuming 1000 threads * 10 performance-structures per thread * 100 bytes per performance-structure, this is a 1,00,000-byte copy. Alternative 1: We could only copy relevant-looking rows: if (the statement refers to a performance_schema table e.g. events_waits_current) for (each thread) for (each occurrence of performance-structure) if the data matches the WHERE clause conditions copy performance-structure data to local temporary table Alternative 2: We could copy more rarely: if (a 3-second interval has elapsed since the last snapshot) for (each thread) for (each occurrence of performance-structure) copy performance-structure data to global temporary table Alternative 3: We could keep snapshots until end-of-transaction instead of until end-of-statement. This could be an option that depends on isolation level. We reject all the alternatives. Timer and Time Units -------------------- We need a timer which is fast (less than 100 cycles to get) and precise (better than one tick on a wall clock). For worklog task WL#2373 "Use cycle counter for timing", Peter supplied code for accessing many low-precision or high-precision timers, and showed that "a choice of a low-level cycle counter is usually best". The implementor will use an extract of the code attached to WL#2373. . Depending on the platform, there will be a timer for at least one of: cycles, nanoseconds, microseconds, milliseconds, ticks. . The timer initialization function will discover what the possible choices are, and the characteristics: underlying routine, overhead, resolution. This will be made available in a table PERFORMANCE_TIMERS. . For most platforms, the default will be a cycle counter unless the timer initialization function shows problems with it. . For some platforms, at least as fallback, we'll use a wall clock This may be useless, but we want the package to "work" on all platforms even if the result is useless. Alternatively we could let the package "work", but sets wait-related columns to zero if there is no way to get a fast counter. We will allow setting a global variable. UPDATE SETUP_TIMERS SET timer_name = { 'CYCLE' | 'NANOSECOND' | 'MICROSECOND' | 'MILLISECOND' | 'TICK' }. Not all of these options are always available; check SETUP_TIMERS. "UPDATE SETUP SET timer_name = 'MICROSECOND'" usually involves gettimeofday. Times that the user sees in EVENTS_WAITS_CURRENT.TIMER_START or EVENTS_WAITS_CURRENT.TIMER_END are BIGINTs representing canonical time units. Extra precision sometimes looks useless but is good for aggregating. The rest of this section has rationale and details about timer tables. It is not necessary for everyone to read it. Most people will benefit by skipping to the next section. These details were discussed on dev-private, starting with this email: [mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=25923 The instrumentation uses timers a lot so getting it right is good. We must tell the user what timers are available; we'll put that in the PERFORMANCE_TIMERS table. We must give the user a choice among the available timers; we'll put that in the SETUP_TIMERS table. And of course we must have the timers; code is already in for WL#2373 "Use cycle counter for timing" so later in this section we'll just say how WL#2373 relates to WL#2360. [ PERFORMANCE_TIMERS TABLE ] PERFORMANCE_TIMERS is a read-only table in the performance_schema schema. It will look like this on an ordinary 1.6-MHz x86 Acer laptop: mysql> select * from PERFORMANCE_TIMERS; +-------------+-----------------+------------------+----------------+ | TIMER_NAME | TIMER_FREQUENCY | RESOLUTION | TIMER_OVERHEAD | +-------------+-----------------+------------------+----------------+ | CYCLE | 1595661691 | 1 | 14 | | NANOSECOND | 1000000000 | 1 | 1350 | | MICROSECOND | 1000000 | 1 | 1227 | | MILLISECOND | 1004 | 1 | 1319 | | TICK | 103 | 1 | 947 | +-------------+-----------------+------------------+----------------+ 6 rows in set (0.13 sec) The table shows highest-frequency timer is CYCLE (the figure "1595661691 / second" is the same as "1.6GHz" which is my machine's speed); highest-overhead timer is NANOSECOND (it takes 1350 cycles to get one nanosecond timing); and the "interval" is always 1, which for instance means that the millisecond-frequency timer really does change every 1 millisecond. The user, or the user's program, can determine from this what the best timer is for a given purpose and time, taking into account the timer's frequency and the timer's overhead. The user must also take into account the well-known problems that some timers may cause, but that is a matter for documentation rather than PERFORMANCE_TIMERS. There are always five rows, ordered from highest frequency to lowest: CYCLE Usually this is an on-chip counter which we get via assembler instructions, such as RDTSC on x86. NANOSECOND Usually this comes from gethrtime() or clock_gettime() or read_real_time(). Or it's unavailable. Clocks that really tick over one billion times a second are rare. MICROSECOND Usually this comes from gettimeofday() on non-Windows, and from QueryPerformanceCounter on Windows. With QueryPerformanceCounter the timer frequency is not 1,000,000 but sometimes it's the same ballpark. MILLISECOND Usually this comes from ftime() if non-Windows, or GetTickCounter() if Windows. TICK A tick is the time between processor interrupts. It's been 1/100 second for a long time. But perhaps during the lifetime of MySQL 6.x many/most customers will see timer_frequency = 1000 for ticks. All these timers show "elapsed time". There are no available timers for "time used by this thread" or "elapsed time / kernel only". There are other timers, e.g. getrusage(), which try to give such data, and which SHOW PROFILE tries to use. We won't use getrusage(). On some platforms we couldn't find something for, or didn't bother to code for, every frequency. For example, the CYCLE counter might be unavailable. In that case users will still see a CYCLE row, but it will be obvious from the other data that we picked whatever was closest, for example it will be the same as the MICROSECOND row. PERFORMANCE_TIMERS is dynamic. We know that the data could change, particularly the frequency for CYCLE. But we don't only calculate the table values once on startup. We recalculate every time that a user selects from PERFORMANCE_TIMERS. What Peter has seen so far is: TICK has high overhead as well as low frequency so it looks like a bad choice on all platforms that we support. (This surprises him greatly; he can't explain the numbers that he gets.) So far we haven't found a use for ticks, but initially we want all possibilities to be there. If no user reports a case where it's worthwhile, we'll remove the TICK row for GA. [ PERFORMANCE_TIMERS."TIMER_NAME" COLUMN ] Definition: TIMER_NAME ENUM('CYCLE','NANOSECOND','MICROSECOND','MILLISECOND','TICK') The name corresponds to the name in the SETUP_TIMERS table. [ PERFORMANCE_TIMERS."TIMER_FREQUENCY" COLUMN ] Definition: TIMER_FREQUENCY BIGINT If a clock ticks once per second, then its frequency is 1. Typically, for a CYCLE timer, the frequency will be about the same as the rated frequency of the CPU. Thus on a 1.6-GHz laptop one should see a number close to 1.6 billion, but when one switches to "Power Save" mode one should see a number close to 800 million. Peter says lots about fluctuating frequency for CYCLE in WL#2373. For the nanosecond and microsecond timers, the numbers are fixed: 1,000,000,000 and 1,000,000. These are the only numbers in the table which are fixed instead of calculated at runtime. High frequency is better than low frequency. But in theory, any one of the timers could be okay. Even if frequency is only 100 times per second, the data is good enough for the most frequently-occurring real-world performance-analysis tasks (according to some advice by people with real-world experience). [ PERFORMANCE_TIMERS."RESOLUTION" COLUMN ] Definition: RESOLUTION BIGINT Sometimes resolution is defined as the inverse of frequency, that is, if frequency is 100 times per second then resolution is 1/100 second. That is not the meaning here. What we're trying to say with "resolution" is: how meaningful is the frequency? Take for example the case of MySQL's 'melody' machine, which has clock_gettime(), which cheerily reports time to the nearest billionth of a second. However, if you call it twice, you either get the same figure both times, or you get different figures but the second figure is one-millionth greater than the first, instead of 1-billionth greater than the first. It "jumps" by millionths not billionths. Peter wanted to call this column TIMER_INTERVAL, but Sergei insisted it's resolution, and nobody else objected to calling it RESOLUTION. Intervals greater than 1 are very common for nanosecond timers and for millisecond counters under Windows. See the tests in rdtsc3_notes.txt which is attached to WL#2373. [ PERFORMANCE_TIMERS."TIMER_OVERHEAD" COLUMN ] Definition: TIMER_OVERHEAD BIGINT This is the minimal overhead of a timer. When we initialize, we call the timer 20 times and what you see in "overhead" is the smallest figure. In fact the instrumentation will invoke the timer twice (before and after what's being instrumented) so multiply this figure by 2 to get the minimal timer overhead per instrumentation. This column is important because for 'always-on' monitoring we want the timer with the low overhead. In fact the low-overhead timer has always been the cycle timer, as is evident again from the test notes in rdtsc3_notes.txt which is attached to WL#2373. A thought experiment: These figures are simplified but not absurd: suppose a cycle takes 1 billionth of a second, suppose timer_overhead = 1000, suppose we monitor 1000 events per second per job, suppose there are 100 simultaneous jobs. When that happens, total instrumentation overhead is more than 20% of the time, and the DBA will have to tweak some dials. [ PERFORMANCE_TIMERS."OTHER" COLUMN? ] Each timer has an underlying routine which depends on platform. For example I mentioned that the microsecond timer depends on 'gettimeofday' for non-Windows, or depends on 'queryperformanceounter' on Windows. But we won't report that. Cycle timers may have a variable frequency. So we could recalculate periodically and have columns that show minimum / maximum rates. But we won't report that. (Recalculation does happen, though, whenever you select from the PERFORMANCE_TIMERS table.) Timers could have a 'base' or 'epoch'. This might be a moment in 1970, or it might be the time that the server started, or perhaps someday we cold allow it to be reset. But we won't report that. [ SETUP_TIMERS TABLE ] SETUP_TIMERS is a tiny table which just contains the information: which of the available timers will be used? NAME VARCHAR(64) TIMER_NAME enum('CYCLE','NANOSECOND','MICROSECOND','MILLISECOND','TICK') There is always one row in the table. On my machine it looks like this: mysql> SELECT * FROM SETUP_TIMERS; +------+------------+ | NAME | TIMER_NAME | +------+------------+ | Wait | CYCLE | +------+------------+ 1 row in set (0.00 sec) CYCLE will tend to be a default because of its lower overhead. Suppose that, after experiencing with my particular installation that CYCLE is not I want, I decide to use a MILLISECOND timer instead. I'd say: UPDATE SETUP_TIMERS SET TIMER_NAME='MILLISECOND' WHERE NAME='Wait'; This might return an error, or might set for the next-best timer, if there's no millisecond timer. Notice that you can't choose 'NONE'. There always is a timer. If you want to disable timing, you do that by setting ENABLED='No' in SETUP_INSTRUMENTS or by disabling actors (see WL#4674). [ SETUP_TIMERS."NAME" COLUMN ] The NAME is the same as the 'class' in SETUP_INSTRUMENTS. Remember the taxonomy in SETUP_INSTRUMENTS? It looked like " Instrument names follow a 'taxonomy' convention: Phylum: Instrument /* not used, there's only one phylum */ Class: Wait Order Synchronization Object Family Mutex Genus Lock_open Species Server Individual program_name.c_line_xx /* not used, our lowest level is species */ These all go together in the the NAME column, separated by '/'s. " At this time the only class we're instrumenting is 'Wait'. So there's only one row in the SETUP_TIMERS table, named 'Wait'. Eventually we'll monitor long and infrequent routines, not just waits. In that case, arguably, the overhead matters less and the relation to wall-clock time matters more. So for a different class, we might have a different row, okay. But two things in the same class and order, like MUTEX and RWLOCK, always have the same timer. You can't update the NAME column. [ SETUP_TIMERS."TIMER_NAME" COLUMN ] The TIMER_NAME is a reference to the possible names in the PERFORMANCE_TIMERS table: 'CYCLE', 'NANOSECOND', 'MICROSECOND', 'MILLISECONDD', or 'TICK'. TIMER_NAME is updatable. [ PRIVILEGES ] You also need privileges on the SETUP_TIMERS table and columns. Effectively, there is a script which says GRANT SELECT ON PERFORMANCE_TIMERS TO [user]; GRANT UPDATE (TIMER_NAME) ON SETUP_TIMERS TO [user]; Implementation details are the same as for other setup tables. [ WL#2373 ] WL#2373 "Use cycle counter for timing" has the timer code. For description, or for the code, look at WL#2373 and its attachments. But how does this relate to WL#2360? Thus: * WL#2373 stays in existence and Peter will keep it up-to-date for a while. It has been extremely useful for just testing the timers on multiple platforms, and perhaps QA will find more uses for it. But it will not be independently reviewed. WL#2373 was only made for the sake of WL#2360. In theory a high-resolution timer might be useful for other purposes (e.g. UUID), but we're not going to put such tricky code in unless we are doing real monitoring. * Marc incorporates the timer code in MySQL server code. (He's already done this actually, in mysql-6.0-perf.) * The WL#2360 code reviewers are responsible for checking all the code in Marc's patch, including inline assembler. * WL#2373 does not have to be on the "6.x roadmap". [ CANONICAL TIME UNIT ] We will have "canonical time units". This means that, no matter what timer you choose, MySQL will adjust it so that it's comparable with results from other timers. The unit is smaller than nanoseconds and smaller than cycles for efficiency reasons. the canonical time unit is a picosecond, 1 / 1 000 000 000 000, one millionth of a microsecond. Convert to picoseconds during instrumentation. For example: suppose we know that the cycle timer has a frequency of 1.8GHz, i.e. 1800 million cycles per second, so our Multiplier is (1 000 000 000 000 / 1 800 000 000) = 555.56. Round to 556. Store it forever in global 64-bit unsigned integer "Multiplier". Now suppose the timer returns 8888 (cycles). Multiply that by Multiplier = (8888 * 556) = 4941728. That's what goes into performance-structure . timer_start or timer_end. Precision loss: variance from the original exact value will occur due to rounding. The example shows one of the greatest variances that can appear in real 1.8-GHz computers: 0.5 out of 555. That is, 0.1%. That is, if we reverse the calculation by evaluating 4941728 / (1000000000000 / 1800000000), we get 8895, not 8888. This is negligible for our current purposes. Overhead: Multiplying by an integer is less than 10 cycles on Peter's x86. It occurs twice per instrumentation, and very rarely there would be a cache miss for getting Multiplier, but a cost of 20 cycles per instrumentation is the worst Peter can imagine. The key point is: we do not divide or resort to floating point. Overflow: According to Peter's idea of how arithmetic should work, (power(2,64) / 1000000000000) / (60 * 60 * 24) = 213 days. So if we say the epoch is always "when the instance started", overflow is quite unlikely. Stepping: the Multiplier is the wrong value if the frequency changes. Recalculating the frequency is very slow, and is not justified given that frequency change is both unlikely to happen and unlikely to cause difficulty if it does happen. Remember, we originally agreed that a cycle counter is okay regardless of frequency change. Reading: during reading you certainly don't want picoseconds, so you'll have to divide to get microseconds, cycles, or whatever. At this point, you'll have to divide by 1 000 000 for microseconds, divide by frequency for cycles, and so on. So, the reader does get slower. So the Benefit is that you always see the same time units. Epoch: so that timer_start with one timer is comparable with timer_start from another timer, we now must have a common epoch. This can't be fixed on some date (see above re 'Overflow'), so we standardize on: epoch = time the server started. For example, if we have a microsecond timer, we will say something like: x = getttimeofday x = x - mysqld-startup epoch value x = x * 1000000 /* i.e. number of microseconds in a picosecond */ timer_start = x, or timer_end = x. Putting it another way, this means that in effect, a value T returned by a timer is adjusted as: Normalized_t = (T - To=server-startup) * timer-unit-to-pico-factor and that's what gets recorded in TIMER_START, TIMER_END, and TIMER_WAIT. Let's make sure it's clear that there are no additional miracles. We'll say the dread words 'not a bug' if anyone ever claims these are flaws: * We calculate cycle frequency when the mysqld instance starts. We do not recalculate later. If frequency changes, that's acceptable. * If different cores have different frequencies, that's acceptable. * Because the epoch is from the mysqld start, you cannot usefully compare timer_start values from different instances, or timer_start values from logs made on different days. Summaries --------- The "Summaries" section has been shifted to a new worklog task, WL#4816 PERFORMANCE_SCHEMA Summaries Sampling -------- We could have a separate thread that reads performance schema events data once per second, filters so that "Idle" sessions are ignored, and puts the result (a snapshot of the status of active sessions) in a "sampling table". Similarly, "metrics" (e.g. rates, like number-of-waits-per-second) are usually calculated at long intervals, like one minute. We will need these for alerts. Finally, we could sample every 30 minutes by default, and keep the result in a permanent on-disk table, purging every week. The server will not do sampling, metrics, or workload repositories. We believe that the Enterprise Monitor crew can do it, with a scheduler such as CREATE EVENT or with their own scheduling methods. So our job as server people is to make sure that the tools people can make things like active session history and automatic workload repository without encountering showstopper-like difficulties. Problem: background threads and processes ----------------------------------------- [ The contents of this section are removed. We can remove the heading and the table-of-contents entry any time. ] Problem: no mutex for reading performance-structure --------------------------------------------------- It is possible that a thread (the Provider) will be writing to performance-structure while another thread (a Consumer) is reading it. We won't use a mutex to control every time a Provider wants to overwrite performance-structure, because that would add too much overhead. How can we then prevent crashes or ridiculous results? Conflict between writers and writers: threads write only to their own (thread-specific) memory areas, so two "writers" will not conflict. There are only two exceptions to this rule: 1. Clearing. We intend to support SQL statements which will remove (zero) everything for an EVENTS table. This could indeed occur at the same time as some other job is updating the same memory. However, the only effect is that the clearing doesn't (totally) succeed, sometimes. We document that, and forget it. 2. Thread termination. If we're reading a thread's memory, and the thread's memory is deallocated, that's no good. So this is controlled by a mutex. There's nothing new about this, though. And it's only a case of "the reader must watch a mutex"; there is no mutex necessary for the provider's instrumentation. There is no other time when memory might be allocated or deallocated affecting a reader; we insist on fixed-size storage at all times after mysqld has initialized. Conflict between readers and writers: readers will maybe be reading while writers are writing in the next structure that the reader is about to look at, or writing in the same structure, or even writing to the same field in a structure. Some scenarios, which would be classed as acceptable errors if it was possible for them to happen: * It's a 32-bit OS, and the reader reads the top half of an address, then the writer rewrites, then the reader reads the bottom half. This may not happen (see below), but it's recommended that Readers could check all addresses for sanity, though not for correctness. * It's a UTF8 string, and the reader sees exactly half of it, including the first byte of a multi-byte character. Readers could check for invalid strings and truncate them. Again, this may not happen (see below). * It's a SELECT with a WHERE and a GROUP BY and a JOIN and an ORDER BY and a UNION on many different ACTS tables. There's going to be a lot of read inconsistency happening. Readers don't have to check for this. The danger is only with logic like "if a > b ... else if a == b ... else if a < b ... else assert(a makes no sense)". Perhaps, though it's unlikely, you might see assertions with debug builds. * It's a writing job but it does some reading as well, from SETUP contents. At the same time somebody could be updating a SETUP table so there's temporary inconsistency. Sergei Golubchik has proposed an algorithm which requires no mutexes and no memory barriers, and yet is safe. Marc Alff has agreed to implement this algorithm. Therefore the implementor is not "required" to perform sanity checks. Peter Gulutzan only says that such cautious procedures are "recommended". See dev-private email thread starting at [mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=26195 Combine all that with other potential error sources, such as cycle counter drifting (see WL#2373), and it's clear that the implementor and the users must take care. Remember the upside: Overhead is low (see "Overhead" section). Overhead -------- Guessing from the fact that all MOV or ADD instructions are cheap, and assuming very good luck with caching, the overhead "per event" (i.e. for the Provider's Start and End and Release procedures) might be around 200 cycles on a P4. Since events are very frequent, Peter has emphasized that the overhead "per event" must be minimal, minimal, minimal. That's why he is sure that the Provider should use a cycle counter, should not itself use mutexes unless it's terminating, and should pass on all calculations to the Consumer. Our overall overhead should be 3% or less. Peter thinks that's possible because he's seen Kelly's timing tests with DBT2 runs. If everything is disabled in the setup tables, the overhead is merely that we waste some memory for unused performance-structure and that we have test-and-jump instructions for every Provider procedure. And if there is no timing (setup_instruments.timed='no'), the overhead likely goes down about 75%. On the other hand, if there is timing and setup_timers.timer_name = 'microsecond', the overhead likely goes up, by as much as 4000%. For a modern x86 server the requirement is 200 cycles, for one instrumentation, with cycle timing on, with summaries and optional SETUP flags off. If we can't do 200 cycles, but we still can do less than 1000 cycles, then we'll proceed but we'll have to be less ambitious about adding summary tables. If we can't do less than 1000 cycles, then we won't proceed. The main thing is: we have a specific statement what is meant by "too much overhead". UPDATE. Kelly Long did a throughput DBT2 test with what Marc Alff has coded so far, on an AMD 8-way Dual Core. The results in transactions per second, with columns for 8 or 16 or 32 or 64 or 12 or 192 connections, comparing "IN" i.e. performance interface in and instruments all enabled, versus "OUT" i.e. performance interface not even compiled in, was: 8 16 32 64 128 192 ----- ----- ----- ----- ----- ----- IN 15096 18049 21259 20494 19886 19293 OUT 15680 21277 22183 21025 20474 19174 Overhead 3.72% 15.17% 4.17% 2.53% 2.87% -0.62% ... By now Marc Alff has implemented a lot, so these figures might resemble reality. Removals -------- Some tasks might become redundant, or might need changing to be part of the larger task. That probably includes all tasks that currently are plans for some bit of extra performance measurement. But, due to the delay implementing WL#2360, some things have gotten in and there is no hope of getting rid of them soon. They include: WL#2317 Add patch to show mutex stats Vadim Tkachenko added this. It has cumulations for InnoDB mutexes: Count, Spin_waits, Spin_rounds, OS_waits, OS_yields, OS_waits_time. Peter proposes that we remove all code for the SHOW MUTEX STATUS statement (now called SHOW ENGINE INNODB MUTEX), but if the information is truly necessary we will put it in the appropriate PERFORMANCE table. It is difficult to make changes to InnoDB code in any case. Google's SHOW TABLE STATISTICS http://code.google.com/p/google-mysql-tools/wiki/UserTableMonitoring This will almost certainly be in before we make it obsolete. Chad Miller's implementation of Jeremy Cole's profiling (BUG#24795). Peter said he'll try to get this deprecated; Sergei said he thinks we should wait till performance schema has everything profiling has. Christopher Powers's Falcon diagnostics. (However, Christopher was at the Riga meeting where we discussed the performance schema, and expressed support for instrumenting Falcon this way. Sergei Golubchik spoke similarly about Maria instrumenting.) Stewart Smith's NDB$INFO. We expect cluster will diverge from MySQL. Trace ----- [ The contents of this section have been moved to WL#4878 PERFORMANCE_SCHEMA Trace. ] DTrace ------ [ The content of this section has been removed. We can remove the heading and the table-of-contents entry any time. ] Embedded -------- We will not support Performance Schema with embedded MySQL / libmysqld. Terminology ----------- ACTOR. Something which does an event. A thread, a background process, or a group of threads belonging to a single role or user. CONSUMER. The server components which read information about the performance, in in-memory performance structures. Any code which reads, in order to produce something in a performance schema EVENTS table, is a Consumer. The reverse of Provider. EVENT. Something which is done. A deed. The meaning "act of a play" is not officially intended, but usually fits. Try not to confuse it with the schema object that you make with CREATE EVENT. EVENTS TABLES. Any table in performance schema whose name begins with EVENTS_. Rows in events tables can be "individual" (a mutex event happens at Time T1 for 1 time-unit), "summary" (9 mutex events happened today), "histogram" (4 mutex events happened today for less than 1 time-unit and 1 took 1 time-unit and 4 took 2 time-units), or "history" (a mutex event happened at Time T2, a mutex event happened at Time T2, a mutex event happened at Time T3, and so on). FILE. "File IO" is our term for any read or write call via the operating system which we have reason to believe may cause disk IO. If the operating system in fact doesn't do real disk IO at the time, for example because it caches, that's fine, we're not demanding that the disk IO be "real". INSTRUMENT. A pair of code pieces which bracket a code piece which is performing an Event. The bracketed code piece is "instrumented code". The first code piece of the instrument (before the instrumented code) is "start_wait". The second code piece of the instrument (after the instrumented code) is "end_wait". Instruments are in Providers. Typically an instrument will calculate performance data and store it in an in-memory performance structure. INSTRUMENTATION. The set of all Instruments in the server. METRIC. A routine or application which happens at long intervals (typically once per minute) and shows rates such as "acts per second in the last minute", which can be found from snapshots that Sampling produces. MUTEX. "Mutex" is the word we will use for all objects that are used to reserve access to processes or memory locations. Mutexes are not shared, and are not enqueued. A lock is not a mutex. We will not use the word "latch" except to explain that we don't use it. PERFORMANCE SCHEMA. What I hope will become the preferred term for the whole feature, instead of Monitoring (easily confused with MySQL Enterprise Monitor), or Profiling (which to me implies "summaries"), or Performance Analysis (too long), or Instrumentation (may not always be true) or "Performance Interface" (unless somebody really likes it). PROVIDER. The server components which supply information about the performance, in in-memory performance structures. The reverse of Consumer. Any code which contains an Instrument is a Provider. SAMPLING. A routine or application which happens at short intervals (typically once per second) and takes snapshots of performance schema tables. STAGE. An act that is nested in a statement, and may nest a wait. For example, "Filesort". WAIT. A "wait" occurs when a thread ceases to do useful work with the CPU. Typically waits involve an operating-system call (yield, sleep, get) which involves accessing a slow device (a net, a human, a disk) or which involves contention (another thread has asked for the same thing). The performance schema "wait" genus necessarily includes some events that involve waiting zero time units (disk block already in cache, mutex needs no wait, etc.), but we can't know that till we call, so there's no separate category. Requests for comment -------------------- [ The content of this section has been removed. We can remove the heading and the table-of-contents entry any time. ] Performance_xxx routines ------------------------ [ The content of this section has been removed. We can remove the heading and the table-of-contents entry any time. ] List of Mutexes --------------- [ The content of this section haa been removed. We can remove the heading and the table-of-contents entry any time. There is a separate task WL#4670 "Mutex documentation". ] Administrative -------------- Marc Alff added a soft dependency on WL#4601: there is no point in instrumenting the "fast mutex" implementation if it's going away. If there is already a user-created database named PERFORMANCE_SCHEMA, then mysql_upgrade will fail. Source code ----------- Status, as of 2010-01-16: The WL#2360 code is now merged in 5.5 and 6.0. The code is publicly available at: https://code.launchpad.net/~mysql/mysql-server/mysql-next-mr https://code.launchpad.net/~mysql/mysql-server/mysql-6.0-codebase The following branches are not maintained anymore: mysql-5.4 port, abandoned: https://code.launchpad.net/~marc.alff/mysql-server/mysql-azalea-perfschema mysql-6.0, abandoned: https://code.launchpad.net/~marc.alff/mysql-server/mysql-6.0-perfschema mysql-5.5, merged: https://code.launchpad.net/~marc.alff/mysql-server/mysql-trunk-perfschema References ---------- This section includes some references to other DBMSs. Monitoring is common to all products. For most DBMSs, the section only will show how little information they provide. Oracle Database Reference 11g Part III "Dynamic Performance Views": http://download.oracle.com/docs/cd/B28359_01/server.111/b28320/dynviews_part.htm#i403961 Oracle white paper "The Self-Managing Database ...": http://www.oracle.com/technology/products/manageability/database/pdf/twp03/TWP_manage_automatic_performance_diagnosis.pdf DB2 system monitor guide: ftp://ftp.software.ibm.com/ps/products/db2/info/vr9/pdf/letter/en_US/db2f0e90.pdf PostgreSQL manual chapter on monitoring database activity: http://www.postgresql.org/docs/8.3/static/monitoring.html (Measurements, in "pg_stat_" views, are counts, systemwide or per table or per index, for statements, reads, buffer hits. PostgreSQL also lets you see time used per statement, that's about all.) SQL Server 2008 books online, "sysprocesses": http://msdn2.microsoft.com/en-us/library/ms179881.aspx SQL Server 2008 books online, "sp_trace_setevent": http://technet.microsoft.com/en-us/library/ms186265(SQL.100).aspx (SQL Server 2000 is not a good model to follow.) "SQL Server 2008 Dynamic Management Views and Functions": http://msdn2.microsoft.com/en-us/library/ms188754(SQL.100).aspx "Profiler and trace in SQL Server 2005": zip including short power point file: http://www.sqlteam.com/item.asp?ItemID=17345 (SQL Server 2005 allows trace files as SQL views.) Sybase ASA manual re performance monitoring statistics: http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0901/en/html/dbdaen9/00000765.htm (Sybase ASA provides cumulative statistics to the Windows performance monitor) Sybase ASE performance monitor, french manual: http://download.sybase.com/pdfdocs/mng1250f/mon.pdf sybooks.sybase.com "syslocks" etc.: http://sybooks.sybase.com:80/onlinebooks/group-as/asg1251e/tables/%40Generic__BookView WL#3363 ndb$info. Also ndb thread "Re: WL#3363 ndb$info": [mysql intranet]/secure/mailarchive/mail.php?folder=113&mail=6396 dev-private july 2007 thread re Ingo's time estimates [mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=20154
The code in bzr mysql-6.0-perf contains the low-level structures and main algorithms, commented and set apart from other code with if/endif as described in the HLS. That suffices as an LLD for a task of this nature.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.