WL#2360: Performance Schema

Affects: Server-5.5 — Status: Complete

Example                
-------                
                
Leaving out many details and variations, this is what might happen ...         

The DBA uses an SQL statement to "enable MUTEXES instrumentation".
This statement requires special privileges because global                
monitoring has a slight impact on performance of all jobs.
MUTEXES is a class of performance events that we can monitor.
                
When a MySQL thread is about to acquire a mutex, it sets variables in       
its own area (possibly THD), including "start time" based on a        
low-overhead timer/counter.                
Whenever a thread has finished acquiring a mutex, it sets variables                
in the same structure, such as: "end time".                
                
A user says SELECT ... FROM PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT.
The thread which is handling this knows that all PERFORMANCE_SCHEMA
information is in the memory of other threads. It scans them and makes rows:   
            
Thread ID      Event Name     Timer Wait              
---------      ----------     ----------               
1              "...mutex..."  11               
2              "...mutex..."  NULL                  /* (still waiting) */    

The HLS has the variations and some implementation details.                

Terminology: the job which sets the mutexes  is "Provider",
the job which reads PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT
is a "Consumer".

Contents
--------

Example
Why we're doing this
The performance_schema schema
SETUP
PERFORMANCE_SCHEMA . EVENTS_WAITS_CURRENT
Writing performance-structure components
Ring buffers and PERFORMANCE_SCHEMA . EVENTS_WAITS_HISTORY
Reading performance-structure components
Consumer Snapshot
Timer and Time Units
Summaries
Sampling
Problem: Background threads and processes
Problem: No mutex for reading performance-structure
Overhead
Removals
Trace
DTrace
Embedded
Terminology
Requests for comment
Performance_xxx routines
Administrative
Source code
References

For this one task, we can show information related to a few mutexes.
(File IO is also in but as a separate worklog task.)
"Mutexes" is an arbitrary choice; there will be other categories.
We want to show as much mutex-related information as any
other tool could provide, but continuously and as SQL tables.
Upon task completion it will be demonstrable that the mechanism
works, the Consumer doesn't crash, and the Provider is fast.

Showing all performance-related information, instead of just a
few mutexes, is a bigger job. The claim is: this task and only
this task is going in the right direction. It is a genuine
all-accoutrement monitor, instead of a temporary substitute.

Example
-------

See High-Level Description.

Why we're doing this
--------------------

See High_Level Description.

The performance_schema schema
-----------------------------

There is a new "database" (i.e. schema) PERFORMANCE_SCHEMA.

The PERFORMANCE_SCHEMA schema is much like INFORMATION_SCHEMA, since it
contains no persistent base tables, can be read but never changed
by users, and always exists. We could use INFORMATION_SCHEMA rather
than have a separate PERFORMANCE_SCHEMA schema. However, the
PERFORMANCE_SCHEMA tables have a different purpose, so Peter thought
they should be separate. See also WL#4513 "Implement PERFORMANCE_SCHEMA".

Proposals for the schema name were DYNAMIC (but that is a reserved word in
standard SQL), PERFORMANCE_SCHEMA (Jim Winstead's suggestion),
PERFORMANCE (Peter's preference), and DBTrace because it sounds like DTrace.
There was a vote (see dev-private thread "WL#2360 Vote: schema name and
table name"). The winner was PERFORMANCE_SCHEMA.

Values in PERFORMANCE_SCHEMA tables are non-deterministic, non-consistent,
non-repeatable. PERFORMANCE_SCHEMA tables cannot be locked, indexed, or
backed up. Selections on PERFORMANCE_SCHEMA tables should not be cached.
Initially Peter thought: unless all those things are true, a table
should be elsewhere, e.g. mysql database or INFORMATION_SCHEMA
or perhaps even a new schema like FIPS_DOCUMENTATION
http://www.itl.nist.gov/fipspubs/fip127-2.htm.
But to make INSTRUMENTS and ACTORS and CONSUMERS (and PERFORMANCE_TIMERS)
fit in, we add:
"The schema also contains tables which exist solely as aids
for managing the main performance tables and the monitoring process."

There are four table groups in the performance_schema schema:
"Setup" see "SETUP" section
"Actors" e.g. threads, connections, events
"Objects" e.g. locks, buffers, files, mutexes (see WL#4674)
"Alien" tables that aren't part of this project but still fit
in this schema because they're transient, unbackuped, etc.
"Events" results of instrumentation

There will eventually be more than 100 tables in PERFORMANCE_SCHEMA.

Our first tables will be setup tables and several events table,
including PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT.

For all tables described in following sections, the implementor may
change BIGINT to BIGINT UNSIGNED or change INTEGER TO INTEGER UNSIGNED
if overflow will otherwise occur.

SETUP
-----

The use of flags variables, e.g.
SET GLOBAL performance_flags = {'' | 'MUTEXES'};
seemed too cumbersome given that there are so many flags,
that there are so many ways to set them, and that users
need to search and display the combinations as structures.
So the settings will be in multiple performance_schema tables,
SETUP_INSTRUMENTS, SETUP_CONSUMERS, and possibly SETUP_ACTORS
depending on decisions for WL#4674 PERFORMANCE_SCHEMA Setup For Actors.
The instruments table has rows for bits of code that are instrumented,
e.g. "mutex for lock open".
The actors table has rows for people or jobs which invoke the code,
e.g. "joe@mysql.com".
The consumers table has rows for summary tables,
e.g. "the long history".

SETUP_INSTRUMENTS table:
NAME VARCHAR(128)
ENABLED enum('YES','NO')
TIMED enum('YES','NO')

SETUP_CONSUMERS table:
NAME VARCHAR(64)
ENABLED enum('Yes','No')

Initially the only rows in SETUP_INSTRUMENTS are '%Mutex%' rows,
according to this specification. The implementor may choose to do more.
(Note added later: the implementor chose to do more.)

Later we'll add worklog tasks for several classes or subclasses.
A very tentative list:
short_functions e.g. string-comparison. only if mysqld is built with DEBUG
mutexes mutexes, semaphores, spins ... all synchronization objects
locks row lock, table lock, metadata lock. WL#2333
file "System IO" read, write, seek, open, close, delete. WL#4678
table "Table IO" read (row) etc.
index "Index IO" read (key) etc.
memory malloc/free, and perhaps memcpy strcmp etc. WL#3249
net
other "other operating system call"
idle
stage well-known internal function e.g. sort/parse/commit, WL#4813
sql_functions function written with CREATE FUNCTION statement
long_operations long operations (any operations which might take > N seconds)
statements WL#2515
compound_statements
transactions
procedures
engine_defined there might be more than one 'engine_defined' bit
plugin_defined there might be more than one 'plugin_defined' bit
all combination of all the above

NB: We have now specified that, for io, class=wait, order=io.
So the statement "io is in the same class as mutexes" is true.
That's decided in WL#4678.

At start-up, all instruments are globally disabled. They should always be
enabled. But it's a new feature so we'll be cautious about making it the
default. There might be a mysqld option to turn everything on at the start.
There will be options for mysqld startup to specify the maximum
lengths of internal tables.

The setup is elaborate, so it's possible to make precise customizations.
The ordinary user can ignore all the elaborateness, by just accepting
the defaults or by using statements that turn everything on | off.
It does take time to test each option, and we'll have to see if that's bad.

Interesting effect: changing the instrumentation flag does not
affect code currently waiting. So if an event started with enabled='yes',
then a few rows might still be written in EVENTS_WAITS_CURRENT
after you changed to enabled='no'."

It would be possible to reset existing counters during updating of setup tables.
But we won't. It's probably more convenient for users if we "freeze" counters.

Description of another setup table, "PERFORMANCE_TIMERS", comes later.

The rest of this section has rationale and details about setup tables.
It is not necessary for everyone to read it.
Most people will benefit by skipping to the next section.

These details were discussed on dev-private, ending with this email:
[mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=25914
We split "SETUP" into three setup tables, SETUP_INSTRUMENTS and
SETUP_CONSUMERS, and another table which we'll talk about in WL#4674.
Some columns that are applicable to instruments are meaningless for
consumers, and vice versa.

[ WHY WE USE SETUP TABLES ]

Maybe we'll have 1000 instruments plus 100 consumers.
That's 1100 sets of flags.

To handle so many, the use of tables give us abilities:
(1) to avoid new syntax.
(2) to turn multiple flag on/off at the same time for any user.
(3) to change by group, e.g. "all mutex" or "all mutex lock open".
(4) to see and change other people's flags, with appropriate privileges.
(5) to have multiple flags per instrument, e.g. both "enabled" and "timed".
(6) to inspect the complete setup.
(7) to save a setup and reload it later.

We discussed alternative flag-setting mechanisms
("SET performance_flags ...", "MONITOR")
but they lacked some of these abilities.

[ SETUP_INSTRUMENTS TABLE ]

NAME VARCHAR(128)
ENABLED enum('YES','NO')
TIMED enum('YES','NO')

Some alternatives to the name INSTRUMENTS:
PROVIDERS, INSTRUMENTATION.

Effectively there's a primary key over the NAME
column, although show CREATE TABLE won't show it.

There is no numeric identifier like INSTRUMENT_ID.

We considered a COMMENT column. This would be the implementor's
explanation of what is being measured, or what parameters mean.
We decided it's unnecessary.

We considered a TRACE_ENABLED column. This would cause contents to
be dumped to a file, as well as go into a table.
(Perhaps a trace file is more correctly a CONSUMER.)
We decided it's unnecessary.

We considered a DTRACE column. This would enable DTrace
provider code, if any, that's in the same place as the instrumentation.
We decided it's unnecessary.

We considered a PLUGIN column. This would call a user-supplied
plugin, with parameters = the same data as an EVENTS_WAITS row,
whenever the instrumentation is enabled.
We decided it's unnecessary.

We considered a MORE column. This would invoke getrusage,
or whatever other mechanism the chip or the operating system
provides, to give material in addition to what the timer gives,
for example "number of page faults".
We decided it's unnecessary.

The ENABLED and TIMED columns are updatable.
But see later section "Privileges".

There is no automatic ordering.

Contents may change in every version, even minor versions.

[ SETUP_INSTRUMENTS."NAME" COLUMN ]

Definition: NAME VARCHAR(128) CHARACTER SET UTF8 COLLATE UTF8_GENERAL_CI NOT NULL

Instrument names follow a 'taxonomy' convention:
Phylum: Instrument /* not used, there's only one phylum */
Class: Wait
Order Synchronization Object /* abbreviated 'Synch' */
Family Mutex
Genus Lock_open
Species Server
Individual program_name.c_line_xx /* not used, our lowest level is species */
These all go together in the the NAME column, separated by '/'s.

For example: 'Wait/Synch/MUTEX/LOCK_plugin'.

Whenever a person instruments a new piece of code,
he/she should specify a name following the convention.
We will have a test to ensure that.

We considered having individual columns, that is,
CLASS VARCHAR(), ORDER VARCHAR(), FAMILY VARCHAR(), GENUS VARCHAR(),
SPECIES VARCHAR(). We decided that's unnecessary.

We considered going down to the "Individual" level, that is,
'Wait/Synch/MUTEX/LOCK_plugin/x.c_line_xx'.
We decided that's unnecessary for the name in SETUP_INSTRUMENTS.
It will appear in some Consumer tables, though.

Alternatives to the name NAME are:
WAIT_NAME.

You can't update the NAME column.

[ SETUP_INSTRUMENTS."ENABLED" COLUMN ]

Definition: ENABLED enum('YES','NO') DEFAULT 'NO' NOT NULL

ENABLED always means: "the instrumentation is on".
However, it isn't necessarily on for all threads.
Setting another flag in the SETUP_ACTORS table, described
in WL#4674, would solve that ... but other means exist.

[ SETUP_INSTRUMENTS."TIMED" COLUMN ]

Definition: TIMED enum('YES','NO') DEFAULT 'NO' NOT NULL

This turns the timer on. If the instrumentation is enabled
but the timer is off, then the performance code always
generates NULL for start and end times.

You cannot specify the type of timer here. That's determined by
a setting in another table, SETUP_TIMERS. We'll talk about timers in
a later section.

There was a suggestion that we could have the timer on
"every 10th time". This could be done by making TIMED
an integer and changing it to '10' for all users.
We decided that's unnecessary.

[ SETUP_CONSUMERS TABLE ]

NAME VARCHAR(64)
ENABLED enum('Yes','No')

Alternatives to the name CONSUMERS are:
TABLES.

Effectively there's a primary key over the NAME
column, although show CREATE TABLE won't show it.

There is no numeric identifier.

The ENABLED column is updatable.
But see later section "Privileges".

We considered having a MAXIMUM_ROWS column, that is,
each table would be variable size. A little like
an equivalent of profiling_history_size.
We decided that's a bad idea, it's too risky.

[ SETUP_CONSUMERS."NAME" column ]

Definition: VARCHAR(64) CHARACTER SET UTF8 COLLATE UTF8_GENERAL_CI

This contains, non-taxonomically, the name of a table.
For example: 'EVENTS_WAITS', 'EVENTS_WAITS_HISTORY'.

[ SETUP_CONSUMERS."ENABLED" COLUMN ]

Definition: ENABLED enum('YES','NO') DEFAULT 'NO' NOT NULL

If a consumer is enabled, then any data that's gathered
(by the instrumentation) goes into, or is summarized into,
the consumer table. If it's off, the table is frozen.

Suppose, then, that all the CONSUMERS flags are disabled.
That means that there is no point gathering the data.
But there is no effect on SETUP_INSTRUMENTS.ENABLED when this happens.
The gathering of the data depends on SETUP_INSTRUMENTS.ENABLED
and perhaps the SETUP_ACTORS values (see WL#4674), not on SETUP_CONSUMERS.ENABLED.

[ SETUP_CONSUMERS."NUMBER_OF_FILLED_ROWS" column ]

(This column was removed on 2009-03-03. It turned out to be useless.)

[ PRIVILEGES ]

You also need privileges on the setup tables and columns.
There is no script in mysql_install_db which says
GRANT UPDATE (ENABLED, TIMED) ON SETUP_INSTRUMENTS TO [user];
GRANT UPDATE (ENABLED) ON SETUP_CONSUMERS TO [user];
... it's not needed, since root starts off with all privileges
and can grant to others.

There is no need for SETUP privilege.

Even if there was an ALTER privileges, you wouldn't be able to change
the default values of setup tables. We may document that users simply
lack privileges to do absurd things with PERFORMANCE_SCHEMA tables.
In fact, though, the main prevention is that the storage engine cannot
perform tasks like ALTER, so users will see a message referring to the
storage engine rather than a missing-privilege error message. ALTER is
always impossible, things like CREATE TABLE and DROP TABLE and CREATE
TRIGGER may be possible for privileged users but will be useless and
harmless.

[ CHANGES HAPPEN IMMEDIATELY ]

If you set a flag for another user, there's almost no delay enabling it.
That's different from what you might be used to when you say
"SET GLOBAL flag_name". It's necessary because you might want to
start monitoring something that's already running.

[ VIEWS ]

Doubtless, with experience, we'll find that the raw setup tables aren't
what users typically want to get at. They'll want views.

We won't give them any.

A little aside about what Peter thinks one of our goals is.
We are not trying to make this feature "easy to use".
We are trying to make it easy for everybody else to make
this feature easy to use. For example, we give scope for
the Enterprise folk to build a tool with it, for code
contributors to add their own instruments, and for end
users to add convenient views.

Anyway, the agents who really want to use these particular
tables in complex ways are either DBAs (who by definition
know what they're doing) or tools programs (which of course
will be written by people who don't worry about convenience).

[ EXAMPLES OF STATEMENTS ]

Assume default database = the performance_schema schema.

SELECT * FROM SETUP_INSTRUMENTS;
This lists all the instruments currently in existence. Sample rows of result:
+----------------------------------------------------------+---------+-------+
| NAME | ENABLED | TIMED |
+----------------------------------------------------------+---------+-------+
| wait/synch/mutex/sql/LOCK_des_key_file | YES | YES |
| wait/synch/mutex/mysys/KEY_CACHE::cache_lock | YES | YES |
| wait/synch/mutex/myisam/MI_SORT_INFO::mutex | YES | YES |
| wait/synch/mutex/tina/tina_mutex | YES | YES |
| wait/synch/mutex/archive/archive_mutex | YES | YES |
| wait/synch/mutex/archive/azio_container::thresh_mutex | YES | YES |
| wait/synch/mutex/blackhole/blackhole_mutex | YES | YES |
| wait/synch/mutex/heap/HP_SHARE::intern_lock | YES | YES |
| wait/synch/mutex/maria/MARIA_SHARE::key_del_lock | YES | YES |
| wait/synch/mutex/myisammrg/MYRG_INFO::mutex | YES | YES |
| wait/synch/rwlock/sql/LOCK_grant | YES | YES |
| wait/synch/rwlock/mysys/WT_RESOURCE::lock | YES | YES |
| wait/synch/rwlock/mysys/SAFE_HASH_mutex | YES | YES |
| wait/synch/rwlock/myisam/MYISAM_SHARE::key_root_lock | YES | YES |
| wait/synch/rwlock/maria/MARIA_KEYDEF::root_lock | YES | YES |
| wait/synch/cond/sql/PAGE::cond | YES | YES |
| wait/synch/cond/mysys/WT_RESOURCE::cond | YES | YES |
| wait/synch/cond/myisam/MI_SORT_INFO::cond | YES | YES |
| wait/synch/cond/archive/azio_container::threshhold | YES | YES |
| wait/synch/cond/maria/MARIA_SHARE::intern_cond | YES | YES |
| wait/io/file/sql/ERRMSG | YES | YES |
| wait/io/file/mysys/proc_meminfo | YES | YES |
| wait/io/file/myisam/backup_log | YES | YES |
| wait/io/file/tina/metadata | YES | YES |
| wait/io/file/maria/control | YES | YES |
+----------------------------------------------------------+---------+-------+

UPDATE SETUP_INSTRUMENTS SET ENABLED='no' WHERE NAME LIKE '%/RWLOCK/%';
This turns off instrumentation for all RWLOCK instrumentation.

CREATE TABLE test.SAVED_SETTINGS AS SELECT * FROM SETUP_INSTRUMENTS;
...
UPDATE SETUP_INSTRUMENTS
SET ENABLED=(SELECT ENABLED FROM test.SAVED_SETTINGS y WHERE x.NAME=y.NAME);
This saves the SETUP_INSTRUMENTS names and flags, and then (presumably
some time later) restores them.

To say that I want to stop updating of every performance_schema table but one:
UPDATE SETUP_CONSUMERS SET ENABLED='No' WHERE NAME<>'EVENTS_WAITS';

UPDATE SETUP_INSTRUMENTS SET ENABLED='YES';
UPDATE SETUP_CONSUMERS SET ENABLED='YES';
This turns on everything. We just recommend to ordinary users:
by default everything is off, but we recommend turning it on.
We may later require "per-actor" pre-filtering, perhaps with something like
INSERT INTO SETUP_ACTORS VALUES ('ALL','ALL','ALL'); /* see WL#4674 */

To disable all the Mutex instrumentation, I say
UPDATE SETUP_INSTRUMENTS SET ENABLED='NO' WHERE WAIT_NAME LIKE '/Mutex/%';

To disable the timer for all cases, I say:
UPDATE SETUP_INSTRUMENTS SET Timed='no';

PERFORMANCE_SCHEMA . EVENTS_WAITS_CURRENT
-----------------------------------------

Developer discussion of this section is in a dev-private thread
starting with
[mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=25970
This section is now called EVENTS_WAITS_CURRENT because we won't use the
earlier proposals for table names.

An "event" is the execution of a snippet of code that is
recorded by an instrument. EVENTS_WAITS_CURRENT will show
what the latest event was for each thread (actor).

The PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT table always exists, but if
everything in the setup_instruments is disabled, or all the actors are
somehow disabled (see WL#4674), then there are no current rows in it.

PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT will be the first table we make, because:
1. originally mutex waits were an object of interest for another task.
Times have changed, but changing our starting place seems unnecessary.

Every user may be granted a SELECT privilege for any or all
PERFORMANCE_SCHEMA tables, and no other privilege is necessary for viewing.
This is reasonable so no user may update the tables
(except for some columns in the setup tables, see SETUP section), even
clearing the counters is impossible. Peter proposes that users
without PROCESS privilege will be able to see other users' EVENTS_WAITS
rows, even though such users cannot do SHOW PROCESSLIST for other users.
Some other PERFORMANCE_SCHEMA tables will have sensitive information, for
example SQL statement text, but we will just change the sensitive columns
to NULL, or skip the row, instead of denying access to the table.

If the server was started with mysqld --log-bin=binlog_file_name,
we will flag all performance_schema tables as unsafe to replicate
with both statement-level and row-level replication, so the replication
code can generate an appropriate warning or error when then the user
changes a persistent non-temporary table and refers to Performance
Schema table, for example by saying
CREATE TABLE t AS SELECT * FROM PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT.
However, WL#2360 won't dictate what the error message is or what its
effects are, replication decides that.

If somebody wants SHOW as well as SELECT, that's fine, as a
separate task. The SHOW statement would read
PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT.

[ THE CLASSES OF EVENTS TABLES ]

From the SETUP section you'll know that we have a hierarchy like this:
"
Class: Wait
Order Synchronization Object
Family Mutex
Genus Lock_open
Species Server
Individual program_name.c_line_xx /* not used, our lowest level is species */
These all go together in the the NAME column, separated by '/'s.
"

We will have a separate set of tables for each class. For example
there will be a table EVENTS_STATEMENTS_CURRENT analogous to EVENTS_WAITS_CURRENT.
In the rest of this section I'll only talk about EVENTS_WAITS
because for WL#2360 we only are instrumenting waits; however,
the same descriptions will apply for all the other EVENTS tables.

[ EVENTS_WAITS_CURRENT TABLE ]

Summary of definition:
THREAD_ID Actually a connection id, possibly a background-thread id
EVENTS_WAITS_ID A numeric identifier to distinguish from all other acts
EVENTS_WAITS_NAME Corresponds to SETUP_INSTRUMENTS.NAME.
Obviously it will start with 'Wait...'.
TIMER_* Three columns calculated with the aid of the timer
OBJECT_* Four columns that identify the object of the act

List of columns and data types:
THREAD_ID BIGINT
EVENT_ID BIGINT
EVENT_NAME VARCHAR(128)
SOURCE VARCHAR(64)
TIMER_START BIGINT
TIMER_END BIGINT
TIMER_WAIT BIGINT
SPINS INTEGER
OBJECT_SCHEMA VARCHAR(64)
OBJECT_NAME VARCHAR(64)
OBJECT_TYPE VARCHAR(64)
OBJECT_INSTANCE_BEGIN BIGINT
NESTING_EVENT_ID BIGINT

All VARCHAR(64) columns are CHARACTER SET utf8 COLLATE utf8_general_ci,
subject to change if we do something about identifiers for version 6.1.

Note added by Peter Gulutzan 2009-10-18: Earlier we said utf8_unicode_ci.
Alexander Barkov suggested a change to utf8_general_ci, Marc Alff and I
concurred, and Mikael Ronström made no objection.)

[ EVENTS_WAITS_CURRENT."THREAD_ID" COLUMN ]

THREAD_ID BIGINT

Identifier of the "thread" that's doing the event.

Actually this is a connection id, possibly a background-thread id.

It's the same as MySQL "Id" for those connection
threads for which "Id" is defined. It used to be
defined as INTEGER. But it's now BIGINT.
That's because we might identify some threads by
a special number. We might discuss special numbers in
later emails. For now just take it that it's a number
and doesn't have to be VARCHAR(64).

The original idea was that this would correspond to what SHOW PROCESSLIST
shows. However, we monitor more threads, and must supply different data.
Therefore the implementor will add a new PERFORMANCE_SCHEMA.PROCESSLIST
table which shows all threads. Having an extra temporary table in
PERFORMANCE_SCHEMA is deemed harmless and we won't require its removal
for purposes of WL#2360 review.

Former name: WAIT_CONNECTION_ID

[ EVENTS_WAITS_CURRENT."EVENT_ID" COLUMN ]

EVENT_ID BIGINT

Identifier of the event.

The columns (THREAD_ID, EVENT_ID) are unique
and not null. They constitute a primary key.
That is, if a row in EVENTS_WAITS_CURRENT has
THREAD_ID = 1 and EVENT_ID = 12345,
then there is no other row in EVENTS_WAITS_CURRENT with the
same THREAD_ID number and the same EVENT_ID number. Also
there is no row in one of the other
EVENTS_..._CURRENT tables (e.g. EVENTS_STATEMENTS_CURRENT)
with the same THREAD_ID number and the same EVENT_ID number.

EVENT_ID is a thread-local counter, that is,
there's one for each thread. It starts at 0 when the
corresponding THD object is created. It goes up by 1,
that is, there is a source-code instruction to increment
event_id, in the start_wait() for every instrument.

Originally EVENT_ID was 32-bit unsigned. We changed to BIGINT after
seeing that the number gets high quickly in real-world testing.

EVENT_ID is not useful for comparing to see whether
one event precedes another event in another thread. However,
timer_start will be useful for such comparisons unless the user
chooses a millisecond or tick timer.

[ EVENTS_WAITS_CURRENT."EVENT_NAME" COLUMN ]

EVENT_NAME VARCHAR(128)

Identifies what is being waited for.

This corresponds to a value in SETUP_INSTRUMENTS."NAME".
For example, "Wait/Synch/MUTEX/LOCK_plugin". In any
EVENTS_WAITS table, all names will begin with "Wait/...".

EVENT_NAME has a full hierarchy.

[ EVENTS_WAITS_CURRENT."SOURCE" COLUMN ]

SOURCE VARCHAR(64)

This is a string which tells us about the source code where the
wait-related instruction takes place. We recommend to the person
who adds the instrument: take this from __FILE__ and __LINE__.

[ EVENTS_WAITS_CURRENT."TIMER_START" COLUMN ]

TIMER_START BIGINT

The value of TIMER_START is the moment that an event began.

This will contain whatever the instrument code gets when it
calls the user-designated timer (for example CYCLE) after
adjustment to "canonical time units".

This column, and the related columns TIMER_END and
TMER_WAIT, are all BIGINT. Data types like TIME or
INTERVAL would be inappropriate even for MILLISECOND timers
because we lack fractional-seconds precision. The "start"
value is therefore an integral number of units since an
arbitrary moment like "when mysqld started".
There is no way to retrieve or reset this start value.

[ EVENTS_WAITS_CURRENT."TIMER_END" COLUMN ]

TIMER_END BIGINT

The value of TIMER_END is the moment that an event ended.

This will contain whatever the instrument code gets when it
calls the user-designated timer (for example CYCLE) after
adjustment to "canonical time units".

If the instrument has called the timer for TIMER_START but
has not yet completed the call for TIMER_END (presumably
because it's still executing the instrumented code), then
TIMER_END value is NULL.

[ EVENTS_WAITS_CURRENT."TIMER_WAIT" COLUMN ]

TIMER_WAIT BIGINT

This should tell us "how many time-units elapsed for this
instrumentation". This is called WAIT_TIME in Robin's document.

Some remarks about the meaning of TIMER_WAIT follow.
For understanding the following, remember that
an instrument generally looks like this:
get-the-time() for the sake of TIMER_START
do-the-event
get-the-time() for the sake of TIMER_END
But get-the-timer() itself takes time.

If TIMER_END is NOT NULL, then TIMER_WAIT = TIMER_END - TIMER_START.

If TIMER_END is NULL, then TIMER_WAIT is NULL.
Alternatively, we could have calculated it as
current_time - TIMER_START. Notice that the user has
no way of getting current_time directly (there is no
user-callable function that uses the current enabled
timer).

Either way, TIMER_WAIT is not exactly the same as "the time
elapsed for the event", because -- depending on the point
within the timer call when the value is actually retrieved --
either the TiMER_END or the TIMER_START call will not
have the "overhead" of the call itself. The minimum
overhead in cycles is available in PERFORMANCE_TIMERS,
but we won't convert it to canonical time units and add it
to TIMER_WAIT.

There are also a few cycles wasted for checking whether
timing is enabled, which will probably be a flag-test
based on settings in setup tables. These cycles are not
included in TIMER_WAIT.

We calculate TIMER_START and TIMER_END using the same setup values.
For example, if an instrument flag is "enabled" at the start, then
we regard it as "enabled" at the end, even if the user has changed it.
We'll discuss that in the section "Writing performance-structure components".

Former name: WAIT_TIME

[ EVENTS_WAITS_CURRENT."SPINS" COLUMN ]

SPINS INTEGER

This is the number of "spins" for a mutex, or else it's not.

Peter has to mention this because there might be a whole group
of columns whose values depend on what family the wait belongs
to (for example is it a mutex), and even on peculiarities of
the storage engine (for example a spin counter is often not
going to be available). So sometimes there is going to
be a bit of possibly-useful extra information, but usually
it's "not available" or "not applicable" and therefore NULL.

5.0 SHOW MUTEX STATUS had Spin_waits and Spin_rounds for a while.
But now you only see them when the moon is full.
http://dev.mysql.com/doc/refman/5.0/en/show-mutex-status.html
Someday somebody (perhaps not Sergei Golubchik) will say that
we must keep SHOW MUTEX STATUS / SHOW ENGINE INNODB MUTEX STATUS
until there's some other way to know about spins. Usefulness is
a less important consideration.

Peter considered: in that case, what folks really want is an
"EXTRA" column and we'd leave it up to the person who
adds the instrument, to decide what it means. But
we'll decide: no. If you have extra information, you
should petition to add new columns.

Two spins for the same mutex acquisition are not counted
as separate events. They won't get different EVENT_ID numbers.

[ EVENTS_WAITS_CURRENT."OBJECT_..." COLUMNS ]

OBJECT_SCHEMA VARCHAR(64)
OBJECT_NAME VARCHAR(64)
OBJECT_TYPE VARCHAR(64)
OBJECT_INSTANCE_BEGIN BIGINT

These four columns together identify the thing that's being
acted on. It's up to the implementor, and the person adding
the instrumentation, to decide "what's being acted upon".
For example, maybe when we're "reading" we're reading a table,
and at the same time we're reading a tablespace, and at the
same time we're reading a partition file. Which one of these
things is the "object" is something we deliberately keep vague.

But for a mutex, it's reasonable to think that this is
"usually null" except for a few cases where "it might be a table name".
And for file io, it's reasonable to expect a file name and
a byte number or block number.

Maybe VARCHAR(64) is small but we won't increase it.
Just truncate too-long values.

There is no OBJECT_CATALOG column, since we probably won't
support catalogs for a long time.

[ EVENTS_WAITS_CURRENT."NESTING_EVENT_ID" COLUMN ]

NESTING_EVENT_ID BIGINT

This is always NULL. The idea is that we'll store the EVENT_ID of
the immediately-encompassing event. For a wait, the nesting id will
be of a stage or a statement.

[ PRE-JOINED COLUMNS ]

There has been discussion of essential columns that come with the instrumenting.
But what about information that we could steal from elsewhere?
For example INFORMATION_SCHEMA.PROCESSLIST has ID, USER, HOST, DB,
COMMAND, TIME, STATE, INFO.

If we want to know, for a given wait, "what user and host
and statement text" it is associated with, we could join:
SELECT ... FROM PERFORMANCE_SCHEMA.EVENTS_WAITS_CURRENT a,
INFORMATION_SCHEMA.PROCESSLIST b
WHERE a.thread_id = b.id;

Or we could assume that users will usually want those columns,
we'd add USER and HOST and INFO to EVENTS_WAITS_CURRENT.
Such columns were in Robin's original proposal for sesscurrwait.
Denormalizing is a fine thing in this situation.

Peter believes we'll have to steal columns from other tables
to avoid joining. But he's not positive. The answer will
depend on performance, and perhaps on user feedback.
He is not proposing that pre-joined columns should exist at this time.

[ OTHER EVENTS TABLES ]

The EVENTS_WAITS_CURRENT table is the fundamental table.
All the other EVENTS tables are derived from CURRENT information.
At least, that's the way that things will appear.

The other EVENTS tables will be histories, summaries, snapshots
and metrics. For example (NB: this includes tables in future versions):

EVENTS_WAITS_HISTORY
EVENTS_WAITS_HISTORY_LONG
EVENTS_WAITS_SUMMARY
EVENTS_WAITS_SUMMARY_BY_THREAD_BY_CLASS
EVENTS_WAITS_SUMMARY_BY_THREAD_BY_ORDER
EVENTS_WAITS_SUMMARY_BY_THREAD_BY_FAMILY
EVENTS_WAITS_SUMMARY_BY_THREAD_BY_GENUS
EVENTS_WAITS_SUMMARY_BY_THREAD_BY_TIME
...
EVENTS_WAITS_SUMMARY_GLOBAL_BY_CLASS
...
EVENTS_WAITS_SNAPSHOT
...
EVENTS_WAITS_HISTORY_METRIC
...

Each of these table names may appear in SETUP_CONSUMERS.
Aggregation topics will be in later sections like "Summaries".

Writing performance-structure components
----------------------------------------

For illustration suppose we have: one actor = a connection thread,
and one instrument = a mutex acquisition. What code do we need to
support one consumer = the events_waits_current table?
By answering that we'll show the important structures and guidelines.

Start with the fact that #ifdefs exist. The important one is
HAVE_PERFORMANCE_MONITORING. There's also HAVE_PERFORMANCE_SCHEMA
which is what goes on if one says ./configure --with-perfschema.
(We have two #defines rather than one, because it's conceivable
that someday we'll want the performance interface without the
schema.) By default both #defines are on for a generic build like
"BUILD/compile-pentium" (notice that we didn't say "-debug" or "-max").
There's also a "Performance Schema Engine", which we'll describe in the
section "Reading performance-structure components".
Recommendation to the Implementor: do not use the abbreviation
PSE for Performance Schema Engine.

Currently the MySQL server code has this:
mysql_mutex_lock(X)
We want it to have this:
#ifdef HAVE_PERFORMANCE_MONITORING
/* Look at SETUP_INSTRUMENTS."ENABLED" column value for mutex X */
if (setup.instruments.enabled" = "yes" and actors.enabled = "yes")
start_wait(parameters);
mysql_mutex_lock(X);
end_wait(parameters);
else
mysql_mutex_lock(X); /* instrumentation disabled */
#else
mysql_mutex_lock(X); /* instrumentation undefined */
#endif

Yes, there will be a macro, the macro will be the contents of the LLD.

And performance-structure has these components:

(I'm not specifying whether the structure is private or opaque or
secret; nor am I specifying whether the components are really
stored or can be accessed via pointers to consts; nor am I specifying
order and type! Those are implementor decisions. This is just the
easiest way to clarify what's necessary. The field names correspond
to column names in EVENTS_WAITS_CURRENT.)

NAME LENGTH DESCRIPTION
thread_id ? We're storing in a thread-specific area.
Not necessarily THD and not necessarily local.
But if it's global, it's indexed by thread_id.
So there's no need to store thread_id now.
event_id 8 A thread-local unsigned auto-increment number.
event_name 8 An integer code for the type of event, or a
pointer to a name of the event, or a pointer to
a structure which has event_name + flags e.g.
PERFORMANCE_MONITORING_GLOBAL_FLAG.
timer_start 8 value returned by timer function
timer_end 8 value returned by timer function
spins 4 only certain storage engines will set this
object_id 8 How can we get information for object_schema,
object_name, and object_type?
Minutes of the WL#2360 meeting,
Riga 2008 say:
- Naming tables
Igor: what's the problem with a string copy ?
Sergei Petrunia: implement a name dictionary
To Igor we have to say "maybe we have to do a string
copy".
To Sergei Petrunia we have to say: "not until
WL#2760".
We could store an entire string here, e.g.
"schema_name.table_name table", and indeed
we might have to do so until there is a true
object id, see WL#2760 re unique numeric identifiers.
If it's a mutex, this can point to a key address.
We can defer discussions of this until we have to
refer to schema objects.
object_instance_begin 8 If object type = file, this is offset within file.
If object type = table, this is some sort of row id.
If object type = mutex, this can have address in
memory.

In earlier editions of this specification, there were more columns:
timer_release 0 For when a mutex was freed
string-for-comments ? For "Statements" text, for "sql_functions" name, etc.
Those columns no longer are there.

For the SOURCE column in EVENTS_WAITS_CURRENT, Peter has not indicated the
underlying performance-struc. Peter expects that there will be a constant
pointer to event_name, and recommends that the source be concatenated to that
(it can be stripped later). Everyone expects that the value will be a
concatenation of __FILE__ and __LINE__ but we're stating that only as a
recommendation to the person who adds the instrument.
If something else is more appropriate, fine, as long as it's constant.

So start_wait() operates thus:
* Figure out what the active timer function is, as decided by
SETUP_TIMERS.TIMER_NAME.
* Point to the new instance of performance-structure.
Getting the location is the job of the Implementor.
* Fill in event_name, object_id, object_instance_begin.
Supplying the parameters for these is the job of the person who adds the
instrument. (But the macro should lighten this task.)
* Fill in timer_start.
Calling the active timer function is the job of the Implementor.
* Set timer_end = 0. So if timer_end = 0, we know a job is still waiting.
(The performance_schema . events_waits_current section says this appears as
NULL.)

And end_wait() operates thus:
* Figure out the same active timer as start_wait() figured out.
* Point to the same place that start_wait() pointed to.
* Fill in spins if it's relevant.
* Fill in timer_end.
Calling the active timer function is the job of the Implementor.

If (end-timer - start_timer) i.e. timer_wait = 5 000 000,
we know that a job waited for 5 microseconds before the mutex call finished.
Maybe. This figure is in "pseudo picoseconds" (see "Canonical Time Units").
It is probably reasonably close to the truth if the counter is
really based on time of day. However, we make no guarantees of
similarity to wall-clock time, and for cycle timers there could
be considerable variance, and certainly it's not accurate to the
nearest picosecond, not even to the nearest 1000 picoseconds.
Users are free to make calculation themselves after checking for
frequency changes, knowing the risks.

This is a recommendation to the Implementor, it is not a requirement:
Peter Zaitsev inspired the following suggestion.
Suppose there are two pieces of instrumented code closely together:
INSTRUMENTED_CODE_MACRO(mutex_1); INSTRUMENTED_CODE_MACRO(mutex_2);
Since there is no code between the pieces, probably nothing has happened.
In that case: do not bother to test the enabled flags for the second
piece (assume they're the same as for the first piece), and do not bother
to calculate timer_start (assume it's the same as timer_end from the
first piece, plus 1 canonical time unit).
We can dismiss the possibility that we'd get swapped out between the pieces.
But how do we know there is no code between the two pieces?
There might be automatic ways, for example look at row number of
assembly output. But there should be a manual way, so that the
person who is adding the instrument can specify something like
start_wait_without_checking_enabled_flags_and_calling_timer();
instead of start_wait().

Ring Buffers and PERFORMANCE_SCHEMA . EVENTS_WAITS_HISTORY
----------------------------------------------------------

The performance-structure occurs N times for each thread.
This is a ring (circular) buffer. N is fixed, it's a #define with a value of 10.
We do not allow changing N at runtime.
We do allow changing N at startup with mysqld --performance_structures = 20.
We do allow changing N with a ./configure option.
We do allow for a long history in a separate table, which might not be enabled.

Since performance-structure occurs N times per thread, there is an
additional pointer or index -- call it performance_structure_current --
so we know "what occurrence of performance-structure is current".
The Provider's Start procedure will re-calculate performance_structure_current.

One can see what the last N events were, with a separate
table in the PERFORMANCE_SCHEMA schema, namely EVENTS_WAITS_HISTORY.

The definition of PERFORMANCE_SCHEMA . EVENTS_WAITS_HISTORY is the
same as the definition of PERFORMANCE_SCHEMA . EVENTS_WAITS_CURRENT. In fact
EVENTS_WAITS_CURRENT can be regarded as a view of EVENTS_WAITS_HISTORY,
containing the last 1 row per thread rather than the last N rows.

It is possible to clear any history table with a TRUNCATE statement.
For example:
TRUNCATE TABLE PERFORMANCE_SCHEMA.EVENTS_WAITS_HISTORY;
After this, a SELECT from EVENTS_WAITS_HISTORY will return zero rows,
although the size of the underlying performance-structure is unaffected,
and there will soon be new rows. Effects on EVENTS_WAITS_CURRENT are
undefined. The required privilege is DROP; however, granting DROP on a
history table will not enable anyone to really drop it.

Reading performance-structure components
----------------------------------------

To handle an SQL statement which reads from an EVENTS table, such as
SELECT ... FROM PERFORMANCE.EVENTS_WAITS_CURRENT ...
the Consumer must read from the performance-structure in-memory areas.

With performance-structure setting as described in the previous sections,
it is easy for the consumer to see at any given time: whether a job
is doing mutex acquisition, how long it has waited, what the
mutex is for, and so on. The instrumentation provides it all.

Notable aspects that affect reading are:
* The Consumer has to do some interpreting and translating of what
is in performance-structure. This is okay. The provider must be fast.
The consumer can be slow. It's conceivable that in some shops the
instrumentation will be on for everything all day and so will be
invoked millions of times, and there will be a summary at the end
of the day so reading will be invoked a few dozen times.
* Because of the definition of the performance schema (see earlier
section "The performance schema") there must be tolerance for errors.
It seems reasonable to hope that we'll show sensible stuff 999 times
out of 1000, but we give unobtrusiveness much higher priority than
correctness.

The reading will be via a new "performance schema" storage engine.
That means that SHOW ENGINES will show the performance schema,
and it means that "read row" calls become, after appropriate
transmogrification in the storage-engine interface, accesses to
the in-memory performance structures. We already know that this
is feasible because Marc Alff's prototype does it this way.

Consumer Snapshot
-----------------

The Consumer can take a snapshot at the start of the statement:
if (the statement refers to a performance_schema table e.g. events_waits_current)
for (each thread)
for (each occurrence of performance-structure)
copy performance-structure data to local temporary in-memory table
After that, all readings from events_waits_current would be readings
of the local temporary table, and therefore consistent.
But assuming 1000 threads * 10 performance-structures per thread *
100 bytes per performance-structure, this is a 1,00,000-byte copy.

Alternative 1: We could only copy relevant-looking rows:
if (the statement refers to a performance_schema table e.g. events_waits_current)
for (each thread)
for (each occurrence of performance-structure)
if the data matches the WHERE clause conditions
copy performance-structure data to local temporary table

Alternative 2: We could copy more rarely:
if (a 3-second interval has elapsed since the last snapshot)
for (each thread)
for (each occurrence of performance-structure)
copy performance-structure data to global temporary table

Alternative 3: We could keep snapshots until end-of-transaction
instead of until end-of-statement. This could be an option that
depends on isolation level.

We reject all the alternatives.

Timer and Time Units
--------------------

We need a timer which is fast (less than 100 cycles to get)
and precise (better than one tick on a wall clock). For worklog task
WL#2373 "Use cycle counter for timing", Peter supplied code for
accessing many low-precision or high-precision timers, and showed
that "a choice of a low-level cycle counter is usually best".
The implementor will use an extract of the code attached to WL#2373.

. Depending on the platform, there will be a timer for
at least one of: cycles, nanoseconds, microseconds,
milliseconds, ticks.

. The timer initialization function will discover what the
possible choices are, and the characteristics:
underlying routine, overhead, resolution. This will
be made available in a table PERFORMANCE_TIMERS.

. For most platforms, the default will be a cycle counter unless
the timer initialization function shows problems with it.

. For some platforms, at least as fallback, we'll use a wall clock
This may be useless, but we want the package to "work" on all
platforms even if the result is useless. Alternatively we could let the
package "work", but sets wait-related columns to zero if there
is no way to get a fast counter.
We will allow setting a global variable.
UPDATE SETUP_TIMERS SET timer_name =
{ 'CYCLE' | 'NANOSECOND' | 'MICROSECOND' | 'MILLISECOND' | 'TICK' }.
Not all of these options are always available; check SETUP_TIMERS.
"UPDATE SETUP SET timer_name = 'MICROSECOND'" usually
involves gettimeofday.

Times that the user sees in EVENTS_WAITS_CURRENT.TIMER_START or
EVENTS_WAITS_CURRENT.TIMER_END are BIGINTs representing canonical
time units. Extra precision sometimes looks useless but is good for aggregating.

The rest of this section has rationale and details about timer tables.
It is not necessary for everyone to read it.
Most people will benefit by skipping to the next section.

These details were discussed on dev-private, starting with this email:
[mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=25923

The instrumentation uses timers a lot so getting it right is good.
We must tell the user what timers are available; we'll put that in
the PERFORMANCE_TIMERS table. We must give the user a choice among
the available timers; we'll put that in the SETUP_TIMERS table.
And of course we must have the timers; code is already in for
WL#2373 "Use cycle counter for timing" so later in this section we'll
just say how WL#2373 relates to WL#2360.

[ PERFORMANCE_TIMERS TABLE ]

PERFORMANCE_TIMERS is a read-only table in the performance_schema schema.
It will look like this on an ordinary 1.6-MHz x86 Acer laptop:

mysql> select * from PERFORMANCE_TIMERS;
+-------------+-----------------+------------------+----------------+
| TIMER_NAME | TIMER_FREQUENCY | RESOLUTION | TIMER_OVERHEAD |
+-------------+-----------------+------------------+----------------+
| CYCLE | 1595661691 | 1 | 14 |
| NANOSECOND | 1000000000 | 1 | 1350 |
| MICROSECOND | 1000000 | 1 | 1227 |
| MILLISECOND | 1004 | 1 | 1319 |
| TICK | 103 | 1 | 947 |
+-------------+-----------------+------------------+----------------+
6 rows in set (0.13 sec)

The table shows highest-frequency timer is CYCLE
(the figure "1595661691 / second" is the same as "1.6GHz" which
is my machine's speed); highest-overhead timer is NANOSECOND
(it takes 1350 cycles to get one nanosecond timing); and the
"interval" is always 1, which for instance means that the
millisecond-frequency timer really does change every 1 millisecond.

The user, or the user's program, can determine from this what the
best timer is for a given purpose and time, taking into account
the timer's frequency and the timer's overhead. The user must also
take into account the well-known problems that some timers may cause,
but that is a matter for documentation rather than PERFORMANCE_TIMERS.

There are always five rows, ordered from highest frequency to lowest:
CYCLE Usually this is an on-chip counter which we get via
assembler instructions, such as RDTSC on x86.
NANOSECOND Usually this comes from gethrtime() or clock_gettime()
or read_real_time(). Or it's unavailable. Clocks that
really tick over one billion times a second are rare.
MICROSECOND Usually this comes from gettimeofday() on non-Windows,
and from QueryPerformanceCounter on Windows. With
QueryPerformanceCounter the timer frequency is not
1,000,000 but sometimes it's the same ballpark.
MILLISECOND Usually this comes from ftime() if non-Windows, or
GetTickCounter() if Windows.
TICK A tick is the time between processor interrupts. It's
been 1/100 second for a long time. But perhaps during
the lifetime of MySQL 6.x many/most customers will see
timer_frequency = 1000 for ticks.

All these timers show "elapsed time". There are no available timers
for "time used by this thread" or "elapsed time / kernel only".
There are other timers, e.g. getrusage(), which try to give
such data, and which SHOW PROFILE tries to use. We won't use getrusage().

On some platforms we couldn't find something for, or didn't bother to code for,
every frequency. For example, the CYCLE counter might be unavailable.
In that case users will still see a CYCLE row, but it will be obvious
from the other data that we picked whatever was closest, for example
it will be the same as the MICROSECOND row.

PERFORMANCE_TIMERS is dynamic. We know that the data could change,
particularly the frequency for CYCLE. But we don't only calculate the
table values once on startup. We recalculate every time
that a user selects from PERFORMANCE_TIMERS.

What Peter has seen so far is: TICK has high overhead as well as low frequency
so it looks like a bad choice on all platforms that we support.
(This surprises him greatly; he can't explain the numbers that he gets.)
So far we haven't found a use for ticks, but initially we want
all possibilities to be there. If no user reports a case where
it's worthwhile, we'll remove the TICK row for GA.

[ PERFORMANCE_TIMERS."TIMER_NAME" COLUMN ]

Definition: TIMER_NAME ENUM('CYCLE','NANOSECOND','MICROSECOND','MILLISECOND','TICK')

The name corresponds to the name in the SETUP_TIMERS table.

[ PERFORMANCE_TIMERS."TIMER_FREQUENCY" COLUMN ]

Definition: TIMER_FREQUENCY BIGINT

If a clock ticks once per second, then its frequency is 1.

Typically, for a CYCLE timer, the frequency will be about
the same as the rated frequency of the CPU. Thus on a 1.6-GHz
laptop one should see a number close to 1.6 billion, but when one
switches to "Power Save" mode one should see a number close to 800 million.
Peter says lots about fluctuating frequency for CYCLE in WL#2373.

For the nanosecond and microsecond timers, the numbers are
fixed: 1,000,000,000 and 1,000,000. These are the only numbers
in the table which are fixed instead of calculated at runtime.

High frequency is better than low frequency. But
in theory, any one of the timers could be okay. Even if frequency is
only 100 times per second, the data is good enough for the
most frequently-occurring real-world performance-analysis tasks
(according to some advice by people with real-world experience).

[ PERFORMANCE_TIMERS."RESOLUTION" COLUMN ]

Definition: RESOLUTION BIGINT

Sometimes resolution is defined as the inverse of frequency,
that is, if frequency is 100 times per second then resolution
is 1/100 second. That is not the meaning here.

What we're trying to say with "resolution" is: how meaningful is
the frequency? Take for example the case of MySQL's 'melody'
machine, which has clock_gettime(), which cheerily reports time
to the nearest billionth of a second. However, if you call it
twice, you either get the same figure both times, or you get
different figures but the second figure is one-millionth greater
than the first, instead of 1-billionth greater than the first.
It "jumps" by millionths not billionths. Peter wanted to call
this column TIMER_INTERVAL, but Sergei insisted it's resolution,
and nobody else objected to calling it RESOLUTION.

Intervals greater than 1 are very common for nanosecond timers
and for millisecond counters under Windows. See the tests in
rdtsc3_notes.txt which is attached to WL#2373.

[ PERFORMANCE_TIMERS."TIMER_OVERHEAD" COLUMN ]

Definition: TIMER_OVERHEAD BIGINT

This is the minimal overhead of a timer. When we initialize,
we call the timer 20 times and what you see in "overhead" is
the smallest figure. In fact the instrumentation will invoke
the timer twice (before and after what's being instrumented)
so multiply this figure by 2 to get the minimal timer overhead
per instrumentation.

This column is important because for 'always-on' monitoring
we want the timer with the low overhead. In fact the low-overhead
timer has always been the cycle timer, as is evident again from the
test notes in rdtsc3_notes.txt which is attached to WL#2373.

A thought experiment: These figures are simplified but not absurd:
suppose a cycle takes 1 billionth of a second,
suppose timer_overhead = 1000,
suppose we monitor 1000 events per second per job,
suppose there are 100 simultaneous jobs.
When that happens, total instrumentation overhead is more
than 20% of the time, and the DBA will have to tweak some dials.

[ PERFORMANCE_TIMERS."OTHER" COLUMN? ]

Each timer has an underlying routine which depends on platform.
For example I mentioned that the microsecond timer depends on
'gettimeofday' for non-Windows, or depends on 'queryperformanceounter'
on Windows. But we won't report that.

Cycle timers may have a variable frequency.
So we could recalculate periodically and have
columns that show minimum / maximum rates.
But we won't report that. (Recalculation does happen, though,
whenever you select from the PERFORMANCE_TIMERS table.)

Timers could have a 'base' or 'epoch'. This might be a moment in 1970,
or it might be the time that the server started, or perhaps someday we
cold allow it to be reset. But we won't report that.

[ SETUP_TIMERS TABLE ]

SETUP_TIMERS is a tiny table which just contains the information:
which of the available timers will be used?

NAME VARCHAR(64)
TIMER_NAME enum('CYCLE','NANOSECOND','MICROSECOND','MILLISECOND','TICK')

There is always one row in the table. On my machine it looks like this:

CYCLE will tend to be a default because of its lower overhead.
Suppose that, after experiencing with my particular installation that
CYCLE is not I want, I decide to use a MILLISECOND timer instead.
I'd say:

UPDATE SETUP_TIMERS SET TIMER_NAME='MILLISECOND' WHERE NAME='Wait';

This might return an error, or might set for the next-best timer,
if there's no millisecond timer.

Notice that you can't choose 'NONE'. There always is a timer.
If you want to disable timing, you do that by setting ENABLED='No'
in SETUP_INSTRUMENTS or by disabling actors (see WL#4674).

[ SETUP_TIMERS."NAME" COLUMN ]

The NAME is the same as the 'class' in SETUP_INSTRUMENTS.
Remember the taxonomy in SETUP_INSTRUMENTS? It looked like
"
Instrument names follow a 'taxonomy' convention:
Phylum: Instrument /* not used, there's only one phylum */
Class: Wait
Order Synchronization Object
Family Mutex
Genus Lock_open
Species Server
Individual program_name.c_line_xx /* not used, our lowest level is species */
These all go together in the the NAME column, separated by '/'s.
"
At this time the only class we're instrumenting is 'Wait'.
So there's only one row in the SETUP_TIMERS table, named 'Wait'.

Eventually we'll monitor long and infrequent routines, not just
waits. In that case, arguably, the overhead matters less and the
relation to wall-clock time matters more. So for a different class,
we might have a different row, okay. But two things in the same class
and order, like MUTEX and RWLOCK, always have the same timer.

You can't update the NAME column.

[ SETUP_TIMERS."TIMER_NAME" COLUMN ]

The TIMER_NAME is a reference to the possible names in the
PERFORMANCE_TIMERS table: 'CYCLE', 'NANOSECOND', 'MICROSECOND',
'MILLISECONDD', or 'TICK'.

TIMER_NAME is updatable.

[ PRIVILEGES ]

You also need privileges on the SETUP_TIMERS table and columns.
Effectively, there is a script which says
GRANT SELECT ON PERFORMANCE_TIMERS TO [user];
GRANT UPDATE (TIMER_NAME) ON SETUP_TIMERS TO [user];

Implementation details are the same as for other setup tables.

[ WL#2373 ]

WL#2373 "Use cycle counter for timing" has the timer code.
For description, or for the code, look at WL#2373 and its attachments.

But how does this relate to WL#2360? Thus:
* WL#2373 stays in existence and Peter will keep it up-to-date for
a while. It has been extremely useful for just testing
the timers on multiple platforms, and perhaps QA will
find more uses for it. But it will not be independently
reviewed. WL#2373 was only made for the sake of WL#2360.
In theory a high-resolution timer might be useful for
other purposes (e.g. UUID), but we're not going to put
such tricky code in unless we are doing real monitoring.
* Marc incorporates the timer code in MySQL server code.
(He's already done this actually, in mysql-6.0-perf.)
* The WL#2360 code reviewers are responsible for checking
all the code in Marc's patch, including inline assembler.
* WL#2373 does not have to be on the "6.x roadmap".

[ CANONICAL TIME UNIT ]

We will have "canonical time units". This means that, no matter
what timer you choose, MySQL will adjust it so that it's comparable
with results from other timers.

The unit is smaller than nanoseconds and smaller than cycles
for efficiency reasons.

the canonical time unit is a picosecond, 1 / 1 000 000 000 000,
one millionth of a microsecond. Convert to picoseconds during instrumentation.

For example: suppose we know that the cycle timer has a frequency
of 1.8GHz, i.e. 1800 million cycles per second, so our Multiplier is
(1 000 000 000 000 / 1 800 000 000) = 555.56. Round to 556.
Store it forever in global 64-bit unsigned integer "Multiplier".
Now suppose the timer returns 8888 (cycles).
Multiply that by Multiplier = (8888 * 556) = 4941728. That's what
goes into performance-structure . timer_start or timer_end.

Precision loss: variance from the original exact value will occur
due to rounding. The example shows one of the greatest variances
that can appear in real 1.8-GHz computers: 0.5 out of 555. That is, 0.1%.
That is, if we reverse the calculation by evaluating
4941728 / (1000000000000 / 1800000000), we get 8895, not 8888.
This is negligible for our current purposes.

Overhead: Multiplying by an integer is less than 10 cycles on Peter's x86.
It occurs twice per instrumentation, and very rarely there would be
a cache miss for getting Multiplier, but a cost of 20 cycles
per instrumentation is the worst Peter can imagine. The key point is:
we do not divide or resort to floating point.

Overflow: According to Peter's idea of how arithmetic should work,
(power(2,64) / 1000000000000) / (60 * 60 * 24) = 213 days.
So if we say the epoch is always "when the instance started",
overflow is quite unlikely.

Stepping: the Multiplier is the wrong value if the frequency changes.
Recalculating the frequency is very slow, and is not justified given
that frequency change is both unlikely to happen and unlikely to
cause difficulty if it does happen. Remember, we originally agreed
that a cycle counter is okay regardless of frequency change.

Reading: during reading you certainly don't want picoseconds, so you'll
have to divide to get microseconds, cycles, or whatever. At this point,
you'll have to divide by 1 000 000 for microseconds, divide by
frequency for cycles, and so on. So, the reader does get slower.

So the Benefit is that you always see the same time units.

Epoch: so that timer_start with one timer is comparable with timer_start
from another timer, we now must have a common epoch. This can't be fixed
on some date (see above re 'Overflow'), so we standardize on:
epoch = time the server started. For example, if we have a microsecond
timer, we will say something like:
x = getttimeofday
x = x - mysqld-startup epoch value
x = x * 1000000 /* i.e. number of microseconds in a picosecond */
timer_start = x, or timer_end = x.

Putting it another way, this means that in effect,
a value T returned by a timer is adjusted as:
Normalized_t = (T - To=server-startup) * timer-unit-to-pico-factor
and that's what gets recorded in TIMER_START, TIMER_END, and TIMER_WAIT.

Let's make sure it's clear that there are no additional miracles.
We'll say the dread words 'not a bug' if anyone ever claims these are flaws:
* We calculate cycle frequency when the mysqld instance starts.
We do not recalculate later. If frequency changes, that's acceptable.
* If different cores have different frequencies, that's acceptable.
* Because the epoch is from the mysqld start, you cannot usefully compare
timer_start values from different instances, or timer_start values from
logs made on different days.

Summaries
---------

The "Summaries" section has been shifted to a new worklog task,
WL#4816 PERFORMANCE_SCHEMA Summaries

Sampling
--------

We could have a separate thread that reads performance schema events data
once per second, filters so that "Idle" sessions are ignored,
and puts the result (a snapshot of the status of active sessions)
in a "sampling table".

Similarly, "metrics" (e.g. rates, like number-of-waits-per-second)
are usually calculated at long intervals, like one minute. We will
need these for alerts.

Finally, we could sample every 30 minutes by default, and keep
the result in a permanent on-disk table, purging every week.

The server will not do sampling, metrics, or workload repositories.
We believe that the Enterprise Monitor crew can do it, with a
scheduler such as CREATE EVENT or with their own scheduling methods.
So our job as server people is to make sure that the tools people can
make things like active session history and automatic workload
repository without encountering showstopper-like difficulties.

Problem: background threads and processes
-----------------------------------------

[ The contents of this section are removed. We can remove
the heading and the table-of-contents entry any time. ]

Problem: no mutex for reading performance-structure
---------------------------------------------------

It is possible that a thread (the Provider) will be writing to
performance-structure while another thread (a Consumer) is reading it.
We won't use a mutex to control every time a Provider wants to
overwrite performance-structure, because that would add too much
overhead. How can we then prevent crashes or ridiculous results?

Conflict between writers and writers: threads write only to their own
(thread-specific) memory areas, so two "writers" will not
conflict. There are only two exceptions to this rule:
1. Clearing. We intend to support SQL statements which will
remove (zero) everything for an EVENTS table. This could indeed
occur at the same time as some other job is updating the same
memory. However, the only effect is that the clearing doesn't
(totally) succeed, sometimes. We document that, and forget it.
2. Thread termination. If we're reading a thread's memory, and
the thread's memory is deallocated, that's no good. So this is
controlled by a mutex. There's nothing new about this, though.
And it's only a case of "the reader must watch a mutex"; there
is no mutex necessary for the provider's instrumentation.
There is no other time when memory might be allocated or
deallocated affecting a reader; we insist on fixed-size
storage at all times after mysqld has initialized.

Conflict between readers and writers: readers will maybe
be reading while writers are writing in the next structure
that the reader is about to look at, or writing in the same
structure, or even writing to the same field in a structure.
Some scenarios, which would be classed as acceptable errors
if it was possible for them to happen:
* It's a 32-bit OS, and the reader reads the top half of an
address, then the writer rewrites, then the reader reads the
bottom half. This may not happen (see below), but it's
recommended that Readers could check all addresses for sanity,
though not for correctness.
* It's a UTF8 string, and the reader sees exactly half of
it, including the first byte of a multi-byte character.
Readers could check for invalid strings and truncate them.
Again, this may not happen (see below).
* It's a SELECT with a WHERE and a GROUP BY and a JOIN and
an ORDER BY and a UNION on many different ACTS tables.
There's going to be a lot of read inconsistency happening.
Readers don't have to check for this. The danger is only
with logic like "if a > b ... else if a == b ... else if
a < b ... else assert(a makes no sense)". Perhaps, though
it's unlikely, you might see assertions with debug builds.
* It's a writing job but it does some reading as well,
from SETUP contents. At the same time somebody could be
updating a SETUP table so there's temporary inconsistency.

Sergei Golubchik has proposed an algorithm which requires no
mutexes and no memory barriers, and yet is safe. Marc Alff has
agreed to implement this algorithm. Therefore the implementor
is not "required" to perform sanity checks. Peter Gulutzan
only says that such cautious procedures are "recommended".
See dev-private email thread starting at
[mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=26195

Combine all that with other potential error sources,
such as cycle counter drifting (see WL#2373), and it's
clear that the implementor and the users must take care.

Remember the upside: Overhead is low (see "Overhead" section).

Overhead
--------

Guessing from the fact that all MOV or ADD instructions are cheap,
and assuming very good luck with caching, the overhead "per event"
(i.e. for the Provider's Start and End and Release procedures) might
be around 200 cycles on a P4. Since events are very frequent, Peter
has emphasized that the overhead "per event" must be minimal,
minimal, minimal. That's why he is sure that the Provider should use a
cycle counter, should not itself use mutexes unless it's terminating,
and should pass on all calculations to the Consumer.

Our overall overhead should be 3% or less. Peter thinks that's
possible because he's seen Kelly's timing tests with DBT2 runs.

If everything is disabled in the setup tables, the overhead is merely
that we waste some memory for unused performance-structure and that
we have test-and-jump instructions for every Provider procedure. And
if there is no timing (setup_instruments.timed='no'), the overhead
likely goes down about 75%. On the other hand, if there is timing
and setup_timers.timer_name = 'microsecond', the overhead likely goes
up, by as much as 4000%.

For a modern x86 server the requirement is 200 cycles,
for one instrumentation, with cycle timing on, with summaries and
optional SETUP flags off. If we can't do 200 cycles, but
we still can do less than 1000 cycles, then we'll proceed but we'll
have to be less ambitious about adding summary tables. If we
can't do less than 1000 cycles, then we won't proceed.
The main thing is: we have a specific statement what is meant by
"too much overhead".

UPDATE. Kelly Long did a throughput DBT2 test with what Marc Alff has coded so
far, on an AMD 8-way Dual Core. The results in transactions per second, with
columns for 8 or 16 or 32 or 64 or 12 or 192 connections, comparing "IN" i.e.
performance interface in and instruments all enabled, versus "OUT" i.e.
performance interface not even compiled in, was:
8 16 32 64 128 192
----- ----- ----- ----- ----- -----
IN 15096 18049 21259 20494 19886 19293
OUT 15680 21277 22183 21025 20474 19174
Overhead 3.72% 15.17% 4.17% 2.53% 2.87% -0.62%
... By now Marc Alff has implemented a lot, so these figures might resemble
reality.

Removals
--------

Some tasks might become redundant, or might need changing to be
part of the larger task. That probably includes all tasks that
currently are plans for some bit of extra performance measurement.

But, due to the delay implementing WL#2360, some things have gotten
in and there is no hope of getting rid of them soon. They include:

WL#2317 Add patch to show mutex stats
Vadim Tkachenko added this. It has cumulations for InnoDB mutexes:
Count, Spin_waits, Spin_rounds, OS_waits, OS_yields, OS_waits_time.
Peter proposes that we remove all code for the SHOW MUTEX STATUS statement
(now called SHOW ENGINE INNODB MUTEX), but if the information is truly
necessary we will put it in the appropriate PERFORMANCE table.
It is difficult to make changes to InnoDB code in any case.

Google's SHOW TABLE STATISTICS
http://code.google.com/p/google-mysql-tools/wiki/UserTableMonitoring
This will almost certainly be in before we make it obsolete.

Chad Miller's implementation of Jeremy Cole's profiling (BUG#24795).
Peter said he'll try to get this deprecated; Sergei said he thinks
we should wait till performance schema has everything profiling has.

Christopher Powers's Falcon diagnostics. (However, Christopher was
at the Riga meeting where we discussed the performance schema, and
expressed support for instrumenting Falcon this way. Sergei Golubchik
spoke similarly about Maria instrumenting.)

Stewart Smith's NDB$INFO. We expect cluster will diverge from MySQL.

Trace
-----

[ The contents of this section have been moved to
WL#4878 PERFORMANCE_SCHEMA Trace. ]

DTrace
------

[ The content of this section has been removed. We can remove
the heading and the table-of-contents entry any time. ]

Embedded
--------

We will not support Performance Schema with embedded MySQL / libmysqld.

Terminology
-----------

ACTOR. Something which does an event. A thread, a background process, or
a group of threads belonging to a single role or user.

CONSUMER. The server components which read information about the
performance, in in-memory performance structures. Any code which
reads, in order to produce something in a performance schema EVENTS
table, is a Consumer. The reverse of Provider.

EVENT. Something which is done. A deed. The meaning "act of a play" is
not officially intended, but usually fits. Try not to confuse it with the
schema object that you make with CREATE EVENT.

EVENTS TABLES. Any table in performance schema whose name begins with EVENTS_.
Rows in events tables can be "individual" (a mutex event happens at Time T1 for
1 time-unit), "summary" (9 mutex events happened today), "histogram"
(4 mutex events happened today for less than 1 time-unit and
1 took 1 time-unit and 4 took 2 time-units), or "history"
(a mutex event happened at Time T2, a mutex event happened at Time T2,
a mutex event happened at Time T3, and so on).

FILE. "File IO" is our term for any read or write call via the operating system
which we have reason to believe may cause disk IO. If the operating
system in fact doesn't do real disk IO at the time, for example because
it caches, that's fine, we're not demanding that the disk IO be "real".

INSTRUMENT. A pair of code pieces which bracket a code piece which is
performing an Event. The bracketed code piece is "instrumented code".
The first code piece of the instrument (before the instrumented code)
is "start_wait". The second code piece of the instrument (after the
instrumented code) is "end_wait". Instruments are in Providers.
Typically an instrument will calculate performance data and store
it in an in-memory performance structure.

INSTRUMENTATION. The set of all Instruments in the server.

METRIC. A routine or application which happens at long intervals
(typically once per minute) and shows rates such as "acts per second
in the last minute", which can be found from snapshots that Sampling
produces.

MUTEX. "Mutex" is the word we will use for all objects that are used to
reserve access to processes or memory locations. Mutexes are not
shared, and are not enqueued. A lock is not a mutex. We will not
use the word "latch" except to explain that we don't use it.

PERFORMANCE SCHEMA. What I hope will become the preferred term for
the whole feature, instead of Monitoring (easily confused with MySQL
Enterprise Monitor), or Profiling (which to me implies "summaries"),
or Performance Analysis (too long), or Instrumentation (may not always be true)
or "Performance Interface" (unless somebody really likes it).

PROVIDER. The server components which supply information about the
performance, in in-memory performance structures. The reverse of
Consumer. Any code which contains an Instrument is a Provider.

SAMPLING. A routine or application which happens at short intervals
(typically once per second) and takes snapshots of performance
schema tables.

STAGE. An act that is nested in a statement, and may nest a wait.
For example, "Filesort".

WAIT. A "wait" occurs when a thread ceases to do useful work with the CPU.
Typically waits involve an operating-system call (yield, sleep, get)
which involves accessing a slow device (a net, a human, a disk) or
which involves contention (another thread has asked for the same thing).
The performance schema "wait" genus necessarily includes some events that
involve waiting zero time units (disk block already in cache, mutex needs
no wait, etc.), but we can't know that till we call, so there's no
separate category.

Requests for comment
--------------------

[ The content of this section has been removed. We can remove
the heading and the table-of-contents entry any time. ]

Performance_xxx routines
------------------------

[ The content of this section has been removed. We can remove
the heading and the table-of-contents entry any time. ]

List of Mutexes
---------------

[ The content of this section haa been removed. We can remove
the heading and the table-of-contents entry any time. There is
a separate task WL#4670 "Mutex documentation". ]

Administrative
--------------

Marc Alff added a soft dependency on WL#4601: there is no point
in instrumenting the "fast mutex" implementation if it's going away.

If there is already a user-created database named PERFORMANCE_SCHEMA,
then mysql_upgrade will fail.

Source code
-----------

Status, as of 2010-01-16:

The WL#2360 code is now merged in 5.5 and 6.0.
The code is publicly available at:
https://code.launchpad.net/~mysql/mysql-server/mysql-next-mr
https://code.launchpad.net/~mysql/mysql-server/mysql-6.0-codebase

The following branches are not maintained anymore:
mysql-5.4 port, abandoned:
https://code.launchpad.net/~marc.alff/mysql-server/mysql-azalea-perfschema
mysql-6.0, abandoned:
https://code.launchpad.net/~marc.alff/mysql-server/mysql-6.0-perfschema
mysql-5.5, merged:
https://code.launchpad.net/~marc.alff/mysql-server/mysql-trunk-perfschema

References
----------

This section includes some references to other DBMSs. Monitoring is
common to all products. For most DBMSs, the section only will show
how little information they provide.

Oracle Database Reference 11g Part III "Dynamic Performance Views":
http://download.oracle.com/docs/cd/B28359_01/server.111/b28320/dynviews_part.htm#i403961

Oracle white paper "The Self-Managing Database ...":
http://www.oracle.com/technology/products/manageability/database/pdf/twp03/TWP_manage_automatic_performance_diagnosis.pdf

DB2 system monitor guide:
ftp://ftp.software.ibm.com/ps/products/db2/info/vr9/pdf/letter/en_US/db2f0e90.pdf

PostgreSQL manual chapter on monitoring database activity:
http://www.postgresql.org/docs/8.3/static/monitoring.html
(Measurements, in "pg_stat_" views, are counts, systemwide
or per table or per index, for statements, reads, buffer hits.
PostgreSQL also lets you see time used per statement, that's about all.)

SQL Server 2008 books online, "sysprocesses":
http://msdn2.microsoft.com/en-us/library/ms179881.aspx

SQL Server 2008 books online, "sp_trace_setevent":
http://technet.microsoft.com/en-us/library/ms186265(SQL.100).aspx

(SQL Server 2000 is not a good model to follow.)

"SQL Server 2008 Dynamic Management Views and Functions":
http://msdn2.microsoft.com/en-us/library/ms188754(SQL.100).aspx
"Profiler and trace in SQL Server 2005": zip including short power point file:
http://www.sqlteam.com/item.asp?ItemID=17345
(SQL Server 2005 allows trace files as SQL views.)

Sybase ASA manual re performance monitoring statistics:
http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0901/en/html/dbdaen9/00000765.htm
(Sybase ASA provides cumulative statistics to the Windows performance monitor)

Sybase ASE performance monitor, french manual:
http://download.sybase.com/pdfdocs/mng1250f/mon.pdf

sybooks.sybase.com "syslocks" etc.:
http://sybooks.sybase.com:80/onlinebooks/group-as/asg1251e/tables/%40Generic__BookView

WL#3363 ndb$info. Also ndb thread "Re: WL#3363 ndb$info":
[mysql intranet]/secure/mailarchive/mail.php?folder=113&mail=6396

dev-private july 2007 thread re Ingo's time estimates
[mysql intranet]/secure/mailarchive/mail.php?folder=4&mail=20154

The code in bzr mysql-6.0-perf contains the low-level
structures and main algorithms, commented and set
apart from other code with if/endif as described in
the HLS. That suffices as an LLD for a task of this nature.