WL#4878: PERFORMANCE_SCHEMA Trace

Affects: Server-9.x   —   Status: Un-Assigned

The initial description here has been shifted from the "Trace"
section of WL#2360 Performance Schema. This task depends on WL#2360.

"Trace" is not a requirement. However, this task description
must be here to show that, with WL#2360 as a basis,
we could produce trace files if we wanted.

Providers could dump to a trace file as well as copy to performance-structures.
It's certain to be slow (10%?) so doing it from the start would be an error.
The benefit is mostly to professional performance analysts.

Enabling Trace
--------------

There will be a new column in SETUP_INSTRUMENTS:
TRACE_ENABLED ENUM('yes','no')
It's in SETUP_INSTRUMENTS rather than SETUP_CONSUMERS because
it affects the instrument code path, and because we want to
enable/disable depending whether the event is a WAIT, a STATEMENT, etc.
The default value is 'no'.
It is an error to set TRACE_ENABLED='yes' if ENABLED='no'.

There will be a new column in the SETUP_ACTORS table (see WL#4674):
TRACE_ENABLED ENUM('yes','no')
The default value is 'no'.
Use of this column is affected by the FILE privilege and --secure-file-priv.

Trace file writing will occur for a particular instrumentation if:
SETUP_INSTRUMENTS.TRACE_ENABLED='yes', and
SETUP_ACTORS.TRACE_ENABLED='yes'.

There is no maximum length of a trace file.

Creating a trace file
---------------------

For each new thread,
or for each existing thread when a user updates a TRACE_ENABLED value:
  If (this thread is a thread for this setup_actors value)
    Set thread file name = datadir/"thread".actor.thread.
    For example, for datadir 'd' actor 'a' thread 1, = d/thread.a.1
    Create this file.

If a thread is associated with multiple actors,
for example host 'h' user 'a' role 'r', MySQL
will combine all names that fit the thread.

One File Per Thread
-------------------

We turn on tracing "per actor" but we have
one file "per thread". This is necessary
because we don't want two threads writing
to the same file at the same time -- either
thread#1 would wait for thread#2 (which is bad),
or thread#1 would not wait for thread#2 (which is bad).

It should be simple to make a utility
which combines all per-thread files and sorts
the rows by one of the timer columns (but not EVENT_ID
because EVENT_ID is now a per-thread counter).

Writing
-------

The instrument start_wait() and instrument end_wait()
code (both!) will write to the trace file.

There is no attempt to fsync, flush, or whatever.
You crash, you lose some of the trace.

The writing looks like
write(buffer);

This is a good possible spot for a plugin,
that is, we could replace "write(buffer)"
with "plugin(buffer)" if instead of writing
we wanted to examine the contents, cause a
crash, send an email, etc. We'll decide not
to do this.

File Errors
-----------

All kinds of errors can occur.
For example, the create fails due to privileges missing,
or the write fails due to full disk.

MySQL will not display an error message and stop
the statement processing. That is not what the
user, who is issuing the statement, expects.

MySQL will disable the trace for all threads
for the current actor. Thus, users will be able
to discover that a failure happened by seeing
that SETUP_ACTORS.TRACE_ENABLED becomes 'no'.
Thus, any error will disable tracing permanently.

Content and Format
------------------

This is a text file containing lines which
are equivalent to "events" rows in PERFORMANCE_SCHEMA
tables. Doubtless it would be quicker to dump raw
data from underlying structures, but when you ask
for a trace file you've given up on speed.
Doubtless it would be interesting to output
things in summary or state tables, but then
it wouldn't be tracing.

The header row is:
"/* MySQL Trace File for (actor_name) on (datetime) */".
This is redundant. Users can find out the
time the file was created, the actor, and the
thread, by saying 'ls -l'.

Each detail row takes one line and looks like an
INSERT statement, with all the columns of a row of
EVENTS_WAITS_CURRENT, EVENTS_STATEMENTS_CURRENT, etc.

Since different lines might be for rows of
different 'tables', a simple CSV format is not
possible. But it's easy to produce a database
file from INSERT statements.

Since writing occurs both before and after the
instrumentation point, everything is duplicated.

Example
-------

Suppose user Harry has thread 5.
Harry has privileges to update performance_schema tables.

Harry says:
UPDATE performance_schema.SETUP_INSTRUMENTS
  SET enabled='yes', trace_enabled='yes';
INSERT INTO performance_schema.SETUP_ACTORS
  VALUES ('%','%','%','yes');
This causes creation of trace files.

Harry now executes a statement
SET @a=5;
which happens to cause one mutex lock, no stages.

As a result, there is a trace file named
/usr/local/mysql/var/trace.harry.5
It contains
/* MySQL Trace File for Harry on 2009-04-07 14:18:00 */
INSERT INTO events_statements_current VALUES ('SET @a=5',...);
INSERT INTO events_waits_current VALUES ('mutex',...);
INSERT INTO events_waits_current VALUES ('mutex',...);
INSERT INTO events_statements_current VALUES ('SET @a=5',...);

To distinguish between 'start' and 'end' rows, check if TIMER_END is NULL.

Overhead
--------

The cost of taking a performance-struc and formatting it as
an INSERT statement is between 100 and 200 cycles. Most of
the time is string-move instructions. The cost of invoking
a write function is about 20 cycles on Peter's x86. The writing is deferred
so you don't immediately see the actual cost of writing to disk.
We have Kelly Long's test for "how many transactions per second on a
loaded system". We'll run it with "trace on and no update of
performance schema tables" versus "trace off and update of
performance schema tables".

mysqld option
-------------

Marc Alff pointed out: if one can enable a trace file with a mysqld option,
there's no need to worry about adding dozens more mysqld options that would
enable particular performance schema instruments for analyzing operations
that happen before the first user connects, for example crash recoveries.

For this, we could say that, if the server started with
mysqld --performance-schema-trace-enabled=true,
then any instrument which is 'enabled' is considered to be 'trace_enabled'.

More settings
-------------

We've said that there is no maximum length for a trace file,
and said that the destination is datadir/"thread".actor.thread.
But we know that users sometimes need to choose such things somehow.
So we could say that these things are
(a) specifiable in SETUP_INSTRUMENTS,
(b) specifiable via global variables similar to Oracle 11g's
    USER_DUMP_DEST and MAX_DUMP_FILE_SIZE,
(c) specifiable in one of the ways described by WL#3824 Resource Limits.

We've said that errors will be visible because SETUP_ACTORS.TRACE_ENABLED
becomes 'no'. But we can guess that users sometimes want to choose
what action to take on error. They could want to increment a counter,
or change file destination, or cause a warning from the SQL statement.