WL#5520: InnoDB: persistent optimizer statistics

Status: Complete

Description
High Level Architecture
Low Level Design

Overview

The statistics that InnoDB gathers for tables and indexes for the optimizer are
currently stored in memory only and are wiped away when the database server is
shutdown. These statistics are currently recalculated every time a table is
first opened (and on other events too) and thus the calculation needs to be
fast. However, because they are computed quickly, the statistics vary widely,
are inaccurate and sometimes even totally bogus. To provide stability and
accuracy of the statistics, and to provide user control over their collection,
we will make the statistics persistent, and computed only when the user issues
an ANALYZE command. 

User Value of Persistent Statistics

The statistics are not so valuable in and of themselves, as they are usually
only estimates of the real values that describe the content of tables. (Even
such things as the row count or number of distinct values in a column can differ
significantly from reality.) Users should not rely on optimizer statistics for
analysis.

The real value of approximate statistics describing a table is for query
optimization. Users would like good performance for their queries, and the query
optimizer uses (approximate) statistics to make decisions as to access paths,
join orders, etc. (the query execution plan). Users want statistics that
relatively accurately reflect the contents of the database, so that the plans
chosen by the optimizer are as good as they can be (given the quality of the
optimizer itself).

In addition to their quality, the stability of the statistics is important, so
that the optimizer will make the same choices for a given query over time. This
"plan stability" is desirable, so that query performance remains predictable.
Note that changes in the optimizer (even bug fixes) can affect the chosen query
execution plan. Thus, making statistics persistent can help ensure, but not
guarantee, that a given query will be executed in the same way each time it
appears. Only a more advanced capability, where the query plan itself is saved,
can ensure complete "plan stability" even across upgrades of the software.
Making the statistics themselves persistent, as well as giving users explicit
control over when statistics are calculated (rather than have them calculated as
a side-effect of accessing a table) provides the required stability of statistics.

Furthermore, users would like to be able to change the statistics, either to
revert the system to a point in time before an ANALYZE command was done, or to
do testing. For example, a user can do some testing and evaluation of optimizer
choices (with EXPLAIN PLAN) using a small data set, then change the statistics
to see how sensitive the optimizer is to changes in numbers of rows, numbers of
distinct values, etc.

Other database systems and storage engines

Most "mature" DBMSs maintain statistics persistently, and give users control
over their (re-)calculation. Oracle, for example, has a very comprehensive
feature list in this area, including:

    * persistent statistics
    * statistics gathered with ANALYZE command or stored procedure
    * manual or automatic (scheduled) statistics capture
    * user control over sampling techniques
    * parallel statistics gathering 

There are many other related features of Oracle as documented in Chapter 13:
"Managing Optimizer Statistics" of the Oracle 11g Database Performance Tuning Guide.

The PostgreSQL documentation describes its statistics, which are gathered by an
ANALYZE command and stored in the system catalog in a table called pg_statistic.
PostgreSQL lacks many of the features of Oracle and DB2 for example, but does
include histograms. More information is in the manual about how the "planner"
(optimizer) uses these statistics.

Requirements, Constraints and Considerations

The "Persistent Statistics" feature is motivated by the requirement for good
query performance, stable optimizer plans and user control of when statistics
are (re-)calculated and other factors.

Requirements

The following requirements should be met by this implementation:

    * Statistics stored in a database, in a normal InnoDB table
accessible via SQL
    * User control over when statistics are computed
    * Ability for the user to change (and delete) statistics, where delete means
to switch this feature OFF for a given table
    * Statistics should be computed for the entire table and each of its indexes
    * User control which tables have statistics computed/saved
    * User control over the number of scanned pages
    * Time-stamping of statistics collections
    * Appropriate behavior on CREATE/DROP/ALTER TABLE/INDEX (i.e., don't leave
orphan rows in stats table)
    * System behaves as now if statistics table, or related statistics do not exist
    * SHOW INDEX should report cardinality from persistent stats 

Non-requirements

We have considered a number of other features that could be done as part of this
feature, however there are reasons we want to limit the scope of this project.
Therefore, we specifically exclude the following possible aspects for this feature:

    * Special security on statistics tables
    * Integration with Transportable Tablespaces
    * Automatic statistics calculation based on how much a table has changed or
on a schedule
    * Use of statistics on system tables
    * Versioning of statistics collections
    * Histograms 

Constraints

The following constraints are imposed on the implementation of this feature:

    * No modifications to MySQL syntax
    * Cannot require a new on-disk file format

User Interface

This section describes the feature from the user perspective. How does a user
invoke statistics computation, what do they see, how can they use it. The
following factors must be addressed.

Creating Statistics Tables - the two statistics tables are automatically created
during mysql_install_db time in the mysql database - mysql.innodb_table_stats
and mysql.innodb_index_stats.

The current MySQL ANALYZE command is used to gather statistics for a normal user
table.

Semantics of the ANALYZE command

Persistent stats will be computed and saved by the ANALYZE command if
innodb_analyze_is_persistent is ON (it is OFF by default to avoid behavioral
hickups during upgrades). The current automatic collection of statistics will
occur only when persistent statistics do not exist for a given table. That
automatic collection will never store the stats, like it does before this WL.
Likewise, the ANALYZE command will always store the statistics, making them
persistent (when innodb_analyze_is_persistent=ON). Thus, there are two types of
statistics:

    * Transient stats: as now, computed by "n random dives", on table open, and
not stored
    * Persistent stats: the new feature, computed by a slower technique and stored 

With this approach the system behaves as it does before this WL. The ANALYZE
command has simple semantics: compute and store statistics (if
innodb_analyze_is_persistent=ON and behave as before if OFF).

Specifying the Sampling Technique

Before this WL, the parameter innodb_stats_sample_pages controls the number of
index dives that are done when estimating statistics. This parameter is renamed
to innodb_stats_transient_sample_pages and a new one is introduced -
innodb_stats_persistent_sample_pages. If the old method is to be used for stats
calculation, then the _transient config will be used. If the new method is to be
used, then the _persistent config will be used. This is needed, because one
could afford to increase the _persistent parameter significantly, provided that
the stats will not be calculated without the DBA explicitly executing ANALYZE
TABLE. One could afford to increase it so that ANALYZE takes a few minutes
(assuming the table is that big). If the number given in _persistent is too high
then we will trigger a full scan (the exact boundary is when
innodb_stats_persistent_sample_pages * n_uniq is larger than the number of leaf
pages in a given index - then a full scan of that index will be done instead of
sampling innodb_stats_persistent_sample_pages for each n_uniq key prefix. n_uniq
is the number of columns in the index that make each row unique, if the index is
non-unique and it does not contain unique indexed columns then that number is
the number of columns + 1).


Manipulating the Stats Table

Ordinary SQL commands can be used to modify the statistics table, INSERTing,
DELETEing or UPDATEing rows as desired to eliminate statistics for a table,
revert to a prior set of statistics, or experiment to see how the optimizer will
react to different values of statistics. While we will not try to prevent DBAs
from doing so, such direct manipulation of the stats tables can lead to errors
(such as orphan rows or meaninless values in the tables).

Effect of other SQL commands on stats

CREATE TABLE: For simplicity, and to avoid side-effects, CREATE TABLE will NOT
automatically invoke the ANALYZE command. So, there will be no stats on the
table until ANALYZE is run.

DROP TABLE and ALTER TABLE DROP INDEX: will delete the corresponding rows in the
stats tables so there are no orphan stats.

CREATE INDEX (or ALTER TABLE ADD INDEX): Do nothing special, this means for a
given table, some indexes might have stats and others not. At table open time,
the stats for the index(es) that didn't have them would be populated (in memory
only) using the current technique. No storing of transient stats that would
potentially be computed with a different method. 

If persistent statistics exist for a table (because ANALYZE has been run), then
none of the other commands that would today compute transient stats will do so.

SHOW INDEX "just works" because the cardinality estimate in memory will have
come from the persistent stats.

Timestamping

It is helpful to users that the date/time when statistics were collected is
preserved with the statistics. The timestamp will be recorded as a table-level
attribute in mysql.innodb_table_stats.stats_timestamp.

Some use cases for having a timestamp are:

    * If the stats are old, users might want to re-ANALYZE because the table has
been updated a lot.
    * Some future capability might automatically re-ANALYZE based on passage of
time and/or table volatility.
    * If the stats were "working well" (e.g., there was a good execution plan
and good performance during a certain time, a user might like to go back to the
stats from that time).
    * The timestamp is a convenient way to label sets of stats (e.g., copies of
the stats rows) for user-managed versioning
    * Although stats are only estimates, users might want to collect stats on a
regular basis, and analyze those stats over time to detect trends. 

The above use cases might be helpful in the documentation.

Number of Stats Tables

    * One stats table per instance (i.e., across MySQL "databases", or what most
database systems call "schemas")

Where to store the stats tables?

One possible place to store the stats table would be the "mysql" database. This
database will contain the two stats tables.

Security

The statistics tables raise a question of security. Even the existence of a
table may be considered sensitive information (for example, Oracle says "no such
table" if a user SELECTs from a table on which he has no privilege). With a
single set of statistics tables for the whole mysql instance, rows for all
tables will be visible to all users with the right to access the stats tables.
The statistics tables must be restricted to the super user only.

There is normally no need for application users to see or change the statistics.
In a hosted environment, where multiple customers use the same MySQL instance
(each with their own database), it is unlikely the hosting service provider
would want to provide superuser privileges. However, in such cases, the customer
probably has little need to manage statistics (e.g., when multiple customers use
WordPress, they may have their own (identical) databases, but none needs to be a
superuser with the ability to access or change the statistics.

Design and Implementation

This section details the implementation of the persistent optimizer statistics
feature.

What statistics are computed?

The statistics must support the current operations of the MySQL query optimizer.
In the future, additional statistics may be necessary to do so (e.g., histograms).

Additionally, we will capture in the stats table other information that InnoDB
currently collects but that the optimizer does not use. Furthermore, we will
capture additional information that could be useful to the DBA, even if the
optimizer does not require or use such information.

    * Per table
          o approximate number of rows: stat_n_rows
          o approximate clustered index size: stat_clustered_index_size
          o sum of other indexes sizes: stat_sum_of_other_index_sizes
    * Per index
          o approximate index size: stat_index_size
          o approximate number of leaf pages in the index: stat_n_leaf_pages
          o for each n-columns unique prefix in a composite (multi-column)
index/key, the approximate number of distinct values: stat_n_diff_key_vals (n
can be from 1 to 16). 

The above statistics are used to satisfy the requirements of the MySQL optimizer
(see below for the data structure definitions InnoDB uses).

Additional stats

In addition, to help the DBA, we will include the following information:

    * Per table:
          o timestamp when stats were collected
    * Per index:
          o timestamp when stats were collected
          o number of pages in the B-tree

C Data structures used by MySQL and InnoDB

MySQL currently asks for:
per table:
class ha_statistics
{
public:
...
  /*
    The number of records in the table.
      0    - means the table has exactly 0 rows
    other  - if (table_flags() & HA_STATS_RECORDS_IS_EXACT)
               the value is the exact number of records in the table
             else
               it is an estimate
  */
  ha_rows records;
  ha_rows deleted;                      /* Deleted records */
  ulong mean_rec_length;                /* physical reclength */
...
InnoDB keeps these per table:
                                /** Statistics for query optimization */
                                /* @{ */
        unsigned        stat_initialized:1; /*!< TRUE if statistics have
                                been calculated the first time
                                after database startup or table creation */
        ib_int64_t      stat_n_rows;
                                /*!< approximate number of rows in the table;
                                we periodically calculate new estimates */
        ulint           stat_clustered_index_size;
                                /*!< approximate clustered index size in
                                database pages */
        ulint           stat_sum_of_other_index_sizes;
                                /*!< other indexes in database pages */
        ulint           stat_modified_counter;
                                /*!< when a row is inserted, updated,
                                or deleted,
                                we add 1 to this number; we calculate new
                                estimates for the stat_... values for the
                                table and the indexes at an interval of 2 GB
                                or when about 1 / 16 of table has been
                                modified; also when the estimate operation is
                                called for MySQL SHOW TABLE STATUS; the
                                counter is reset to zero at statistics
                                calculation; this counter is not protected by
                                any latch, because this is only used for
                                heuristics */
                                /* @} */
InnoDB's "stat_n_rows" corresponds to MySQL's "records".
Per index:
typedef struct st_key {
...
 /*
    Array of AVG(#records with the same field value) for 1st ... Nth key part.
    0 means 'not known'.
    For temporary heap tables this member is NULL.
  */
  ulong *rec_per_key;
...
InnoDB keeps these per index:
        /** Statistics for query optimization */
        /* @{ */
        ib_int64_t*     stat_n_diff_key_vals;
                                /*!< approximate number of different
                                key values for this index, for each
                                n-column prefix where n <=
                                dict_get_n_unique(index); we
                                periodically calculate new
                                estimates */
        ulint           stat_index_size;
                                /*!< approximate index size in
                                database pages */
        ulint           stat_n_leaf_pages;
                                /*!< approximate number of leaf pages in the
                                index tree */
InnoDB's "stat_n_diff_key_vals" corresponds to MySQL's "rec_per_key".


Description of Statistics Tables

The statistics table(s) must:

    * Be able to identify stats by the name of the related database, table and index
    * Store table-level data (e.g., number of rows in the table)
    * Store index-level data, for each indexes
          o must accommodate the maximum number of columns in a composite key (16). 
    * Support the statistics required by the optimizer
    * Store extra stats that may not be used by the optimizer but are helpful to
the DBA
    * Record the sample size used to estimate stats
    * Be relatively easy to use via SQL 

There will be two stats tables, one containing table-level stats, and one
containing index-level stats, as follows.

CREATE TABLE IF NOT EXISTS innodb_table_stats (
        database_name                   VARCHAR(64) NOT NULL,
        table_name                      VARCHAR(64) NOT NULL,
        stats_timestamp                 TIMESTAMP NOT NULL,
        n_rows                          BIGINT UNSIGNED NOT NULL,
        clustered_index_size            BIGINT UNSIGNED NOT NULL,
        sum_of_other_index_sizes        BIGINT UNSIGNED NOT NULL,
        PRIMARY KEY (database_name, table_name)
) ENGINE=INNODB DEFAULT CHARSET=utf8;

CREATE TABLE IF NOT EXISTS innodb_index_stats (
        database_name                   VARCHAR(64) NOT NULL,
        table_name                      VARCHAR(64) NOT NULL,
        index_name                      VARCHAR(64) NOT NULL,
        stat_timestamp                  TIMESTAMP NOT NULL,
        /* there are at least:
        stat_name='size'
        stat_name='n_leaf_pages'
        stat_name='n_diff_pfx%' */
        stat_name                       VARCHAR(64) NOT NULL,
        stat_value                      BIGINT UNSIGNED NOT NULL,
        sample_size                     BIGINT UNSIGNED,
        stat_description                VARCHAR(1024) NOT NULL,
        PRIMARY KEY (database_name, table_name, index_name, stat_name),
        FOREIGN KEY (database_name, table_name)
          REFERENCES innodb_table_stats (database_name, table_name)
) ENGINE=INNODB DEFAULT CHARSET=utf8;

For any given table that has been ANALYZED, there will be one row in the
innodb_table_stats table and n rows in the innodb_index_stats table.

How and when statistics are calculated

Without this feature: The number of unique values in an index is calculated by
making N random dives into the index, fetching N leaf pages and counting the
unique values in them. Then the number of the unique values in the whole index
is estimated to be the average of the N random pages, multiplied by the total
number of pages in the index. N can be specified by the user via
innodb_stats_sample_pages which defaults to 8. These statistics are recalculated
when a table is opened, when user explicitly calls ANALYZE TABLE and when InnoDB
decides that the table has changed too much.

With this feature: The above N random dives algorithm is kept as one
alternative, along with the new algorithm. Previous stats, if any, are discarded
when ANALYZE is done. The stats are (re)calculated only when the user explicitly
calls ANALYZE TABLE. This differs from the InnoDB Plugin and MySQL 5.1 behavior.

Because the stats are persistent and not wiped away out of the control of the
user (when table is opened etc), the user can afford to scan 1000s of pages to
get more precise stats. The user can choose when the slow stats calculation
takes place because it only happens when ANALYZE TABLE is executed.

Another thing to consider, maybe in the next version, is to add an exception to
the recalculated-only-on-ANALYZE rule and recalculate the stats when the table
has changed "too much", users could specify how much in % globally or % per
table where 0 would mean - disable this feature.

Algorithm description

For now the following single algorithm will be used for sampling. If at some
point another algorithms are approved we may add them.

The algorithm is controlled by one number - A, which is the number of leaf pages
to analyze for a given index for each n-prefix (if the index is on 3 columns,
then 3*A pages will be analyzed). A represents innodb_stats_persistent_sample_pages.

Let the total number of leaf pages in the table be T.
Level 0 - leaf pages, level H - root.

A can be given as absolute number, or in future versions as a% in which
case A = T * a / 100.

Definition: Boring record is a record on a non-leaf page that equals the next
(to the right, cross page boundaries) record on the same level. The last (user)
record on a level is not boring (it does not equal the supremum on that page).
It is boring because all the records on the page below it equal that boring record.

We avoid diving below boring records when searching for a leaf page to estimate
the number of distinct records because we know that such a leaf page will have
number of distinct records == 1.

For each n-prefix start from the root level and full scan subsequent lower
levels until a level that contains at least A*10 distinct records is found. As
an optimization the search is canceled if it has reached level 1 (never descend
to the level 0 (leaf)) and also if the next level to be scanned would contain
more than A pages. The latter is because the user has asked to analyze A leaf
pages and it does not make sense to scan much more than A non-leaf pages with
the sole purpose of finding a good sample of A leaf pages.

After finding the appropriate level with >A*10 distinct records, divide it into
groups of equal records and pick A such groups. Then pick the last record from
each group. For example, let the level be:

index:  0,1,2,3,4,5,6,7,8,9,10
record: 1,1,1,2,2,7,7,7,7,7,9

There are 4 groups of distinct records and if A=2 random ones are selected, e.g.
1,1,1 and 7,7,7,7,7, then records with indexes 2 and 9 will be selected.

After selecting A records as described above, dive below them to find A leaf
pages and analyze them, finding the total number of distinct records. The dive
to the leaf level is performed by selecting a non-boring record from each page
and diving below it.

This way, a total of A leaf pages are analyzed.

Let the number of different key values found in page i be Pi (i=1..A)
Calculate the number of different key values in the whole level A,
  let that be V.
Then the total number of different key values in the whole tree is:
  V * (P1 + P2 + ... PN) / A.

The above describes how to calculate the cardinality of an index. This algorithm
is executed for each n-prefix of a multi-column index where n=1..n_uniq.

For other stats (like e.g. number of rows) use the same algorithm except that at
level A we will not unify equal key values into one group, that is - randomly
pick A keys from 1,1,1,2,2,7,7,7,7,7,9, no matter that some of them are equal.

Persistent storage modification

In InnoDB Plugin and MySQL 5.1 the stats are kept in memory only and discarded
when the server is shut down. In new InnoDB (with this feature) the stats will
be stored in an InnoDB table that the user can even change to alter the
statistics manually. The internal SQL parser will be used to access that table
from within InnoDB. 

UPDATE/INSERT of the persistent storage

    * The user calls ANALYZE TABLE t
          o The stats related members of the structs are updated as usual
          o The values of those members are added to the persistent storage:
                + If the persistent storage does not exist, do nothing
(eventually emit a notice to the user that persistent stats are not enabled
because the system tables are missing)
                + If the innodb_table_stats or innodb_index_stats table has the
wrong format, issue an error
                + If the relevant table/index is not in the persistent storage,
INSERT
                + If the relevant table/index is already in the persistent
storage, UPDATE 
    * The user UPDATEs the persistent storage directly
          o Add a trigger to the stats table that updates the dict_table_t
struct in the dictionary cache if there is one loaded already. Do not touch
MySQL structures, MySQL will fetch the info from the dictionary cache next time
it opens the table. This trigger will be a code that is inserted somewhere in an
INSERT and UPDATE code path that checks if the table being changed needs special
handling.

SELECTing from the persistent storage

    * When a table is not in the dictionary cache and is opened, read the
relevant stats from the persistent storage and assign to the stats members in
the structs described
    * MySQL can read the (in memory) structs at any time, then the persistent
storage is not involved 

DELETE from the persistent storage

    * When a table or index is dropped, then scan the persistent storage for
relevant entries and wipe them away 

IR: Do we allow user to delete rows from stat table directly (as a mechanism to
revert to current behavior)? If so, will we maintain RI constraints between the
innodb_table_stats table and innodb_index_stats table so that user cannot just
delete from the innodb_table_stats leaving behind orphan rows in innodb_index_stats?

Vasil: Yes, users will be able to DELETE, RI is in place.

The statistics are calculated by dict_update_statistics_low(). Add a boolean
parameter to this function that tells it whether to calculate the stats (as
before) and save them to the persistent storage OR to fetch them from the
persistent storage. If fetch is requested but there is no data for the given
table in the persistent storage, then calculate the stats and add them to the
persistent storage.

From all places, except ANALYZE, request fetching from the persistent storage,
from ANALYZE request recalculation.

Pseudo code / Raw code snippets

Translated to code changes, the above means:

    * add one more parameter to dict_update_statistics_low()
    * in dict_update_statistics_low():
          o if that parameter is RECALCULATE, then
                + if the persistent storage for the table exists then recalc the
stats using a new algorithm and save them
                + else do what the function currently does (and emit warning?) 
          o else (it must be FETCH_FROM_PERSISTENT_STORAGE)
                + if persistent storage exists and there are stats for the table
then read the stats from the persistent storage
                + else do what the function currently does 
    * all callers of dict_update_statistics_low() call it with
FETCH_FROM_PERSISTENT_STORAGE or are changed to accept one more parameter and
pass it down unchanged to dict_update_statistics_low()
    * the code path of ANALYZE is: ::analyze() -> ::info() ->
dict_update_statistics() -> dict_update_statistics_low(). Move the whole body of
::info() into a new function X that accepts one more parameter and passes it
down to dict_update_statistics(), in ::info() call X(...,
FETCH_FROM_PERSISTENT_STORAGE). In ::analyze(), instead of calling ::info(),
call X(..., RECALCULATE) 


Other Information

The statistics that InnoDB gathers for tables and indexes for the optimizer are
currently stored in memory only and are wiped away when the database server
is shutdown. These statistics are currently recalculated every time a table is
first opened (and on other events too) and thus the calculation needs to be
fast. However, because they are computed quickly, the statistics vary widely,
are inaccurate and sometimes even totally bogus. To provide stability and
accuracy of the statistics, and to provide user control over their collection,
we will make the statistics persistent, and computed only when the user issues
an ANALYZE command.

Performance Objectives

The key benefit to users is not the performance of the ANALYZE command itself,
but the stability of the optimizer plan. Nevertheless, since we are implementing
new sampling techniques, we should have some goals about the performance of
these new techniques.

Performance Tests

Given that there are potentially new sampling techniques, a performance testing
plan is required.

   1. What happens to the stats table on CREATE INDEX? 

Currently nothing is done - the table will have an index without
corresponding entry in mysql.innodb_index_stats, when fetching stats from
mysql.innodb_index_stats this index's stat fields will be left untouched. The
next ANALYZE command will add a row for the newly created index. We can handle
this differently at any time easily.

   1. What happens to the stats table on DROP INDEX? 

DELETE FROM mysql.innodb_index_stats will be hooked next to DELETE FROM
SYS_INDEXES to drop the necessary rows from innodb_index_stats.

   1. Allow updating the stats for one index only in ANALYZE?
          * Since changing the supported syntax in MySQL is beyond question, we
could introduce a session variable and look at it during analyze. Something like
"SET innodb_stats_only_for_indexes="index1;iii"; ANALYZE TABLE t;" (assuming
table t has 3 indexes - index1, iii and index2, the above would recalculate only
the stats for the first two indexes. This is not implemented.


References:

 * http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/stats.htm#PFGRF003
 * http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_stats.htm.

Ref: rb://373