WL#5520: InnoDB: persistent optimizer statistics
Status: Complete
Overview The statistics that InnoDB gathers for tables and indexes for the optimizer are currently stored in memory only and are wiped away when the database server is shutdown. These statistics are currently recalculated every time a table is first opened (and on other events too) and thus the calculation needs to be fast. However, because they are computed quickly, the statistics vary widely, are inaccurate and sometimes even totally bogus. To provide stability and accuracy of the statistics, and to provide user control over their collection, we will make the statistics persistent, and computed only when the user issues an ANALYZE command. User Value of Persistent Statistics The statistics are not so valuable in and of themselves, as they are usually only estimates of the real values that describe the content of tables. (Even such things as the row count or number of distinct values in a column can differ significantly from reality.) Users should not rely on optimizer statistics for analysis. The real value of approximate statistics describing a table is for query optimization. Users would like good performance for their queries, and the query optimizer uses (approximate) statistics to make decisions as to access paths, join orders, etc. (the query execution plan). Users want statistics that relatively accurately reflect the contents of the database, so that the plans chosen by the optimizer are as good as they can be (given the quality of the optimizer itself). In addition to their quality, the stability of the statistics is important, so that the optimizer will make the same choices for a given query over time. This "plan stability" is desirable, so that query performance remains predictable. Note that changes in the optimizer (even bug fixes) can affect the chosen query execution plan. Thus, making statistics persistent can help ensure, but not guarantee, that a given query will be executed in the same way each time it appears. Only a more advanced capability, where the query plan itself is saved, can ensure complete "plan stability" even across upgrades of the software. Making the statistics themselves persistent, as well as giving users explicit control over when statistics are calculated (rather than have them calculated as a side-effect of accessing a table) provides the required stability of statistics. Furthermore, users would like to be able to change the statistics, either to revert the system to a point in time before an ANALYZE command was done, or to do testing. For example, a user can do some testing and evaluation of optimizer choices (with EXPLAIN PLAN) using a small data set, then change the statistics to see how sensitive the optimizer is to changes in numbers of rows, numbers of distinct values, etc. Other database systems and storage engines Most "mature" DBMSs maintain statistics persistently, and give users control over their (re-)calculation. Oracle, for example, has a very comprehensive feature list in this area, including: * persistent statistics * statistics gathered with ANALYZE command or stored procedure * manual or automatic (scheduled) statistics capture * user control over sampling techniques * parallel statistics gathering There are many other related features of Oracle as documented in Chapter 13: "Managing Optimizer Statistics" of the Oracle 11g Database Performance Tuning Guide. The PostgreSQL documentation describes its statistics, which are gathered by an ANALYZE command and stored in the system catalog in a table called pg_statistic. PostgreSQL lacks many of the features of Oracle and DB2 for example, but does include histograms. More information is in the manual about how the "planner" (optimizer) uses these statistics. Requirements, Constraints and Considerations The "Persistent Statistics" feature is motivated by the requirement for good query performance, stable optimizer plans and user control of when statistics are (re-)calculated and other factors. Requirements The following requirements should be met by this implementation: * Statistics stored in a database, in a normal InnoDB table accessible via SQL * User control over when statistics are computed * Ability for the user to change (and delete) statistics, where delete means to switch this feature OFF for a given table * Statistics should be computed for the entire table and each of its indexes * User control which tables have statistics computed/saved * User control over the number of scanned pages * Time-stamping of statistics collections * Appropriate behavior on CREATE/DROP/ALTER TABLE/INDEX (i.e., don't leave orphan rows in stats table) * System behaves as now if statistics table, or related statistics do not exist * SHOW INDEX should report cardinality from persistent stats Non-requirements We have considered a number of other features that could be done as part of this feature, however there are reasons we want to limit the scope of this project. Therefore, we specifically exclude the following possible aspects for this feature: * Special security on statistics tables * Integration with Transportable Tablespaces * Automatic statistics calculation based on how much a table has changed or on a schedule * Use of statistics on system tables * Versioning of statistics collections * Histograms Constraints The following constraints are imposed on the implementation of this feature: * No modifications to MySQL syntax * Cannot require a new on-disk file format
User Interface This section describes the feature from the user perspective. How does a user invoke statistics computation, what do they see, how can they use it. The following factors must be addressed. Creating Statistics Tables - the two statistics tables are automatically created during mysql_install_db time in the mysql database - mysql.innodb_table_stats and mysql.innodb_index_stats. The current MySQL ANALYZE command is used to gather statistics for a normal user table. Semantics of the ANALYZE command Persistent stats will be computed and saved by the ANALYZE command if innodb_analyze_is_persistent is ON (it is OFF by default to avoid behavioral hickups during upgrades). The current automatic collection of statistics will occur only when persistent statistics do not exist for a given table. That automatic collection will never store the stats, like it does before this WL. Likewise, the ANALYZE command will always store the statistics, making them persistent (when innodb_analyze_is_persistent=ON). Thus, there are two types of statistics: * Transient stats: as now, computed by "n random dives", on table open, and not stored * Persistent stats: the new feature, computed by a slower technique and stored With this approach the system behaves as it does before this WL. The ANALYZE command has simple semantics: compute and store statistics (if innodb_analyze_is_persistent=ON and behave as before if OFF). Specifying the Sampling Technique Before this WL, the parameter innodb_stats_sample_pages controls the number of index dives that are done when estimating statistics. This parameter is renamed to innodb_stats_transient_sample_pages and a new one is introduced - innodb_stats_persistent_sample_pages. If the old method is to be used for stats calculation, then the _transient config will be used. If the new method is to be used, then the _persistent config will be used. This is needed, because one could afford to increase the _persistent parameter significantly, provided that the stats will not be calculated without the DBA explicitly executing ANALYZE TABLE. One could afford to increase it so that ANALYZE takes a few minutes (assuming the table is that big). If the number given in _persistent is too high then we will trigger a full scan (the exact boundary is when innodb_stats_persistent_sample_pages * n_uniq is larger than the number of leaf pages in a given index - then a full scan of that index will be done instead of sampling innodb_stats_persistent_sample_pages for each n_uniq key prefix. n_uniq is the number of columns in the index that make each row unique, if the index is non-unique and it does not contain unique indexed columns then that number is the number of columns + 1). Manipulating the Stats Table Ordinary SQL commands can be used to modify the statistics table, INSERTing, DELETEing or UPDATEing rows as desired to eliminate statistics for a table, revert to a prior set of statistics, or experiment to see how the optimizer will react to different values of statistics. While we will not try to prevent DBAs from doing so, such direct manipulation of the stats tables can lead to errors (such as orphan rows or meaninless values in the tables). Effect of other SQL commands on stats CREATE TABLE: For simplicity, and to avoid side-effects, CREATE TABLE will NOT automatically invoke the ANALYZE command. So, there will be no stats on the table until ANALYZE is run. DROP TABLE and ALTER TABLE DROP INDEX: will delete the corresponding rows in the stats tables so there are no orphan stats. CREATE INDEX (or ALTER TABLE ADD INDEX): Do nothing special, this means for a given table, some indexes might have stats and others not. At table open time, the stats for the index(es) that didn't have them would be populated (in memory only) using the current technique. No storing of transient stats that would potentially be computed with a different method. If persistent statistics exist for a table (because ANALYZE has been run), then none of the other commands that would today compute transient stats will do so. SHOW INDEX "just works" because the cardinality estimate in memory will have come from the persistent stats. Timestamping It is helpful to users that the date/time when statistics were collected is preserved with the statistics. The timestamp will be recorded as a table-level attribute in mysql.innodb_table_stats.stats_timestamp. Some use cases for having a timestamp are: * If the stats are old, users might want to re-ANALYZE because the table has been updated a lot. * Some future capability might automatically re-ANALYZE based on passage of time and/or table volatility. * If the stats were "working well" (e.g., there was a good execution plan and good performance during a certain time, a user might like to go back to the stats from that time). * The timestamp is a convenient way to label sets of stats (e.g., copies of the stats rows) for user-managed versioning * Although stats are only estimates, users might want to collect stats on a regular basis, and analyze those stats over time to detect trends. The above use cases might be helpful in the documentation. Number of Stats Tables * One stats table per instance (i.e., across MySQL "databases", or what most database systems call "schemas") Where to store the stats tables? One possible place to store the stats table would be the "mysql" database. This database will contain the two stats tables. Security The statistics tables raise a question of security. Even the existence of a table may be considered sensitive information (for example, Oracle says "no such table" if a user SELECTs from a table on which he has no privilege). With a single set of statistics tables for the whole mysql instance, rows for all tables will be visible to all users with the right to access the stats tables. The statistics tables must be restricted to the super user only. There is normally no need for application users to see or change the statistics. In a hosted environment, where multiple customers use the same MySQL instance (each with their own database), it is unlikely the hosting service provider would want to provide superuser privileges. However, in such cases, the customer probably has little need to manage statistics (e.g., when multiple customers use WordPress, they may have their own (identical) databases, but none needs to be a superuser with the ability to access or change the statistics.
Design and Implementation This section details the implementation of the persistent optimizer statistics feature. What statistics are computed? The statistics must support the current operations of the MySQL query optimizer. In the future, additional statistics may be necessary to do so (e.g., histograms). Additionally, we will capture in the stats table other information that InnoDB currently collects but that the optimizer does not use. Furthermore, we will capture additional information that could be useful to the DBA, even if the optimizer does not require or use such information. * Per table o approximate number of rows: stat_n_rows o approximate clustered index size: stat_clustered_index_size o sum of other indexes sizes: stat_sum_of_other_index_sizes * Per index o approximate index size: stat_index_size o approximate number of leaf pages in the index: stat_n_leaf_pages o for each n-columns unique prefix in a composite (multi-column) index/key, the approximate number of distinct values: stat_n_diff_key_vals (n can be from 1 to 16). The above statistics are used to satisfy the requirements of the MySQL optimizer (see below for the data structure definitions InnoDB uses). Additional stats In addition, to help the DBA, we will include the following information: * Per table: o timestamp when stats were collected * Per index: o timestamp when stats were collected o number of pages in the B-tree C Data structures used by MySQL and InnoDB MySQL currently asks for: per table: class ha_statistics { public: ... /* The number of records in the table. 0 - means the table has exactly 0 rows other - if (table_flags() & HA_STATS_RECORDS_IS_EXACT) the value is the exact number of records in the table else it is an estimate */ ha_rows records; ha_rows deleted; /* Deleted records */ ulong mean_rec_length; /* physical reclength */ ... InnoDB keeps these per table: /** Statistics for query optimization */ /* @{ */ unsigned stat_initialized:1; /*!< TRUE if statistics have been calculated the first time after database startup or table creation */ ib_int64_t stat_n_rows; /*!< approximate number of rows in the table; we periodically calculate new estimates */ ulint stat_clustered_index_size; /*!< approximate clustered index size in database pages */ ulint stat_sum_of_other_index_sizes; /*!< other indexes in database pages */ ulint stat_modified_counter; /*!< when a row is inserted, updated, or deleted, we add 1 to this number; we calculate new estimates for the stat_... values for the table and the indexes at an interval of 2 GB or when about 1 / 16 of table has been modified; also when the estimate operation is called for MySQL SHOW TABLE STATUS; the counter is reset to zero at statistics calculation; this counter is not protected by any latch, because this is only used for heuristics */ /* @} */ InnoDB's "stat_n_rows" corresponds to MySQL's "records". Per index: typedef struct st_key { ... /* Array of AVG(#records with the same field value) for 1st ... Nth key part. 0 means 'not known'. For temporary heap tables this member is NULL. */ ulong *rec_per_key; ... InnoDB keeps these per index: /** Statistics for query optimization */ /* @{ */ ib_int64_t* stat_n_diff_key_vals; /*!< approximate number of different key values for this index, for each n-column prefix where n <= dict_get_n_unique(index); we periodically calculate new estimates */ ulint stat_index_size; /*!< approximate index size in database pages */ ulint stat_n_leaf_pages; /*!< approximate number of leaf pages in the index tree */ InnoDB's "stat_n_diff_key_vals" corresponds to MySQL's "rec_per_key". Description of Statistics Tables The statistics table(s) must: * Be able to identify stats by the name of the related database, table and index * Store table-level data (e.g., number of rows in the table) * Store index-level data, for each indexes o must accommodate the maximum number of columns in a composite key (16). * Support the statistics required by the optimizer * Store extra stats that may not be used by the optimizer but are helpful to the DBA * Record the sample size used to estimate stats * Be relatively easy to use via SQL There will be two stats tables, one containing table-level stats, and one containing index-level stats, as follows. CREATE TABLE IF NOT EXISTS innodb_table_stats ( database_name VARCHAR(64) NOT NULL, table_name VARCHAR(64) NOT NULL, stats_timestamp TIMESTAMP NOT NULL, n_rows BIGINT UNSIGNED NOT NULL, clustered_index_size BIGINT UNSIGNED NOT NULL, sum_of_other_index_sizes BIGINT UNSIGNED NOT NULL, PRIMARY KEY (database_name, table_name) ) ENGINE=INNODB DEFAULT CHARSET=utf8; CREATE TABLE IF NOT EXISTS innodb_index_stats ( database_name VARCHAR(64) NOT NULL, table_name VARCHAR(64) NOT NULL, index_name VARCHAR(64) NOT NULL, stat_timestamp TIMESTAMP NOT NULL, /* there are at least: stat_name='size' stat_name='n_leaf_pages' stat_name='n_diff_pfx%' */ stat_name VARCHAR(64) NOT NULL, stat_value BIGINT UNSIGNED NOT NULL, sample_size BIGINT UNSIGNED, stat_description VARCHAR(1024) NOT NULL, PRIMARY KEY (database_name, table_name, index_name, stat_name), FOREIGN KEY (database_name, table_name) REFERENCES innodb_table_stats (database_name, table_name) ) ENGINE=INNODB DEFAULT CHARSET=utf8; For any given table that has been ANALYZED, there will be one row in the innodb_table_stats table and n rows in the innodb_index_stats table. How and when statistics are calculated Without this feature: The number of unique values in an index is calculated by making N random dives into the index, fetching N leaf pages and counting the unique values in them. Then the number of the unique values in the whole index is estimated to be the average of the N random pages, multiplied by the total number of pages in the index. N can be specified by the user via innodb_stats_sample_pages which defaults to 8. These statistics are recalculated when a table is opened, when user explicitly calls ANALYZE TABLE and when InnoDB decides that the table has changed too much. With this feature: The above N random dives algorithm is kept as one alternative, along with the new algorithm. Previous stats, if any, are discarded when ANALYZE is done. The stats are (re)calculated only when the user explicitly calls ANALYZE TABLE. This differs from the InnoDB Plugin and MySQL 5.1 behavior. Because the stats are persistent and not wiped away out of the control of the user (when table is opened etc), the user can afford to scan 1000s of pages to get more precise stats. The user can choose when the slow stats calculation takes place because it only happens when ANALYZE TABLE is executed. Another thing to consider, maybe in the next version, is to add an exception to the recalculated-only-on-ANALYZE rule and recalculate the stats when the table has changed "too much", users could specify how much in % globally or % per table where 0 would mean - disable this feature. Algorithm description For now the following single algorithm will be used for sampling. If at some point another algorithms are approved we may add them. The algorithm is controlled by one number - A, which is the number of leaf pages to analyze for a given index for each n-prefix (if the index is on 3 columns, then 3*A pages will be analyzed). A represents innodb_stats_persistent_sample_pages. Let the total number of leaf pages in the table be T. Level 0 - leaf pages, level H - root. A can be given as absolute number, or in future versions as a% in which case A = T * a / 100. Definition: Boring record is a record on a non-leaf page that equals the next (to the right, cross page boundaries) record on the same level. The last (user) record on a level is not boring (it does not equal the supremum on that page). It is boring because all the records on the page below it equal that boring record. We avoid diving below boring records when searching for a leaf page to estimate the number of distinct records because we know that such a leaf page will have number of distinct records == 1. For each n-prefix start from the root level and full scan subsequent lower levels until a level that contains at least A*10 distinct records is found. As an optimization the search is canceled if it has reached level 1 (never descend to the level 0 (leaf)) and also if the next level to be scanned would contain more than A pages. The latter is because the user has asked to analyze A leaf pages and it does not make sense to scan much more than A non-leaf pages with the sole purpose of finding a good sample of A leaf pages. After finding the appropriate level with >A*10 distinct records, divide it into groups of equal records and pick A such groups. Then pick the last record from each group. For example, let the level be: index: 0,1,2,3,4,5,6,7,8,9,10 record: 1,1,1,2,2,7,7,7,7,7,9 There are 4 groups of distinct records and if A=2 random ones are selected, e.g. 1,1,1 and 7,7,7,7,7, then records with indexes 2 and 9 will be selected. After selecting A records as described above, dive below them to find A leaf pages and analyze them, finding the total number of distinct records. The dive to the leaf level is performed by selecting a non-boring record from each page and diving below it. This way, a total of A leaf pages are analyzed. Let the number of different key values found in page i be Pi (i=1..A) Calculate the number of different key values in the whole level A, let that be V. Then the total number of different key values in the whole tree is: V * (P1 + P2 + ... PN) / A. The above describes how to calculate the cardinality of an index. This algorithm is executed for each n-prefix of a multi-column index where n=1..n_uniq. For other stats (like e.g. number of rows) use the same algorithm except that at level A we will not unify equal key values into one group, that is - randomly pick A keys from 1,1,1,2,2,7,7,7,7,7,9, no matter that some of them are equal. Persistent storage modification In InnoDB Plugin and MySQL 5.1 the stats are kept in memory only and discarded when the server is shut down. In new InnoDB (with this feature) the stats will be stored in an InnoDB table that the user can even change to alter the statistics manually. The internal SQL parser will be used to access that table from within InnoDB. UPDATE/INSERT of the persistent storage * The user calls ANALYZE TABLE t o The stats related members of the structs are updated as usual o The values of those members are added to the persistent storage: + If the persistent storage does not exist, do nothing (eventually emit a notice to the user that persistent stats are not enabled because the system tables are missing) + If the innodb_table_stats or innodb_index_stats table has the wrong format, issue an error + If the relevant table/index is not in the persistent storage, INSERT + If the relevant table/index is already in the persistent storage, UPDATE * The user UPDATEs the persistent storage directly o Add a trigger to the stats table that updates the dict_table_t struct in the dictionary cache if there is one loaded already. Do not touch MySQL structures, MySQL will fetch the info from the dictionary cache next time it opens the table. This trigger will be a code that is inserted somewhere in an INSERT and UPDATE code path that checks if the table being changed needs special handling. SELECTing from the persistent storage * When a table is not in the dictionary cache and is opened, read the relevant stats from the persistent storage and assign to the stats members in the structs described * MySQL can read the (in memory) structs at any time, then the persistent storage is not involved DELETE from the persistent storage * When a table or index is dropped, then scan the persistent storage for relevant entries and wipe them away IR: Do we allow user to delete rows from stat table directly (as a mechanism to revert to current behavior)? If so, will we maintain RI constraints between the innodb_table_stats table and innodb_index_stats table so that user cannot just delete from the innodb_table_stats leaving behind orphan rows in innodb_index_stats? Vasil: Yes, users will be able to DELETE, RI is in place. The statistics are calculated by dict_update_statistics_low(). Add a boolean parameter to this function that tells it whether to calculate the stats (as before) and save them to the persistent storage OR to fetch them from the persistent storage. If fetch is requested but there is no data for the given table in the persistent storage, then calculate the stats and add them to the persistent storage. From all places, except ANALYZE, request fetching from the persistent storage, from ANALYZE request recalculation. Pseudo code / Raw code snippets Translated to code changes, the above means: * add one more parameter to dict_update_statistics_low() * in dict_update_statistics_low(): o if that parameter is RECALCULATE, then + if the persistent storage for the table exists then recalc the stats using a new algorithm and save them + else do what the function currently does (and emit warning?) o else (it must be FETCH_FROM_PERSISTENT_STORAGE) + if persistent storage exists and there are stats for the table then read the stats from the persistent storage + else do what the function currently does * all callers of dict_update_statistics_low() call it with FETCH_FROM_PERSISTENT_STORAGE or are changed to accept one more parameter and pass it down unchanged to dict_update_statistics_low() * the code path of ANALYZE is: ::analyze() -> ::info() -> dict_update_statistics() -> dict_update_statistics_low(). Move the whole body of ::info() into a new function X that accepts one more parameter and passes it down to dict_update_statistics(), in ::info() call X(..., FETCH_FROM_PERSISTENT_STORAGE). In ::analyze(), instead of calling ::info(), call X(..., RECALCULATE) Other Information The statistics that InnoDB gathers for tables and indexes for the optimizer are currently stored in memory only and are wiped away when the database server is shutdown. These statistics are currently recalculated every time a table is first opened (and on other events too) and thus the calculation needs to be fast. However, because they are computed quickly, the statistics vary widely, are inaccurate and sometimes even totally bogus. To provide stability and accuracy of the statistics, and to provide user control over their collection, we will make the statistics persistent, and computed only when the user issues an ANALYZE command. Performance Objectives The key benefit to users is not the performance of the ANALYZE command itself, but the stability of the optimizer plan. Nevertheless, since we are implementing new sampling techniques, we should have some goals about the performance of these new techniques. Performance Tests Given that there are potentially new sampling techniques, a performance testing plan is required. 1. What happens to the stats table on CREATE INDEX? Currently nothing is done - the table will have an index without corresponding entry in mysql.innodb_index_stats, when fetching stats from mysql.innodb_index_stats this index's stat fields will be left untouched. The next ANALYZE command will add a row for the newly created index. We can handle this differently at any time easily. 1. What happens to the stats table on DROP INDEX? DELETE FROM mysql.innodb_index_stats will be hooked next to DELETE FROM SYS_INDEXES to drop the necessary rows from innodb_index_stats. 1. Allow updating the stats for one index only in ANALYZE? * Since changing the supported syntax in MySQL is beyond question, we could introduce a session variable and look at it during analyze. Something like "SET innodb_stats_only_for_indexes="index1;iii"; ANALYZE TABLE t;" (assuming table t has 3 indexes - index1, iii and index2, the above would recalculate only the stats for the first two indexes. This is not implemented. References: * http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/stats.htm#PFGRF003 * http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_stats.htm. Ref: rb://373
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.