In MySQL 5.6 we introduced a huge improvement in the way that index and table statistics are gathered by InnoDB and subsequently used by the Optimizer during query optimization: Persistent Statistics. Some aspects of the way that Persistent Statistics work could be improved further though, and we’d really like your input on that.
How much to sample?
The statistics are gathered by picking some pages semi-randomly, analyzing them, and deriving some conclusions about the entire table and/or index from those analyzed pages. The number of pages sampled can be specified on a per-table basis with the STATS_SAMPLE_PAGES
clause. For example:
ALTER TABLE t STATS_SAMPLE_PAGES=500;
This way you can request that more pages be sampled for larger tables in order to make the estimates more accurate. The more pages that are sampled, the better the estimates will be and thus the better the chance that the Optimizer will be able to choose optimal query plans, with the only downside being that the index statistics gathering operation (the sampling) itself becomes slower.
Being able to specify the number of pages to sample on a per table basis is flexible, but what if one can specify the value as a percentage of the table size, rather than an absolute number? Would that make sense? Setting such a global value would not meet every need and use case, because the given percentage may not be sufficient to get a good estimate on a small table and the same percentage may make the sampling process too slow for a very large table. Setting the percentage on a per table basis may also have little advantage over the existent functionality as one already knows if the table is huge or tiny and can set the appropriate value as an absolute number of pages to sample. What do you think? We would love to get your input on this! Do you see a lot of value in being able to set the number of pages to sample as a percentage of the table size or have it depend on the table size in some other way? Cast your vote:
How often to sample?
Currently, if 10% of the table has changed (rows updated, deleted, or inserted) since the previous statistics gathering operation then InnoDB will automatically trigger a recalculation of the statistics (it will analyze some leaf pages as discussed in the previous paragraph, reading them from disk if they are not already cached).
In MySQL 5.5 and older, where the auto recalculation threshold was set to 6.25%, it was a problem that the statistics were updated too often which caused too many execution plan changes. In general, we’ve found that 10% is a good overall threshold for this and we haven’t seen a great need to make this dynamic. But do you agree? Do you want to be able to change that value globally or even on a per table basis? Would you increment it or decrement it if you could? Cast your vote:
We look forward to your feedback! And as always, THANK YOU for using MySQL!