# WL#9223: Using histogram statistics in the optimizer

Affects: Server-8.0
—
Status: Complete

8.0 implemented syntax for both creating and removing histogram statistics in the server. This worklog will make use of histogram statistics in the optimizer. The primary use case for histogram statistics is for calculating the selectivity (filter effect) of predicates on the form "COLUMN operator CONSTANT".

F-1: The optimizer should make use of any histogram statistics if applicable, without the user having to turning any features or flags on/off. F-2: If a query cannot use any histogram statistics, the query plan/cost MUST NOT be changed. F-3: The optimizer SHOULD be able to use histogram statistics for all these data types. BOOLEAN, BIT, TINYINT, SMALLINT, MEDIUMINT, INTEGER, BIGINT ENUM, SET FLOAT, DOUBLE, DECIMAL DATE, TIME, YEAR, DATETIME, TIMESTAMP CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT BINARY, VARBINARY, TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB NF-1: If no histogram statistics are present, this worklog should have no negative impact on query performance.

== When are histogram useful == Histogram statistics will be useful for non-indexed columns. For instance, if we have a column with a normal (Gaussian) data distribution, a histogram will help the query optimizer on determine the filtering effect on these tables if we have any predicates on this column. Given the following table definition: CREATE TABLE orders (o_id INT, o_amount DECIMAL, o_date DATETIME); We can create histogram statistics for the column o_amount using ANALYZE TABLE: ANALYZE TABLE orders UPDATE HISTOGRAM ON o_amount WITH 64 BUCKETS; The histogram statistics can help the query optimizer in answering how many rows that will be returned from the table when a query has a predicate like the following: SELECT * FROM orders WHERE o_amount BETWEEN 100.0 AND 300.0; Adding an index to such a column might also solve the same problem. However, an index need maintenance when table data is modified. A histogram is only created/updated when the user explicitly asks for it. So a histogram does not add any overhead during update/insert. but it may give some of the same benefits during query optimization. For columns that are indexed, we already have records_in_range() which gives us dynamic estimates based on "index dives". These are rather accurate estimates, and histogram statistics are thus not necessarily useful in this case. == When will histograms be used == We will provide histogram statistics for the following predicates: Equal to: "COLUMN = CONSTANT" Not equal to: "COLUMN != CONSTANT" Not equal to: "COLUMN <> CONSTANT" Greater than: "COLUMN > CONSTANT" Less than: "COLUMN < CONSTANT" Greater than or equal to: "COLUMN >= CONSTANT" Less than or equal to: "COLUMN <= CONSTANT" Is null: "COLUMN IS NULL" Is not null: "COLUMN IS NOT NULL" Between: "COLUMN BETWEEN CONSTANT AND CONSTANT" Not between: "COLUMN NOT BETWEEN CONSTANT AND CONSTANT" In: "COLUMN IN (CONSTANT[, CONSTANT])" Not in: "COLUMN NOT IN (CONSTANT[, CONSTANT])" The requirement is that the value we are comparing the column against MUST be constant value. Note that this also includes functions that are constant, such as ABS and FLOOR: SELECT * FROM tbl WHERE col1 < ABS(-34); The histogram statistics will return a selectivity value between 0 and 1.0 inclusive. == What does the user need to do in order to use histograms == For histograms to be used, there are two requirements that needs to be satisfied: - Histogram statistics needs to be present/created for a column in the predicate list. - The predicate list must contains a predicate on one of the forms listed above. If these two requirements are fulfilled, the optimizer may choose to use histogram statistics. There are no need to enable any optimizer switches or any other options. For example, this sequence of commands may use histogram statistics in the final query: CREATE TABLE t1 (col1 INT, col2 INT); CREATE TABLE t2 (col1 INT, col2 INT); ... fill table t1 and t2 with data ... ANALYZE TABLE t1 UPDATE HISTOGRAM ON col2 WITH 64 BUCKETS; SELECT * FROM t1 JOIN t2 ON (t1.col1 = t2.col1) WHERE t1.col2 = 435; In the final query, the optimizer might use histogram statistics to find out the filtering effect of the predicate "t1.col2 = 435". This in turn may or may not affect the join order of the query. The query optimizer has a "list" of preferred statistics that looks like this: 1) Row estimates from the range optimizer 2) Row estimates from index statistics (rec_per_key) 3) Guesstimates (hard-coded filter effect) Histogram statistics will go in as the second preferred option: 1) Row estimates from the range optimizer 2) Histogram statistics 3) Row estimates from index statistics (rec_per_key) 4) Guesstimates (hard-coded filter effect) This means that if the optimizer find out that the range optimizer can be used to calculate the selectivity/filter effect of a predicate, histogram statistics will not be evaluated. == Limitations == The histogram statistics will not handle covariance statistics. For example, if we have a query like "SELECT * FROM cars WHERE brand = 'Volvo' AND model = 'A6'" a person interested in cars will understand quickly that this predicate will return zero rows since Volvo doesn't produce A6 (Audi does). However, histogram statistics will not understand this. Also, histogram statistics will not always make a query go faster. There are multiple reasons for this: 1) Histogram statistics may be out of date, and thus provide wrong statistics. 2) Even though the optimizer gets better statistics, the execution plan generated may be worse than the original one due to flaws in the optimizer. However, if histogram statistics gives any regression the solution is simply to remove the histogram by using "ANALYZE TABLE tbl DROP HISTOGRAM ON col"; A different solution for disabling histogram statistics is by turning off the optimizer switch "condition_fanout_filter": SET optimizer_switch='condition_fanout_filter=off'; Note that this will also turn off the guesstimates described above. == Observing the effect of histogram statistics == If histogram statistics are used, the user will be able to see the resulting effect using EXPLAIN. Given the following query; SELECT * FROM t1 WHERE col1 < 24; If histogram statistics tells us that 57% of the rows in t1 will satisfy the predicate "col1 < 24", EXPLAIN will show us the number "57.00" in the "filtered" column given that there are no index available for "col1".

== Where to hook in the histogram selectivity estimation == Each operator in a query predicate is represented by an Item_func_*. For instance the operator "<" (less than) is represented by Item_func_lt. Each of these objects has a function named "get_filtering_effect" that estimates the filtering effect for the predicate. We will add the histogram selectivity estimation inside this code. For example, Item_func_lt::get_filtering_effect currently contains the following code: { const Item_field* fld= contributes_to_filter(read_tables, filter_for_table, fields_to_ignore); if (!fld) return COND_FILTER_ALLPASS; return fld->get_cond_filter_default_probability(rows_in_table, COND_FILTER_INEQUALITY); } With this worklog, we will modify the function so that it will look something like this: { const Item_field* fld= contributes_to_filter(read_tables, filter_for_table, fields_to_ignore); if (!fld) return COND_FILTER_ALLPASS; double histogram_selectivity; if (!histograms::Histogram::get_selectivity(thd, args, arg_count, histograms::enum_operator::LESS_THAN, &histogram_selectivity)) { return static_cast(histogram_selectivity); } return fld->get_cond_filter_default_probability(rows_in_table, COND_FILTER_INEQUALITY); } This will ensure that if histogram statistics exists for a column, it will be preferred over the hard coded cost constants. == Caching and locking of histograms == When the optimizer is using histogram for estimating statistics, we don't want to read the histogram data from disk each time. We also need to make sure that it's impossible to remove histogram data while it's being used by the optimizer. The original plan for this worklog was to make use of the caching/locking that the data dictionary provides, since we implemented column statistics as a dictionary object. However, the performance of the dictionary caching and locking does not match the requirements needed. Lets take a look at some numbers: Firstly, I created a new sysbench test suite that simply produced a lot of queries on the following form: SELECT sbtest1.c FROM sbtest1, sbtest2 WHERE sbtest1.id = sbtest2.id AND sbtest1.k BETWEEN " .. num .. " AND " .. (num + 1) We createt a two-table join with a BETWEEN predicate so that the function Item_func_between::get_filtering_effect was invoked during query optimization. I then made a baseline test run with the following sysbench command: sysbench --db-driver=mysql --threads=4 --time=20 --warmup-time=10 --max-requests=0 --mysql-table-engine=innodb --oltp-read-only=on --oltp-auto-inc=off --mysql-user=root --oltp-table-size=100 --oltp-dist-type=uniform --mysql-socket=/tmp/efroseth.sock --db-ps-mode=disable /home/efroseth/perftest/oltp_multitable.lua prepare This creates a table "sbtest1". I renamed this table to "sbtest2", and re-ran the sysbench prepare so that we have two tables "sbtest1" and "sbtest2". I also removed all indexes expect for the primary key on sbtest1.id and sbtest2.id. I then made a baseline test with commit b7a4ae02d5e89f38d73034a19eb0a79bb97162c9 with the same command as above, but with "run" instead of "prepare": baseline: 30947.94 QPS Then, I applied a patch which implemented a solution where Item_func_between::get_filtering_effect asked the dictionary cache for any histogram statistics on each invokation. So the next sysbench run was made with this patch, but without any histogram statistics added: histogram patch, no histogram statistics: 13863.15 QPS This is a 55% reduction in QPS, which is due to the fact that we have a cache miss on every call to Item_func_between::get_filtering_effect. I then added a histogram on "sbtest1.k" which gave the following result: histogram patch, with histogram statistics: 26805.18 QPS There is still a 13% reduction, which is not acceptable. So the solution we are going for is the following: 1) Add "malloc_unordered_map *histograms;" as a member variable to TABLE_SHARE. The key "uint" is for field index. 2) In function "get_table_share", populate this map with any existing histogram statistics for the table. We will make a copy of the histogram object from the dictionary cache to the TABLE_SHARE mem_root. 3) During selectivity estimation, look at this map instead of the dictionary cache for finding any histogram statistics. 4) In Sql_cmd_analyze_table::handle_histogram_command, we will call tdc_remove_table() after a histogram has been updated, created or removed. This will remove the table and TABLE_SHARE from the table definition cache so that any new queries will get a new TABLE_SHARE with the updated histogram statistics. With this approach, I'm unable to see any regression with the same sysbench testing as above. The histogram statistics that are stored in TABLE_SHARE will be protected by the versioning of TABLE_SHARE. Any ongoing transaction will use the TABLE_SHARE that they have fetched, even if an "ANALYZE TABLE tbl DROP HISTGORAM" is run in parallell. It will keep the TABLE_SHARE until the transaction is complete. == Comparison between user-provided values and stored histogram values == Consider the following case: CREATE TABLE tbl_int (col1 INT); INSERT INTO tbl_int VALUES (1), (2); SELECT * FROM tbl_int WHERE col1 > "01"; Should the comparison be done in a string context or in a numeric context? In MySQL, the above condition is compared in a numeric context. For estimating selectivity with histogram statistics, we will try to interpret the user provided value as the type given by the column. In the above case, we will call "val_int()" on the user provided value, since the column we have a predicate on is an integer column. In order to correctly compare TIME, DATE, DATETIME, ENUM and SET, we will expand the enum "Value_map_type" with values for all the mentioned types. We need to keep the exact type information in the histogram in order to know how we should interpret the user provided value. So an Equi_height construction like this; Equi_height histogram(mem_root, "db", "tbl", "col", Value_map_type::ENUM); ... means that the data is stored as a longlong (the template type), but the data is interpreted as an ENUM. == New files and functions == histogram.cc ------------ histogram { public: /** Try to get the selectivity for a field. This functions will work for a variety of operators, and it has different requirement for these operators: EQUALS_TO, NOT_EQUALS_TO, GREATER_THAN, LESS_THAN, LESS_THAN_OR_EQUAL and GREATER_THAN_OR_EQUAL: If the operator is one of the above, one of the Items MUST be a field item and the other item MUST be a constant item. The order of them does not matter; the function will handle both "field OP constant" and "constant OP field". IS_NULL IS_NOT_NULL For these operators, we only need a single Item that MUST be a field item. BETWEEN NOT_BETWEEN For these two operators there MUST be three items where the first MUST be a field item and the rest MUST be constant items. This will be for queries on the form "col (NOT) BETWEEN constant AND constant". IN NOT_IN For these two operators there MUST be at least two items where the first MUST be a field item and the rest MUST be constant items. This will be for queries on the form "col (NOT) IN (constant, ..., constant)". If the above requirements are fulfilled AND we have a histogram for the field in question, this functions will return the selectivity for the operator/field. @param items An array of items to be evaluated. This array has some different requirements based on which operator is being used, as explained above. @param item_count The number of items in the array "items". @param op The operator to be used/carried out. @param[out] selecitivty If the needed requirements are fulfilled, the selectivity of the operation is stored here. @return false on success, true on error. Some of the requirements listed above may not fulfilled. */ bool get_selectivity(Item **items, size_t item_count, enum_operator op, double *selectivity) const; } /** Ask the dictionary cache for histogram statistics for a given column. @param thd Thread handle @param schema_name Schema name @param table_name Table name @param column_name Column name @return nullptr if no histogram is found. Otherwise, a pointer to a histogram. Please take a copy of the histogram if it must survive longer than the current transaction! */ const Histogram *find_histogram(THD *thd, const std::string &schema_name, const std::string &table_name, const std::string &column_name); singleton.cc / equi_height.cc ----------------------------- class Singleton / Equi_height { public: /** Esitmate the fraction of values that matches the given value. @param value The value to match @return the selectivity in the range 0.0 to 1.0 (inclusive) */ double get_equal_to_selectivity(const T& value) const; /** Esitmate the fraction of values that is less than the given value. @param value The value to match @return the selectivity in the range 0.0 to 1.0 (inclusive) */ double get_less_than_selectivity(const T& value) const; /** Esitmate the fraction of values that is greater than the given value. @param value The value to match @return the selectivity in the range 0.0 to 1.0 (inclusive) */ double get_greater_than_selectivity(const T& value) const; }; table.h ------- struct TABLE_SHARE { /// The histograms that this table has, where the uint key is the field index. malloc_unordered_map *histograms; /** @return the histogram for the given field index. nullptr is returned if no histogram is found. */ const histograms::Histogram* find_histogram(uint field_index) const; };

Copyright (c) 2000, 2019, Oracle Corporation and/or its affiliates. All rights reserved.