WL#9223: Using histogram statistics in the optimizer

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

8.0 implemented syntax for both creating and removing histogram statistics
in the server. This worklog will make use of histogram statistics in the
optimizer. The primary use case for histogram statistics is for calculating
the selectivity (filter effect) of predicates on the form
"COLUMN operator CONSTANT".

F-1: The optimizer should make use of any histogram statistics if applicable,
     without the user having to turning any features or flags on/off.

F-2: If a query cannot use any histogram statistics, the query plan/cost MUST
     NOT be changed.

F-3: The optimizer SHOULD be able to use histogram statistics for all these
     data types.

     BOOLEAN, BIT, TINYINT, SMALLINT, MEDIUMINT, INTEGER, BIGINT
     ENUM, SET
     FLOAT, DOUBLE, DECIMAL
     DATE, TIME, YEAR, DATETIME, TIMESTAMP
     CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT
     BINARY, VARBINARY, TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB

NF-1: If no histogram statistics are present, this worklog should have no
      negative impact on query performance.

== When are histogram useful ==
Histogram statistics will be useful for non-indexed columns. For instance,
if we have a column with a normal (Gaussian) data distribution, a histogram
will help the query optimizer on determine the filtering effect on these tables
if we have any predicates on this column. Given the following table definition:

  CREATE TABLE orders (o_id INT, o_amount DECIMAL, o_date DATETIME);

We can create histogram statistics for the column o_amount using ANALYZE TABLE:

  ANALYZE TABLE orders UPDATE HISTOGRAM ON o_amount WITH 64 BUCKETS;

The histogram statistics can help the query optimizer in answering how many
rows that will be returned from the table when a query has a predicate like
the following:

  SELECT * FROM orders WHERE o_amount BETWEEN 100.0 AND 300.0;

Adding an index to such a column might also solve the same problem. However,
an index need maintenance when table data is modified. A histogram is only
created/updated when the user explicitly asks for it. So a histogram does not
add any overhead during update/insert. but it may give some of the same
benefits during query optimization.

For columns that are indexed, we already have records_in_range() which gives
us dynamic estimates based on "index dives". These are rather accurate
estimates, and histogram statistics are thus not necessarily useful in this
case.


== When will histograms be used ==
We will provide histogram statistics for the following predicates:

  Equal to:                 "COLUMN = CONSTANT"
  Not equal to:             "COLUMN != CONSTANT"
  Not equal to:             "COLUMN <> CONSTANT"
  Greater than:             "COLUMN > CONSTANT"
  Less than:                "COLUMN < CONSTANT"
  Greater than or equal to: "COLUMN >= CONSTANT"
  Less than or equal to:    "COLUMN <= CONSTANT"
  Is null:                  "COLUMN IS NULL"
  Is not null:              "COLUMN IS NOT NULL"
  Between:                  "COLUMN BETWEEN CONSTANT AND CONSTANT"
  Not between:              "COLUMN NOT BETWEEN CONSTANT AND CONSTANT"
  In:                       "COLUMN IN (CONSTANT[, CONSTANT])"
  Not in:                   "COLUMN NOT IN (CONSTANT[, CONSTANT])"

The requirement is that the value we are comparing the column against MUST be
constant value. Note that this also includes functions that are constant,
such as ABS and FLOOR:

  SELECT * FROM tbl WHERE col1 < ABS(-34);

The histogram statistics will return a selectivity value between 0 and 1.0
inclusive.

== What does the user need to do in order to use histograms ==
For histograms to be used, there are two requirements that needs to be
satisfied:
  - Histogram statistics needs to be present/created for a column in the
    predicate list.

  - The predicate list must contains a predicate on one of the forms listed
    above.

If these two requirements are fulfilled, the optimizer may choose to use
histogram statistics. There are no need to enable any optimizer switches or any
other options. For example, this sequence of commands may use histogram
statistics in the final query:

  CREATE TABLE t1 (col1 INT, col2 INT);
  CREATE TABLE t2 (col1 INT, col2 INT);
  ... fill table t1 and t2 with data ...
  ANALYZE TABLE t1 UPDATE HISTOGRAM ON col2 WITH 64 BUCKETS;
  SELECT * FROM t1 JOIN t2 ON (t1.col1 = t2.col1) WHERE t1.col2 = 435;

In the final query, the optimizer might use histogram statistics to find out
the filtering effect of the predicate "t1.col2 = 435". This in turn may or may
not affect the join order of the query.


The query optimizer has a "list" of preferred statistics that looks like this:

1) Row estimates from the range optimizer
2) Row estimates from index statistics (rec_per_key)
3) Guesstimates (hard-coded filter effect)

Histogram statistics will go in as the second preferred option:

1) Row estimates from the range optimizer
2) Histogram statistics
3) Row estimates from index statistics (rec_per_key)
4) Guesstimates (hard-coded filter effect)

This means that if the optimizer find out that the range optimizer can be used
to calculate the selectivity/filter effect of a predicate, histogram statistics
will not be evaluated.


== Limitations ==
The histogram statistics will not handle covariance statistics. For example, if
we have a query like "SELECT * FROM cars WHERE brand = 'Volvo' AND model = 'A6'"
a person interested in cars will understand quickly that this predicate will
return zero rows since Volvo doesn't produce A6 (Audi does). However, histogram
statistics will not understand this.


Also, histogram statistics will not always make a query go faster. There are
multiple reasons for this:

1) Histogram statistics may be out of date, and thus provide wrong statistics.
2) Even though the optimizer gets better statistics, the execution plan
   generated may be worse than the original one due to flaws in the optimizer.

However, if histogram statistics gives any regression the solution is simply to
remove the histogram by using "ANALYZE TABLE tbl DROP HISTOGRAM ON col";
A different solution for disabling histogram statistics is by turning off the
optimizer switch "condition_fanout_filter":

  SET optimizer_switch='condition_fanout_filter=off';

Note that this will also turn off the guesstimates described above.

== Observing the effect of histogram statistics ==
If histogram statistics are used, the user will be able to see the resulting
effect using EXPLAIN. Given the following query;

  SELECT * FROM t1 WHERE col1 < 24;

If histogram statistics tells us that 57% of the rows in t1 will satisfy the
predicate "col1 < 24", EXPLAIN will show us the number "57.00" in the "filtered"
column given that there are no index available for "col1".

== Where to hook in the histogram selectivity estimation ==

Each operator in a query predicate is represented by an Item_func_*. For
instance the operator "<" (less than) is represented by Item_func_lt. Each of
these objects has a function named "get_filtering_effect" that estimates the
filtering effect for the predicate. We will add the histogram selectivity
estimation inside this code. For example, Item_func_lt::get_filtering_effect
currently contains the following code:

  {
    const Item_field* fld=
      contributes_to_filter(read_tables, filter_for_table, fields_to_ignore);
    if (!fld)
      return COND_FILTER_ALLPASS;

    return fld->get_cond_filter_default_probability(rows_in_table,
                                                    COND_FILTER_INEQUALITY);
  }

With this worklog, we will modify the function so that it will look something
like this:

  {
    const Item_field* fld=
      contributes_to_filter(read_tables, filter_for_table, fields_to_ignore);
    if (!fld)
      return COND_FILTER_ALLPASS;

    double histogram_selectivity;
    if (!histograms::Histogram::get_selectivity(thd, args, arg_count,
                                               
histograms::enum_operator::LESS_THAN,
                                                &histogram_selectivity))
    {
      return static_cast(histogram_selectivity);
    }

    return fld->get_cond_filter_default_probability(rows_in_table,
                                                    COND_FILTER_INEQUALITY);
  }

This will ensure that if histogram statistics exists for a column, it will be
preferred over the hard coded cost constants.

== Caching and locking of histograms ==

When the optimizer is using histogram for estimating statistics, we don't want
to read the histogram data from disk each time. We also need to make sure that
it's impossible to remove histogram data while it's being used by the optimizer.
The original plan for this worklog was to make use of the caching/locking that
the data dictionary provides, since we implemented column statistics as a
dictionary object. However, the performance of the dictionary caching and
locking does not match the requirements needed. Lets take a look at some
numbers:

Firstly, I created a new sysbench test suite that simply produced a lot of
queries on the following form:

  SELECT sbtest1.c FROM sbtest1, sbtest2 WHERE sbtest1.id = sbtest2.id
    AND sbtest1.k BETWEEN " .. num .. " AND " .. (num + 1)

We createt a two-table join with a BETWEEN predicate so that the function
Item_func_between::get_filtering_effect was invoked during query optimization.
I then made a baseline test run with the following sysbench command:

  sysbench --db-driver=mysql --threads=4 --time=20 --warmup-time=10
  --max-requests=0 --mysql-table-engine=innodb --oltp-read-only=on
  --oltp-auto-inc=off --mysql-user=root --oltp-table-size=100
  --oltp-dist-type=uniform --mysql-socket=/tmp/efroseth.sock
  --db-ps-mode=disable /home/efroseth/perftest/oltp_multitable.lua prepare

This creates a table "sbtest1". I renamed this table to "sbtest2", and re-ran
the sysbench prepare so that we have two tables "sbtest1" and "sbtest2". I also
removed all indexes expect for the primary key on sbtest1.id and sbtest2.id.

I then made a baseline test with commit b7a4ae02d5e89f38d73034a19eb0a79bb97162c9
with the same command as above, but with "run" instead of "prepare":

  baseline: 30947.94 QPS

Then, I applied a patch which implemented a solution where
Item_func_between::get_filtering_effect asked the dictionary cache for any
histogram statistics on each invokation. So the next sysbench run was made with
this patch, but without any histogram statistics added:

  histogram patch, no histogram statistics: 13863.15 QPS

This is a 55% reduction in QPS, which is due to the fact that we have a cache
miss on every call to Item_func_between::get_filtering_effect. I then added
a histogram on "sbtest1.k" which gave the following result:

  histogram patch, with histogram statistics: 26805.18 QPS

There is still a 13% reduction, which is not acceptable.

So the solution we are going for is the following:

1) Add "malloc_unordered_map *histograms;"
   as a member variable to TABLE_SHARE. The key "uint" is for field index.

2) In function "get_table_share", populate this map with any existing histogram
   statistics for the table. We will make a copy of the histogram object from
   the dictionary cache to the TABLE_SHARE mem_root.

3) During selectivity estimation, look at this map instead of the dictionary
   cache for finding any histogram statistics.

4) In Sql_cmd_analyze_table::handle_histogram_command, we will call
   tdc_remove_table() after a histogram has been updated, created or removed.
   This will remove the table and TABLE_SHARE from the table definition cache
   so that any new queries will get a new TABLE_SHARE with the updated histogram
   statistics.

With this approach, I'm unable to see any regression with the same sysbench
testing as above.

The histogram statistics that are stored in TABLE_SHARE will be protected by the
versioning of TABLE_SHARE. Any ongoing transaction will use the TABLE_SHARE that
they have fetched, even if an "ANALYZE TABLE tbl DROP HISTGORAM" is run in
parallell. It will keep the TABLE_SHARE until the transaction is complete.

== Comparison between user-provided values and stored histogram values ==

Consider the following case:

  CREATE TABLE tbl_int (col1 INT);
  INSERT INTO tbl_int VALUES (1), (2);
  SELECT * FROM tbl_int WHERE col1 > "01";

Should the comparison be done in a string context or in a numeric context? In
MySQL, the above condition is compared in a numeric context. For estimating
selectivity with histogram statistics, we will try to interpret the user
provided value as the type given by the column. In the above case, we
will call "val_int()" on the user provided value, since the column we have
a predicate on is an integer column.

In order to correctly compare TIME, DATE, DATETIME, ENUM and SET, we will expand
the enum "Value_map_type" with values for all the mentioned types. We need to
keep the exact type information in the histogram in order to know how we should
interpret the user provided value. So an Equi_height construction like this;

  Equi_height histogram(mem_root, "db", "tbl", "col",
                                  Value_map_type::ENUM);

... means that the data is stored as a longlong (the template type), but the
data is interpreted as an ENUM.

== New files and functions ==

histogram.cc
------------
histogram
{
public:
  /**
    Try to get the selectivity for a field.

    This functions will work for a variety of operators, and it has different
    requirement for these operators:

    EQUALS_TO, NOT_EQUALS_TO, GREATER_THAN, LESS_THAN, LESS_THAN_OR_EQUAL and
    GREATER_THAN_OR_EQUAL:
      If the operator is one of the above, one of the Items MUST be a field item
      and the other item MUST be a constant item. The order of them does not
      matter; the function will handle both "field OP constant" and "constant OP
      field".

    IS_NULL
    IS_NOT_NULL
      For these operators, we only need a single Item that MUST be a field item.

    BETWEEN
    NOT_BETWEEN
      For these two operators there MUST be three items where the first MUST be
      a field item and the rest MUST be constant items. This will be for queries
      on the form "col (NOT) BETWEEN constant AND constant".

    IN
    NOT_IN
      For these two operators there MUST be at least two items where the first
      MUST be a field item and the rest MUST be constant items. This will be for
      queries on the form "col (NOT) IN (constant, ..., constant)".

    If the above requirements are fulfilled AND we have a histogram for the
    field in question, this functions will return the selectivity for the
    operator/field.

    @param items            An array of items to be evaluated. This array has
                            some different requirements based on which operator
                            is being used, as explained above.
    @param item_count       The number of items in the array "items".
    @param op               The operator to be used/carried out.
    @param[out] selecitivty If the needed requirements are fulfilled, the
                            selectivity of the operation is stored here.

    @return false on success, true on error. Some of the requirements listed
            above may not fulfilled.
  */
  bool get_selectivity(Item **items, size_t item_count, enum_operator op,
                       double *selectivity) const;
}

/**
  Ask the dictionary cache for histogram statistics for a given column.

  @param thd         Thread handle
  @param schema_name Schema name
  @param table_name  Table name
  @param column_name Column name

  @return nullptr if no histogram is found. Otherwise, a pointer to a histogram.
          Please take a copy of the histogram if it must survive longer than
          the current transaction!
*/
const Histogram *find_histogram(THD *thd, const std::string &schema_name,
                                const std::string &table_name,
                                const std::string &column_name);

singleton.cc / equi_height.cc
-----------------------------
class Singleton / Equi_height
{
public:
  /**
    Esitmate the fraction of values that matches the given value.

    @param value The value to match

    @return the selectivity in the range 0.0 to 1.0 (inclusive)
  */
  double get_equal_to_selectivity(const T& value) const;

  /**
    Esitmate the fraction of values that is less than the given value.

    @param value The value to match

    @return the selectivity in the range 0.0 to 1.0 (inclusive)
  */
  double get_less_than_selectivity(const T& value) const;

  /**
    Esitmate the fraction of values that is greater than the given value.

    @param value The value to match

    @return the selectivity in the range 0.0 to 1.0 (inclusive)
  */
  double get_greater_than_selectivity(const T& value) const;
};


table.h
-------
struct TABLE_SHARE
{
  /// The histograms that this table has, where the uint key is the field index.
  malloc_unordered_map *histograms;

  /**
    @return the histogram for the given field index. nullptr is returned if
            no histogram is found.
  */
  const histograms::Histogram* find_histogram(uint field_index) const;
};