WL#8706: Persistent storage of Histogram data

Affects: Server-8.0 — Status: Complete

This worklog covers the work needed to store histogram data persistent. This is
needed so that histogram data does not have to be created each time the server
starts.

We have considered several ways of storing histograms persistent, and we
propose to store the histogram statistics in a new system table;
mysql.column_stats. The histogram itself will be stored in a JSON column, which
gives us a very flexible and powerful way of storing histograms. We also present
some of the other approaches we have considered.

Functional Requirements
=======================

F-1: The storage method chosen MUST be able to store histograms for all of the
     following data types:
     - BOOL/BOOLEAN, BIT, TINYINT, SMALLINT, MEDIUMINT, INT/INTEGER, BIGINT
     - FLOAT, DOUBLE, DECIMAL
     - DATE, TIME, YEAR, DATETIME, TIMESTAMP
     - CHAR, VARCHAR
     - TINYTEXT, MEDIUMTEXT, TEXT, LONGTEXT
     - BINARY, VARBINARY, TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB
     - ENUM, SET

F-2: After booting the server or upgrading from a previous version, one new
     table MUST be present in the database "mysql":

       column_stats

F-3: The new system table MUST be created in the InnoDB storage engine.

F-4: After installing MySQL or upgrading from a previous version without
     histogram support, the new table MUST be empty

F-5: SQL statements to mysql.column_stats SHALL be replicated unless anything
     else is specified at server startup. See "High-Level Specification" for a
     discussion around replication.

F-6: This worklog shall not have any effect on the query optimizer.

F-7: The new system table MUST be accessible by the root user only.

Non-Functional Requirements
===========================
NF-1: The storage SHOULD facilitate for expanding MySQL with table- and/or
      column statistics in the future.

NF-2: The storage SHOULD facilitate for adding new histogram types in the
      future. These new histogram types might require additional attributes to
      function properly.

For the first implementation of histogram statistics, we will implement two
different histogram types; a small variant of the equi-height histogram and a
"singleton" histogram.

For the equi-height histogram, we want to store the following attributes for
each histogram bucket:
- Lower inclusive value.
- Upper inclusive value.
- Cumulative frequency.
- Number of distinct values.

For the singleton histogram, we want to store the following attributes for each
histogram bucket:
- Value.
- Cumulative frequency.

We have considered multiple ways of modeling and storing the histograms, and we
present some of the solutions we have considered below.

DIFFERENT WAYS OF MODELING HISTOGRAMS
=====================================

== STORING HISTOGRAM IN A SYSTEM TABLE, WITH ONE ROW FOR EACH BUCKET ==

This approach seems like the most common among other systems. Basically, a
table will have one column for each attribute needed in the histogram. This
normally includes upper inclusive value, lower inclusive value, bucket
frequency, the bucket number and possibly a reference to the column the
histogram belongs to. The data will be easily readable by humans, have a decent
performance etc. This approach has a few drawbacks though:
- Finding the "best" data type for storing lower and upper bucket values.
  Since we must support several data types, we would probably need to use BLOB,
  BINARY or VARCHAR. A different approach could be to convert everything to a 
  BIGINT. This will however give us some big restrictions, especially with
  strings.
- We might want to implement other histogram types in the future. This can lead
  to changes in the schema, which may (or may not) cause issues.

With all of the attributes mentioned above, each bucket will require around
70 bytes on average, but this depends a lot on the data type stored in the
histogram.


== STORING HISTOGRAM IN A SINGLE FIELD ==

MariaDB has chosen this approach, where they store the entire histogram
in a VARBINARY(255). This gives however some big limitations. A histogram
over string columns is very limited (only a small prefix of the string is
considered), and it also imposes a limitation on the number of buckets. So we
want to use a "bigger" data type that can store more information.

With the new JSON type, one option is to store the entire histogram as a JSON
object (equi-height histogram shown):

  {
    "histogram-type": "equi-height",
    "last-updated": "2015-06-08 12:09:00.000000",
    "buckets":
    [
      { "low": 3, "high": 3, "freq": 0.02, "distinct": 1 },
      { "low": 4, "high": 4, "freq": 0.98, "distinct": 1 },
      ...
    ]
  }

Each bucket will require a bit over 100 bytes. In order to make it more compact,
each bucket can be stored as an array instead of an object. This will save
around 30% for each bucket:

  "buckets":
  [
    [ 3, 3, 0.02, 1 ],
    [ 4, 4, 0.98, 1 ],
    ...
  ]

By using this approach, it solves most of the challenges mentioned earlier:
- Virtually unlimited number of buckets if desirable.
- Any data type can be stored in a readable format.
- Strings can be stored in their full length if desirable.
- Adding new histogram types in the future is trivial.

So, what are the drawbacks with this approach?
- A very "untraditional" way of doing it.
- May require a bit more storage. This however depends on the data types
  stored in the histogram. Numeric values favor this approach, while other
  types mostly favor the other approach.

A different approach is to create a new format for storing histograms, and
serialize it into a BLOB column. This will however increase the complexity, and
most likely increase the amount of things that can go wrong. We have thus not
considered this approach any further.



TABLE DEFINITION ALTERNATIVES
=============================

- A histogram must support multiple different data types. The choice of storage
  must reflect this.
- It is possible that we want to extend MySQL with more statistics about tables
  and columns in the future. We should thus choose a storage option that
  facilitates this.
- The histogram should not require unreasonable amount of storage space.
- We might want to add more histogram types in the future, and the storage
  format chosen should facilitate this.

Based on the above observations, we present the following four ways to store
the histograms using the two modeling approaches descibed above:

1) Three new system tables in 'mysql':

  +----------------------------+
  | mysql.table_stats          |
  +----------------------------+
  | PK table_id      (INT)     |
  |    database_name (VARCHAR) |
  |    table_name    (VARCHAR) |
  +----------------------------+

  +--------------------------+
  | mysql.column_stats       |
  +--------------------------+
  | PK column_id   (INT)     |
  | PK table_id    (INT)     |
  |    column_name (VARCHAR) |
  +--------------------------+

  +---------------------------------+
  | mysql.histograms                |
  +---------------------------------+
  | PK column_id        (INT)       |
  | PK bucket_number    (INT)       |
  |    histogram_type   (ENUM)      |
  |    bucket_high      (VARBINARY) |
  |    bucket_low       (VARBINARY) |
  |    bucket_frequency (DOUBLE)    |
  |    num_distinct     (INT)       |
  +---------------------------------+

  -- ADVANTAGES --
  - Easy to expand for both table and column statistics in the future.

  -- DISADVANTAGES --
  - VARBINARY has a limited length of 255.
  - Quite a few new system tables, which leads to more FK constraints
    (complexity).
  - Using auto increment IDs imposes an upper limit a few places.

2) Two new system tables in 'mysql':

  +----------------------------+
  | mysql.table_stats          |
  +----------------------------+
  | PK table_id      (INT)     |
  |    database_name (VARCHAR) |
  |    table_name    (VARCHAR) |
  +----------------------------+

  +--------------------------+
  | mysql.column_stats       |
  +--------------------------+
  | PK table_id    (INT)     |
  | PK column_name (VARCHAR) |
  |    histogram   (JSON)    |
  +--------------------------+

  -- ADVANTAGES --
  - Flexible to store both different data types, new histogram types and such
    in the future.
  - Possible to implement both table and column statistics in the future.

  -- DISADVANTAGES --
  - Requires more storage space than proposal number 1.
  - Using auto increment IDs imposes an upper limit a few places.

3) Two new system tables in 'mysql':

  +----------------------------+
  | mysql.column_stats         |
  +----------------------------+
  | PK column_id     (INT)     |
  |    database_name (VARCHAR) |
  |    table_name    (VARCHAR) |
  |    column_name   (VARCHAR) |
  +----------------------------+

  +---------------------------------+
  | mysql.histograms                |
  +---------------------------------+
  | PK column_id        (INT)       |
  | PK bucket_number    (INT)       |
  |    histogram_type   (ENUM)      |
  |    bucket_high      (VARBINARY) |
  |    bucket_low       (VARBINARY) |
  |    bucket_frequency (DOUBLE)    |
  |    num_distinct     (INT)       |
  +---------------------------------+

  -- ADVANTAGES --
  - Less FK constraints/complexity than proposal number one.

  -- DISADVANTAGES --
  - VARBINARY has a limited length of 255.
  - Does not facilitate for table statistics in the future.
  - Using auto increment IDs imposes an upper limit a few places.

4) One new system table in 'mysql':

  +----------------------------+
  | mysql.column_stats         |
  +----------------------------+
  | PK database_name (VARCHAR) |
  | PK table_name    (VARCHAR) |
  | PK column_name   (VARCHAR) |
  |    histogram     (JSON)    |
  +----------------------------+

  -- ADVANTAGES --
  - Flexible to store both different data types, new histogram types and such
    in the future.
  - No limit with auto increment column as mentioned above.
  - Simple, but flexible design.

  -- DISADVANTAGES --
  - Repeating database name and table name is redundant.
  - Does not facilitate for table statistics in the future.



THE CHOSEN APPROACH
===================
We have chosen to go for suggestion number four, due to both the simplicity and
the flexibility it gives.



JSON HISTOGRAM EXAMPLES
=======================
The following section will give some examples on how much storage space the
chosen storage solution will use in different scenarios.

# 100 buckets with DATETIME values
  - 4781 bytes
  - A sample from the histogram:
  {
    "last-updated": "2015-06-08 12:09:00.000000",
    "histogram-type": "equi-height",
    "buckets":
    [
      [
        "2015-10-21 13:13:04.000000",
        "2015-11-01 23:36:02.000000",
        0.010009765625,
        984
      ],
      [
        "2015-11-01 23:37:20.000000",
        "2015-11-13 09:19:31.000000",
        0.02001953125,
        984
      ],
      ...
      [
        "2018-11-30 07:13:54.000000",
        "2018-12-11 18:43:43.000000",
        0.9909769694010417,
        984
      ],
      [
        "2018-12-11 19:04:44.000000",
        "2018-12-21 23:12:28.000000",
        1,
        886
      ]
    ]
  }

# 512 buckets with DOUBLE values
  - 21793 bytes
  - A sample from the histogram:
  {
    "last-updated": "2015-06-08 12:09:00.000000",
    "histogram-type": "equi-height",
    "buckets":
    [
      [
        4.296909448967231e303,
        3.0266566256216324e305,
        0.001983642578125,
        65
      ],
      [
        3.0912283005204318e305,
        6.8428144850887495e305,
        0.00396728515625,
        65
      ],
      ...
      [
        1.794196092769798e308,
        1.7975437041712213e308,
        0.999786376953125,
        65
      ],
      [
        1.7975458137001152e308,
        1.797660823206325e308,
        1,
        7
      ]
    ]
  }

# 512 buckets with STRING values. The histogram only considers the 42 first
# characters, and the average string length in the table is 22 characters.
  - 36059 bytes
  - A sample from the histogram:
  {
    "last-updated": "2015-06-08 12:09:00.000000",
    "histogram-type": "equi-height",
    "buckets":
    [
      [
        "0",
        "002a38227ecc7f0d952e85ffe37832d3f58910da",
        0.001978728666831561,
        10
      ],
      [
        "002de4494914a3a210d68f47b826df8d06a",
        "00b1dbd0070ba5ed",
        0.003957457333663122,
        32
      ],
      ...
      [
        "ff1f6d",
        "ff908e8fd6f",
        0.9988251298540731,
        32
      ],
      [
        "ff9725bad9bcb7d209a52c",
        "ffe82024b2e4870ef290b2cc7aae01695",
        1,
        19
      ]
    ]
  }


Since the table is created in the "mysql" database, it will only be visible
to the root user. Doing updates directly to the table by executing SQL
statements will be possible, but not recommended.

NOTE: The following solution suggests to add the new table to the mysql.*
database. However, it is expected that the new Data Dictionary will provide a
new API that allows us to define system tables that aren't visible to the
user. It will also most likely give us the ability to refer to a unique colum
ID instead of the actual column name. This will make the job of maintaining
the referential integrity to the original column much easier (that is, handling
DDL statements which alters the column name and such).

Defining the system table with the new DD API will hide the table from all
users, including root. If we want to make the histogram data visible to the
user, we will instead create a view over the system table with the
correct/necessary permissions.

This DD API is not expected to be available before early/mid 2016. When this is
available, we will consider to reimplement this worklog to use these new
features.


REPLICATION OF HISTOGRAMS
=========================
Manual changes to the new system table by running SQL statements will be
replicated, to keep it in line with the normal behavior of MySQL. Note that
manual changes is not recommended though.

The histograms are populated with data by running
"ANALYZE TABLE [table] UPDATE HISTOGRAMS". By default, "ANALYZE TABLE" is
replicated. Running this command on a master, will thus populate histograms
on the slaves as well. If specifying "ANALYZE LOCAL TABLE", the statement is not
replicated.

In order to avoid "double work" when running ANALYZE TABLE, binlogging will be
temporarily disabled while inserting/updating mysql.column_stats. This will
avoid that both ANALYZE TABLE AND the statements from inserting/updating
mysql.column_stats will be replicated. Only ANALYZE TABLE will be replicated.


GENERATED COLUMNS
=================
Histograms in MySQL will support both "normal" columns and generated columns
(both virtual and stored). The storage for normal and generated will be similar,
and this worklog will thus not take this into account.

Note: All the J_* data types mentioned below are taken from enum_json_type in
      json_dom.h.

System table definition
-----------------------
CREATE TABLE IF NOT EXISTS column_stats (
  database_name VARCHAR(64) NOT NULL,
  table_name VARCHAR(64) NOT NULL,
  column_name VARCHAR(64) NOT NULL,
  histogram JSON NOT NULL,
  PRIMARY KEY (database_name, table_name, column_name)
) ENGINE=InnoDB CHARACTER SET=utf8 COLLATE=utf8_bin
COMMENT="Column statistics";

This definition will be added to "scripts/mysql_system_tables.sql". This will
take care of creating the table on both initialize and upgrade.


Equi-height JSON definition
---------------------------
{
  // Last time the histogram was updated. As of now, this means "when the
  // histogram was created" (incremental updates are not supported). Date/time
  // is given in UTC.
  // -- J_DATETIME
  "last-updated": "2015-11-04 15:19:51.000000",

  // Histogram type. Always "equi-height" for equi-height histograms.
  // -- J_STRING
  "histogram-type": "equi-height",

  // Histogram buckets. This will always be at least one bucket.
  // -- J_ARRAY
  "buckets":
  [
    [
      // Lower inclusive value.
      // -- Data type depends on the source column.
      "0",

      // Upper inclusive value.
      // -- Data type depends on the source column.
      "002a38227ecc7f0d952e85ffe37832d3f58910da",

      // Cumulative frequence
      // -- J_DOUBLE
      0.001978728666831561,

      // Number of distinct values in this bucket.
      // -- J_UINT
      10
    ]
  ]
}


Singleton JSON definition
-------------------------
{
  // Last time the histogram was updated. As of now, this means "when the
  // histogram was created" (incremental updates are not supported). Date/time
  // is given in UTC.
  // -- J_DATETIME
  "last-updated": "2015-11-04 15:19:51.000000",

  // Histogram type. Always "singleton" for singleton histograms.
  // -- J_STRING
  "histogram-type": "singleton",

  // Histogram buckets. This will always be at least one bucket.
  // -- J_ARRAY
  "buckets":
  [
    [
      // Value value.
      // -- Data type depends on the source column.
      42,

      // Cumulative frequence
      // -- J_DOUBLE
      0.001978728666831561,
    ]
  ]
}