WL#8707: Classes/structures for Histograms

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

This worklog gives an overview of the new classes necessary to implement
histogram statistics in MySQL. We will introduce four new classes, where most of
the classes will be templatized so that different data types can be handled
differently. For instance, we want/need to handle strings quite different from
numeric values.

This worklog will not define the final version of the classes, but only the
interfaces needed to implement ANALYZE TABLE with histogram support. This is to
keep the scope of this WL as small as possible. The classes
will be expanded further by Histogram caching.

We will implement the necessary functions for building both singleton and
equi-height histograms, as well as serializing the histograms to JSON format.
This worklog will not have any visible effect to MySQL users, as it won't expose
any new SQL functions or alter any existing behavior.

User Documentation
==================

No user-visible changes. No user documentation needed.
F-1: All histogram types must support the following data types:
     - BOOLEAN, BIT, TINYINT, SMALLINT, MEDIUMINT, INTEGER, BIGINT
     - ENUM, SET
     - FLOAT, DOUBLE, DECIMAL
     - DATE, TIME, YEAR, DATETIME, TIMESTAMP
     - CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT
     - BINARY, VARBINARY, TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB

F-2: If the number of buckets specified is greater than or equal to the number
     of distinct values in the input data set, the function that auto-selects
     the most appropriate histogram type (Histogram::build_histogram) MUST
     create a singleton histogram.

F-3: If the number of buckets specified is less than the number of distinct
     values in the input data set, the function that auto-selects the most
     appropriate histogram type (Histogram::build_histogram) MUST create an
     equi-height histogram.

F-4: A histogram over strings MUST be sorted correctly according to the
     character set and collation.

F-5: For the data types CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT and LONGTEXT,
     a maximum of 42 characters should be considered. If the first 42 characters
     of two values are equal, the two values MUST be considered equal regardless
     of the contents beyond character number 42. For data types without any
     character set (BLOB, BINARY and similar), a maximum of 42 BYTES should be
     considered.

F-6: All histogram types MUST be serializable into a JSON format.
Terms and "gotchas"
===================
value map
---------
When we want to create a histogram, we must pass a "value map" to the histogram
constructor function. A value map is a map containing a value as the key, and
the frequency as the map value. Imagine the following data set:

  1, 1, 1, 1, 2, 2, 3, 5, 7, 7, 7

The "value map" for this data set would be the following map:
  [value, frequency]
  ------------------
  [    1, 4        ]
  [    2, 2        ]
  [    3, 1        ]
  [    5, 1        ]
  [    7, 3        ]

The main reasons for using a value map instead of the complete data set, is:

  1) In most cases, a value map will use less memory than the complete data set.
  2) It is much easier to decide which histogram type to build (number of
     disinct values is known).
  3) Building the histogram will be quicker.


histogram comparator
--------------------
For the value map (and other collections needed) to be sorted correctly, we will
provide a custom compare class. This allows us to sort different data types
correctly. For instance, we can then sort strings based on the character set/
collation instead of directly comparing the char values. And a few of the types
we will support does not have the "less than"-operator overloaded.


New classes
===========
We will introduce four different classes in order to represent histograms in
memory:

namespace histograms
{
  Histogram_comparator

  Histogram
  |
  +-- Equi_height<T>
  |
  |
  +-- Singleton<T>

  namespace equi_height
  {
    Bucket<T>
  }
}

Histogram_comparator<T>
-----------------------
A struct that can be used as comparator in std::map, and possibly other
structures. This struct will ensure that all the different data types are sorted
correctly.

Histogram
--------------
This class is the base class for all histogram types. It will provide the basic
interface for interacting with histograms, which mainly is fetching selectivity
estimates. This class will not be templatized.

We inherit this class from Sql_alloc, for allocating Histogram objects on a
MEM_ROOT.

Equi_height<T>
--------------
A subclass of Histogram, which represents an equi-height histogram. This class
will be templatized in order to handle different data types correctly.

Bucket<T>
----------------------
Bucket represents a single equi-height bucket, containing the
properties:
  - lower inclusive value
  - upper inclusive value
  - cumulative frequency
  - number of distinct values

It is placed in the namespace histograms::equi_height.

Singleton<T>
------------
A subclass of Histogram, which represents a singleton histogram. This class will
be templatized in order to handle different data types correctly. We do not
provide a separate class for representing singleton buckets, due to the fact
that we can simply use a std::map<T, double> for this purpose (no need to
re-invent the wheel).


Memory allocation
=================
All histogram constructors, as well as the function Histogram::build_histogram,
takes a MEM_ROOT as one of its argument. This means that a histogram and its
content will be allocated on the supplied MEM_ROOT, and thus have the same
life time as the supplied MEM_ROOT. The main reasoning for this is that in a
future worklog, we will attach histograms to a TABLE_SHARE. We can then use the
same MEM_ROOT as the TABLE_SHARE uses, to ensure that the histograms survives as
long as they should as well as properly freeing the memory used.

One limitation with MEM_ROOT is that memory cannot be reclaimed. This is however
not an issue in the case of histograms, since histogram contents will not
change after it is built/created.


Templates and supported types
=============================
As mentioned in F-1, the following data types must be supported:

- BOOLEAN, BIT, TINYINT, SMALLINT, MEDIUMINT, INTEGER, BIGINT
- ENUM, SET
- FLOAT, DOUBLE, DECIMAL
- DATE, TIME, YEAR, DATETIME, TIMESTAMP
- CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT
- BINARY, VARBINARY, TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB

To properly support all these data types in memory, the templated histograms
will be instansiated with the types ulonglong, longlong, MYSQL_TIME, String,
double and my_decimal:

- BOOLEAN, BIT, TINYINT, SMALLINT, MEDIUMINT, INTEGER, BIGINT SIGNED, YEAR, ENUM
  and SET will be instansiated with longlong.

- BIGINT UNSIGNED will be instansiated with ulonglong.

- DATE, TIME, DATETIME and TIMESTAMP will be instansiated with MYSQL_TIME.

- CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT and LONGTEXT will be instansiated
  with String.

- BINARY, VARBINARY, TINYBLOB, BLOB, MEDIUMBLOB and LONGBLOB will be
  instansiated with String.

- FLOAT and DOUBLE will be instansiated with double.

- DECIMAL will be instansiated with my_decimal.
sql/histograms/histogram.h
===============================
template<typename T>
using value_map_allocator = Memroot_allocator<std::pair<T, ha_rows> >;

template<typename T>
using value_map_type = std::map<T, ha_rows, Histogram_comparator,
                                value_map_allocator<T> >;

class Histogram : public Sql_alloc
{
public:
 /// All supported histogram types in MySQL.
  enum class enum_histogram_type
  {
    EQUI_HEIGHT,
    SINGLETON
  };

  /// The different fields in mysql.column_stats.
  enum enum_fields
  {
    FIELD_DB_NAME,
    FIELD_TBL_NAME,
    FIELD_COL_NAME,
    FIELD_HISTOGRAM
  };
  
  /**
    Histogram constructor.
  */
  Histogram(MEM_ROOT *mem_root,std::string db_name, std::string tbl_name,
            std::string col_name, enum_histogram_type type);

  /**
    @return Name of the database this histogram represents.
  */
  std::string get_database_name() const;

  /**
    @return Name of the table this histogram represents.
  */
  std::string get_table_name() const;

  /**
    @return Name of the column this histogram represents.
  */
  std::string get_column_name() const;

  /**
    @return Type of this histogram.
  */
  enum_histogram_type get_histogram_type() const;

  /**
    @return the fraction of NULL values, in the range [0.0, 1.0]
  */
  double get_null_values_fraction() const;

  /**
    Returns the histogram type as a readable string.

    @return    A readable string representation of the histogram type.
  */
  virtual std::string histogram_type_to_str() const = 0;

  /**
    @return Number of buckets in this histogram.
  */
  virtual size_t get_num_buckets() const = 0;

  /**
    Converts the histogram to a JSON object.

    @param[in,out] json_object output where the histogram is to be stored. The
                   caller is responsible for allocating/deallocating the JSON
                   object

    @return     true on error, false otherwise
  */
  virtual bool histogram_to_json(Json_object *json_object) const;
};

/**
  Create a histogram from a value map.

  This function will build a histogram from a value map. The histogram type
  depends on both the size of the input data, as well as the number of buckets
  specified. If the number of distinct values is less than or equal to the
  number of buckets, a Singleton histogram will be created. Otherwise, an
  equi-height histogram will be created.

  The histogram will be allocated on the supplied mem_root, and it is the
  callers responsibility to properly clean up when the histogram isn't needed
  anymore.

  @param[in] mem_root        The MEM_ROOT where the histogram contents will
                             be allocated.
  @param[in] value_map       A value map containing [value, frequency].
  @param[in] num_null_values The number of NULL values in the data set.
  @param[in] num_buckets     The maximum number of buckets to create.
  @param[in] db_name         Name of the database this histogram represents.
  @param[in] tbl_name        Name of the table this histogram represents.
  @param[in] col_name        Name of the column this histogram represents.

  @return    A histogram, using at most "num_buckets" buckets. The histogram
             type depends on the size of the input data, and the number of
             buckets.
*/
template <class T>
Histogram *build_histogram(MEM_ROOT *mem_root,
                           const value_map_type<T> &value_map,
                           ha_rows num_null_values, size_t num_buckets,
                           std::string db_name, std::string tbl_name,
                           std::string col_name)

sql/histograms/equi_height.h
========================================

template <class T>
class Equi_height<T> : public Histogram
{
public:
  /**
    Equi-height constructor.

    This will not build the histogram, but only set its properties.

    @param[in] mem_root The mem_root where the histogram contents will be
                        allocated.
    @param[in] db_name  Name of the database this histogram represents.
    @param[in] tbl_name Name of the table this histogram represents.
    @param[in] col_name Name of the column this histogram represents.
  */
  Equi_height(MEM_ROOT *mem_root, std::string db_name, std::string tbl_name,
              std::string col_name);

  /**
    Build the histogram.

    This function will build a new histogram from a "value map". The function
    will create at most num_buckets buckets, but may use less.

    @param  value_map       a value map, where the map key is a value and the
                            map value is the absolute frequency for that value
    @param  num_null_values the number of NULL values in the data set
    @param  num_buckets     maximum number of buckets to create

    @return true on error, false otherwise
  */
  bool build_histogram(const value_map_type<T> &value_map,
                       ha_rows num_null_values, ha_rows num_buckets);

  /**
    @return Number of values/buckets in this histogram.
  */
  size_t get_num_buckets() const override;

  /**
    Convert this histogram to a JSON object.

    This function will take the contents of the current histogram and put
    it in the output parameter "json_object".

    @param[in,out] json_object output where the histogram is to be stored The
                   caller is responsible for allocating/deallocating the JSON
                   object

    @return        true on error, false otherwise
  */
  bool histogram_to_json(Json_object *json_object) const override;

  /**
    Returns the histogram type as a readable string.

    @return a readable string representation of the histogram type
  */
  std::string histogram_type_to_str() const override;
}


sql/histograms/equi_height_bucket.h
====================================
template <class T>
class Bucket
{
public:
  /**
    Equi-height bucket constructor.

    Does nothing more than setting the member variables.

    @param lower         lower inclusive value
    @param upper         upper inclusive value
    @param freq          the cumulative frequency
    @param num_distinct  number of distinct/unique values in this bucket
  */
  Bucket(T lower, T upper, double freq, ha_rows num_distinct);

  /**
    @return lower inclusive value
  */
  const T& get_lower_inclusive() const;

  /**
    @return upper inclusive value
  */
  const T& get_upper_inclusive() const;

  /**
    @return cumulative frequency
  */
  double get_cumulative_frequency() const;

  /**
    @return number of distinct values
  */
  ha_rows get_num_distinct() const;

  /**
    Convert this equi-height bucket to a JSON array.

    This function will take the contents of the current equi-height bucket
    and put it in the output parameter "json_array". The result is an array
    with the following contents:
      Index 0: Lower inclusive value.
      Index 1: Upper inclusive value.
      Index 2: Cumulative frequency.
      Index 3: Number of distinct values.

    @param[out] json_array output where the bucket content is to be stored

    @return     true on error, false otherwise
  */
  bool bucket_to_json(Json_array *json_array) const;
};

sql/histograms/singleton.h
====================================
template <class T>
class Singleton : public Histogram
{
public:
  /**
    Singleton constructor.

    This will not build the histogram, but only set its properties.

    @param[in] mem_root The mem_root where the histogram contents will be
                        allocated.
    @param[in] db_name  Name of the database this histogram represents.
    @param[in] tbl_name Name of the table this histogram represents.
    @param[in] col_name Name of the column this histogram represents.
  */
  Singleton(MEM_ROOT *mem_root, std::string db_name, std::string tbl_name,
            std::string col_name);

  /**
    Build the Singleton histogram.

    @param   value_map       values to create the histogram for
    @param   num_null_values the number of NULL values in the data set

    @return  true on error, false otherwise
  */
  bool build_histogram(const value_map_type<T> &value_map,
                       ha_rows num_null_values);

  /**
    Convert this histogram to a JSON object.

    This function will take the contents of the current histogram and put
    it in the output parameter "json_object".

    @param[in,out] json_object output where the histogram is to be stored. The
                   caller is responsible for allocating/deallocating the JSON
                   object

    @return        true on error, false otherwise
  */
  bool histogram_to_json(Json_object *json_object) const override;

  /**
    @return Number of values/buckets in this histogram.
  */
  size_t get_num_buckets() const override;

  /**
    Returns the histogram type as a readable string.

    @return a readable string representation of the histogram type
  */
  std::string histogram_type_to_str() const override;
};