MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
Histogram statistics in MySQL

As of MySQL 8.0.3, you now have the ability to create histogram statistics in order to provide more statistics to the optimizer. In this blog post, we will have a look at how you can create histogram statistics, and we will explain when it might be useful to have histogram statistics.

What is a histogram

The query optimizer is the part of the database that is responsible for translating a SQL query into the most efficient execution plan possible. Sometimes, the query optimizer fails to find the most efficient plan and ends up spending a lot more time executing the query than necessary. The main reason for this is often that the optimizer doesn’t have enough knowledge about the data it is about to query:

  • How many rows are there in each table?
  • How many distinct values are there in each column?
  • How is the data distributed in each column?

One simple example where the query optimizer lacks important information is a table containing the time of day persons goes to bed:

For the column “time_of_day”, most values will probably be around 11:00 p.m since that is when most people goes to bed. So the first query below will likely return a lot more rows than the second query:

Without any statistics available, the optimizer will assume that the values in “time_of_day” are evenly distributed (i.e. it is equally likely that a person goes to bed around 3:00 p.m as 11:00 p.m). How can you make the query optimizer aware of this skewness in the data? One solution to this is creating histogram statistics for that column.

A histogram is an approximation of the data distribution for a column. It can tell you with a reasonably accuray whether your data is skewed or not, which in turn will help the database server understand the nature of data it contains. Histograms comes in many different flavours, and in MySQL we have chosen to support two different types: The “singleton” histogram and the “equi-height” histogram. Common for all histogram types is that they split the data set into a set of “buckets”, and MySQL automatically divides the values into buckets, and will also automatically decide what type of histogram to create.

How to create and remove histogram statistics

In order to manage histogram statistics, we have extended the ANALYZE TABLE with two new clauses:

The first syntax will allow you to create histogram statistics for one or more columns at the same time:

Note that the number of buckets must be specified, and can be in the range from 1 to 1024. How many buckets you should choose for your data set depends on several factors; how many distinct values do you have, how skewed is your data set, how high accuracy do you need etc. However, after a certain amount of buckets the increased accuracy is rather low. So we suggest to start at a lower number such as 32, and increase it if you see that it doesn’t fit your needs.

In the example above, we can see that we have build a histogram for the column “amount” twice. In the first query, a new histogram is created. In the second query, the histogram for “amount” is automatically overwritten.

If you want to remove any histogram statistics that you have created, you simply use the DROP HISTOGRAM syntax for that:

As with UPDATE HISTOGRAM, you can specify multiple columns in the same command. One feature that is worth noting is that the ANALYZE TABLE command will try do to as much work as possible, even though if something goes wrong in the middle of the command. Let us say that you specify three columns, but the second one doesn’t exist. The server will still create and store histogram for the first and third column:

What happens under the hood?

If you have read the MySQL manual, you might have seen the new system variable histogram_generation_max_mem_size. This variable will control approximately how much memory in bytes the server is allowed to use while generating histogram statistics. So why do you want to control that?

When you specify that you want to build a histogram, the server will read all the data into memory and do all the work in memory (including sorting). If you then want to generate a histogram over a very big table, you might risk reading hundreds of megabytes of data into memory, which might not be desirable. So in order to deal with this problem, MySQL will calculate how many rows of data it can fit into memory given the amount of memory specified by the system variable histogram_generation_max_mem_size. If it realizes that it can only fit a subset of the rows within the given memory limit, it will resort to sampling. This can be observed by looking at the property “sampling-rate“:

Here we can see that the optimizer created a histogram by reading approximately 4.8% of the data in the column “c_birth_country“. It is worth noting that the sampling is non-deterministic, so that two subsequent calls to “ANALYZE TABLE tbl UPDATE HISTOGRAM …” on the same data set might give you two different histograms if sampling is used.

Query examples

So, what can you expect from using histogram statistics? Lets us take a look at a few queries from TPC-DS Benchmark Suite where adding a single histogram can make a big difference in query execution time. Below we’ll be using TPC-DS with scale factor of 1, which means that the database is around 1GB in size. The machine is an Intel Core i7-4770 running Debian Stretch with MySQL 8.0 RC1. The configuration is fairly standard, except that the innodb_buffer_pool_size is increased to 2G so that we can fit the entire database in the buffer pool.

In order for the optimizer to actually use the statistics provided by histograms, you only have to ensure that the optimizer switch “condition_fanout_filter” is turned on. Note that this is turned on by default.

Query 90

This query is described by the Benchmark Suite as “What is the ratio between the number of items sold over the internet in the morning to the number of items sold in the evening of customers with a specified number of dependents. Consider only websites with a high amount of content.”:

As we can see, the query takes around 1.5 seconds to execute. This doesn’t seem like much, but by adding a histogram on a single column we can make this query run three times faster (the query is truncated for readability);

With this single histogram, the query now takes around 0.5 seconds. Why is that? The main reason can be found by looking at the predicate “web_page.wp_char_count BETWEEN 5000 AND 5200“. Without any statistics available, the optimizer assumes that 11.11% of the rows in the table “web_page” matches the given predicate. However, this is wrong. By inspecting the table, we can see that only 1.6% matches this predicate (one out of 60 rows):

With histogram statistics available, the optimizer now knows this and pushes the table earlier in the join order and thus producing an execution plan that execute three times faster.

Query 61

This query is described as “Find the ratio of items sold with and without promotions in a given month and year. Only items in certain categories sold to customers living in a specific time zone are considered.”. It is a big, complex query with multiple joins involved:

As you can see from the ouput the query takes around 2.8 seconds to execute. However, what the query optimizer doesn’t realize is that there is only one distinct value in the column “s_gmt_offset“. Without any statistics available, the optimizer uses some hard-coded guesstimates, and this assuming that 10% of the rows will match the predicate “ca_gmt_offset = -5“. If we add a histogram for this column, the optimizer is now aware that all the rows in the table will satisfy the condition and thus gives us a better execution plan (the query is truncated for better readability):

With this histogram, the query execution time goes down to less than 1.4 seconds, which is a 2X improvement. The reason is that in the first plan, the optimizer chooses for the first derived table to do a full table scan on the table store, and then do a primary key lookup in <item, store_sales, date_dim, customer and finally customer_address respectively. But when it realizes that the table store will return more rows than it guessed without the histogram statistics available, the optimizer chooses to do a full table scan on the table item instead and do a primary key lookup in store_sales, store, date_dim, customer and finally customer_address respectively.

But, why not an index?

Several of you might now think that an index would do the job equally well, which is true:

However, there are two reasons for why you might consider a histogram instead of an index:

  1. Maintaining an index has a cost. If you have an index, every INSERT/UPDATE/DELETE causes the index to be updated. This is not free, and will have an impact on your performance. A histogram on the other hand is created once and never updated unless you explicitly ask for it. It will thus not hurt your INSERT/UPDATE/DELETE-performance.
  2. If you have an index, the optimizer will do what we call “index dives” to estimate the number of records in a given range. This also has a certain cost, and it might become too costly if you have for instance very long IN-lists in your query. Histogram statistics are much cheaper in this case, and might thus be more suitable.

Inspecting histogram statistics

Histogram statistics are stored in the data dictionary as JSON objects, which makes them both flexible and readable. You can for instance use the built-in JSON functions to extract information from the histograms. Let us say that you want to know when your histogram was created/updated for the column “amount” in the table “payment”. You can easily use the JSON unquoting extraction operator to find this information:

Or let us say that you want to find out how many buckets there are in the histogram compared to how many buckets you specified in your ANALYZE TABLE statement:

We refer to the manual for more information about what kind of information you can extract from the histogram.

Optimizer trace

If you want to know the estimations that the histogram has done, the simplest way is to look at the EXPLAIN output:

If you look at the column “filtered”, you will see that it changes from the default 11.11% to a much more accurate 32.12%. However, if you have multiple conditions where some of the columns have histogram statistics and others don’t, it will be difficult to know what the optimizer has estimated:

If you want to know the estimations done by the histograms in more detail, you can have a look at the trace for the query:

Here we are using the JSON_EXTRACT function to extract the relevant part from the trace output. Here we see that for each condition where a histogram is used, we can see the estimated selectivity. In this case, the selectivity was only estimated for one of the conditions (c_birth_day <= 20), and the histogram estimated that 63.76% of the rows in the column would match this condition. And in fact, this matches the actual data in the column: