As of MySQL 8.0.3, you now have the ability to create histogram statistics in order to provide more statistics to the optimizer. In this blog post, we will have a look at how you can create histogram statistics, and we will explain when it might be useful to have histogram statistics.
What is a histogram
The query optimizer is the part of the database that is responsible for translating a SQL query into the most efficient execution plan possible. Sometimes, the query optimizer fails to find the most efficient plan and ends up spending a lot more time executing the query than necessary. The main reason for this is often that the optimizer doesn’t have enough knowledge about the data it is about to query:
- How many rows are there in each table?
- How many distinct values are there in each column?
- How is the data distributed in each column?
One simple example where the query optimizer lacks important information is a table containing the time of day persons goes to bed:
1
2
3
|
CREATE TABLE bedtime ( person_id INT, time_of_day TIME); |
For the column “time_of_day”, most values will probably be around 11:00 p.m since that is when most people goes to bed. So the first query below will likely return a lot more rows than the second query:
1
2
|
1) SELECT * FROM bedtime WHERE time_of_day BETWEEN "22:00:00" AND "23:59:00" 2) SELECT * FROM bedtime WHERE time_of_day BETWEEN "12:00:00" AND "14:00:00" |
Without any statistics available, the optimizer will assume that the values in “time_of_day” are evenly distributed (i.e. it is equally likely that a person goes to bed around 3:00 p.m as 11:00 p.m). How can you make the query optimizer aware of this skewness in the data? One solution to this is creating histogram statistics for that column.
A histogram is an approximation of the data distribution for a column. It can tell you with a reasonably accuray whether your data is skewed or not, which in turn will help the database server understand the nature of data it contains. Histograms comes in many different flavours, and in MySQL we have chosen to support two different types: The “singleton” histogram and the “equi-height” histogram. Common for all histogram types is that they split the data set into a set of “buckets”, and MySQL automatically divides the values into buckets, and will also automatically decide what type of histogram to create.
How to create and remove histogram statistics
In order to manage histogram statistics, we have extended the ANALYZE TABLE with two new clauses:
1
2
|
ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] WITH N BUCKETS; ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name]; |
The first syntax will allow you to create histogram statistics for one or more columns at the same time:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount WITH 32 BUCKETS; +----------------+-----------+----------+---------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics created for column 'amount'. | +----------------+-----------+----------+---------------------------------------------------+ 1 row in set (0.27 sec) mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount, payment_date WITH 32 BUCKETS; +----------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics created for column 'amount'. | | sakila.payment | histogram | status | Histogram statistics created for column 'payment_date'. | +----------------+-----------+----------+---------------------------------------------------------+ |
Note that the number of buckets must be specified, and can be in the range from 1 to 1024. How many buckets you should choose for your data set depends on several factors; how many distinct values do you have, how skewed is your data set, how high accuracy do you need etc. However, after a certain amount of buckets the increased accuracy is rather low. So we suggest to start at a lower number such as 32, and increase it if you see that it doesn’t fit your needs.
In the example above, we can see that we have build a histogram for the column “amount” twice. In the first query, a new histogram is created. In the second query, the histogram for “amount” is automatically overwritten.
If you want to remove any histogram statistics that you have created, you simply use the DROP HISTOGRAM syntax for that:
1
2
3
4
5
6
|
mysql> ANALYZE TABLE payment DROP HISTOGRAM ON payment_date; +----------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics removed for column 'payment_date'. | +----------------+-----------+----------+---------------------------------------------------------+ |
As with UPDATE HISTOGRAM, you can specify multiple columns in the same command. One feature that is worth noting is that the ANALYZE TABLE command will try do to as much work as possible, even though if something goes wrong in the middle of the command. Let us say that you specify three columns, but the second one doesn’t exist. The server will still create and store histogram for the first and third column:
1
2
3
4
5
6
7
8
9
|
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day, c_foobar, c_birth_month WITH 32 BUCKETS; +----------------+-----------+----------+----------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+----------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_day'. | | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_month'. | | tpcds.customer | histogram | Error | The column 'c_foobar' does not exist. | +----------------+-----------+----------+----------------------------------------------------------+ 3 rows in set (0.15 sec) |
What happens under the hood?
If you have read the MySQL manual, you might have seen the new system variable histogram_generation_max_mem_size. This variable will control approximately how much memory in bytes the server is allowed to use while generating histogram statistics. So why do you want to control that?
When you specify that you want to build a histogram, the server will read all the data into memory and do all the work in memory (including sorting). If you then want to generate a histogram over a very big table, you might risk reading hundreds of megabytes of data into memory, which might not be desirable. So in order to deal with this problem, MySQL will calculate how many rows of data it can fit into memory given the amount of memory specified by the system variable histogram_generation_max_mem_size. If it realizes that it can only fit a subset of the rows within the given memory limit, it will resort to sampling. This can be observed by looking at the property “sampling-rate“:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
mysql> SET histogram_generation_max_mem_size = 1000000; Query OK, 0 rows affected (0.00 sec) mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_country WITH 16 BUCKETS; +----------------+-----------+----------+------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+------------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_country'. | +----------------+-----------+----------+------------------------------------------------------------+ 1 row in set (0.22 sec) mysql> SELECT histogram->>'$."sampling-rate"' -> FROM information_schema.column_statistics -> WHERE table_name = "customer" -> AND column_name = "c_birth_country"; +---------------------------------+ | histogram->>'$."sampling-rate"' | +---------------------------------+ | 0.048743243211626014 | +---------------------------------+ 1 row in set (0.00 sec) |
Here we can see that the optimizer created a histogram by reading approximately 4.8% of the data in the column “c_birth_country“. It is worth noting that the sampling is non-deterministic, so that two subsequent calls to “ANALYZE TABLE tbl UPDATE HISTOGRAM …” on the same data set might give you two different histograms if sampling is used.
Query examples
So, what can you expect from using histogram statistics? Lets us take a look at a few queries from TPC-DS Benchmark Suite where adding a single histogram can make a big difference in query execution time. Below we’ll be using TPC-DS with scale factor of 1, which means that the database is around 1GB in size. The machine is an Intel Core i7-4770 running Debian Stretch with MySQL 8.0 RC1. The configuration is fairly standard, except that the innodb_buffer_pool_size is increased to 2G so that we can fit the entire database in the buffer pool.
In order for the optimizer to actually use the statistics provided by histograms, you only have to ensure that the optimizer switch “condition_fanout_filter” is turned on. Note that this is turned on by default.
Query 90
This query is described by the Benchmark Suite as “What is the ratio between the number of items sold over the internet in the morning to the number of items sold in the evening of customers with a specified number of dependents. Consider only websites with a high amount of content.”:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
mysql> SELECT CAST(amc AS DECIMAL(15, 4)) / CAST(pmc AS DECIMAL(15, 4)) am_pm_ratio -> FROM (SELECT COUNT(*) amc -> FROM web_sales, -> household_demographics, -> time_dim, -> web_page -> WHERE ws_sold_time_sk = time_dim.t_time_sk -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk -> AND ws_web_page_sk = web_page.wp_web_page_sk -> AND time_dim.t_hour BETWEEN 9 AND 9 + 1 -> AND household_demographics.hd_dep_count = 2 -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) at, -> (SELECT COUNT(*) pmc -> FROM web_sales, -> household_demographics, -> time_dim, -> web_page -> WHERE ws_sold_time_sk = time_dim.t_time_sk -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk -> AND ws_web_page_sk = web_page.wp_web_page_sk -> AND time_dim.t_hour BETWEEN 15 AND 15 + 1 -> AND household_demographics.hd_dep_count = 2 -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) pt -> ORDER BY am_pm_ratio -> LIMIT 100; +-------------+ | am_pm_ratio | +-------------+ | 1.27619048 | +-------------+ 1 row in set (1.48 sec) |
As we can see, the query takes around 1.5 seconds to execute. This doesn’t seem like much, but by adding a histogram on a single column we can make this query run three times faster (the query is truncated for readability);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
mysql> ANALYZE TABLE web_page UPDATE HISTOGRAM ON wp_char_count WITH 8 BUCKETS; +----------------+-----------+----------+----------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+----------------------------------------------------------+ | tpcds.web_page | histogram | status | Histogram statistics created for column 'wp_char_count'. | +----------------+-----------+----------+----------------------------------------------------------+ 1 row in set (0.06 sec) mysql> SELECT ... +-------------+ | am_pm_ratio | +-------------+ | 1.27619048 | +-------------+ 1 row in set (0.50 sec) |
With this single histogram, the query now takes around 0.5 seconds. Why is that? The main reason can be found by looking at the predicate “web_page.wp_char_count BETWEEN 5000 AND 5200“. Without any statistics available, the optimizer assumes that 11.11% of the rows in the table “web_page” matches the given predicate. However, this is wrong. By inspecting the table, we can see that only 1.6% matches this predicate (one out of 60 rows):
1
2
3
4
5
6
7
8
9
10
|
mysql> SELECT -> (SELECT COUNT(*) FROM web_page WHERE web_page.wp_char_count BETWEEN 5000 AND 5200) -> / -> (SELECT COUNT(*) FROM web_page) AS ratio; +--------+ | ratio | +--------+ | 0.0167 | +--------+ 1 row in set (0.00 sec) |
With histogram statistics available, the optimizer now knows this and pushes the table earlier in the join order and thus producing an execution plan that execute three times faster.
Query 61
This query is described as “Find the ratio of items sold with and without promotions in a given month and year. Only items in certain categories sold to customers living in a specific time zone are considered.”. It is a big, complex query with multiple joins involved:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
mysql> SELECT promotions, -> total, -> CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 -> FROM (SELECT SUM(ss_ext_sales_price) promotions -> FROM store_sales, -> store, -> promotion, -> date_dim, -> customer, -> customer_address, -> item -> WHERE ss_sold_date_sk = d_date_sk -> AND ss_store_sk = s_store_sk -> AND ss_promo_sk = p_promo_sk -> AND ss_customer_sk = c_customer_sk -> AND ca_address_sk = c_current_addr_sk -> AND ss_item_sk = i_item_sk -> AND ca_gmt_offset = -5 -> AND i_category = 'Home' -> AND ( p_channel_dmail = 'Y' -> OR p_channel_email = 'Y' -> OR p_channel_tv = 'Y' ) -> AND s_gmt_offset = -5 -> AND d_year = 2000 -> AND d_moy = 12) promotional_sales, -> (SELECT SUM(ss_ext_sales_price) total -> FROM store_sales, -> store, -> date_dim, -> customer, -> customer_address, -> item -> WHERE ss_sold_date_sk = d_date_sk -> AND ss_store_sk = s_store_sk -> AND ss_customer_sk = c_customer_sk -> AND ca_address_sk = c_current_addr_sk -> AND ss_item_sk = i_item_sk -> AND ca_gmt_offset = -5 -> AND i_category = 'Home' -> AND s_gmt_offset = -5 -> AND d_year = 2000 -> AND d_moy = 12) all_sales -> ORDER BY promotions, -> total -> LIMIT 100; +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (2.78 sec) |
As you can see from the ouput the query takes around 2.8 seconds to execute. However, what the query optimizer doesn’t realize is that there is only one distinct value in the column “s_gmt_offset“. Without any statistics available, the optimizer uses some hard-coded guesstimates, and this assuming that 10% of the rows will match the predicate “ca_gmt_offset = -5“. If we add a histogram for this column, the optimizer is now aware that all the rows in the table will satisfy the condition and thus gives us a better execution plan (the query is truncated for better readability):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
mysql> ANALYZE TABLE store UPDATE HISTOGRAM ON s_gmt_offset WITH 8 BUCKETS; +-------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-------------+-----------+----------+---------------------------------------------------------+ | tpcds.store | histogram | status | Histogram statistics created for column 's_gmt_offset'. | +-------------+-----------+----------+---------------------------------------------------------+ 1 row in set (0.06 sec) mysql> SELECT ... +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (1.37 sec) |
With this histogram, the query execution time goes down to less than 1.4 seconds, which is a 2X improvement. The reason is that in the first plan, the optimizer chooses for the first derived table to do a full table scan on the table store, and then do a primary key lookup in <item, store_sales, date_dim, customer and finally customer_address respectively. But when it realizes that the table store will return more rows than it guessed without the histogram statistics available, the optimizer chooses to do a full table scan on the table item instead and do a primary key lookup in store_sales, store, date_dim, customer and finally customer_address respectively.
But, why not an index?
Several of you might now think that an index would do the job equally well, which is true:
1
2
3
4
5
6
7
8
9
10
|
mysql> CREATE INDEX s_gmt_offset_idx ON store (s_gmt_offset); Query OK, 0 rows affected (0.53 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> SELECT ... +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (1.41 sec) |
However, there are two reasons for why you might consider a histogram instead of an index:
- Maintaining an index has a cost. If you have an index, every INSERT/UPDATE/DELETE causes the index to be updated. This is not free, and will have an impact on your performance. A histogram on the other hand is created once and never updated unless you explicitly ask for it. It will thus not hurt your INSERT/UPDATE/DELETE-performance.
- If you have an index, the optimizer will do what we call “index dives” to estimate the number of records in a given range. This also has a certain cost, and it might become too costly if you have for instance very long IN-lists in your query. Histogram statistics are much cheaper in this case, and might thus be more suitable.
Inspecting histogram statistics
Histogram statistics are stored in the data dictionary as JSON objects, which makes them both flexible and readable. You can for instance use the built-in JSON functions to extract information from the histograms. Let us say that you want to know when your histogram was created/updated for the column “amount” in the table “payment”. You can easily use the JSON unquoting extraction operator to find this information:
1
2
3
4
5
6
7
8
9
10
11
12
|
mysql> SELECT -> HISTOGRAM->>'$."last-updated"' AS last_updated -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS -> WHERE -> SCHEMA_NAME = "sakila" -> AND TABLE_NAME = "payment" -> AND COLUMN_NAME = "amount"; +----------------------------+ | last_updated | +----------------------------+ | 2017-09-15 11:54:25.000000 | +----------------------------+ |
Or let us say that you want to find out how many buckets there are in the histogram compared to how many buckets you specified in your ANALYZE TABLE statement:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
mysql> SELECT -> TABLE_NAME, -> COLUMN_NAME, -> HISTOGRAM->>'$."number-of-buckets-specified"' AS num_buckets_specified, -> JSON_LENGTH(HISTOGRAM, '$.buckets') AS num_buckets_created -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS -> WHERE -> SCHEMA_NAME = "sakila"; +------------+--------------+-----------------------+---------------------+ | TABLE_NAME | COLUMN_NAME | num_buckets_specified | num_buckets_created | +------------+--------------+-----------------------+---------------------+ | payment | amount | 32 | 19 | | payment | payment_date | 32 | 32 | +------------+--------------+-----------------------+---------------------+ |
We refer to the manual for more information about what kind of information you can extract from the histogram.
Optimizer trace
If you want to know the estimations that the histogram has done, the simplest way is to look at the EXPLAIN output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 11.11 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day WITH 32 BUCKETS; +----------------+-----------+----------+--------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+--------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_day'. | +----------------+-----------+----------+--------------------------------------------------------+ 1 row in set (0.10 sec) mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 32.12 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) |
If you look at the column “filtered”, you will see that it changes from the default 11.11% to a much more accurate 32.12%. However, if you have multiple conditions where some of the columns have histogram statistics and others don’t, it will be difficult to know what the optimizer has estimated:
1
2
3
4
5
6
7
|
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 6.38 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) |
If you want to know the estimations done by the histograms in more detail, you can have a look at the trace for the query:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
mysql> SET OPTIMIZER_TRACE = "enabled=on"; Query OK, 0 rows affected (0.00 sec) mysql> SET OPTIMIZER_TRACE_MAX_MEM_SIZE = 1000000; Query OK, 0 rows affected (0.00 sec) mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967; mysql> SELECT JSON_EXTRACT(TRACE, "$**.filtering_effect") FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE; +----------------------------------------------------------------------------------------+ | JSON_EXTRACT(TRACE, "$**.filtering_effect") | +----------------------------------------------------------------------------------------+ | [[{"condition": "(`customer`.`c_birth_day` <= 20)", "histogram_selectivity": 0.6376}]] | +----------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) |
Here we are using the JSON_EXTRACT function to extract the relevant part from the trace output. Here we see that for each condition where a histogram is used, we can see the estimated selectivity. In this case, the selectivity was only estimated for one of the conditions (c_birth_day <= 20), and the histogram estimated that 63.76% of the rows in the column would match this condition. And in fact, this matches the actual data in the column:
1
2
3
4
5
6
7
8
9
10
|
mysql> SELECT -> (SELECT count(*) FROM customer WHERE c_birth_day <= 20) -> / -> (SELECT COUNT(*) FROM customer) AS ratio; +--------+ | ratio | +--------+ | 0.6376 | +--------+ 1 row in set (0.03 sec) |