MySQL :: A new dimension to MySQL query optimizations

This is “A new dimension to MySQL query optimizations – part 2”. If you didn’t read part 1 already I suggest you skim through it before reading on.

To summarize, the problem at hand is this: Given a query with a join between two or more tables, the MySQL optimizer’s mission is to find the best combination of join order and access method so that the response time becomes as low as possible. The optimizer does this by calculating the cost of each combination and then picks the cheapest one.

Consider the following query:

SELECT *
FROM employee JOIN department ON employee.dept_no=department.dept_no
WHERE employee.first_name="John" AND
      employee.hire_date BETWEEN "2012-01-01" AND "2012-06-01"

SELECT *

FROM employee JOIN department ON employee.dept_no=department.dept_no

WHERE employee.first_name="John" AND

employee.hire_date BETWEEN "2012-01-01" AND "2012-06-01"

The optimizer will calculate the cost of the alternative plans as follows:

total cost = cost(access_method_table1) +
             prefix_rows_table1 x cost(access_method_table2)

1 2	total cost = cost(access_method_table1) + prefix_rows_table1 x cost(access_method_table2)

As explained in part 1, the problem with this calculation is that the cost of accessing table2 should not be multiplied with the number of rows returned by the chosen access method on table1 but rather the number of rows in table1 that evaluate to true for all conditions. Up until 5.6, MySQL had this wrong.

And that’s where condition filtering comes in: it provides a much better prefix rows estimate by taking into account not only conditions that are used by the chosen access method but all other relevant conditions as well.

How it works

Before we start with the examples, here are the most important things you need to know:

The conditions are investigated for each table, and a condition will contribute to the filtering estimate for that table only if:
1. it refers to the table at hand, and
2. the condition depends only on constant values or values from tables earlier in the join sequence, and
3. the condition is not in use by the access method
If a condition contributes to the filtering estimate, the estimate will be based on the range optimizer’s analysis since this is very accurate. If not available, index statistics is used instead. If that is not available either, heuristic numbers are used.
Conditions are assumed to have no correlation.
The condition filter estimate is shown in the filter column of EXPLAIN as per cent. While “rows” shows the estimated number of rows fetched by the chosen access method, prefix rows for the next table is rows multiplied by filter.
Condition filtering is only calculated if it can cause a change of plan. Since it only affects the cost of accessing tables later in the join sequence, it is not calculated for the last table. Thus, by definition it is not calculated for single-table queries. However, there is one exception: it is always calculated for EXPLAIN so that you can see its value.
It can be turned on and off by optimizer_switch condition_fanout_filter (“set optimizer_switch=’condition_fanout_filter=on'” etc).

Examples, please!

In the example queries below, there are:

1024 rows in the employee table
12 rows in the department table
150 rows with hire_date BETWEEN “2012-01-01” AND “2012-06-01”
8 rows with first_name=”John”
1 row with first_name=”John” AND hire_date BETWEEN …

First, let’s see what it looks like in MySQL 5.6:

mysql> EXPLAIN SELECT * FROM employee
->     WHERE hire_date BETWEEN "2012-01-01" AND "2012-06-01" AND
->           first_name="John";
+----+----------+------+---------------+------+------ +------+----------+ 
| id | table    | type | possible_keys | key  | ref   | rows | filtered | 
+----+----------+------+---------------+------+-------+------+----------+ 
| 1  | employee | ref  | name,h_date   | name | const | 8    | 100.00   | 
+----+----------+------+---------------+------+-------+------+----------+

mysql> EXPLAIN SELECT * FROM employee

-> WHERE hire_date BETWEEN "2012-01-01" AND "2012-06-01" AND

-> first_name="John";

+----+----------+------+---------------+------+------ +------+----------+

+----+----------+------+---------------+------+-------+------+----------+

| 1 | employee | ref | name,h_date | name | const | 8 | 100.00 |

+----+----------+------+---------------+------+-------+------+----------+

MySQL estimates that it will read 8 rows through ref access. Now let’s try to join with department. MySQL 5.6 now assumes that prefix rows for department is 8 and the chosen access method of department therefore has to be executed 8 times. However, we already know that the correct number is 1 since there is only one row that matches both conditions. Although we can’t see this from the EXPLAIN, the cost of accessing department is greatly exaggerated because of this.

mysql> EXPLAIN
->     SELECT *
->     FROM employee JOIN department ON employee.dept_no=department.dept_no
->     WHERE employee.first_name="John" AND
->           employee.hire_date BETWEEN "2012-01-01" AND "2012-06-01";
+----+------------+--------+------------------+---------+---------+------+----------+
| id | table      | type   | possible_keys    | key     | ref     | rows | filtered |
+----+------------+--------+------------------+---------+---------+------+----------+
| 1  | employee   | ref    | name,h_date,dept | name    | const   | 8    | 100.00   |
| 1  | department | eq_ref | PRIMARY          | PRIMARY | dept_no | 1    | 100.00   |
+----+------------+--------+------------------+---------+---------+------+----------+

mysql> EXPLAIN

-> SELECT *

-> FROM employee JOIN department ON employee.dept_no=department.dept_no

-> WHERE employee.first_name="John" AND

-> employee.hire_date BETWEEN "2012-01-01" AND "2012-06-01";

+----+------------+--------+------------------+---------+---------+------+----------+

+----+------------+--------+------------------+---------+---------+------+----------+

| 1 | employee | ref | name,h_date,dept | name | const | 8 | 100.00 |

+----+------------+--------+------------------+---------+---------+------+----------+

Now let’s take a look at MySQL 5.7. Notice that prefix rows ( rows * filtered = 8 * 16.31% = 1.3 ) is now much closer to reality. Just like before, 8 in the “rows” column is the estimated number of rows that will be read by ref access, while the new condition filtering information is shown in the “filtered” column. Since first_name=”John” is used by the ref access method, 16.31% is the condition filtering effect estimated from the remaining BETWEEN condition. When joined with department, prefix rows for department is now 1.3 instead of 8. In turn, the cost calculation is much more accurate.

mysql> EXPLAIN SELECT * FROM employee
->     WHERE hire_date BETWEEN "2012-01-01" AND "2012-06-01" AND
->           first_name="John";
+----+----------+------+---------------+------+-------+------+----------+
| id | table    | type | possible_keys | key  | ref   | rows | filtered |
+----+----------+------+---------------+------+-------+------+----------+
| 1  | employee | ref  | name,h_date   | name | const | 8    | 16.31    |
+----+----------+------+---------------+------+-------+------+----------+

mysql> EXPLAIN SELECT * FROM employee

-> WHERE hire_date BETWEEN "2012-01-01" AND "2012-06-01" AND

-> first_name="John";

+----+----------+------+---------------+------+-------+------+----------+

+----+----------+------+---------------+------+-------+------+----------+

| 1 | employee | ref | name,h_date | name | const | 8 | 16.31 |

+----+----------+------+---------------+------+-------+------+----------+

mysql> EXPLAIN
->     SELECT *
->     FROM employee JOIN department ON employee.dept_no=department.dept_no
->     WHERE employee.first_name="John" AND
->           employee.hire_date BETWEEN "2012-01-01" AND "2012-06-01";
+----+------------+--------+------------------+---------+---------+------+----------+
| id | table      | type   | possible_keys    | key     | ref     | rows | filtered |
+----+------------+--------+------------------+---------+---------+------+----------+
| 1  | employee   | ref    | name,h_date,dept | name    | const   | 8    | 16.31    |
| 1  | department | eq_ref | PRIMARY          | PRIMARY | dept_no | 1    | 100.00   |
+----+------------+--------+------------------+---------+---------+------+----------+

mysql> EXPLAIN

-> SELECT *

-> FROM employee JOIN department ON employee.dept_no=department.dept_no

-> WHERE employee.first_name="John" AND

-> employee.hire_date BETWEEN "2012-01-01" AND "2012-06-01";

+----+------------+--------+------------------+---------+---------+------+----------+

+----+------------+--------+------------------+---------+---------+------+----------+

| 1 | employee | ref | name,h_date,dept | name | const | 8 | 16.31 |

+----+------------+--------+------------------+---------+---------+------+----------+

If we force a table scan, none of the conditions are used by the access method and the filtered column is updated accordingly. Now we get rows * filtered = 1024 * 0.12% = 1.23 , which is also pretty close to the correct value of 1.

mysql> EXPLAIN SELECT * FROM employee IGNORE INDEX(name,h_date)
->     WHERE hire_date BETWEEN "2012-01-01" AND "2012-06-01" AND
->           first_name="John";
+----+----------+------+---------------+------+-------+------+----------+
| id | table    | type | possible_keys | key  | ref   | rows | filtered |
+----+----------+------+---------------+------+-------+------+----------+
| 1  | employee | ALL  | NULL          | NULL | NULL  | 1024 |   0.12   |
+----+----------+------+---------------+------+-------+------+----------+

mysql> EXPLAIN SELECT * FROM employee IGNORE INDEX(name,h_date)

-> WHERE hire_date BETWEEN "2012-01-01" AND "2012-06-01" AND

-> first_name="John";

+----+----------+------+---------------+------+-------+------+----------+

+----+----------+------+---------------+------+-------+------+----------+

| 1 | employee | ALL | NULL | NULL | NULL | 1024 | 0.12 |

+----+----------+------+---------------+------+-------+------+----------+

These are of course only basic examples to illustrate how it works. It gets much more interesting once we look at many table joins, e.g. the queries in DBT-3 that show up to 88% reduction in response time. I might followup with a part 3 to explain these bigger queries later. In the mean time, you can experiment with your own data by downloading the MySQL 5.7 labs release.

Oh, and by the way: “Condition filtering” is only one of many planned steps towards a new and improved cost model which includes brand new features and a lot of refactoring. There are some subtle traces of this work in the 5.7.4 release; a few new APIs that don’t do much on their own. Stay tuned for more info on this subject!