WL#13066: Improve predicate handling in loose index scan for group by.
Affects: Server-8.0
—
Status: Complete
Based on BUG#90951 Contribution by Facebook: Enhance group-by loose index scan This worklog improves loose index scan(LIS) used for GROUP BY queries by lifting the current limitation of its usage w.r.t the number of infix ranges (range on non-grouping column) that can be present in the query. This improvement would make optimizer pick LIS for more queries. Currently LIS cannot be used when a query has more than one infix range. Consider the following test case: CREATE TABLE t1 (f1 INT NOT NULL, f2 INT NOT NULL, f3 INT NOT NULL, PRIMARY KEY(f1, f2, f3)); INSERT INTO t1 VALUES (1,1,1), (1,2,2), (1,3,3), (1,4, 4), (1,5,5), (2,1,1), (2,2,2), (2,3,3), (2,4, 4), (2,5,5), (3,1,1), (3,2,2), (3,3,3), (3,4, 4), (3,5,5); INSERT INTO t1 SELECT f1, f2 + 5, f3 + 5 FROM t1; INSERT INTO t1 SELECT f1, f2 + 10, f3 + 10 FROM t1; INSERT INTO t1 SELECT f1, f2 + 20, f3 + 20 FROM t1; INSERT INTO t1 SELECT f1, f2 + 40, f3 + 40 FROM t1; ANALYZE TABLE t1; The following query uses LIS as there is only one infix range ("f2=2") SELECT f1, MAX(f3) FROM t1 WHERE (f1 > 2) AND (f2 = 2) GROUP BY f1; However this query does not use LIS if one more infix range is added (say "f2 = 4")like in the query below. SELECT f1, MAX(f3) FROM t1 WHERE (f1 > 2) AND (f2 = 2 OR f2 = 4) GROUP BY f1; When optimizer picks the key(f1,f2,f3) for doing loose index scan, ->It gets a distinct value for "f1" ->This distinct value followed by the next key columns's value ("f2"s value would be 2 for the query above) is used to form a key. ->It then uses this key to read the first/last (min/max) qualifying row. However when it has two different values for "f2", the second step cannot be performed with the current code. Hence LIS is not picked for the second query mentioned above. As a result, currently we have the following restriction mentioned in the rules applied for picking LIS for group by: "NGA3.If BA <> {}, there can only be one range. TODO: This is a code limitation and is not strictly needed. See BUG#15947433 (Sakila)" The objective of this task is to lift the limitation mentioned above to allow dis-junction of equality predicates for the infix ranges. The following example shows the advantage of using LIS. As can be seen, the number of accessed rows are decreased if LIS is used. Without LIS --------------------- explain SELECT f1, MAX(f3) FROM t1 WHERE (f1 > 2) AND (f2 = 2 OR f2 = 4) GROUP BY f1; id select_type table partitions type possible_keys key key_len ref rows filtered Extra 1 SIMPLE t1 NULL range PRIMARY PRIMARY 4 NULL 80 19.00 Using where; Using index Warnings: Note 1003 /* select#1 */ select `test`.`t1`.`f1` AS `f1`,max(`test`.`t1`.`f3`) AS `MAX(f3)` from `test`.`t1` where ((`test`.`t1`.`f1` > 2) and ((`test`.`t1`.`f2` = 2) or (`test`.`t1`.`f2` = 4))) group by `test`.`t1`.`f1` FLUSH STATUS; SELECT f1, MAX(f3) FROM t1 WHERE (f1 > 2) AND (f2 = 2 OR f2 = 4) GROUP BY f1; f1 MAX(f3) 3 4 SHOW STATUS LIKE 'handler_read%'; Variable_name Value Handler_read_first 0 Handler_read_key 1 Handler_read_last 0 Handler_read_next 80 Handler_read_prev 0 Handler_read_rnd 0 Handler_read_rnd_next 0 --------------------- With LIS --------------------- explain SELECT f1, MAX(f3) FROM t1 WHERE (f1 > 2) AND (f2 = 2 OR f2 = 4) GROUP BY f1; id select_type table partitions type possible_keys key key_len ref rows filtered Extra 1 SIMPLE t1 NULL range PRIMARY PRIMARY 8 NULL 3 100.00 Using where; Using index for group-by Warnings: Note 1003 /* select#1 */ select `test`.`t1`.`f1` AS `f1`,max(`test`.`t1`.`f3`) AS `MAX(f3)` from `test`.`t1` where ((`test`.`t1`.`f1` > 2) and ((`test`.`t1`.`f2` = 2) or (`test`.`t1`.`f2` = 4))) group by `test`.`t1`.`f1` FLUSH STATUS; SELECT f1, MAX(f3) FROM t1 WHERE (f1 > 2) AND (f2 = 2 OR f2 = 4) GROUP BY f1; f1 MAX(f3) 3 4 SHOW STATUS LIKE 'handler_read%'; Variable_name Value Handler_read_first 0 Handler_read_key 5 Handler_read_last 1 Handler_read_next 0 Handler_read_prev 0 Handler_read_rnd 0 Handler_read_rnd_next 0 ---------------------
Functional requirements: F-1: LIS will be considered for the case mentioned in HLS and is picked if it has better cost. F-2: If the index can be used for LIS, it should be visible in 'possible_keys' column of the EXPLAIN output. Non-Functional requirements: NF-1: The performance shall be as good or better for queries where LIS can be applied for 90% of the test queries. NF-2: A higher performance degradation clearly caused by skewed data is acceptable as long as NF1 is satisfied.
A detailed explanation of when LIS is considered for queries having GROUP BY is present in opt_range.cc (get_best_group_min_max()) This worklog lifts the limitation NGA3 as mentioned in HLD i.e it allows queries to have a dis-junction of equality predicates on a non-grouping column. 1. Changes to the LIS algorithm for queries having GROUP BY Current LIS implementation allows multiple prefix ranges (range over grouping key columns) and one infix range(range over non-grouping key column). Consider we have KEY(f1, f2, f3) and a query like SELECT f1, MAX(f3) FROM t1 WHERE (f1 = 2 OR f1 = 7) AND (f2 = 2) GROUP BY f1; Prefix ranges here are 'f1 = 2' and 'f1 = 7'. Infix range is 'f2 = 2'. LIS for this query would do the following: For each prefix range ('f1 = 2' for the first iteration ) { Get a distinct value for the prefix range (Note that it need not have to be an equality predicate) Construct a compound range using prefix and infix range. ('f1 = 1 AND f2 = 2' for the first iteration, 'f1 = 7 AND f2 = 2' for second) Read rows qualified by the compound range. (Aggregations that use LIS are min(), max(), count(distinct), avg(distinct), sum(distinct)) } Continue with next prefix range('f1 = 7'). Let us now add multiple infix ranges to the example query. SELECT f1, MAX(f3) FROM t1 WHERE (f1 = 2 OR f1 = 7) AND (f2 = 2 OR f2 = 3) GROUP BY f1; Prefix ranges here are 'f1 = 2' and 'f1 = 7'. Infix ranges are 'f2 = 2' and 'f2 = 3'. To allow multiple infix ranges, LIS for this query needs to do the following: For each prefix range ('f1 = 2' for the first iteration ) { Get a distinct value for the prefix range. (Note that it need not have to be an equality predicate) For each infix range ('f2 = 2' for the first iteration { Construct a compound range using prefix and infix range. (First range would be "f1 = 1 AND f2 = 2" , second "f1 = 1 AND f2 = 3" , third "f1=7 and f2=2" and fourth "f1=7 and f2=3") Read rows qualified by the compound range. (Aggregations that use LIS are min(),max(), count(distinct),avg(distinct),sum(distinct)) } Continue with the next infix range ('f2 = 3') } Continue with next prefix range('f1 = 7'). Note that we now loop over multiple infix ranges to construct the compound range. 2. Changes in the cost model. Since multiple infix ranges are allowed, the cost model is updated so that the number of io_blocks and cpu_cost is increased proportional to the total number of permutations of equality predicates over the non-grouping columns(infix factor). Number of rows to be examined is also increased proportional to the infix factor. In doing so, we have lifted the following limitation "NGA3.If BA <> {}, there can only be one range. TODO: This is a code limitation and is not strictly needed. See BUG#15947433".
1. Changes in sql/opt_range.h QUICK_GROUP_MIN_MAX_SELECT class. Added following fields: + bool min_max_keypart_asc; /* TRUE if min_max key part is ascending. */ + uint key_infix_parts; /* Indicates the number of gap attributes we have */ + // The current infix range position (in key_infix_ranges) used for row + // retrieval. + uint cur_infix_range_position[MAX_REF_PARTS]; + // Indicates if all infix ranges have been used to retrieve rows (all ranges + // in key_infix_ranges) + bool seen_all_infix_ranges; + Quick_ranges_array key_infix_ranges; /* Array of key infix range arrays. */ Array above holds infix ranges. Consider we have the following infix ranges: (f2 = 1 OR f2 = 2 OR f2 = 3) AND (f3 = 4 OR f3 = 5). In this case, key_infix_ranges[0] has {'f2 = 1', 'f2 = 2', 'f2 =3'} and key_infix_ranges[1] has {'f3 = 4', 'f3 = 5'} ranges. This array is filled in TRP_GROUP_MIN_MAX::make_quick() and used in QUICK_GROUP_MIN_MAX_SELECT::append_next_infix() method. Methods added: + bool append_next_infix(); // Determine and append the next infix. + void reset_group(); // Reset various variables used to track position within a group. Method changes: + bool add_range(SEL_ARG *sel_range, int idx); Added new argument 'idx', if negative, inserts into min_max_ranges otherwise inserts into key_infix_ranges[idx] + void update_min_result(bool *reset); // Argument is changed to IN/OUT + void update_max_result(bool *reset); // Argument is changeds to IN/OUT 2. Changes in sql/opt_range.cc. 2.1 Added new argument 'uint num_key_parts' to the functions get_quick_select(), get_quick_keys(). num_key_parts Number of key parts used for creating QUICK_RANGE_SELECT. Note: QUICK_GROUP_MIN_MAX creates ranges only for key parts on grouped columns. Rest of the scans create ranges using all usable keyparts. Hence the default is MAX_REF_PARTS. This is necessary for the certain type of queries, consider we have SELECT DISTINCT a FROM t WHERE a = 1 AND (b = 2 OR b = 15); If all possible parts are used for the range, get_quick_keys() constructs two prefix ranges('a = 1 AND b = 2', 'a = 1 AND b = 15') and the QUICK_GROUP_MIN_MAX_SELECT::next_prefix() treats this as two different prefix ranges. So the prefix range 'a = 1' is processed twice and it leads to duplicate result. 2.2 Function get_quick_keys() is refactored so that it can generate ranges in DESC order if processed key part is descending. Because of this, some code related to handling of DESC key parts became unnecessary and removed, see: range_seq_t quick_range_seq_init(void *init_param, uint, uint); uint quick_range_seq_next(range_seq_t rseq, KEY_MULTI_RANGE *range); QUICK_SELECT_DESC::QUICK_SELECT_DESC(); 2.3 TRP_GROUP_MIN_MAX::make_quick() Added code which populate key_infix_ranges from index_tree. 2.4 Function get_constant_key_infix() is replaced with check_key_infix(). The following restriction is removed: "NGA3.If BA <> {}, there can only be one range. TODO: This is a code limitation and is not strictly needed. See BUG#15947433" Added calculation of 'infix_factor' which is used for the cost model. 2.5 QUICK_GROUP_MIN_MAX_SELECT::get_next(). Function is changed so that it supports multiple infix range processing. 3. Test case. Data file: include/group_min_max_ext_data.inc Result file for InnoDB, MyISAM engines: group_min_max_ext.result Test file for InnoDB, MyISAM engines: group_min_max_ext.test Auxiliary test, executes query, verifies result: include/group_min_max_ext_query.inc Main test: include/group_min_max_ext_test.inc
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.