WL#6986: Make switching of index due to small limit cost-based
Affects: Server-Prototype Only
—
Status: Complete
Decision to switch to an index which gives results in the order requested by "ORDER BY" is not cost based at all places. For queries having "ORDER BY .... LIMIT N", the optimizer tries to switch to use an index that supports the "ORDER BY" order. This happens in two places currently and one is in make_join_select(). Here switching of index is not cost-based. If an index exists and can be used, the optimizer will switch to use it. This can lead to poor performance in cases where the query will need to read a large part or the entire index in order to find qualifying records. The goal of this worklog is to make the decision in make_join_select() of whether to switch to a new index in order to support "ORDER BY ... LIMIT N" cost-based. Note that since we do not have very much knowledge about how much data that actually will need to be read until the query has produced N records, this worklog will not prevent queries from ending up reading the entire index but it should prevent us from switching in cases where it is likely that we have to read a lot from the index.
Functional requirements: ------------------------ F1) The feature shall use cost model present in “test_if_cheaper_ordering” in “make_join_select” to swicth to an index to give the results in requested order. This may result in changed query execution plans. Non-functional requirements: ---------------------------- NF1) The performance shall be as good as or better than before for 95% of all test queries. NF2) A performance degradation of 5% for up to 5% of the queries is tolerable in the general case NF3) A higher performance degradation clearly caused by skewed or covariant data [1] is acceptable as long as NF1 is satisfied.
For queries that have "ORDER BY ... LIMIT N" it will in many cases be beneficial to use an index that supports the "order by" order since we (a) can avoid having to do file sort of the result and (b) be able to stop after having produced the first N result records. The optimizer already supports this by checking if it is possible to switch to use such an index. This test is done after the join order has been decided and is today done in two different places in the code: 1. In make_join_select(): If the current access method is via a quick select, the query has a low limit value, and there exists an alternative index that can be used for reading in "order by" order, then we switch to use this index. This switch is not cost-based and we do the switch even in cases where we might have to read a large portion of the new index. 2. When calling test_if_skip_sort_order() we use test_if_cheaper_ordering() to decide if we should switch to an alternative index in order to avoid file sort. The decision to switch index is cost based and tries to take into account an estimate of how much it will likely need to read from the index in the case where a limit is given. The switching to a new index done in make_join_select() can cause increased execution time if the query needs to read much or the entire index in order to find enough qualifying records. Problem Statement: Consider the following query - SELECT patientId, time FROM t1 WHERE patientId > 41 AND patientId < 43 AND time > 0 AND illness="Headache" ORDER BY time LIMIT 1; If the table has indices on “patientId” and “time”, optimizer can choose between an index for the range “patientId > 41 and patientId < 43” or an index for the range “time > 0”. Also it can choose index “time” to give the ordered results as requested in the order by clause. In this case, say the number of rows for the range on patientId is 5 and number of rows for the range “time” is 1000. If order by was not present, index “patientId” would have been chosen. But based on that index “time” would give the results in the requested order and limit is very small, optimizer decides to switch to use index “time”. This assumption is probably made on the fact that we will likely encounter one row which satisfies the condition “patientId >41 and patientId < 43” and “illness = headache” as early as possible resulting in a optimal plan. This currently leads to the problem we have in hand. What if the range “patientId > 41 and patientId < 43” does not give any rows? If optimizer chooses index "time", the one record which we are looking for might make the query execution to scan the entire index which definitely is not a good thing to do. In that case it would have been more wise to choose index “patientId”. Even in cases where we have two possible ranges giving us comparable number of rows, (unlike in this case), it is always better to do a cost based switch when optimizer has to go with an index that gives the records in the requested order. So with the example above, if we had estimated the cost, based on the possible number of rows that we might have to scan if “time” was chosen based on “quick_condition_rows” which in this case would be 0, then optimizer could have avoided the switch. Cost based switching in test_if_cheaper_ordering: With respect to the example above, say patientId's range results in 5 records and “time” results in 1000. If we need to switch to “time” to get results in requested order, we might have to scan a minimum of 200 records to get the one record that we need as specified by the limit clause. To arrive at 200, it is presumed that data is randomly distributed and the following equation is used: Number of records to be scanned = select_limit * quick_rows for this index/quick_condition_rows. In the example above, select limit is 1 quick_rows for “time” index is 1000 quick_condition_rows is 5 And from this, the optimizer can calculate the cost between scanning 200 records or scanning 5 records and doing filesort. Which in this case would be scanning 5 records and doing filesort there by choosing “patientId” over “time”. So it would be more wise to use this cost model in “make_join_select” as well. As stated above, there is a presumption made about the distribution of data in the cost model present in “test_if_cheaper_ordering”. In the example presented above let us modify the statistics a little : Lets say using index on patiedId would give “200” records and using index “time” would give “1000” records. So using the above equation, optimizer arrives at “5” as the total number of records to be scanned for the index “time”. Since the number for records satisfying the range on “patientId” is more, probablity of finding the records early is high. Hence index "time" is chosen. Now what if data is distributed in such a way that, these 200 records are found at the end of the index when “time” is used i.e what if these 200 records will be the “801st to 1000th” records when “time” is used. This results in scanning 801 records to get the one record needed by the query. This again results in a bad plan. So, ideally the cost model in "test_if_cheaper_ordering” should be altered to get more accurate estimate. But with the current information available about distribution of data, optimizer can only live with this assumption. Although, customer has suggested us to give a configuration variable to turn off this optimization which could be considered. This worklog should make the choice of whether to switch index or not in make_join_select() cost-based in order to reduce the likelihood of switching in cases where a large part of the index will be read. To address this problem, we can use the cost-model that is implemented in test_if_cheaper_ordering(). Also this worklog should be able to give user a configuration variable to turn off the optimization if needed. Related bugs: BUG#57001 INCORRECT OPTIMIZATION OF ORDER BY CLUSTERED_KEY LIMIT
Implementation: Currently in make_join_select, we have the following heuristic to choose the range scan, when there is order by with low limit. For all the possible “usable_keys”, check if any key gives results in the order requested by “order by” clause and give this as input to “test_quick_select” to re-check for a possible range scan. This would be changed to use “test_if_cheaper_ordering” if (recheck_reason == LOW_LIMIT) { /* Do a cost based search on all the keys for an index which can give ordered results */ test_if_cheaper_ordering(tab, join->order, tab->table, usable_keys, -1, select_limit, &best_key, &read_direction, &select_limit); /* * test_if_cheaper_ordering compares range/ref scan vs choosing * index for ordering. But, in this case we might have table * scan/index scan in hand. So re-check if there is * a possibility for a range scan that can give ordered results * even though it was not the cheapest. */ if(best_key < 0) { for (uint idx= 0; idx < tab->table->s->keys; idx++) { if (usable_keys.is_set(idx) && (tab->table->quick_keys.is_set(idx) && tab->table->quick_rows[idx] == tab->table->quick_condition_rows)) { const int read_direction= test_if_order_by_key(join->order, tab->table, idx); if (read_direction != 0) best_key= idx; break; } } } interesting_order= (read_direction == -1 ? ORDER::ORDER_DESC : ORDER::ORDER_ASC); if (best_key < 0) recheck_reason= DONT_RECHECK; // No usasable keys /* If the current plan is to use a range access on an index that provides the order dictated by the ORDER BY clause there is no need to recheck index usage; we already know from the former call to test_quick_select() that a range scan on the chosen index is cheapest. Note that previous calls to test_quick_select() did not take order direction (ASC/DESC) into account, so in case of DESC ordering we still need to recheck. */ if (sel->quick && (sel->quick->index != MAX_KEY) && best_key == (int)sel->quick->index && (interesting_order != ORDER::ORDER_DESC || sel->quick->reverse_sorted())) { recheck_reason= DONT_RECHECK; } if (best_key >= 0) { usable_keys.clear_all(); usable_keys.set_bit(best_key); } } The main change from the previous implementation is in the call to “test_if_cheaper_ordering”. Unlike the previous implementation, this is how the index would be chosen using test_if_cheaper_ordering. for (nr=0; nr < table->s->keys ; nr++) { int direction; uint used_key_parts; if (keys.is_set(nr) && (direction= test_if_order_by_key(order, table, nr, &used_key_parts))) { /* Don't use an index scan with ORDER BY without limit. For GROUP BY without limit always use index scan if there is a suitable index. Why we hold to this asymmetry hardly can be explained rationally. It's easy to demonstrate that using temporary table + filesort could be cheaper for grouping queries too. */ if (is_covering || select_limit != HA_POS_ERROR || (ref_key < 0 && (group || table->force_index))) { double rec_per_key; double index_scan_time; KEY *keyinfo= table->key_info+nr; if (select_limit == HA_POS_ERROR) select_limit= table_records; /* If tab=tk is not the last joined table tn then to get first L records from the result set we can expect to retrieve only L/fanout(tk,tn) where fanout(tk,tn) says how many rows in the record set on average will match each row tk. Usually our estimates for fanouts are too pessimistic. So the estimate for L/fanout(tk,tn) will be too optimistic and as result we'll choose an index scan when using ref/range access + filesort will be cheaper. */ select_limit= (ha_rows) (select_limit < fanout ? 1 : select_limit/fanout); /* We assume that each of the tested indexes is not correlated with ref_key. Thus, to select first N records we have to scan N/selectivity(ref_key) index entries. selectivity(ref_key) = #scanned_records/#table_records = refkey_rows_estimate/table_records. In any case we can't select more than #table_records. N/(refkey_rows_estimate/table_records) > table_records <=> N > refkey_rows_estimate. */ if (select_limit > refkey_rows_estimate) select_limit= table_records; else select_limit= (ha_rows) (select_limit * (double) table_records / refkey_rows_estimate); rec_per_key= keyinfo->rec_per_key[keyinfo->user_defined_key_parts - 1]; set_if_bigger(rec_per_key, 1); /* Here we take into account the fact that rows are accessed in sequences rec_per_key records in each. Rows in such a sequence are supposed to be ordered by rowid/primary key. When reading the data in a sequence we'll touch not more pages than the table file contains. TODO. Use the formula for a disk sweep sequential access to calculate the cost of accessing data rows for one index entry. */ index_scan_time= select_limit/rec_per_key * min(rec_per_key, table->file->scan_time()); if ((ref_key < 0 && is_covering) || (ref_key < 0 && (group || table->force_index)) || index_scan_time < read_time) { ha_rows quick_records= table_records; ha_rows refkey_select_limit= (ref_key >= 0 && table->covering_keys.is_set(ref_key)) ? refkey_rows_estimate : HA_POS_ERROR; if ((is_best_covering && !is_covering) || (is_covering && refkey_select_limit < select_limit)) continue; if (table->quick_keys.is_set(nr)) quick_records= table->quick_rows[nr]; if (best_key < 0 || (select_limit <= min(quick_records,best_records) ? keyinfo->user_defined_key_parts < best_key_parts : quick_records < best_records)) { best_key= nr; best_key_parts= keyinfo->user_defined_key_parts; if (saved_best_key_parts) *saved_best_key_parts= used_key_parts; best_records= quick_records; is_best_covering= is_covering; best_key_direction= direction; best_select_limit= select_limit; } } } } } }
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.