WL#2489: Better ONLY_FULL_GROUP_BY mode
Status: Complete — Priority: Medium
The existing ONLY_FULL_GROUP_BY mode aims at protecting the user by rejecting non-deterministic-result queries which contain GROUP BY or aggregate functions. Example of rejected query: 'SELECT a,b FROM t1 GROUP BY a' (what value of 'b' could be returned for one group having identical values of 'a'?). GOALs/MEANS of this WL: G-1: adress community complains in: http://rpbouman.blogspot.co.uk/2007/05/debunking-group-by-myths.html and BUG#51058 M-1: make this mode more permissive (like described in SQL2011) while still rejecting non-deterministic queries. G-2: make code simpler to understand and maintain. M-2: refactor it. G-3: fix Bug ONLY_FULL_GROUP_BY REJECTS VALID QUERY USING VIEW and Bug SELECT AGGR + NON-AGGR FROM JOIN WITH VIEW IS NOT REJECTED BY ONLY_FULL_GROUP_BY M-3: side-effect of refactoring G-4: fix ONLY_FULL_GROUP_BY DOES NOT BLOCK "SELECT DISTINCT A ORDER BY B" which causes false positives in QA M-4: reuse machinery developed for the rest of this WL G-5: protect more users from non-deterministic queries, as suggested in http://mechanics.flite.com/blog/2013/02/12/why-i-use-only-full-group-by/ https://github.com/datamapper/dm-core/issues/69 "Add ONLY_FULL_GROUP_BY to sql_mode in do_mysql" http://codeascraft.etsy.com/2013/03/19/the-perils-of-sql_mode/ "ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' is the absolute minimum we’d recommend" M-5: enable ONLY_FULL_GROUP_BY by default (users would still be able to turn it off). G-6: reduce our future work. M-6: by enabling the mode by default in the future, we will avoid that users file bug reports about non-deterministic results like BUG#68254. See M-5. G-7: improve our compatibility with the SQL standard. M-7: add this mode to sql_mode=ansi.
<pre> Notations: PREWL: trunk before this worklog is pushed POSTWL: trunk after this worklog is pushed ON: only_full_group_by mode on OFF: only_full_group_by mode off Functional requirements. F-1: queries accepted by PREWL+ON should be accepted by POSTWL+ON, unless it can be proven that it was a bug in PREWL+ON to accept them. F-2: more queries should be accepted by POSTWL+ON than by PREWL+ON, following the prescriptions of SQL 2011. The essence of those prescriptions is to accept a query when the selected expressions are guaranteed uniquely determined inside a group. In detail, these queries should be accepted: F-2.1 ("primary key"): select t1.pk, t1.a+t1.c from t1 group by t1.pk; Same holds if the primary key is multi-column, as long as all its columns are in the GROUP list. Same holds if this is a unique key over only non-nullable columns. F-2.2 ("equality in WHERE"): select t1.a, t2.b*5 from t1 where t1.a=t2.b group by t1.a; F-2.3 ("equality in join condition"): cross join, inner join, natural join; outer join (with restrictions) F-2.4 ("views and derived tables"): select d.x,d.y from (select t2.a as x, t2.a*5 as y from t2) d group by d.x; F-2.5: all combinations of the above, of any complexity, for example: select d.x,d.y from (select t2.pk as x, t2.a as y from t2) d group by d.x; (= F-2.1 + F-2.4) A copy of the relevant SQL2011 section can be provided on demand. F-3: a new function any_value(arg) is be introduced. It has the same type and return value as its argument, and is not checked by only_full_group_by. It will allow people to force MySQL to accept queries which MySQL thinks should be rejected. Example: select t1.b from t1 group by t1.a; where t1.a is not indexed: nothing guarantees determinism, MysQL will reject this, user can force acception: select any_value(t1.b) from t1 group by t1.a; Specification of any_value(arg): * in grouped query, returns one value freely chosen among the group * in implicitely grouped query (with aggregation), returns one value freely chosen among the relation or NULL if relation is empty * if used in ORDER BY in query with DISTINCT, returns one value freely chosen in the group of distinct expressions. * if used in a non-aggregated context, like select * from t where any_value(t.col)=3; or select any_value(t.col) from t; any_value() returns the value of its argument, like if it would be COALESCE(t.col, NULL). * any_value() does not make a query aggregated, unlike real aggregate functions SUM/MAX/etc: select any_value(t.col) from t; is not an aggregated query; select any_value(t.col),sum(t.col2) from t; is. F-4: encourage users to use only_full_group_by: add only_full_group_by to the default value of @@sql_mode. Add it to sql_mode=ansi. F-5: only_full_group_by should allow aliases of selected expressions to be used in HAVING condition. F-6: bugs listed in HLD should be fixed. Nonfunctional requirements. NF-1: PREWL+OFF and POSTWL+OFF should run all queries at identical speed. NF-2: PREWL+OFF and POSTWL+ON should run queries without DISTINCT, GROUP BY, aggregates at identical speed. NF-3: queries with DISTINCT, GROUP BY, aggregates should get a performance penalty lower than 2% in POSTWL+ON compared to PREWL+ON (here I envision the possibility that the re-factoring, by having dedicated "item tree walks" instead of piggybacking on fix_fields (), could make the query slower). </pre>
<pre> Behavior of ONLY_FULL_GROUP_BY mode should be improved. == Implement F-2) make it less strict about selected/order expressions == Currently this mode requires that every selected expression is either: - identical to one group expression, - or a function of aggregates, outer references, literals, group columns; for example, it allows: select a+1 from t1 group by a+1; (selected expr == group expr) select a+b from t1 group by b,a; (a+b function of group columns) Recent versions of the SQL standard introduce the optional feature ''functional dependencies'' (T301): - if this feature is not supported, the standard prescribes a behavior which is that of only_full_group_by above. - if this feature is supported, the standard prescribes that more queries should be allowed, for example: select a from t1 group by pk2,pk1; where (pk1,pk2) is the primary key (or unique not null constraint) of the table. And also: select t2.a from t1,t2 where t2.a=t1.a group by t1.a; In those two queries, the value of the selected expression is constant over a group. The standard has several pages describing exactly what functional dependencies should be recognized, including those in outer joins. The case about primary keys was mentioned in one blog in the community ( http://rpbouman.blogspot.co.uk/2007/05/debunking-group-by-myths.html ). The case about equalities in WHERE was not. It could be useful if one table is split into two (say, the user has split table "customer" : moved rarely used parts of the row in a separate table "customer2", to make the frequently used parts a smaller table "customer1"): select customer1.id, customer2.address, sum(invoice.pricepaid) from customer1, customer2, invoice where customer1.id=customer2.id and customer1.id=invoice.customer_id group by customer1.id; Assuming that customer1.id and customer2.id are primary keys, by looking at the WHERE we can see that each group has a unique value of customer2.id and thus a unique value of customer2.address, so the query is valid according to the standard. The workarounds which people can use today are: - add the functionally dependent fields to the GROUP BY list; drawback: it prevents the Optimizer from doing GROUP BY with a simple index scan (adds a filesort or temporary table), making the query slower. - Put the functionally dependent field inside a dummy MAX(), making it look like an aggregate; drawback: makes the query a little slower (needs to manage the aggregate), and if the schema changes (some field is made non-unique), the functional dependency goes away (randomness possibly comes back), but the MAX() keeps the query legal, which is probably not what the user would have wanted. Implementation of functional dependency recognition: see F-2.* . According to the public documentation: - Oracle, SQL Server, Sybase do not have T301 - PostgreSQL recognizes functional dependencies on a primary key. References to public documentation: - Oracle 11, see "Restrictions on the Select List" in docs.oracle.com/cd/B28359_01/server.111/b28286/statements_10002.htm - SQL Server 12: http://msdn.microsoft.com/en-us/library/ms177673.aspx "Each table or view column in any nonaggregate expression in the <select> list must be included in the GROUP BY list " - Sybase: http://dcx.sybase.com/1200/en/dbreference/select-statement.html "When GROUP BY is used, the select-list, HAVING clause, and ORDER BY clause must not reference any identifier that is not named in the GROUP BY clause. The exception is that the select-list and HAVING clause can contain aggregate functions." - PostgreSQL: http://www.postgresql.org/docs/9.2/static/sql-select.html "it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or if the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column." The standard mandates that every member of the GROUP BY clause should be a column of a table of the FROM clause. MySQL allow any expression instead. We will continue supporting that, but if a non-column expression is used, we will not try to recognize functional dependences with it. For example: select 1+cos(a) group by cos(a); will still be rejected. Thus, this WL will not affect http://bugs.mysql.com/bug.php?id=69310 Searching for functional dependencies is done by analyzing the definition of indexes, and equalities. If the user's query complies with the 5.6 rules of only_full_group_by (all selected non-aggregated columns are in GROUP BY), search for functional dependencies is unneeded and will not be performed, so the user should not experience any speed degradation. Regarding GROUP BY ... WITH ROLLUP: the Standard says that: - the syntactical transformation of ROLLUP is to make a union of queries, and in each such query, some group column references are replaced with a NULL literal. - functional dependencies should be recognized only after that transformation. But there cannot be a key-based or equality-based functional dependency on a NULL literal, so if the query has ROLLUP, the only_full_group_by mode does not need to research functional dependencies, it can behave as it does in 5.6. Bonus side-effect: 5.6 does only_full_group_by checks at every execution; this WL does them only once, by testing "first_execution". Implement F-3: any_value() ========================== For cases where the user knows some functional dependency which MySQL can't see (too complex to see, or based on some property of the data), or is ok with any value of the group to be chosen, a pseudo-aggregate function ANY_VALUE will be implemented. It will freely pick a value in the group, and silence only_full_group_by checks on its argument. Example: select a, any_value(b) from t1 group by a; select sum(a), any_value(b) from t1; Implement F-5: make it allow aliases in HAVING ============================================== MySQL has this extension: it allows aliases in the HAVING clause, like this: select sum(x) as foo ... having foo>2; Currently only_full_group_by rejects such query. For example BUG#51058. With the justification that foo is not an aggregate, it's a reference to an aggregate... But this justification is not very convincing. And it forces people to write: select sum(x) as foo ... having sum(x)>2; which will compute the sum twice. We should allow such query. Implement F-6: make it apply some restrictions to DISTINCT + ORDER BY ===================================================================== in a query with DISTINCT and ORDER BY, we can get random order: select distinct a from t1 order by b; during execution, depending on the query plan, we may apply DISTINCT before ORDER BY; when we apply DISTINCT we may pick a value of 'b' in the group of rows which have the same value of 'a'. That gives random order of the final result. This was filed as: Bug #13581713 ONLY_FULL_GROUP_BY DOES NOT BLOCK "SELECT DISTINCT A ORDER BY B". We see that this is a problem which is similar in nature to what only_full_group_by wants to fight. Thus, we should make only_full_group_by add restrictions on DISTINCT + ORDER BY. The proper restrictions are described in the standard: if a query has ORDER BY, and one expression in ORDER BY is not identical to some expression in the SELECT list, and involves columns of tables of the FROM clause, then the query should not have DISTINCT. This worklog will fix Bug #13581713 . Thanks to code re-factoring done in this worklog, this worklog will fix Bug #16021396 ONLY_FULL_GROUP_BY REJECTS VALID QUERY USING VIEW Bug #18993257 SELECT AGGR + NON-AGGR FROM JOIN WITH VIEW IS NOT REJECTED BY ONLY_FULL_GROUP_BY Implement F-4: enable it by default =================================== When only_full_group_by mode has been improved, we should enable it by default. A few blogs recommend using this mode: http://mechanics.flite.com/blog/2013/02/12/why-i-use-only-full-group-by/ https://github.com/datamapper/dm-core/issues/69 "Add ONLY_FULL_GROUP_BY to sql_mode in do_mysql" http://codeascraft.etsy.com/2013/03/19/the-perils-of-sql_mode/ "ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' is the absolute minimum we’d recommend" it would also avoid bug reports like BUG#68254. We should also add this mode to sql_mode=ansi (it used to be there, see BUG#8510). Future directions ================= The logic developed by this worklog (recognizing functional dependencies of a set of table columns) could be reused to implement the following optimization. If a GROUP BY clause is: GROUP BY column1, expr2, expr3 ... we could see if exprX is functionally dependent on column1, and if it is, then we could remove exprX from the clause. This would make a simpler clause, which could be resolvable with an index method instead of a temporary table. Note that column1 must be before exprX, in the clause. This optimization may also apply to ORDER BY. See also the eq_ref_table() function, which has the same idea. </pre>
<pre> In current trunk, the only_full_group_by code is scattered over: - several functions of name resolution: some which collect info, some which reject query: resolve_ref_in_select_and_group(), Item_field::fix_outer_field(), Item_field::fix_fields(), Item_sum::check_sum_func(), Field_iterator_table::create_item() - JOIN::prepare() (rejects query) - setup_without_group() (uses collected info above and rejects query). - a hack in Item_in_subselect::single_value_transformer() All this will be centralized in some "walk" over the item trees during JOIN::prepare(), which will check and reject the query. This allows to remove scattered code, and some members of SELECT_LEX ("collected info"): List<Item_field> non_agg_fields; bool m_non_agg_field_used, int cur_pos_in_all_fields, non_agg_field_used(), set_non_agg_field_used() and this hack in sql_partition.cc and setup_without_group(): const bool save_agg_field= thd->lex->current_select->non_agg_field_used(); The new method will be the following. In JOIN::prepare(), a function check_only_full_group_by() is called which calls aggregate_check::Group_check::check_query(), which enumerates the SELECT/HAVING/ORDER expressions, and for each such expression: - if it's equal to a GROUP BY expression, okay - otherwise walk it (ignoring parts under any_value) and for each found column reference in it: - if this column is in GROUP BY, ok - otherwise, if this column is functionally dependent on columns in GROUP BY, ok - otherwise error. Functional dependencies of GROUP BY columns are determined only if needed above; with this algorithm: we start with the set of GROUP BY columns, search for primary/unique-not-null keys in this set (any such key allows to add to the set all columns of the table), search for equalities involving this set (we can then add the other member of the equality to the set); this way we are able to grow the set, until we find the desired column (the current one of the current inspected SELECT/HAVING/ORDER expression), or until the set stops growing (error). The above was for checking clauses against GROUP BY (F2&F5 in the HLS). F6 in the HLS (DISTINCT vs ORDER BY) is simpler, will also be done with a walk (which was already partly reviewed): check_only_full_group_by() calls aggregate_check::Distinct_check::check_query(). </pre>
Copyright (c) 2000, 2017, Oracle Corporation and/or its affiliates. All rights reserved.