WL#2489: Better ONLY_FULL_GROUP_BY mode
Status: Complete
The existing ONLY_FULL_GROUP_BY mode aims at protecting the user by rejecting non-deterministic-result queries which contain GROUP BY or aggregate functions. Example of rejected query: 'SELECT a,b FROM t1 GROUP BY a' (what value of 'b' could be returned for one group having identical values of 'a'?). GOALs/MEANS of this WL: G-1: adress community complains in: http://rpbouman.blogspot.co.uk/2007/05/debunking-group-by-myths.html and BUG#51058 M-1: make this mode more permissive (like described in SQL2011) while still rejecting non-deterministic queries. G-2: make code simpler to understand and maintain. M-2: refactor it. G-3: fix Bug ONLY_FULL_GROUP_BY REJECTS VALID QUERY USING VIEW and Bug SELECT AGGR + NON-AGGR FROM JOIN WITH VIEW IS NOT REJECTED BY ONLY_FULL_GROUP_BY M-3: side-effect of refactoring G-4: fix ONLY_FULL_GROUP_BY DOES NOT BLOCK "SELECT DISTINCT A ORDER BY B" which causes false positives in QA M-4: reuse machinery developed for the rest of this WL G-5: protect more users from non-deterministic queries, as suggested in http://mechanics.flite.com/blog/2013/02/12/why-i-use-only-full-group-by/ https://github.com/datamapper/dm-core/issues/69 "Add ONLY_FULL_GROUP_BY to sql_mode in do_mysql" http://codeascraft.etsy.com/2013/03/19/the-perils-of-sql_mode/ "ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' is the absolute minimum we’d recommend" M-5: enable ONLY_FULL_GROUP_BY by default (users would still be able to turn it off). G-6: reduce our future work. M-6: by enabling the mode by default in the future, we will avoid that users file bug reports about non-deterministic results like BUG#68254. See M-5. G-7: improve our compatibility with the SQL standard. M-7: add this mode to sql_mode=ansi.
Notations: PREWL: trunk before this worklog is pushed POSTWL: trunk after this worklog is pushed ON: only_full_group_by mode on OFF: only_full_group_by mode off Functional requirements. F-1: queries accepted by PREWL+ON should be accepted by POSTWL+ON, unless it can be proven that it was a bug in PREWL+ON to accept them. F-2: more queries should be accepted by POSTWL+ON than by PREWL+ON, following the prescriptions of SQL 2011. The essence of those prescriptions is to accept a query when the selected expressions are guaranteed uniquely determined inside a group. In detail, these queries should be accepted: F-2.1 ("primary key"): select t1.pk, t1.a+t1.c from t1 group by t1.pk; Same holds if the primary key is multi-column, as long as all its columns are in the GROUP list. Same holds if this is a unique key over only non-nullable columns. F-2.2 ("equality in WHERE"): select t1.a, t2.b*5 from t1 where t1.a=t2.b group by t1.a; F-2.3 ("equality in join condition"): cross join, inner join, natural join; outer join (with restrictions) F-2.4 ("views and derived tables"): select d.x,d.y from (select t2.a as x, t2.a*5 as y from t2) d group by d.x; F-2.5: all combinations of the above, of any complexity, for example: select d.x,d.y from (select t2.pk as x, t2.a as y from t2) d group by d.x; (= F-2.1 + F-2.4) A copy of the relevant SQL2011 section can be provided on demand. F-3: a new function any_value(arg) is be introduced. It has the same type and return value as its argument, and is not checked by only_full_group_by. It will allow people to force MySQL to accept queries which MySQL thinks should be rejected. Example: select t1.b from t1 group by t1.a; where t1.a is not indexed: nothing guarantees determinism, MysQL will reject this, user can force acception: select any_value(t1.b) from t1 group by t1.a; Specification of any_value(arg): * in grouped query, returns one value freely chosen among the group * in implicitely grouped query (with aggregation), returns one value freely chosen among the relation or NULL if relation is empty * if used in ORDER BY in query with DISTINCT, returns one value freely chosen in the group of distinct expressions. * if used in a non-aggregated context, like select * from t where any_value(t.col)=3; or select any_value(t.col) from t; any_value() returns the value of its argument, like if it would be COALESCE(t.col, NULL). * any_value() does not make a query aggregated, unlike real aggregate functions SUM/MAX/etc: select any_value(t.col) from t; is not an aggregated query; select any_value(t.col),sum(t.col2) from t; is. F-4: encourage users to use only_full_group_by: add only_full_group_by to the default value of @@sql_mode. Add it to sql_mode=ansi. F-5: only_full_group_by should allow aliases of selected expressions to be used in HAVING condition. F-6: bugs listed in HLD should be fixed. Nonfunctional requirements. NF-1: PREWL+OFF and POSTWL+OFF should run all queries at identical speed. NF-2: PREWL+OFF and POSTWL+ON should run queries without DISTINCT, GROUP BY, aggregates at identical speed. NF-3: queries with DISTINCT, GROUP BY, aggregates should get a performance penalty lower than 2% in POSTWL+ON compared to PREWL+ON (here I envision the possibility that the re-factoring, by having dedicated "item tree walks" instead of piggybacking on fix_fields (), could make the query slower).
Behavior of ONLY_FULL_GROUP_BY mode should be improved. == Implement F-2) make it less strict about selected/order expressions == Currently this mode requires that every selected expression is either: - identical to one group expression, - or a function of aggregates, outer references, literals, group columns; for example, it allows: select a+1 from t1 group by a+1; (selected expr == group expr) select a+b from t1 group by b,a; (a+b function of group columns) Recent versions of the SQL standard introduce the optional feature ''functional dependencies'' (T301): - if this feature is not supported, the standard prescribes a behavior which is that of only_full_group_by above. - if this feature is supported, the standard prescribes that more queries should be allowed, for example: select a from t1 group by pk2,pk1; where (pk1,pk2) is the primary key (or unique not null constraint) of the table. And also: select t2.a from t1,t2 where t2.a=t1.a group by t1.a; In those two queries, the value of the selected expression is constant over a group. The standard has several pages describing exactly what functional dependencies should be recognized, including those in outer joins. The case about primary keys was mentioned in one blog in the community ( http://rpbouman.blogspot.co.uk/2007/05/debunking-group-by-myths.html ). The case about equalities in WHERE was not. It could be useful if one table is split into two (say, the user has split table "customer" : moved rarely used parts of the row in a separate table "customer2", to make the frequently used parts a smaller table "customer1"): select customer1.id, customer2.address, sum(invoice.pricepaid) from customer1, customer2, invoice where customer1.id=customer2.id and customer1.id=invoice.customer_id group by customer1.id; Assuming that customer1.id and customer2.id are primary keys, by looking at the WHERE we can see that each group has a unique value of customer2.id and thus a unique value of customer2.address, so the query is valid according to the standard. The workarounds which people can use today are: - add the functionally dependent fields to the GROUP BY list; drawback: it prevents the Optimizer from doing GROUP BY with a simple index scan (adds a filesort or temporary table), making the query slower. - Put the functionally dependent field inside a dummy MAX(), making it look like an aggregate; drawback: makes the query a little slower (needs to manage the aggregate), and if the schema changes (some field is made non-unique), the functional dependency goes away (randomness possibly comes back), but the MAX() keeps the query legal, which is probably not what the user would have wanted. Implementation of functional dependency recognition: see F-2.* . According to the public documentation: - Oracle, SQL Server, Sybase do not have T301 - PostgreSQL recognizes functional dependencies on a primary key. References to public documentation: - Oracle 11, see "Restrictions on the Select List" in docs.oracle.com/cd/B28359_01/server.111/b28286/statements_10002.htm - SQL Server 12: http://msdn.microsoft.com/en-us/library/ms177673.aspx "Each table or view column in any nonaggregate expression in the
In current trunk, the only_full_group_by code is scattered over: - several functions of name resolution: some which collect info, some which reject query: resolve_ref_in_select_and_group(), Item_field::fix_outer_field(), Item_field::fix_fields(), Item_sum::check_sum_func(), Field_iterator_table::create_item() - JOIN::prepare() (rejects query) - setup_without_group() (uses collected info above and rejects query). - a hack in Item_in_subselect::single_value_transformer() All this will be centralized in some "walk" over the item trees during JOIN::prepare(), which will check and reject the query. This allows to remove scattered code, and some members of SELECT_LEX ("collected info"): Listnon_agg_fields; bool m_non_agg_field_used, int cur_pos_in_all_fields, non_agg_field_used(), set_non_agg_field_used() and this hack in sql_partition.cc and setup_without_group(): const bool save_agg_field= thd->lex->current_select->non_agg_field_used(); The new method will be the following. In JOIN::prepare(), a function check_only_full_group_by() is called which calls aggregate_check::Group_check::check_query(), which enumerates the SELECT/HAVING/ORDER expressions, and for each such expression: - if it's equal to a GROUP BY expression, okay - otherwise walk it (ignoring parts under any_value) and for each found column reference in it: - if this column is in GROUP BY, ok - otherwise, if this column is functionally dependent on columns in GROUP BY, ok - otherwise error. Functional dependencies of GROUP BY columns are determined only if needed above; with this algorithm: we start with the set of GROUP BY columns, search for primary/unique-not-null keys in this set (any such key allows to add to the set all columns of the table), search for equalities involving this set (we can then add the other member of the equality to the set); this way we are able to grow the set, until we find the desired column (the current one of the current inspected SELECT/HAVING/ORDER expression), or until the set stops growing (error). The above was for checking clauses against GROUP BY (F2&F5 in the HLS). F6 in the HLS (DISTINCT vs ORDER BY) is simpler, will also be done with a walk (which was already partly reviewed): check_only_full_group_by() calls aggregate_check::Distinct_check::check_query().
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.