WL#2489: Better ONLY_FULL_GROUP_BY mode

Status: Complete   —   Priority: Medium

The existing ONLY_FULL_GROUP_BY mode aims at protecting the user by rejecting
non-deterministic-result queries which contain GROUP BY or aggregate functions.
Example of rejected query: 'SELECT a,b FROM t1 GROUP BY a' (what value of 'b'
could be returned for one group having identical values of 'a'?).

GOALs/MEANS of this WL:

G-1: adress community complains in:
 and BUG#51058
M-1: make this mode more permissive (like described in SQL2011) while still
rejecting non-deterministic queries. 

G-2: make code simpler to understand and maintain.
M-2: refactor it.

M-3: side-effect of refactoring

  DISTINCT A ORDER BY B" which causes false positives in QA
M-4: reuse machinery developed for the rest of this WL

G-5: protect more users from non-deterministic queries, as suggested in
https://github.com/datamapper/dm-core/issues/69 "Add ONLY_FULL_GROUP_BY to
sql_mode in do_mysql"
"ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' is the absolute minimum we’d recommend"
M-5: enable ONLY_FULL_GROUP_BY by default (users would still be able to
turn it off).

G-6: reduce our future work.
M-6: by enabling the mode by default in the future, we will avoid that users
file bug reports about non-deterministic results like BUG#68254. See M-5.

G-7: improve our compatibility with the SQL standard.
M-7: add this mode to sql_mode=ansi.
PREWL: trunk before this worklog is pushed
POSTWL: trunk after this worklog is pushed
ON: only_full_group_by mode on
OFF: only_full_group_by mode off

Functional requirements.

F-1: queries accepted by PREWL+ON should be accepted by POSTWL+ON, unless it can
be proven that it was a bug in PREWL+ON to accept them.

F-2: more queries should be accepted by POSTWL+ON than by PREWL+ON, following
the prescriptions of SQL 2011. The essence of those prescriptions is to accept a
query when the selected expressions are guaranteed uniquely determined inside a
group. In detail, these queries should be accepted:
F-2.1 ("primary key"):
select t1.pk, t1.a+t1.c from t1 group by t1.pk;
Same holds if the primary key is multi-column, as long as all its columns are in
the GROUP list. Same holds if this is a unique key over only non-nullable columns.
F-2.2 ("equality in WHERE"):
select t1.a, t2.b*5 from t1 where t1.a=t2.b group by t1.a;
F-2.3 ("equality in join condition"):
cross join, inner join, natural join; outer join (with restrictions)
F-2.4 ("views and derived tables"):
select d.x,d.y from (select t2.a as x, t2.a*5 as y from t2) d
group by d.x; 
F-2.5: all combinations of the above, of any complexity, for example:
select d.x,d.y from (select t2.pk as x, t2.a as y from t2) d
group by d.x; 
(= F-2.1 + F-2.4)
A copy of the relevant SQL2011 section can be provided on demand.

F-3: a new function any_value(arg) is be introduced. It has the same type
and return value as its argument, and is not checked by only_full_group_by. It
will allow people to force MySQL to accept queries which MySQL thinks should be
rejected. Example:
select t1.b from t1 group by t1.a; where t1.a is not indexed:
nothing guarantees determinism, MysQL will reject this, user can force acception:
select any_value(t1.b) from t1 group by t1.a;
Specification of any_value(arg):
* in grouped query, returns one value freely chosen among the group
* in implicitely grouped query (with aggregation), returns
one value freely chosen among the relation or NULL if relation is empty
* if used in ORDER BY in query with DISTINCT, returns one value freely chosen in
the group of distinct expressions.
* if used in a non-aggregated context, like
  select * from t where any_value(t.col)=3;
  select any_value(t.col) from t;
any_value() returns the value of its argument, like if it would be
* any_value() does not make a query aggregated, unlike real aggregate functions
  select any_value(t.col) from t;
is not an aggregated query;
select any_value(t.col),sum(t.col2) from t;

F-4: encourage users to use only_full_group_by: add only_full_group_by to the
default value of @@sql_mode. Add it to sql_mode=ansi.

F-5: only_full_group_by should allow aliases of selected expressions to be used
in HAVING condition.

F-6: bugs listed in HLD should be fixed.

Nonfunctional requirements.

NF-1: PREWL+OFF and POSTWL+OFF should run all queries at identical speed.

NF-2: PREWL+OFF and POSTWL+ON should run queries without DISTINCT, GROUP BY,
aggregates at identical speed.

NF-3: queries with DISTINCT, GROUP BY, aggregates should get a performance
penalty lower than 2% in POSTWL+ON compared to PREWL+ON (here I envision the
possibility that the re-factoring, by having dedicated "item tree walks" instead
of piggybacking on fix_fields (), could make the query slower).
Behavior of ONLY_FULL_GROUP_BY mode should be improved.

== Implement F-2) make it less strict about selected/order expressions ==

Currently this mode requires that every selected expression is either:
- identical to one group expression,
- or a function of aggregates, outer references, literals, group columns; for
example, it allows:
select a+1 from t1 group by a+1; (selected expr == group expr)
select a+b from t1 group by b,a; (a+b function of group columns)
Recent versions of the SQL standard introduce the optional feature ''functional
dependencies'' (T301):
- if this feature is not supported, the standard prescribes a behavior which is
that of only_full_group_by above.
- if this feature is supported,  the standard prescribes that more queries
should be allowed, for example:
select a from t1 group by pk2,pk1;
where (pk1,pk2) is the primary key (or unique not null constraint) of the table.
And also:
select t2.a from t1,t2 where t2.a=t1.a group by t1.a;
In those two queries, the value of the selected expression is constant over a group.
The standard has several pages describing exactly what functional dependencies
should be recognized, including those in outer joins.

The case about primary keys was mentioned in one blog in the community
( http://rpbouman.blogspot.co.uk/2007/05/debunking-group-by-myths.html ).
The case about equalities in WHERE was not. It could be useful if one table
is split into two (say, the user has split table "customer" : moved rarely used
parts of the row in a separate table "customer2", to make the frequently used
parts a smaller table "customer1"):
select customer1.id, customer2.address, sum(invoice.pricepaid)
from customer1, customer2, invoice
where customer1.id=customer2.id
and customer1.id=invoice.customer_id
group by customer1.id;
Assuming that customer1.id and customer2.id are primary keys, by looking at the
WHERE we can see that each group has a unique value of customer2.id and thus a
unique value of customer2.address, so the query is valid according to the standard.

The workarounds which people can use today are:
- add the functionally dependent fields to the GROUP BY list; drawback: it
prevents the Optimizer from doing GROUP BY with a simple index scan (adds a
filesort or temporary table), making the query slower.
- Put the functionally dependent field inside a dummy MAX(), making it look like
an aggregate; drawback: makes the query a little slower (needs to manage the
aggregate), and if the schema changes (some field is made non-unique), the
functional dependency goes away (randomness possibly comes back), but the MAX()
keeps the query legal, which is probably not what the user would have wanted.

Implementation of functional dependency recognition: see F-2.* .

According to the public documentation:
- Oracle, SQL Server, Sybase do not have T301
- PostgreSQL recognizes functional dependencies on a primary key.

References to public documentation:
- Oracle 11, see "Restrictions on the Select List" in
- SQL Server 12:
"Each table or view column in any nonaggregate expression in the <select> list
must be included in the GROUP BY list " 
- Sybase:
"When GROUP BY is used, the select-list, HAVING clause, and ORDER BY
clause must not reference any identifier that is not named in the GROUP
BY clause. The exception is that the select-list and HAVING clause can
contain aggregate functions."
- PostgreSQL:
"it is not valid for the SELECT list expressions to refer to ungrouped columns
except within aggregate functions or if the ungrouped column is functionally
dependent on the grouped columns, since there would otherwise be more than one
possible value to return for an ungrouped column. A functional dependency exists
if the grouped columns (or a subset thereof) are the primary key of the table
containing the ungrouped column." 

The standard mandates that every member of the GROUP BY clause should be a
column of a table of the FROM clause. MySQL allow any expression instead. We
will continue supporting that, but if a non-column expression is used, we will
not try to recognize functional dependences with it. For example:
select 1+cos(a) group by cos(a);
will still be rejected.
Thus, this WL will not affect

Searching for functional dependencies is done by analyzing the definition of
indexes, and equalities. If the user's query complies with the 5.6 rules of
only_full_group_by (all selected non-aggregated columns are in GROUP BY), search
for functional dependencies is unneeded and will not be performed, so the user
should not experience any speed degradation.

Regarding GROUP BY ... WITH ROLLUP: the Standard says that:
- the syntactical transformation of ROLLUP is to make a union of queries, and in
each such query, some group column references are replaced with a NULL literal.
- functional dependencies should be recognized only after that transformation.
But there cannot be a key-based or equality-based functional dependency on a
NULL literal, so if the query has ROLLUP, the only_full_group_by mode does not
need to research functional dependencies, it can behave as it does in 5.6.

Bonus side-effect: 5.6 does only_full_group_by checks at every execution; this
WL does them only once, by testing "first_execution".

Implement F-3: any_value()

For cases where the user knows some functional dependency which MySQL can't see
(too complex to see, or based on some property of the data), or is ok with any
value of the group to be chosen, a pseudo-aggregate function ANY_VALUE will be
implemented. It will freely pick a value in the group, and silence
only_full_group_by checks on its argument. Example:
select a, any_value(b) from t1 group by a;
select sum(a), any_value(b) from t1;

Implement F-5: make it allow aliases in HAVING

MySQL has this extension: it allows aliases in the HAVING clause, like this:
select sum(x) as foo ... having foo>2;
Currently only_full_group_by rejects such query. For example BUG#51058. With the
justification that foo is not an aggregate, it's a reference to an aggregate...
But this
justification is not very convincing. And it forces people to write:
select sum(x) as foo ... having sum(x)>2;
which will compute the sum twice.
We should allow such query.

Implement F-6: make it apply some restrictions to DISTINCT + ORDER BY

in a query with DISTINCT and ORDER BY, we can get random order:
select distinct a from t1 order by b;
during execution, depending on the query plan, we may apply DISTINCT before
ORDER BY; when we apply DISTINCT we may pick a value of 'b' in the group of rows
which have the same value of 'a'. That gives random order of the final result.
This was filed as:
We see that this is a problem which is similar in nature to what
only_full_group_by wants to fight. Thus, we should make only_full_group_by add
restrictions on DISTINCT + ORDER BY. The proper restrictions are described in
the standard: if a query has ORDER BY, and one expression in ORDER BY is not
identical to some expression in the SELECT list, and involves columns of  tables
of the FROM clause, then the query should not have DISTINCT.
This worklog will fix Bug #13581713 .

Thanks to code re-factoring done in this worklog, this worklog will fix

Implement F-4: enable it by default

When only_full_group_by mode has been improved, we should enable it by default.
A few blogs recommend using this mode:
https://github.com/datamapper/dm-core/issues/69 "Add ONLY_FULL_GROUP_BY to
sql_mode in do_mysql"
"ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' is the absolute minimum we’d recommend"

it would also avoid bug reports like BUG#68254.
We should also add this mode to sql_mode=ansi (it used to be there, see BUG#8510).

Future directions

The logic developed by this worklog (recognizing functional dependencies of a
set of table columns) could be reused to implement the following optimization.
If a GROUP BY clause is:
GROUP BY column1, expr2, expr3 ...
we could see if exprX is functionally dependent on column1, and if it is, then
we could remove exprX from the clause. This would make a simpler clause, which
could be resolvable with an index method instead of a temporary table. Note that
column1 must be before exprX, in the clause.
This optimization may also apply to ORDER BY. See also the eq_ref_table()
function, which has the same idea.
In current trunk, the only_full_group_by code is scattered over:
- several functions of name resolution: some which collect info, some which
reject query: resolve_ref_in_select_and_group(), Item_field::fix_outer_field(),
Item_field::fix_fields(), Item_sum::check_sum_func(),
- JOIN::prepare() (rejects query)
- setup_without_group() (uses collected info above and rejects query).
- a hack in Item_in_subselect::single_value_transformer()

All this will be centralized in some "walk" over the item trees during
JOIN::prepare(), which will check and reject the query.

This allows to remove scattered code, and some members of SELECT_LEX ("collected
List<Item_field> non_agg_fields;
bool m_non_agg_field_used, int cur_pos_in_all_fields,
non_agg_field_used(), set_non_agg_field_used()
and this hack in sql_partition.cc and setup_without_group():
const bool save_agg_field= thd->lex->current_select->non_agg_field_used();

The new method will be the following. In JOIN::prepare(), a function
check_only_full_group_by() is called which calls
aggregate_check::Group_check::check_query(), which enumerates the
SELECT/HAVING/ORDER expressions, and for each such expression:
- if it's equal to a GROUP BY expression, okay
- otherwise walk it (ignoring parts under any_value) and for each found column
reference in it:
- if this column is in GROUP BY, ok
- otherwise, if this column is functionally dependent on columns in GROUP BY, ok
- otherwise error.
Functional dependencies of GROUP BY columns are determined only if needed above;
with this algorithm: we start with the set of GROUP BY columns, search for
primary/unique-not-null keys in this set (any such key allows to add to the set
all columns of the table), search for equalities involving this set (we can then
add the other member of the equality to the set); this way we are able to grow
the set, until we find the desired column (the current one of the current
inspected SELECT/HAVING/ORDER expression), or until the set stops growing (error).

The above was for checking clauses against GROUP BY (F2&F5 in the HLS).
F6 in the HLS (DISTINCT vs ORDER BY) is simpler, will also be done with a walk
(which was already partly reviewed): check_only_full_group_by() calls