WL#2489: Better ONLY_FULL_GROUP_BY mode

Status: Complete

The existing ONLY_FULL_GROUP_BY mode aims at protecting the user by rejecting
non-deterministic-result queries which contain GROUP BY or aggregate functions.
Example of rejected query: 'SELECT a,b FROM t1 GROUP BY a' (what value of 'b'
could be returned for one group having identical values of 'a'?).

GOALs/MEANS of this WL:

G-1: adress community complains in:
 and BUG#51058
M-1: make this mode more permissive (like described in SQL2011) while still
rejecting non-deterministic queries. 

G-2: make code simpler to understand and maintain.
M-2: refactor it.

M-3: side-effect of refactoring

  DISTINCT A ORDER BY B" which causes false positives in QA
M-4: reuse machinery developed for the rest of this WL

G-5: protect more users from non-deterministic queries, as suggested in
https://github.com/datamapper/dm-core/issues/69 "Add ONLY_FULL_GROUP_BY to
sql_mode in do_mysql"
"ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES' is the absolute minimum we’d recommend"
M-5: enable ONLY_FULL_GROUP_BY by default (users would still be able to
turn it off).

G-6: reduce our future work.
M-6: by enabling the mode by default in the future, we will avoid that users
file bug reports about non-deterministic results like BUG#68254. See M-5.

G-7: improve our compatibility with the SQL standard.
M-7: add this mode to sql_mode=ansi.
PREWL: trunk before this worklog is pushed
POSTWL: trunk after this worklog is pushed
ON: only_full_group_by mode on
OFF: only_full_group_by mode off

Functional requirements.

F-1: queries accepted by PREWL+ON should be accepted by POSTWL+ON, unless it can
be proven that it was a bug in PREWL+ON to accept them.

F-2: more queries should be accepted by POSTWL+ON than by PREWL+ON, following
the prescriptions of SQL 2011. The essence of those prescriptions is to accept a
query when the selected expressions are guaranteed uniquely determined inside a
group. In detail, these queries should be accepted:
F-2.1 ("primary key"):
select t1.pk, t1.a+t1.c from t1 group by t1.pk;
Same holds if the primary key is multi-column, as long as all its columns are in
the GROUP list. Same holds if this is a unique key over only non-nullable columns.
F-2.2 ("equality in WHERE"):
select t1.a, t2.b*5 from t1 where t1.a=t2.b group by t1.a;
F-2.3 ("equality in join condition"):
cross join, inner join, natural join; outer join (with restrictions)
F-2.4 ("views and derived tables"):
select d.x,d.y from (select t2.a as x, t2.a*5 as y from t2) d
group by d.x; 
F-2.5: all combinations of the above, of any complexity, for example:
select d.x,d.y from (select t2.pk as x, t2.a as y from t2) d
group by d.x; 
(= F-2.1 + F-2.4)
A copy of the relevant SQL2011 section can be provided on demand.

F-3: a new function any_value(arg) is be introduced. It has the same type
and return value as its argument, and is not checked by only_full_group_by. It
will allow people to force MySQL to accept queries which MySQL thinks should be
rejected. Example:
select t1.b from t1 group by t1.a; where t1.a is not indexed:
nothing guarantees determinism, MysQL will reject this, user can force acception:
select any_value(t1.b) from t1 group by t1.a;
Specification of any_value(arg):
* in grouped query, returns one value freely chosen among the group
* in implicitely grouped query (with aggregation), returns
one value freely chosen among the relation or NULL if relation is empty
* if used in ORDER BY in query with DISTINCT, returns one value freely chosen in
the group of distinct expressions.
* if used in a non-aggregated context, like
  select * from t where any_value(t.col)=3;
  select any_value(t.col) from t;
any_value() returns the value of its argument, like if it would be
* any_value() does not make a query aggregated, unlike real aggregate functions
  select any_value(t.col) from t;
is not an aggregated query;
select any_value(t.col),sum(t.col2) from t;

F-4: encourage users to use only_full_group_by: add only_full_group_by to the
default value of @@sql_mode. Add it to sql_mode=ansi.

F-5: only_full_group_by should allow aliases of selected expressions to be used
in HAVING condition.

F-6: bugs listed in HLD should be fixed.

Nonfunctional requirements.

NF-1: PREWL+OFF and POSTWL+OFF should run all queries at identical speed.

NF-2: PREWL+OFF and POSTWL+ON should run queries without DISTINCT, GROUP BY,
aggregates at identical speed.

NF-3: queries with DISTINCT, GROUP BY, aggregates should get a performance
penalty lower than 2% in POSTWL+ON compared to PREWL+ON (here I envision the
possibility that the re-factoring, by having dedicated "item tree walks" instead
of piggybacking on fix_fields (), could make the query slower).
Behavior of ONLY_FULL_GROUP_BY mode should be improved.

== Implement F-2) make it less strict about selected/order expressions ==

Currently this mode requires that every selected expression is either:
- identical to one group expression,
- or a function of aggregates, outer references, literals, group columns; for
example, it allows:
select a+1 from t1 group by a+1; (selected expr == group expr)
select a+b from t1 group by b,a; (a+b function of group columns)
Recent versions of the SQL standard introduce the optional feature ''functional
dependencies'' (T301):
- if this feature is not supported, the standard prescribes a behavior which is
that of only_full_group_by above.
- if this feature is supported,  the standard prescribes that more queries
should be allowed, for example:
select a from t1 group by pk2,pk1;
where (pk1,pk2) is the primary key (or unique not null constraint) of the table.
And also:
select t2.a from t1,t2 where t2.a=t1.a group by t1.a;
In those two queries, the value of the selected expression is constant over a group.
The standard has several pages describing exactly what functional dependencies
should be recognized, including those in outer joins.

The case about primary keys was mentioned in one blog in the community
( http://rpbouman.blogspot.co.uk/2007/05/debunking-group-by-myths.html ).
The case about equalities in WHERE was not. It could be useful if one table
is split into two (say, the user has split table "customer" : moved rarely used
parts of the row in a separate table "customer2", to make the frequently used
parts a smaller table "customer1"):
select customer1.id, customer2.address, sum(invoice.pricepaid)
from customer1, customer2, invoice
where customer1.id=customer2.id
and customer1.id=invoice.customer_id
group by customer1.id;
Assuming that customer1.id and customer2.id are primary keys, by looking at the
WHERE we can see that each group has a unique value of customer2.id and thus a
unique value of customer2.address, so the query is valid according to the standard.

The workarounds which people can use today are:
- add the functionally dependent fields to the GROUP BY list; drawback: it
prevents the Optimizer from doing GROUP BY with a simple index scan (adds a
filesort or temporary table), making the query slower.
- Put the functionally dependent field inside a dummy MAX(), making it look like
an aggregate; drawback: makes the query a little slower (needs to manage the
aggregate), and if the schema changes (some field is made non-unique), the
functional dependency goes away (randomness possibly comes back), but the MAX()
keeps the query legal, which is probably not what the user would have wanted.

Implementation of functional dependency recognition: see F-2.* .

According to the public documentation:
- Oracle, SQL Server, Sybase do not have T301
- PostgreSQL recognizes functional dependencies on a primary key.

References to public documentation:
- Oracle 11, see "Restrictions on the Select List" in
- SQL Server 12:
"Each table or view column in any nonaggregate expression in the