MySQL :: MySQL 5.7: only_full_group_by Improved, Recognizing Functional Dependencies, Enabled by Default!

Here’s an account of recent work which has kept me busy and excited for a few months. For those unfamiliar with the only_full_group_by sql mode, let me provide some context. I’ll use the world database, which can be downloaded from this MySQL Documentation page. You can find details on how to install the database on the same page. We have a table of countries and a table of languages spoken in each country. It’s a 1-to-N relationship: a country can have many languages so CountryLanguage.CountryCode is a foreign key referencing Country.Code:

mysql> desc Country;
+----------------+---------------------------------------------------------------------------------------+------+-----+---------+-------+
| Field          | Type         | Null | Key | Default | Extra |
+----------------+---------------------------------------------------------------------------------------+------+-----+---------+-------+
| Code           | char(3)      | NO   | PRI |         |       |
| Name           | char(52)     | NO   |     |         |       |
...

mysql> desc CountryLanguage;
+-------------+---------------+------+-----+---------+-------+
| Field       | Type          | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+-------+
| CountryCode | char(3)       | NO   | PRI |         |       |
| Language    | char(30)      | NO   | PRI |         |       |
... (I omitted non-relevant columns)

mysql> desc Country;

+----------------+---------------------------------------------------------------------------------------+------+-----+---------+-------+

+----------------+---------------------------------------------------------------------------------------+------+-----+---------+-------+

| Code | char(3) | NO | PRI | | |

...

mysql> desc CountryLanguage;

+-------------+---------------+------+-----+---------+-------+

+-------------+---------------+------+-----+---------+-------+

| CountryCode | char(3) | NO | PRI | | |

| Language | char(30) | NO | PRI | | |

... (I omitted non-relevant columns)

Here are two sample rows from the Country table:

+------+----------------------------------------------+
| Code | Name                                         |
+------+----------------------------------------------+
| BEL  | Belgium                                      |
| FRA  | France                                       |

+------+----------------------------------------------+

| Code | Name |

+------+----------------------------------------------+

| BEL | Belgium |

| FRA | France |

And corresponding rows in CountryLanguage:

+-------------+------------+
| CountryCode | Language   |
+-------------+------------+
| BEL         | Arabic     |
| BEL         | Dutch      |
| BEL         | French     |
| BEL         | German     |
| BEL         | Italian    |
| BEL         | Turkish    |
| FRA         | Arabic     |
| FRA         | French     |
| FRA         | Italian    |
| FRA         | Portuguese |
| FRA         | Spanish    |
| FRA         | Turkish    |

+-------------+------------+

| CountryCode | Language |

+-------------+------------+

| BEL | Arabic |

| BEL | Dutch |

| BEL | French |

| BEL | German |

| BEL | Italian |

| BEL | Turkish |

| FRA | Arabic |

| FRA | French |

| FRA | Italian |

| FRA | Portuguese |

| FRA | Spanish |

| FRA | Turkish |

Consider this grouped SELECT:

SELECT cl.Language, cl.CountryCode
FROM CountryLanguage AS cl
GROUP BY cl.Language;

SELECT cl.Language, cl.CountryCode

FROM CountryLanguage AS cl

GROUP BY cl.Language;

This SELECT groups by “Language”, so the first thing it has to do is “segregate” rows by language:

+----------+-------------+
| Language | CountryCode |
+----------+-------------+
| French   | BEL         |
| French   | FRA         |
...
-- group for French ends here ---
| Spanish  | ABW         |
| Spanish  | AND         |
...
-- group for Spanish ends here --
...

+----------+-------------+

| Language | CountryCode |

+----------+-------------+

| French | BEL |

| French | FRA |

...

-- group for French ends here ---

| Spanish | ABW |

| Spanish | AND |

...

-- group for Spanish ends here --

...

Then it must collapse (“aggregate”) all rows of the same group to form one row; this row should have, according to the columns listed in the SELECT, a value for “Language” and a value for “CountryCode”. For the “French” group, the value for “Language” is obviously “French”; but what is the value for “CountryCode”: “BEL” or “FRA” or something else? That’s non-deterministic, and that means the query is likely nonsensical.

The SQL-92 Standard solved this non-determinism by requiring that all selected columns be part of the GROUP BY clause. So it would then reject the example query above.

Historically, MySQL took a different route: it allowed the query above, and picked one value of “CountryCode” among the group. So you may obtain
(French, BEL), (Spanish, ABW)
or
(French, FRA), (Spanish, AND)
depending on MySQL’s implementation details, the access paths chosen to read the table, and subject to change in future versions; so in practice it was simply unpredictable.

In other words, users were allowed to execute non-standard SQL queries and it was up to them to think and decide whether the query was non-sensical or not. That made it far too easy for users to make mistakes that resulted in incorrect data presented to the user/application. Imagine that the user meant this reasonable query:

SELECT cl.Language, COUNT(cl.CountryCode)
FROM CountryLanguage AS cl
GROUP BY cl.Language;

SELECT cl.Language, COUNT(cl.CountryCode)

FROM CountryLanguage AS cl

GROUP BY cl.Language;

tin order to know how many countries speak each language. Alas he/she forgot the COUNT word, writing just “(cl.CountryCode)”, and the query accidentally became the bad one which we saw earlier where we produce incorrect and unpredictable results. MySQL didn’t raise any error or warning here, which is not being helpful at all in this scenario.

Users wary of this “dangerous permissive behaviour” would prefer MySQL to complain here, that’s why the only_full_group_by flag was added to the sql_mode server variable years ago. When this sql_mode was set, then MySQL rejected the bad query by implementing the behaviour prescribed by the SQL-92 standard. Here’s a test to demonstrate this behavior in MySQL 5.6:

mysql> SELECT cl.Language, cl.CountryCode FROM CountryLanguage AS cl GROUP BY cl.Language;
[ nonsensical random result here, no error ]

mysql> SET SQL_MODE='only_full_group_by';

mysql> SELECT cl.Language, cl.CountryCode FROM CountryLanguage AS cl GROUP BY cl.Language;
ERROR 1055 (42000): 'test.cl.CountryCode' isn't in GROUP BY

mysql> SELECT cl.Language, cl.CountryCode FROM CountryLanguage AS cl GROUP BY cl.Language;

[ nonsensical random result here, no error ]

mysql> SET SQL_MODE='only_full_group_by';

mysql> SELECT cl.Language, cl.CountryCode FROM CountryLanguage AS cl GROUP BY cl.Language;

ERROR 1055 (42000): 'test.cl.CountryCode' isn't in GROUP BY

See: MySQL 5.6 requires that all SELECTed columns be in the GROUP BY clause; “CountryCode” is not, so an error message signals it.

Using this new safe GROUP BY mode was rightfully advertised in the manual, and by users such as http://mechanics.flite.com/blog/2013/02/12/why-i-use-only-full-group-by/
or http://codeascraft.etsy.com/2013/03/19/the-perils-of-sql_mode/.

Still, this safe mode wasn’t enabled by default. I wasn’t around at that time (as it was more than twelve years ago), but I can speculate that there were the following obstacles:

Certain non-standard queries can be perfectly reasonable; for example this one, which wants a list of country names and how many languages are spoken in each country:

Valid SELECT

MySQL

SELECT co.Code, co.Name, COUNT(cl.Language) FROM Country AS co JOIN CountryLanguage AS cl ON cl.CountryCode = co.Code GROUP BY co.Code;

1

2

3

4

5

SELECT co.Code, co.Name,
COUNT(cl.Language)
FROM Country AS co JOIN CountryLanguage AS cl
ON cl.CountryCode = co.Code
GROUP BY co.Code;

In a group of rows of the same co.Code value, all rows have the same co.Name because co.Code is a primary key of the Country table. So co.Name is deterministic, but it’s not in the GROUP BY clause so SQL-92 and 5.6’s only_full_group_by mode rejected it.
So enabling the mode would make some users’ life unduly difficult (the kind of user who “knows what they are doing”); they would have to put all selected columns in the GROUP BY clause, thus making this clause more complex and thus increasing the chances that MySQL switches from an index scan to a table scan plus a filesort and a temporary table, thus making the query far slower.
The Standard itself evolved to accomodate these reasonable queries! Starting from SQL-99, the concept of “functional dependency” was introduced; a column B is functionally dependent on columns {A1, A2} in a result set if all rows which share the same value of A1 and the same value of A2 necessarily share the same value of B. Then SQL-99 accordingly modifies the constraints on GROUP BY: by default it keeps the same rules as in SQL-92, but it allows a DBMS to optionally use a relaxed rule, specifically that all columns in the SELECT list merely need to be functionally dependent on the GROUP BY columns. That is known as “SQL optional feature T301” and it guarantees that inside a group of rows, all selected columns have identical values, thus making the query’s result deterministic. That is exactly the case of the last query we showed where “co.Name” is determined by “co.Code”. Of course the old SQL-92 rule satisfies the relaxed rule: if all selected columns are part of the GROUP BY clause then they are functionally dependent on the GROUP BY columns. So we can say that the relaxed rule allows all SQL-92 legal queries, plus some new ones, while preserving determinism.

That was well explained by my former colleague Roland Bouman in
http://rpbouman.blogspot.co.uk/2007/05/debunking-group-by-myths.html .

For us then, it was a problem. The safe mode, only_full_group_by, was too restrictive and thus we could not make it the default. But that also meant that our innocent users were left prey to the potentially dangerous default mode. Once in a while a user would come to us with a strange result and we would have to explain, once again, that it’s a non-deterministic query and she/he should try only_full_group_by.
Another headache was that our QA team, who runs many thousands of automatically generated queries every day, was also experiencing this non-determinism; query result differences between, say, MySQL 5.5 and 5.6, would be flagged as regressions, though it was just a non-deterministic query with a GROUP BY. To avoid these false positives, QA would be running with only_full_group_by, which alas limited them to queries which were only a subset of what our users were allowed to do with the default permissive setting. So we were losing some testing depth, which was obviously not desirable.

Good news — those days are now over!

In MySQL 5.7.5, I have made only_full_group_by detect functional dependencies, which can be inferred not only from keys’ definitions as in the example above, but also from column equalities in the WHERE clause or [LEFT] JOIN ON clauses and from selected expressions in a view’s or derived table’s body. It was not an easy job, and it took a while to get it right, especially with outer joins and with views. Along the way, I even found a couple of small bugs in the relevant section of the SQL-2011 Standard, and thanks to the help from a colleague who has access to the Standard’s redaction committee, those bugs should be corrected in the next version of the SQL Standard.

The different types of functional dependencies are best explained with examples; so together with my colleague Paul DuBois we put them in a new section within the 5.7 manual. I really encourage you to read it as it shows well when columns are determined by other columns and when they are not, in spite of sometimes deceptive appearances! We used the world database there too, and our queries are understandable — so please check it out!

And, for a higher-level view, this is the master page which describes the new behaviour of only_full_group_by: http://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html .

Because this mode is now smarter than before, there is very little reason not to use it. So I have made it a default in 5.7.5 and later. This means that some old queries might return an error when upgrading. Should this happen, please check them; quite possibly they are non-deterministic and should be corrected (remember the COUNT() example earlier in this post). If they are deterministic, and MySQL simply missed it, or if you really want non-determinism, a new function called ANY_VALUE has been added to force MySQL to accept the intentionally/knowingly non-deterministic query. Of course, you can also take the more brutal approach and turn only_full_group_by off in your session or globally.

Several prominent DMBSs that I know continue using the SQL-92 rule, while another one detects functional dependencies due to keys only. I think MySQL is now the DBMS with the most complete implementation, inferring functional dependencies from keys, equalities, and in views. We don’t have everything that’s described in SQL-2011, but we’re not far.

Implementing this feature was also an occasion to refactor the relevant code. It used to be scattered in several functions, accumulating pieces of information as resolution of the query progressed (in Item::fix_fields() calls), and drawing a conclusion in a final phase of resolution. Consolidating all this into one single entity of code, which looks only at the fully resolved query, made things clearer and automatically fixed an obscure bug with views. The main code file is sql/aggregate_check.h and it has an extensive Doxygen multi-page comment to explain the logic.

Here’s another eliminated annoyance: in only_full_group_by mode, using an alias of a SELECT list element in the HAVING clause was forbidden. That was known as http://bugs.mysql.com/bug.php?id=51058, which I have also now fixed as part of this work. You can now do this in only_full_group_by mode:

SELECT cl.Language, COUNT(cl.CountryCode) AS C
FROM CountryLanguage AS cl
GROUP BY cl.Language
HAVING C >= 5;  # Show countries with at least 5 languages

SELECT cl.Language, COUNT(cl.CountryCode) AS C

FROM CountryLanguage AS cl

GROUP BY cl.Language

HAVING C >= 5; # Show countries with at least 5 languages

That’s far more readable than copying the full COUNT() expression into the HAVING clause and it’s also a bit faster to execute.

There was one last fix done as part of this Worklog. That is having to do with some rare non-grouped queries of the form SELECT DISTINCT ... ORDER BY also being non-deterministic, but this post is already long so I’ll discuss that in a subsequent post. Stay tuned!

That’s all for now. I really hope that this new feature makes users’ lives easier and improves their experience with MySQL! We also look forward to your feedback on this new feature! If you have any questions or encounter any bugs, please do let us know by opening a support ticket or filing a bug. As always, THANK YOU for using MySQL!