MySQL 8.4.2
Source Code Documentation
Aggregate checks of ONLY_FULL_GROUP_BY

Checks for some semantic constraints on queries using GROUP BY, or aggregate functions, or DISTINCT (ONLY_FULL_GROUP_BY). More...

Classes

class  mem_root_deque< Element_type >
 A (partial) implementation of std::deque allocating its blocks on a MEM_ROOT. More...
 
class  Distinct_check
 Checks for queries which have DISTINCT. More...
 
class  Group_check
 Checks for queries which have GROUP BY or aggregate functions. More...
 

Functions

bool Distinct_check::check_query (THD *thd)
 Rejects the query if it has a combination of DISTINCT and ORDER BY which could lead to randomly ordered results. More...
 
bool Group_check::check_query (THD *thd)
 Rejects the query if it does aggregation or grouping, and expressions in its SELECT list, ORDER BY clause, HAVING condition, or window functions may vary inside a group (are not "group-invariant"). More...
 
bool Group_check::check_expression (THD *thd, Item *expr, bool in_select_list)
 Validates one expression (this forms one step of check_query()). More...
 
bool Group_check::is_fd_on_source (Item *item)
 Tells if 'item' is functionally dependent ("FD") on source columns. More...
 
void Group_check::add_to_fd (Item *item, bool local_column, bool add_to_mat_table=true)
 
void Group_check::find_group_in_fd (Item *item)
 This function must be called every time we discover an item which is FD on source columns, or add a bit to whole_tables_fd; it maintains group_in_fd. More...
 
ItemGroup_check::select_expression (uint idx)
 
void Group_check::add_to_source_of_mat_table (Item_field *item_field, Table_ref *tl)
 If we just added a column of a materialized table to 'fd', we record this fact in a new Group_check (mat_gc) for the query expression underlying that table. More...
 
bool Group_check::is_in_fd (Item *item)
 is_in_fd() is low-level compared to is_fd_on_source(). More...
 
bool Group_check::is_in_fd_of_underlying (Item_ident *item)
 See if we can derive a FD from a column which has an underlying expression. More...
 
ItemGroup_check::get_fd_equal (Item *item)
 
void Group_check::analyze_conjunct (Item *cond, Item *conjunct, table_map weak_tables, bool weak_side_upwards)
 Searches for equality-based functional dependences in an AND-ed part of a condition (a conjunct). More...
 
void Group_check::analyze_scalar_eq (Item *cond, Item *left_item, Item *right_item, table_map weak_tables, bool weak_side_upwards)
 Helper function. More...
 
void Group_check::find_fd_in_cond (Item *cond, table_map weak_tables, bool weak_side_upwards)
 Searches for equality-based functional dependencies in a condition. More...
 
void Group_check::find_fd_in_joined_table (mem_root_deque< Table_ref * > *join_list)
 Searches for equality-based functional dependencies in the condition of a join nest, and recursively in all contained join nests. More...
 
void Group_check::to_opt_trace (THD *thd)
 Writes "check information" to the optimizer trace. More...
 
void Group_check::to_opt_trace2 (Opt_trace_context *ctx, Opt_trace_object *parent)
 Utility function for to_opt_trace(), as we need recursion in children Group_checks. More...
 
bool Group_check::do_ident_check (Item_ident *i, table_map tm, enum enum_ident_check type)
 Does one low-level check on one Item_ident. More...
 

Variables

static const enum_walk walk_options
 We need to search for items inside subqueries, in case subqueries contain outer references to tables of a query block having DISTINCT or GROUP BY. More...
 

Detailed Description

Checks for some semantic constraints on queries using GROUP BY, or aggregate functions, or DISTINCT (ONLY_FULL_GROUP_BY).

We call "aggregation" the operation of taking a group of rows and replacing it with a single row. There are three types of aggregation: DISTINCT, implicit grouping, explicit grouping.

This text describes MySQL's checks (why certain queries are rejected) and the rationale behind them.

References:

DISTINCT

DISTINCT: all identical rows in the result of SELECT are "aggregated" to one single row - duplicates are eliminated. If the result of the SELECT without DISTINCT is

1 2
3 4
1 2

then the result with DISTINCT is

1 2
3 4

Here is a problematic query. Assume we have a table T which contains three columns C1,C2,C3 and has the following rows:

C1 C2 C3
1  2  A
3  4  B
1  2  C

and we do "SELECT DISTINCT C1, C2 FROM T ORDER BY C3". Because we want the result to be ordered, we have to first eliminate duplicates then order the result. When we eliminate duplicates, should we keep the first or the third row? This arbitrary choice will influence the retained value of C3, which will influence ordering. In the end, we have arbitrary ordering, which is problematic. To prevent this, if, in a query block 'sl', an ORDER BY expression is not the same expression as one in the SELECT list of 'sl' and contains a column which:

then 'sl' should not have DISTINCT. This rule makes the query above invalid. Class Check_distinct implements this rule.

Class Check_distinct implements a second rule, specific of set functions in ORDER BY (a non-standard feature). Consider these queries labelled (1), (2), (3), with DISTINCT and a set function in ORDER BY:

SELECT DISTINCT MIN(C2) FROM T GROUP BY C1 ORDER BY MIN(C3);
SELECT DISTINCT MIN(C2) FROM T GROUP BY C1 ORDER BY MIN(C2);
SELECT DISTINCT C1, MIN(C2) FROM T GROUP BY C1 ORDER BY MIN(C3);

(1) is a random-order query, (2) and (3) are not. MySQL has traditionally been permissive, we want to allow (2) and (3).

So the second rule is that Check_distinct rejects a query if it has DISTINCT and a set function in ORDER BY which is not the same expression as one in the SELECT list. This rejects (1) and allows (2). It would reject (3); luckily, before Check_distinct checks, DISTINCT is removed (it is redundant with GROUP BY) and thus the query is not processed by Check_distinct; the second rule is thus by-passed and (3) is correctly accepted.

The implementation of Check_distinct works like this: if the query has DISTINCT, examine each element of the ORDER BY list: if the element is not the same expression as one in the SELECT list, then examine parts of the element (using Item::walk()), to spot offending set functions or columns.

Implicit grouping

A query with set functions without GROUP BY can be seen as having GROUP BY() i.e. the set of grouping expressions is empty, all rows are part of a single group and are replaced with a single row. So it is just a sub-case of the next section.

Explicit grouping (GROUP BY)

MySQL is more permissive than the standard, it allows to group by any expression; it does not require that every element of the GROUP BY clause should be a column of one table of the FROM clause.

Here is a problematic query, using a table T with columns C1, C2, C3:

C1 C2 C3
1  2  A
3  4  B
1  2  C

SELECT C1, C3 FROM T GROUP BY C1;

we first form groups, in each group the value of C1 must be the same for all rows. Consider the group made of the first and third row. We have to produce one row out of it, and this row must have a value for C3 because C3 is in the SELECT list. Among those two rows, which value of C3 should we choose? This is arbitrary, and will give a random result.

To prevent this, in a query with GROUP BY or aggregates (known as "a grouped query"), any column referenced by an expression in the SELECT list or HAVING condition and belonging to one table of the FROM clause, should be one of the group columns (enforced by only_full_group_by in MySQL 5.6 and older) or, if functional dependencies are supported (as in MySQL 5.7), should be functionally dependent on the group columns. In the table T above, C3 is obviously not functionally dependent on {C1,C2}: the values of C1 and C2 for a row do not uniquely determine the value of C3. In other words, if two rows have the same value in C1 and the same value in C2 they do not necessarily have the same value in C3. So this rule correctly rejects the query above.

Note that NULL is treated as one value: two NULLs are considered equal.

In WL#2489, we have implemented the optional standard feature T301 "functional dependencies" almost entirely.

Here are the functional dependencies which we recognize.

Key-based, in a base table

A key in this text is a unique constraint made of only non-NULLable columns. For example, a primary key. Considering a base table T, if two rows have the same values of all columns of a key of T they are actually one single row, so: { all columns of key} -> { T.* } (notation: what is on the right of the arrow is functionally dependent on what is on the left).

Generated-column-based, in a base table

Considering a base table T, a generated column is functionally dependent on the set of columns it references (the so-called parametric columns). Note that the SQL standard doesn't allow a parametric column to be a generated column, but MySQL does.

Equality-based, in the result of an inner join

Considering two tables T1 and T2, a condition C, and the result R of an inner join T1 JOIN T2 ON C. Note that T1/T2 are not necessarily base tables. For example, they can also be views, or derived tables, or the results of some joins; in the rest of text, we use the vague term "table" for those various objects.

Note that C may only refer to columns of T1 or T2 (outer references are forbidden by MySQL in join conditions).

Assuming that C is a conjunction (i.e. is made of one or more conditions, "conjuncts", chained together with AND):

Equality-based, in the result of an outer join

Considering the result R of TS LEFT JOIN TW ON C. Assuming that C is a conjunction. TS is said to be the strong table, TW is said to be the weak table (the one which might be NULL-complemented in the result of this join). To make this really clear, note that, if we have "t1 left join (t2 left join t3 on C) on D":

If C is deterministic and one conjunct is of the form TW.A=constant or TW.A=TS.B, then DJS -> {TW.A} holds in R, where DJS is the set of all columns of TS referenced by C. Proof: consider in R two rows r1 and r2 which have the same values of DJS columns. Consider r1. There are two possibilities when r1 was built from a row of TS:

If one conjunct is of the form TW.A=TW.B then {TW.A} -> {TW.B} holds in R Proof: consider r1 and r2 two rows in R having the same value of TW.A. Two cases are possible:

If any conjunct references column T1.A and is not true if T1.A is NULL (e.g. conjunct is T1.A > 3), and T1 is part of TW, we can replace T1 with (SELECT * FROM T1 WHERE T1.A IS NOT NULL) AS T1. We do not do such query rewriting, but a consequence is that we can and do treat T1.A as "known not nullable in T1". Proof: if we do this replacement, it means we remove from T1 some rows where T1.A is NULL. For the sake of the proof, let's broaden this to "we remove from and/or add to T1 rows where T1.A is NULL". By this operation,

Equality-based, in the result of a WHERE clause

Same rule as the result of an inner join. Additionally, the rule is extended to T1.A=outer_reference, because an outer reference is a constant during one execution of this query.

Moreoever, if any conjunct references column T1.A and is not true if T1.A is NULL (e.g. conjunct is T1.A > 3), we can treat T1.A as not nullable in T1. Proof: same as in previous section OUTEREQ. We can then derive FD information from a unique index on a nullable column; consider: create table t1(a int null, b int null, unique(a)); While select a,b from t1 group by a; is invalid (because two null values of 'a' go in same group), select a,b from t1 where 'a' is not null group by a; is valid: 'a' can be treated as non-nullable in t1, so the unique index is like "unique not null", so 'a' determines 'b'.

Below we examine how functional dependencies in a table propagate to its containing join nest.

Considering the result R of TS LEFT JOIN TW ON C.

All functional dependencies in TS are also functional dependencies in R. Proof: trivial. The same is not necessarily true for TW. Let's define a "NULL-friendly functional dependency" (NFFD) as a functional dependency between two sets A and B, which has two properties:

All NFFDs in TW are also NFFDs in R. Proof: consider an NFFD A -> B in TW, and r1 and r2 two rows in R having the same values of A columns. Two cases are possible:

The concept of an NFFD is Guilhem's invention. It was felt it was necessary, to have easy propagation of FDs from TW to R. It was preferred to the alternative, simpler rule which says that a functional dependency A-> B in TW is also a functional dependency in R if A contains a non-nullable column. There are two points to note:

Considering the result R of T1 JOIN T2 ON C.

All [NULL-friendly] functional dependencies in T1 are also [NULL-friendly] functional dependencies in R. the same is true for T2. Proof: trivial.

Summary rules for propagating FDs

All NULL-friendly functional dependencies propagate freely through join nests all the way up to the result of the WHERE clause. The same is true for ordinary functional dependencies except if there are weak tables along the propagation path between the table where the dependency holds and the result of the WHERE clause; in other words, except if the table where the dependency holds belongs to some embedding join nest which is on the weak side of an outer join.

Which functional dependencies are NULL-friendly

A key-based functional dependency A -> B in the base table is NULL-friendly, because, by definition, there can never be a NULL value in any column of A.

A functional dependency A -> B in a base table, between parametric columns A and a generated column B, is not NULL-friendly; for more details, see Functional dependencies in a view or a derived table .

A functional dependency A->B in the result of T1 JOIN T2 ON C, if based on equality of two columns, is NULL-friendly. Indeed, A is not empty and if there was some NULL in A, it would not match the equality in C and thus it would not exist in the result, absurd. If the equality is rather column=constant, A is empty, the dependency is not NULL-friendly. However, in our implementation, function simplify_joins() moves inner-join conditions to the embedding outer-join nest's join condition, or to the WHERE clause. Because our analysis of functional dependencies happens after simplify_joins(), when we analyze T1 JOIN T2 it is guaranteed to have no condition, and this paragraph is irrelevant.

A functional dependency in the result of TS LEFT JOIN TW ON C, if based on equality of two columns, is NULL-friendly. Proof: let's consider, in turn, the two types of equality-based functional dependencies which exist in this result R. Let r1 be a row of R.

A functional dependency in the result of a WHERE clause, if based on equality of two columns, is NULL-friendly. If based on T1.A=constant, it is not, as it has an empty set of source columns.

Summary: all functional dependencies which we have seen so far are NULL-friendly, except those inferred from TW.A=constant in an outer join condition or in a WHERE clause, and those about generated columns.

Thus, in the query with T1-T2-T3 previously seen, T3.PK->T3.A is NULL-friendly and propagates, query is accepted.

In our implementation, we take special care of TW.A=constant in an outer join condition: we infer a functional dependency DJS->TW.A from such equality only if one of these requirements are met:

Either of these two conditions is satisfied in most practical cases. For example, it's very common to have an equality between a strong-side column and a weak-side column as a conjunct in the outer join condition (like, "ON strong.pk = weak.foreign_key AND ..." or "ON strong.foreign_key = weak.pk AND ..."); this satisfies the second condition. It's also common to have outer joins only left-deep ("SELECT ... T1 LEFT JOIN T2 ON ... LEFT JOIN T3 ON ..." is left-deep); this satisfies the first condition. Note that the dependency found from TW.A=TS.B in an outer join condition always satisfies the second condition.

T1.A=constant in a WHERE clause is exterior to any join nest so does not need to propagate, so does not need to be NULL-friendly.

Functional dependencies in a view or a derived table

In the rest of this text, we will use the term "view" for "view or derived table". A view can be merged or materialized, in MySQL. Consider a view V defined by a query expression. If the query expression contains UNION or ROLLUP (which is theoretically based on UNION) there are no functional dependencies in this view. So let's assume that the query expression is a query specification (let's note it VS):

CREATE VIEW V AS SELECT [DISTINCT] VE1 AS C1, VE2 AS C2, ... FROM ... WHERE ...
[GROUP BY ...] [HAVING ...] [ORDER BY ...]

If {VE1, VE2, VE3} are columns of tables of the FROM clause, and {VE1, VE2} -> {VE3} has been deduced from rules in the previous sections [and is NULL-friendly], then {C1, C2} -> { C3 } holds in the view [and is NULL-friendly].

If {VE1, VE2} are columns of tables of the FROM clause, and VE3 is a deterministic expression depending only on VE1 and VE2, then {C1, C2} -> { C3 } in the view. It is not always NULL-friendly, for example: VE3 could be COALESCE(VE1,VE2,3): if VE1 (C1) and VE2 (C2) are NULL, VE3 (C3) is 3: not NULL. Another example: VE3 could be a literal; {}->{C3}, the left set is empty. The same examples apply to a generated column in a base table - it is like a merged view's expression. For example, in a base table T which has a generated column C3 AS COALESCE(C1,C2,3): {C1, C2} -> { C3 } holds in T but is not NULL-friendly.

If VS is a grouped query (which, in MySQL, implies that the view is materialized), then in the result of the grouping there is a functional dependency from the set of all group expressions to the set of all selected expressions (otherwise, this query expression would not have passed its own only_full_group_by validation - in the implementation we validate each query expression separately). Thus, if all group expressions of VS are in the select list of VS, for example they are VE1 and VE2, then {C1, C2} -> {V.*}. It is not NULL-friendly, for example: VE3 is COUNT(1): if the result of the WHERE clause contains a row with group expressions VE1 and VE2 equal to NULL, VE3 is not NULL.

It's possible to infer functional dependencies from equality conditions in HAVING, but we have not implemented it yet.

Because some functional dependencies above are not NULL-friendly, they exist in the view, but may not freely propagate in the result of join nests containing the view. This includes examples just given in paragraphs above, and the case of T1.A=constant in the WHERE clause of VS.

Thus, when we know of a functional dependency A -> B in the query expression of a view, we deduce from it a functional dependency in the view only if:

The above is fine for materialized views. For merged views, we cannot study the query expression of the view, it has been merged (and scattered), so we use a different rule:

Implementation note: used_tables_for_level() vs

used_tables()

Function Documentation

◆ add_to_fd()

void Group_check::add_to_fd ( Item item,
bool  local_column,
bool  add_to_mat_table = true 
)
private

◆ add_to_source_of_mat_table()

void Group_check::add_to_source_of_mat_table ( Item_field item_field,
Table_ref tl 
)
private

If we just added a column of a materialized table to 'fd', we record this fact in a new Group_check (mat_gc) for the query expression underlying that table.

This can later help us derive new functional dependencies in our Group_check. For example: select d.a, d.b from (select t.a*2 as a, t.a as b from t) group by d.b; When we add d.b to 'fd', in this function we create mat_gc, see that d.b is built from a column of t (t.b), we can say that "t.b is determined", so we add t.b to mat_gc.fd. Later, when we wonder if d.a is functionally dependent on d.b, we process d.a in is_in_fd_of_underlying(): we analyze 2*t.a in the context of mat_gc: 2*t.a is FD on t.a, we conclude that d.a is indeed FD on d.b.

Parameters
item_fieldcolumn of 'tl', just added to 'fd'
tlmat table

◆ analyze_conjunct()

void Group_check::analyze_conjunct ( Item cond,
Item conjunct,
table_map  weak_tables,
bool  weak_side_upwards 
)
private

Searches for equality-based functional dependences in an AND-ed part of a condition (a conjunct).

Search for columns which are known-not-nullable due to the conjunct.

Parameters
condcomplete condition
conjunctone AND-ed part of 'cond'
weak_tablesIf condition is WHERE, it's 0. Otherwise it is the map of weak tables in the join nest which owns the condition.
weak_side_upwardsIf condition is WHERE it's false. Otherwise it is true if the join nest owning this condition is embedded in the weak side of some parent outer join (no matter how far up the parent is).

◆ analyze_scalar_eq()

void Group_check::analyze_scalar_eq ( Item cond,
Item left_item,
Item right_item,
table_map  weak_tables,
bool  weak_side_upwards 
)
private

Helper function.

See also
analyze_conjunct().

◆ check_expression()

bool Group_check::check_expression ( THD thd,
Item expr,
bool  in_select_list 
)
private

Validates one expression (this forms one step of check_query()).

Parameters
thdcurrent thread
exprexpression
in_select_listwhether this expression is coming from the SELECT list.

◆ check_query() [1/2]

bool Distinct_check::check_query ( THD thd)

Rejects the query if it has a combination of DISTINCT and ORDER BY which could lead to randomly ordered results.

More precisely: if, in a query block 'sl', an ORDER BY expression

  • is not the same expression as one in the SELECT list of 'sl' (1)
  • and contains a column which: – is of a table whose qualifying query block is 'sl' (2) – and is not in the SELECT list of 'sl' (3) then 'sl' should not have DISTINCT.
Returns
true if rejected (my_error() is called)

◆ check_query() [2/2]

bool Group_check::check_query ( THD thd)

Rejects the query if it does aggregation or grouping, and expressions in its SELECT list, ORDER BY clause, HAVING condition, or window functions may vary inside a group (are not "group-invariant").

◆ do_ident_check()

bool Group_check::do_ident_check ( Item_ident i,
table_map  tm,
enum enum_ident_check  type 
)
private

Does one low-level check on one Item_ident.

Called by Item_ident walk processors.

Parameters
iitem to check
tmmap of strong tables, if type==CHECK_STRONG_SIDE_COLUMN
typecheck to do:
  • CHECK_GROUP for Item_ident::aggregate_check_group()
  • CHECK_STRONG_SIDE_COLUMN for Item_ident::is_strong_side_column_not_in_fd()
  • CHECK_COLUMN for Item_ident::is_column_not_in_fd()
Returns
false if success

◆ find_fd_in_cond()

void Group_check::find_fd_in_cond ( Item cond,
table_map  weak_tables,
bool  weak_side_upwards 
)
private

Searches for equality-based functional dependencies in a condition.

Parameters
condcondition: a WHERE condition or JOIN condition.
weak_tablesIf condition is WHERE, it's 0. Otherwise it is the map of weak tables in the join nest which owns the condition.
weak_side_upwardsIf condition is WHERE it's false. Otherwise it is true if the join nest owning this condition is embedded in the right side of some parent left join.

◆ find_fd_in_joined_table()

void Group_check::find_fd_in_joined_table ( mem_root_deque< Table_ref * > *  join_list)
private

Searches for equality-based functional dependencies in the condition of a join nest, and recursively in all contained join nests.

Parameters
join_listmembers of the join nest

◆ find_group_in_fd()

void Group_check::find_group_in_fd ( Item item)
private

This function must be called every time we discover an item which is FD on source columns, or add a bit to whole_tables_fd; it maintains group_in_fd.

Parameters
itemitem which is FD; if NULL, means that we instead added a bit to whole_tables_fd.

◆ get_fd_equal()

Item * Group_check::get_fd_equal ( Item item)
private
Returns
an element of 'fd' array equal to 'item', or nullptr if not found.
Parameters
itemItem to search for.

◆ is_fd_on_source()

bool Group_check::is_fd_on_source ( Item item)
private

Tells if 'item' is functionally dependent ("FD") on source columns.

Source columns are:

  • if !is_child(), the GROUP BY columns
  • if is_child(), columns of the result of the query expression under 'table' which are themselves part of 'fd' of the parent Group_check.

We recognize most FDs imposed by SQL2011 (optional feature T301)

We build sets, named En, by induction. A "column" is defined as base table / view / derived table's column.

E1 = {source columns} (=group columns, if this is a master Group_check; empty set otherwise).

En is a set of columns of the result of the WHERE clause of 'select' which are functionally dependent on E1. If is_child(), En columns might rather be of the result of the GROUP BY clause (if there is one), but that's an unimportant detail, ignored further down.

Given En, build En+1:

  • let En+1= En
  • for each {pk/unique index of some table T} found in En, add T.* to En+1 (actually, we rather add T's map bit to the table_map whole_tables_fd).

Then build En+2, by adding y, for each x=y in AND parts of WHERE/ON where x is in En+1 or is constant, and y is a column not in En+1.

When we meet columns of views or derived tables, we additionally search for FDs in their query expression, which can then give FDs in our query. If En+2==En, this is the end of the induction. Otherwise we loop.

As we build En, we check if 'item' is in it.

Parameters
itemItem to consider; either a column local to 'select', or a set function whose aggregation query is 'select'
Returns
true if 'item' is functionally dependent on source columns.

◆ is_in_fd()

bool Group_check::is_in_fd ( Item item)
private

is_in_fd() is low-level compared to is_fd_on_source().

The former only searches through built FD information; the latter builds this information and calls the former to search in it.

Parameters
itemItem to consider; either a column local to 'select', or a set function whose aggregation query is 'select'
Returns
true if the expression is FD on the source.

◆ is_in_fd_of_underlying()

bool Group_check::is_in_fd_of_underlying ( Item_ident item)
private

See if we can derive a FD from a column which has an underlying expression.

For a generated column, see if we can derive a FD from its expression. For a column of a view or derived table, see if we can derive a FD from the underlying query block.

Parameters
itemcolumn
Returns
true if this column is FD on source

◆ select_expression()

Item * Group_check::select_expression ( uint  idx)
private
Returns
the idx-th expression in the SELECT list of our query block.

◆ to_opt_trace()

void Group_check::to_opt_trace ( THD thd)

Writes "check information" to the optimizer trace.

◆ to_opt_trace2()

void Group_check::to_opt_trace2 ( Opt_trace_context ctx,
Opt_trace_object parent 
)
private

Utility function for to_opt_trace(), as we need recursion in children Group_checks.

to_opt_trace() only writes a one-time header.

Variable Documentation

◆ walk_options

const enum_walk walk_options
static
Initial value:

We need to search for items inside subqueries, in case subqueries contain outer references to tables of a query block having DISTINCT or GROUP BY.

We also need to sometimes skip parts of item trees, so the walk processor must be called prefix (to enable skipping) and postfix (to disable skipping).