WL#13520: Transform correlated scalar subqueries

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

Basic idea

With correlation, to avoid materializing this several times in a derived table, we can instead materialize a derived table once which adds grouping on the column t2.a and then outer join on the lifted predicate (t1.a = derived.a) in order to select the right group to match up with the outer row.

In the example below we also need to check if COUNT(*) for the relevant group does not exceed 1. So we reject if cardinality > 1.

 SELECT * FROM t1 WHERE (SELECT a FROM t2 WHERE t2.a=t1.a) > 0;

 ->

 (with non-strict grouping and a built in assert to signal cardinality error):
 SELECT t1.* FROM
   t1 LEFT OUTER JOIN
   (SELECT a, COUNT(*) AS cnt FROM t2 GROUP BY a) AS derived
   ON t1.a = derived.a AND 
      REJECT_IF((cnt > 1),
             "ERROR 1242 (21000): Subquery returns more than 1 row")
   WHERE derived.a > 0

where the REJECT_IF is modeled as a boolean function Item_func_reject_if: it performs the comparison and raises the error if the cardinality check fails. Note that the check needs to happen as part of evaluating the JOIN condition or the WHERE condition, before evaluating the lifted predicate.

If the subquery already has explicit grouping, the extra grouping is just added to the end of the grouping list.

FR#1. A correlated field can only be present in the subquery's WHERE clause, i.e. not in the select list, join clause, order by clause, group by list or having clause. Nor can there be correlated fields inside a derived table in the subquery's FROM list.

FR#2. The WHERE clause may contain one or more predicates, combined with AND. If the WHERE clause contains an OR clause, it cannot be transformed.

FR#3. At least one of the WHERE clause predicates must be "eligible for transformation".

FR#4. None of the WHERE clause predicates must reject transformation.

FR#5. A WHERE clause predicate must be an equality predicate to be eligible for transformation. No other predicates, including other comparison predicates, in particular the <=> operator, are eligible for transformation.

FR#6. A WHERE clause predicate that only contains inner references is not eligible for transformation, as it can be evaluated before the grouping.

FR#7. A WHERE clause predicate that only contains outer references is eligible for transformation, even though it can be lifted up to the outer query block. This is made possible by adding the cardinality check without grouping in the derived table.

FR#8. To be eligible, a WHERE clause predicate must have one operand that contains only inner references and one operand that contains only outer references. If the predicate is not eligible due to this rule, it rejects transformation of the query.

FR#9. In an eligible WHERE clause predicate, each operand should be a simple column reference.

FR#10. The subquery cannot contain a LIMIT or OFFSET clause

FR#11. A correlated field can not be contained in an aggregate function's list of arguments.

FR#12. A correlated field must be resolved in the query block directly containing the subquery being considered for transformation.

FR#13. A correlated field cannot be present in a nested scalar subquery in the WHERE clause.

FR#14. The subquery cannot contain a set operation(UNION)

FR#15. A COUNT aggregate function, if contained in the select list element of the subquery, must be top-level, i.e. it can not be part of an expression.

FR#16. The subquery cannot contain any window functions.

FR#17. The subquery must not contain an aggregate function which aggregates in a query block outer to the subquery.

FR#18. The subquery A under consideration must not contain a subquery B which (directly or indirectly) contains an aggregate function which aggregates in subquery A. (This requirement may be redundant, as it would signal another subquery which cannot be transformed.)

FR#19. The subquery can be part of SELECT list, WHERE condition OR an HAVING condition. But it cannot be part of a JOIN condition.

"Outer aggregation"

If the correlated expression is an aggregate from outer query block, the transformation is not easy, so such queries are excluded from transformation. For example:

 SELECT 
   COUNT(*),
   a,
   (SELECT MIN(m) FROM t2 WHERE m = COUNT(*))  
 FROM t1 GROUP BY a;

The inner COUNT(*) is aggregated in the outer query due to its position in the WHERE clause: filtering occurs before grouping.

We could possibly re-write the above to

 SELECT outer.c, outer.a, inner.min_m
 FROM (SELECT COUNT(*) AS c, a
    FROM t1
    GROUP BY a) AS outer
   LEFT JOIN
   (SELECT MIN(m) AS min_m, COUNT(*) AS inner_count
    FROM t2
    GROUP BY m) AS inner
   ON outer.c = inner.m
 WHERE REJECT_IF(inner_count > 1);

i.e the outer grouping must be performed before joining with the grouped subquery. For now, we avoid doing the above.

Similarly, if such an outer aggregation is present in a subquery nested in the subquery under consideration, we also skip transformation.

DISTINCT

A special case is if the subquery contains DISTINCT on a select expression containing only fields we group on:

SELECT * FROM t1 WHERE (SELECT DISTINCT a FROM t2 WHERE t2.a=t1.a) > 0;

->

SELECT t1.* FROM
   t1 LEFT OUTER JOIN
   (SELECT a FROM t2 GROUP BY a) AS derived
   ON t1.a = derived.a
   WHERE derived.a > 0;

Since we group on a, we do not need to aggregate COUNT, since the original statement had the DISTINCT on that value, we can just eliminate it.

Correlated expression's position in subquery

The correlation must be located in the subquery's WHERE clause.

A special case: COUNT

An implicit COUNT selected never gives a NULL value even for an empty result set, but at the position of the inner table of a left outer join, it will. So, the transformation will not be correct unless we compensate for this. Consider

CREATE TABLE t1 (id INT);
CREATE TABLE t2 (id INT);
INSERT INTO t1 VALUES (1), (2);
INSERT INTO t2 VALUES (1);

and the query

SELECT t1.id, ( SELECT COUNT(t.id)
                  FROM t2 AS t
                  WHERE t.id = t1.id ) AS c FROM t1;
+------+------+
| id   | c    |
+------+------+
|    1 |    1 |
|    2 |    0 |
+------+------+

If this is transformed as suggested above:

SELECT t1.id AS id,
       derived_1_2.`COUNT(t.id)` AS c
FROM t1 LEFT JOIN
     (SELECT COUNT(t.id) AS `COUNT(t.id)`,
             t.id AS id
      FROM t2 t
      GROUP BY t.id) derived_1_2
      ON derived_1_2.id = t1.id;

it would yield the wrong result:

+------+------+
| id   | c    |
+------+------+
|    1 |    1 |
|    2 | NULL |
+------+------+

So, instead we wrap it in a COALESCE:

SELECT t1.id AS id,
       COALESCE(derived_1_2.`COUNT(t.id)`, 0) AS c
FROM t1 LEFT JOIN
     (SELECT COUNT(t.id) AS `COUNT(t.id)`,
             t.id AS id
      FROM t2 t
      GROUP BY t.id) derived_1_2
      ON derived_1_2.id = t1.id;

If the COUNT is selected as part of an expression in the scalar subquery, the situation gets more complicated: we would need to move the expression to the outer query, replacing the count with the coalesce, and just keep the basic count as a selected item in the derived table. (not yet supported).

For other aggregate functions, a coalesce is not needed since they yield null on empty result sets, so this restriction does not apply. For all aggregates we know that for each group there is only a single row, so no run-time cardinality checking is added.

Subquery in JOIN clause

When a scalar subquery is located in the JOIN clause, the derived tables has be joined to the outer table before the original inner joinee. Now, if we have correlation, this will give us the following transformation:

SELECT COUNT(*)
FROM t1 a
     JOIN
     t1 outr
     ON a.a= (SELECT count(*)
              FROM t1 inr
              WHERE inr.a = outr.a);


->

 SELECT COUNT(0) AS `COUNT(*)`
 FROM t1 a
      LEFT JOIN
      ( SELECT COUNT(0) AS `count(*)`,   <---- the new derived table
              inr.a AS a
        FROM t1 inr
        GROUP BY inr.a) derived_1_2
      ON derived_1_2.a = outr.a
      JOIN t1 outr
      WHERE (a.a = COALESCE(derived_1_2.`count(*)`,0));

If we try to hand this query to MySQL, we see:

ERROR 1054 (42S22): Unknown column 'mysql.outr.a' in 'on clause'

out.r used in the JOIN clause we constructed is not visible, because it comes from the second joinee table. Since this is a plain inner join, we would rearrange the original query order to:

SELECT COUNT(*)
FROM t1 outr
     JOIN
     t1 a
     ON a.a= (SELECT count(*)
              FROM t1 inr
              WHERE inr.a = outr.a);

to give a valid transform:

SELECT COUNT(0) AS `COUNT(*)`
 FROM t1 outr
      LEFT JOIN
      ( SELECT COUNT(0) AS `count(*)`,   
              inr.a AS a
        FROM t1 inr
        GROUP BY inr.a) derived_1_2
      ON derived_1_2.a = outr.a
      JOIN t1 a
      WHERE (a.a = COALESCE(derived_1_2.`count(*)`,0));

If the join clause is part of an outer join, we cannot reorder and the transformation is impossible.

Currently, we do not attempt the reordering for the inner join case either, meaning we do not support transforming a scalar subquery in a JOIN clause if it is correlated.

Window functions

The transform is not attempted if the subquery contains window functions

ORDER BY + LIMIT/OFFSET

Consider the query:

CREATE TABLE t1 (id INT, contract_id INT, datestamp DATETIME);
INSERT INTO t1 VALUES
       (1,2,'2006-09-18 09:07:53'), (2,3,'2006-09-18 09:07:53'),
       (3,4,'2006-09-18 09:07:53'), (4,10,'2008-09-18 09:07:53'),
       (5,7,'2006-09-18 09:07:53'), (6,5,'2006-09-18 09:07:53'),
       (7,9,'2006-09-18 09:07:53'), (8,10,'2006-09-18 09:07:53'),
       (9,10,'2010-09-18 09:07:53'), (10,6,'2014-09-18 09:07:53');
CREATE TABLE t2 (id INT);
INSERT INTO t2 VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10); 

SELECT (SELECT datestamp
        FROM t1
        WHERE contract_id = t2.id
        ORDER BY datestamp ASC
        LIMIT 1) AS subq
FROM t2;

+---------------------+
| subq                |
+---------------------+
| NULL                |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2014-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| NULL                |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
+---------------------+
10 rows in set (0.01 sec)

Blindly using the transform outlined above gives:

SELECT derived_1_2.datestamp AS subq
FROM t2 LEFT JOIN
     ( SELECT t1.datestamp AS datestamp,
              t1.contract_id AS contract_id,
              COUNT(0) AS Name_exp_3
       FROM t1
       GROUP BY t1.contract_id
       ORDER BY t1.datestamp 
       LIMIT 1) derived_1_2
     ON (derived_1_2.contract_id = t2.id) AND
        (derived_1_2.Name_exp_3 reject_if > 1)

This will give the wrong answer. After the grouping on contract_id, it is indeterministic to order by datestamp, and the LIMIT 1 yield only one group:

+---------------------+
| subq                |
+---------------------+
| NULL                |
| 2006-09-18 09:07:53 |
| NULL                |
| NULL                |
| NULL                |
| NULL                |
| NULL                |
| NULL                |
| NULL                |
| NULL                |
+---------------------+
10 rows in set (0.00 sec)

What about also grouping on the order by expression?

SELECT derived_1_2.datestamp AS subq
FROM t2
     LEFT JOIN
     ( SELECT t1.datestamp AS datestamp,
              t1.contract_id AS contract_id
       FROM t1
       GROUP BY t1.contract_id, t1.datestamp
       ORDER BY t1.datestamp) derived_1_2 

     ON derived_1_2.contract_id = t2.id;

+---------------------+
| subq                |
+---------------------+
| NULL                |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2014-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| NULL                |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2008-09-18 09:07:53 |
| 2010-09-18 09:07:53 | <-- should have been 2006
+---------------------+
12 rows in set (0.00 sec)

Better, but the last row is wrong. We have multiple rows in the derived table that match the ON predicate, so we get an indeterministic value for datestamp, again, cf. inner query:

mysql> SELECT t1.datestamp AS datestamp, 
              t1.contract_id AS contract_id
       FROM t1  
       GROUP BY t1.contract_id, t1.datestamp
       ORDER BY t1.datestamp DESC;
+---------------------+-------------+
| datestamp           | contract_id |
+---------------------+-------------+
| 2014-09-18 09:07:53 |           6 |
| 2010-09-18 09:07:53 |          10 | <--- not this
| 2008-09-18 09:07:53 |          10 | <--- not this
| 2006-09-18 09:07:53 |           2 |
| 2006-09-18 09:07:53 |           3 |
| 2006-09-18 09:07:53 |           4 |
| 2006-09-18 09:07:53 |           7 |
| 2006-09-18 09:07:53 |           5 |
| 2006-09-18 09:07:53 |           9 |
| 2006-09-18 09:07:53 |          10 | <--- we want this one
+---------------------+-------------+
10 rows in set (0.01 sec)

If we order by DESC, we get the correct result, but we cannot rely on that.

The right transform turns out to be:

SELECT derived_1_2.datestamp AS subq
FROM t2
     LEFT JOIN
     ( SELECT MIN(datestamp) OVER () AS datestamp,
              t1.contract_id AS contract_id
       FROM t1
       GROUP BY contract_id, datestamp) derived_1_2
     ON derived_1_2.contract_id = t2.id;

i.e. we effectively transform the ORDER BY ASC + LIMIT 1 to MIN. But we do not transform correlated subqueries with LIMIT/OFFSET in this WL.

For the case when OFFSET is present, transformation with window functions is a bit different.

For the following query:

SELECT (SELECT datestamp
        FROM t1
        WHERE contract_id = t2.id
        ORDER BY datestamp ASC
        LIMIT 1 OFFSET 3) AS subq
FROM t2;

right transformation would be :

SELECT derived_1_2.datestamp AS subq
FROM t2
     LEFT JOIN
     ( SELECT NTH_VALUE(datestamp, 3) OVER (PARTITION BY contract_id
                                            ORDER BY datestamp) AS 
              datestamp,
              t1.contract_id AS contract_id,
              ROW_NUMBER() OVER(PARTITION BY contract_id
                                ORDER BY datestamp) as row_num,
              COUNT(*) OVER(PARTITION BY contract_id) as count
       FROM t1) derived_1_2
     ON derived_1_2.contract_id = t2.id and row_num = 3;

For the case of LIMIT N OFFSET M, we could change the join condition to ROW_NUMBER() >= M and REJECT_IF(count > (M+1))

Without using window functions, we could re-write query having LIMIT 1 by re-writing the query following way:

SELECT derived_1_2.datestamp AS subq
FROM t2 LEFT JOIN  (SELECT min(t1.datestamp) AS datestamp,
                           t1.contract_id AS contract_id,
                           COUNT(0) AS Name_exp_3
                    FROM t1
                    GROUP BY t1.contract_id) derived_1_2
     ON derived_1_2.contract_id = t2.id and REJECT_IF(Name_exp_3 > 1);

With arbitrary LIMIT, we need to carry the COUNT for each group and reject when cardinality is greater than 1 - so we just ignore the LIMIT in the grouping, but reject groups where MIN(COUNT(*), LIMIT) > 1. And LIMIT 0 must be handled specially, as the subquery will only give NULL values.

A new item has been introduced to trigger the necessary cardinality run-time check:

 class Item_func_reject_if : public Item_bool_func

it's val_int method looks like this for the primary engine:

longlong Item_func_reject_if::val_int() {
  longlong result = args[0]->val_int();
  if (result == 1) {
    my_error(ER_SUBQUERY_NO_1_ROW, MYF(0));
  }
  return !result;
}

that is, the expectation is that if the comparison yields true, the method should raise the run-time error ER_SUBQUERY_NO_1_ROW.

Any secondary engine will see this operator in the query plan and should implement similar semantics. For example this query:

SELECT * FROM t1 WHERE (SELECT a FROM t2 WHERE t2.a = t1.a) > 0;

will be transformed to

SELECT t1.a AS a
FROM t1 JOIN
     ( SELECT t2.a AS a,
              COUNT(0) AS Name_exp_2
       FROM t2
       GROUP BY t2.a) derived_1_2
WHERE ((t1.a = derived_1_2.a) AND
       (derived_1_2.a > 0) AND
       REJECT_IF(derived_1_2.Name_exp_2 > 1))

The last predicate in the where contains the Item_func_reject_if operator. If, during execution, the secondary engine sees "reject_if" evaluate to false, it should send back ER_SUBQUERY_NO_1_ROW to the MySQL server.