WL#13520: Transform correlated scalar subqueries
Basic idea
With correlation, to avoid materializing this several times in a derived table, we can instead materialize a derived table once which adds grouping on the column t2.a and then outer join on the lifted predicate (t1.a = derived.a) in order to select the right group to match up with the outer row.
In the example below we also need to check if COUNT(*) for the relevant group does not exceed 1. So we reject if cardinality > 1.
SELECT * FROM t1 WHERE (SELECT a FROM t2 WHERE t2.a=t1.a) > 0;
->
(with non-strict grouping and a built in assert to signal cardinality error):
SELECT t1.* FROM
t1 LEFT OUTER JOIN
(SELECT a, COUNT(*) AS cnt FROM t2 GROUP BY a) AS derived
ON t1.a = derived.a AND
REJECT_IF((cnt > 1),
"ERROR 1242 (21000): Subquery returns more than 1 row")
WHERE derived.a > 0
where the REJECT_IF is modeled as a boolean function Item_func_reject_if: it performs the comparison and raises the error if the cardinality check fails. Note that the check needs to happen as part of evaluating the JOIN condition or the WHERE condition, before evaluating the lifted predicate.
If the subquery already has explicit grouping, the extra grouping is just added to the end of the grouping list.
FR#1. A correlated field can only be present in the subquery's WHERE clause, i.e. not in the select list, join clause, order by clause, group by list or having clause. Nor can there be correlated fields inside a derived table in the subquery's FROM list.
FR#2. The WHERE clause may contain one or more predicates, combined with AND. If the WHERE clause contains an OR clause, it cannot be transformed.
FR#3. At least one of the WHERE clause predicates must be "eligible for transformation".
FR#4. None of the WHERE clause predicates must reject transformation.
FR#5. A WHERE clause predicate must be an equality predicate to be eligible for transformation. No other predicates, including other comparison predicates, in particular the <=> operator, are eligible for transformation.
FR#6. A WHERE clause predicate that only contains inner references is not eligible for transformation, as it can be evaluated before the grouping.
FR#7. A WHERE clause predicate that only contains outer references is eligible for transformation, even though it can be lifted up to the outer query block. This is made possible by adding the cardinality check without grouping in the derived table.
FR#8. To be eligible, a WHERE clause predicate must have one operand that contains only inner references and one operand that contains only outer references. If the predicate is not eligible due to this rule, it rejects transformation of the query.
FR#9. In an eligible WHERE clause predicate, each operand should be a simple column reference.
FR#10. The subquery cannot contain a LIMIT or OFFSET clause
FR#11. A correlated field can not be contained in an aggregate function's list of arguments.
FR#12. A correlated field must be resolved in the query block directly containing the subquery being considered for transformation.
FR#13. A correlated field cannot be present in a nested scalar subquery in the WHERE clause.
FR#14. The subquery cannot contain a set operation(UNION)
FR#15. A COUNT aggregate function, if contained in the select list element of the subquery, must be top-level, i.e. it can not be part of an expression.
FR#16. The subquery cannot contain any window functions.
FR#17. The subquery must not contain an aggregate function which aggregates in a query block outer to the subquery.
FR#18. The subquery A under consideration must not contain a subquery B which (directly or indirectly) contains an aggregate function which aggregates in subquery A. (This requirement may be redundant, as it would signal another subquery which cannot be transformed.)
FR#19. The subquery can be part of SELECT list, WHERE condition OR an HAVING condition. But it cannot be part of a JOIN condition.
"Outer aggregation"
If the correlated expression is an aggregate from outer query block, the transformation is not easy, so such queries are excluded from transformation. For example:
SELECT
COUNT(*),
a,
(SELECT MIN(m) FROM t2 WHERE m = COUNT(*))
FROM t1 GROUP BY a;
The inner COUNT(*) is aggregated in the outer query due to its position in the WHERE clause: filtering occurs before grouping.
We could possibly re-write the above to
SELECT outer.c, outer.a, inner.min_m
FROM (SELECT COUNT(*) AS c, a
FROM t1
GROUP BY a) AS outer
LEFT JOIN
(SELECT MIN(m) AS min_m, COUNT(*) AS inner_count
FROM t2
GROUP BY m) AS inner
ON outer.c = inner.m
WHERE REJECT_IF(inner_count > 1);
i.e the outer grouping must be performed before joining with the grouped subquery. For now, we avoid doing the above.
Similarly, if such an outer aggregation is present in a subquery nested in the subquery under consideration, we also skip transformation.
DISTINCT
A special case is if the subquery contains DISTINCT on a select expression containing only fields we group on:
SELECT * FROM t1 WHERE (SELECT DISTINCT a FROM t2 WHERE t2.a=t1.a) > 0;
->
SELECT t1.* FROM
t1 LEFT OUTER JOIN
(SELECT a FROM t2 GROUP BY a) AS derived
ON t1.a = derived.a
WHERE derived.a > 0;
Since we group on a, we do not need to aggregate COUNT, since the original statement had the DISTINCT on that value, we can just eliminate it.
Correlated expression's position in subquery
The correlation must be located in the subquery's WHERE clause.
A special case: COUNT
An implicit COUNT selected never gives a NULL value even for an empty result set, but at the position of the inner table of a left outer join, it will. So, the transformation will not be correct unless we compensate for this. Consider
CREATE TABLE t1 (id INT);
CREATE TABLE t2 (id INT);
INSERT INTO t1 VALUES (1), (2);
INSERT INTO t2 VALUES (1);
and the query
SELECT t1.id, ( SELECT COUNT(t.id)
FROM t2 AS t
WHERE t.id = t1.id ) AS c FROM t1;
+------+------+
| id | c |
+------+------+
| 1 | 1 |
| 2 | 0 |
+------+------+
If this is transformed as suggested above:
SELECT t1.id AS id,
derived_1_2.`COUNT(t.id)` AS c
FROM t1 LEFT JOIN
(SELECT COUNT(t.id) AS `COUNT(t.id)`,
t.id AS id
FROM t2 t
GROUP BY t.id) derived_1_2
ON derived_1_2.id = t1.id;
it would yield the wrong result:
+------+------+
| id | c |
+------+------+
| 1 | 1 |
| 2 | NULL |
+------+------+
So, instead we wrap it in a COALESCE:
SELECT t1.id AS id,
COALESCE(derived_1_2.`COUNT(t.id)`, 0) AS c
FROM t1 LEFT JOIN
(SELECT COUNT(t.id) AS `COUNT(t.id)`,
t.id AS id
FROM t2 t
GROUP BY t.id) derived_1_2
ON derived_1_2.id = t1.id;
If the COUNT is selected as part of an expression in the scalar subquery, the situation gets more complicated: we would need to move the expression to the outer query, replacing the count with the coalesce, and just keep the basic count as a selected item in the derived table. (not yet supported).
For other aggregate functions, a coalesce is not needed since they yield null on empty result sets, so this restriction does not apply. For all aggregates we know that for each group there is only a single row, so no run-time cardinality checking is added.
Subquery in JOIN clause
When a scalar subquery is located in the JOIN clause, the derived tables has be joined to the outer table before the original inner joinee. Now, if we have correlation, this will give us the following transformation:
SELECT COUNT(*)
FROM t1 a
JOIN
t1 outr
ON a.a= (SELECT count(*)
FROM t1 inr
WHERE inr.a = outr.a);
->
SELECT COUNT(0) AS `COUNT(*)`
FROM t1 a
LEFT JOIN
( SELECT COUNT(0) AS `count(*)`, <---- the new derived table
inr.a AS a
FROM t1 inr
GROUP BY inr.a) derived_1_2
ON derived_1_2.a = outr.a
JOIN t1 outr
WHERE (a.a = COALESCE(derived_1_2.`count(*)`,0));
If we try to hand this query to MySQL, we see:
ERROR 1054 (42S22): Unknown column 'mysql.outr.a' in 'on clause'
out.r used in the JOIN clause we constructed is not visible, because it comes from the second joinee table. Since this is a plain inner join, we would rearrange the original query order to:
SELECT COUNT(*)
FROM t1 outr
JOIN
t1 a
ON a.a= (SELECT count(*)
FROM t1 inr
WHERE inr.a = outr.a);
to give a valid transform:
SELECT COUNT(0) AS `COUNT(*)`
FROM t1 outr
LEFT JOIN
( SELECT COUNT(0) AS `count(*)`,
inr.a AS a
FROM t1 inr
GROUP BY inr.a) derived_1_2
ON derived_1_2.a = outr.a
JOIN t1 a
WHERE (a.a = COALESCE(derived_1_2.`count(*)`,0));
If the join clause is part of an outer join, we cannot reorder and the transformation is impossible.
Currently, we do not attempt the reordering for the inner join case either, meaning we do not support transforming a scalar subquery in a JOIN clause if it is correlated.
Window functions
The transform is not attempted if the subquery contains window functions
ORDER BY + LIMIT/OFFSET
Consider the query:
CREATE TABLE t1 (id INT, contract_id INT, datestamp DATETIME);
INSERT INTO t1 VALUES
(1,2,'2006-09-18 09:07:53'), (2,3,'2006-09-18 09:07:53'),
(3,4,'2006-09-18 09:07:53'), (4,10,'2008-09-18 09:07:53'),
(5,7,'2006-09-18 09:07:53'), (6,5,'2006-09-18 09:07:53'),
(7,9,'2006-09-18 09:07:53'), (8,10,'2006-09-18 09:07:53'),
(9,10,'2010-09-18 09:07:53'), (10,6,'2014-09-18 09:07:53');
CREATE TABLE t2 (id INT);
INSERT INTO t2 VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
SELECT (SELECT datestamp
FROM t1
WHERE contract_id = t2.id
ORDER BY datestamp ASC
LIMIT 1) AS subq
FROM t2;
+---------------------+
| subq |
+---------------------+
| NULL |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2014-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| NULL |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
+---------------------+
10 rows in set (0.01 sec)
Blindly using the transform outlined above gives:
SELECT derived_1_2.datestamp AS subq
FROM t2 LEFT JOIN
( SELECT t1.datestamp AS datestamp,
t1.contract_id AS contract_id,
COUNT(0) AS Name_exp_3
FROM t1
GROUP BY t1.contract_id
ORDER BY t1.datestamp
LIMIT 1) derived_1_2
ON (derived_1_2.contract_id = t2.id) AND
(derived_1_2.Name_exp_3 reject_if > 1)
This will give the wrong answer. After the grouping on contract_id, it is indeterministic to order by datestamp, and the LIMIT 1 yield only one group:
+---------------------+
| subq |
+---------------------+
| NULL |
| 2006-09-18 09:07:53 |
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
+---------------------+
10 rows in set (0.00 sec)
What about also grouping on the order by expression?
SELECT derived_1_2.datestamp AS subq
FROM t2
LEFT JOIN
( SELECT t1.datestamp AS datestamp,
t1.contract_id AS contract_id
FROM t1
GROUP BY t1.contract_id, t1.datestamp
ORDER BY t1.datestamp) derived_1_2
ON derived_1_2.contract_id = t2.id;
+---------------------+
| subq |
+---------------------+
| NULL |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2014-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| NULL |
| 2006-09-18 09:07:53 |
| 2006-09-18 09:07:53 |
| 2008-09-18 09:07:53 |
| 2010-09-18 09:07:53 | <-- should have been 2006
+---------------------+
12 rows in set (0.00 sec)
Better, but the last row is wrong. We have multiple rows in the derived table that match the ON predicate, so we get an indeterministic value for datestamp, again, cf. inner query:
mysql> SELECT t1.datestamp AS datestamp,
t1.contract_id AS contract_id
FROM t1
GROUP BY t1.contract_id, t1.datestamp
ORDER BY t1.datestamp DESC;
+---------------------+-------------+
| datestamp | contract_id |
+---------------------+-------------+
| 2014-09-18 09:07:53 | 6 |
| 2010-09-18 09:07:53 | 10 | <--- not this
| 2008-09-18 09:07:53 | 10 | <--- not this
| 2006-09-18 09:07:53 | 2 |
| 2006-09-18 09:07:53 | 3 |
| 2006-09-18 09:07:53 | 4 |
| 2006-09-18 09:07:53 | 7 |
| 2006-09-18 09:07:53 | 5 |
| 2006-09-18 09:07:53 | 9 |
| 2006-09-18 09:07:53 | 10 | <--- we want this one
+---------------------+-------------+
10 rows in set (0.01 sec)
If we order by DESC, we get the correct result, but we cannot rely on that.
The right transform turns out to be:
SELECT derived_1_2.datestamp AS subq
FROM t2
LEFT JOIN
( SELECT MIN(datestamp) OVER () AS datestamp,
t1.contract_id AS contract_id
FROM t1
GROUP BY contract_id, datestamp) derived_1_2
ON derived_1_2.contract_id = t2.id;
i.e. we effectively transform the ORDER BY ASC + LIMIT 1 to MIN. But we do not transform correlated subqueries with LIMIT/OFFSET in this WL.
For the case when OFFSET is present, transformation with window functions is a bit different.
For the following query:
SELECT (SELECT datestamp
FROM t1
WHERE contract_id = t2.id
ORDER BY datestamp ASC
LIMIT 1 OFFSET 3) AS subq
FROM t2;
right transformation would be :
SELECT derived_1_2.datestamp AS subq
FROM t2
LEFT JOIN
( SELECT NTH_VALUE(datestamp, 3) OVER (PARTITION BY contract_id
ORDER BY datestamp) AS
datestamp,
t1.contract_id AS contract_id,
ROW_NUMBER() OVER(PARTITION BY contract_id
ORDER BY datestamp) as row_num,
COUNT(*) OVER(PARTITION BY contract_id) as count
FROM t1) derived_1_2
ON derived_1_2.contract_id = t2.id and row_num = 3;
For the case of LIMIT N OFFSET M, we could change the join condition to ROW_NUMBER() >= M and REJECT_IF(count > (M+1))
Without using window functions, we could re-write query having LIMIT 1 by re-writing the query following way:
SELECT derived_1_2.datestamp AS subq
FROM t2 LEFT JOIN (SELECT min(t1.datestamp) AS datestamp,
t1.contract_id AS contract_id,
COUNT(0) AS Name_exp_3
FROM t1
GROUP BY t1.contract_id) derived_1_2
ON derived_1_2.contract_id = t2.id and REJECT_IF(Name_exp_3 > 1);
With arbitrary LIMIT, we need to carry the COUNT for each group and reject when cardinality is greater than 1 - so we just ignore the LIMIT in the grouping, but reject groups where MIN(COUNT(*), LIMIT) > 1. And LIMIT 0 must be handled specially, as the subquery will only give NULL values.
class Item_func_reject_if : public Item_bool_func
it's val_int method looks like this for the primary engine:
longlong Item_func_reject_if::val_int() {
longlong result = args[0]->val_int();
if (result == 1) {
my_error(ER_SUBQUERY_NO_1_ROW, MYF(0));
}
return !result;
}
that is, the expectation is that if the comparison yields true, the method should raise the run-time error ER_SUBQUERY_NO_1_ROW.
Any secondary engine will see this operator in the query plan and should implement similar semantics. For example this query:
SELECT * FROM t1 WHERE (SELECT a FROM t2 WHERE t2.a = t1.a) > 0;
will be transformed to
SELECT t1.a AS a
FROM t1 JOIN
( SELECT t2.a AS a,
COUNT(0) AS Name_exp_2
FROM t2
GROUP BY t2.a) derived_1_2
WHERE ((t1.a = derived_1_2.a) AND
(derived_1_2.a > 0) AND
REJECT_IF(derived_1_2.Name_exp_2 > 1))
The last predicate in the where contains the Item_func_reject_if operator. If, during execution, the secondary engine sees "reject_if" evaluate to false, it should send back ER_SUBQUERY_NO_1_ROW to the MySQL server.