MySQL :: MySQL 8.0 Labs: [Recursive] Common Table Expressions in MySQL (CTEs)

[Update of April 2017: the features discussed here are now also available in the official release 8.0.1 of MySQL.]

[Note: this is the first post in a series; there is also a second post].

The MySQL development team just published a Labs release of the MySQL Server (available under “MySQL Server 8.0.0 Optimizer”).
A prominent feature of this release, which I developed, is [Recursive] Common Table Expressions, also known as

[recursive] CTE,
[recursive] subquery factoring,
WITH [RECURSIVE] clause.

Three years ago I had shown a way to emulate CTEs in a blog post, but what we have now in MySQL is the real thing, no ersatz!

So this is the first of several blog posts which will tour all details about this new feature.

A derived table is a subquery in the FROM clause, in bold font below:

SELECT … FROM (SELECT …) AS derived_table;

Derived tables exist in MySQL since long. It’s accurate to say that non-recursive CTEs are “improved derived tables”. Here is a first example:

WITH cte1(txt) AS (SELECT "This "),
     cte2(txt) AS (SELECT CONCAT(cte1.txt,"is a ") FROM cte1),
     cte3(txt) AS (SELECT "nice query" UNION
                   SELECT "query that rocks" UNION
                   SELECT "query"),
     cte4(txt) AS (SELECT concat(cte2.txt, cte3.txt) FROM cte2, cte3)
SELECT MAX(txt), MIN(txt) FROM cte4;

+----------------------------+----------------------+
| MAX(txt)                   | MIN(txt)             |
+----------------------------+----------------------+
| This is a query that rocks | This is a nice query |
+----------------------------+----------------------+
1 row in set (0,00 sec)

WITH cte1(txt) AS (SELECT "This "),

cte2(txt) AS (SELECT CONCAT(cte1.txt,"is a ") FROM cte1),

cte3(txt) AS (SELECT "nice query" UNION

SELECT "query that rocks" UNION

SELECT "query"),

cte4(txt) AS (SELECT concat(cte2.txt, cte3.txt) FROM cte2, cte3)

SELECT MAX(txt), MIN(txt) FROM cte4;

+----------------------------+----------------------+

| MAX(txt) | MIN(txt) |

+----------------------------+----------------------+

| This is a query that rocks | This is a nice query |

+----------------------------+----------------------+

1 row in set (0,00 sec)

We have a CTE cte4, which is built from other CTEs cte3 and cte2 which itself is built from cte1. As one reader of this post (Vasiliy) commented, the same can be achieved with derived tables:

SELECT MAX(txt), MIN(txt)
FROM
(
  SELECT concat(cte2.txt, cte3.txt) as txt
  FROM
  (
    SELECT CONCAT(cte1.txt,'is a ') as txt
    FROM
    (
      SELECT 'This ' as txt
    ) as cte1
  ) as cte2,
  (
    SELECT 'nice query' as txt
    UNION
    SELECT 'query that rocks'
    UNION
    SELECT 'query'
  ) as cte3
) as cte4;

SELECT MAX(txt), MIN(txt)

FROM

(

SELECT concat(cte2.txt, cte3.txt) as txt

FROM

(

SELECT CONCAT(cte1.txt,'is a ') as txt

FROM

(

SELECT 'This ' as txt

) as cte1

) as cte2,

(

SELECT 'nice query' as txt

UNION

SELECT 'query that rocks'

UNION

SELECT 'query'

) as cte3

) as cte4;

However if you read both queries, the CTE-based one flows more nicely as it defines table after table, “linearly”, and we see the sentences forming as we read from top to bottom, whereas the derived-table-based query looks “tree-like” and “turned inside-out” (see how the first word, “This “, is deeply nested). CTEs help writing a readable query, which is a win for future maintenance.

Due to the limited size of a blog post, I’ll cover non-recursive CTEs more in depth in a next post – I hope the example above has wet your appetite. Today, I’ll rather look at recursive CTEs which, in my humble opinion, are even more innovative to traditional SQL than
non-recursive CTEs are.

A recursive CTE is a set of rows which is built iteratively: from an initial set of rows, a process derives new rows, which grow the set, and those new rows are fed into the process again, producing more rows, and so on, until the process produces no more rows.

The simplest possible syntax is to include this

WITH RECURSIVE cte_name AS
(
  SELECT ...      <-- specifies initial set
  UNION ALL
  SELECT ...      <-- specifies how to derive new rows
)

WITH RECURSIVE cte_name AS

(

SELECT ... <-- specifies initial set

UNION ALL

SELECT ... <-- specifies how to derive new rows

)

into your SELECT, INSERT, UPDATE, DELETE statement, or inside any SELECT subquery.

Let’s walk through a first example, producing integers from 1 to 10:

WITH RECURSIVE my_cte AS
(
  SELECT 1 AS n
  UNION ALL
  SELECT 1+n FROM my_cte WHERE n<10
)
SELECT * FROM my_cte;

+------+
| n    |
+------+
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
|    6 |
|    7 |
|    8 |
|    9 |
|   10 |
+------+
10 rows in set (0,00 sec)

WITH RECURSIVE my_cte AS

(

SELECT 1 AS n

UNION ALL

SELECT 1+n FROM my_cte WHERE n<10

)

SELECT * FROM my_cte;

+------+

| n |

+------+

| 1 |

| 2 |

| 3 |

| 4 |

| 5 |

| 6 |

| 7 |

| 8 |

| 9 |

| 10 |

+------+

10 rows in set (0,00 sec)

If we compare it to the syntax rule, we see that:

the CTE is named my_cte
the CTE’s definition is included as prefix of SELECT statement SELECT * FROM my_cte
the initial set is SELECT 1 AS n so it’s a row containing “1”. The meta-data of this row is used to define the column of my_cte: therefore, my_cte has one column of type INT (the type of “1”), named n due to AS n.
the process which takes rows in input and produces new rows in output, is described by
SELECT 1+n FROM my_cte WHERE n<10
which means: take rows in my_cte which match n<10, and for each of them,
produce a row with an incremented n.
the definition of my_cte is clearly recursive, as it contains a reference to my_cte (in the FROM clause of the second SELECT). That’s how you can distinguish a recursive CTE from a non-recursive one.

So MySQL will take these steps:

iteration 0: create the initial set of rows, noted S0: S0={1},
iteration 1: run process on result of iteration 0 (on S0): this produces new set S1={1+1}={2},
iteration 2: run process on result of iteration 1 (on S1): this produces S2={1+2}={3},
and so on,
iteration 9: run process on result of iteration 8 (on S8): produces S9={1+9}={10},
iteration 10: run process on result of iteration 9 (on S9): no row of S9 matches n<10, so produce nothing, and that triggers the termination of the loop,
the final result in my_cte, which the outer SELECT will see, is the union of S0, S1, … and S9: {1,2,…,9,10}.

The initial set of rows is sometimes referred to as “the non-recursive SELECT”, “the anchor SELECT” or “the seed SELECT”. The SELECT describing the row production process is “the recursive SELECT”; it is “recursive” because it reads my_cte.

Here is a more complete syntax:

WITH RECURSIVE cte_name [list of column names ] AS
(
  SELECT ...      <-- specifies initial set
  UNION ALL
  SELECT ...      <-- specifies initial set
  UNION ALL
  ...
  SELECT ...      <-- specifies how to derive new rows
  UNION ALL
  SELECT ...      <-- specifies how to derive new rows
  ...
)
[, any number of other CTE definitions ]

WITH RECURSIVE cte_name [list of column names ] AS

(

SELECT ... <-- specifies initial set

UNION ALL

SELECT ... <-- specifies initial set

UNION ALL

...

SELECT ... <-- specifies how to derive new rows

UNION ALL

SELECT ... <-- specifies how to derive new rows

...

)

[, any number of other CTE definitions ]

Thus:

you can define the initial set as a union of several SELECTs
you can define the process to produce new rows as several SELECTs whose result is
union-ed at every iteration.
this CTE definition can be followed by another CTE definition, which may use the first CTE.
in a single WITH clause you can mix non-recursive CTEs and recursive CTEs (just note that as long as your clause has at least one recursive CTE it must start with the words WITH RECURSIVE).
the CTE’s column names, instead of being specified with AS aliases in the first SELECT, can be put right after the CTE’s name: cte_name(n).

Let’s take our 1-to-10 example, reduce it to 1-to-6 to save screen space, name the column using the my_cte(n) syntax, and use the result of my_cte to create a table:

USE test;
CREATE TABLE numbers
  WITH RECURSIVE my_cte(n) AS
  (
    SELECT 1
    UNION ALL
    SELECT 1+n FROM my_cte WHERE n<6
  )
  SELECT * FROM my_cte;
Query OK, 6 rows affected (0,40 sec)

SELECT * FROM numbers;
+------+
| n    |
+------+
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
|    6 |
+------+
6 rows in set (0,00 sec

USE test;

CREATE TABLE numbers

WITH RECURSIVE my_cte(n) AS

(

SELECT 1

UNION ALL

SELECT 1+n FROM my_cte WHERE n<6

)

SELECT * FROM my_cte;

Query OK, 6 rows affected (0,40 sec)

SELECT * FROM numbers;

+------+

| n |

+------+

| 1 |

| 2 |

| 3 |

| 4 |

| 5 |

| 6 |

+------+

6 rows in set (0,00 sec

The CTE can also be used in INSERT (and REPLACE):

INSERT INTO numbers
WITH RECURSIVE my_cte(n) AS
(
  SELECT 1
  UNION ALL
  SELECT 1+n FROM my_cte WHERE n<6
)
SELECT * FROM my_cte;
Query OK, 6 rows affected (0,12 sec)

SELECT * FROM numbers;
+------+
| n    |
+------+
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
|    6 |
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
|    6 |
+------+
12 rows in set (0,00 sec)

INSERT INTO numbers

WITH RECURSIVE my_cte(n) AS

(

SELECT 1

UNION ALL

SELECT 1+n FROM my_cte WHERE n<6

)

SELECT * FROM my_cte;

Query OK, 6 rows affected (0,12 sec)

SELECT * FROM numbers;

+------+

| n |

+------+

| 1 |

| 2 |

| 3 |

| 4 |

| 5 |

| 6 |

| 1 |

| 2 |

| 3 |

| 4 |

| 5 |

| 6 |

+------+

12 rows in set (0,00 sec)

In UPDATE (single-table and multi-table):

WITH RECURSIVE my_cte(n) AS
(
  SELECT 1
  UNION ALL
  SELECT 1+n FROM my_cte WHERE n<6
)
UPDATE numbers, my_cte
# Change to 0...
SET numbers.n=0
# ... the numbers which are squares, i.e. 1 and 4
WHERE numbers.n=my_cte.n*my_cte.n;
Query OK, 4 rows affected (0,01 sec)

SELECT * FROM numbers;
+------+
| n    |
+------+
|    0 |
|    2 |
|    3 |
|    0 |
|    5 |
|    6 |
|    0 |
|    2 |
|    3 |
|    0 |
|    5 |
|    6 |
+------+
12 rows in set (0,00 sec)

WITH RECURSIVE my_cte(n) AS

(

SELECT 1

UNION ALL

SELECT 1+n FROM my_cte WHERE n<6

)

UPDATE numbers, my_cte

# Change to 0...

SET numbers.n=0

# ... the numbers which are squares, i.e. 1 and 4

WHERE numbers.n=my_cte.n*my_cte.n;

Query OK, 4 rows affected (0,01 sec)

SELECT * FROM numbers;

+------+

| n |

+------+

| 0 |

| 2 |

| 3 |

| 0 |

| 5 |

| 6 |

| 0 |

| 2 |

| 3 |

| 0 |

| 5 |

| 6 |

+------+

12 rows in set (0,00 sec)

And in DELETE (single-table and multi-table); notice how the CTE is referenced from a subquery:

WITH RECURSIVE my_cte(n) AS
(
  SELECT 1
  UNION ALL
  SELECT 1+n FROM my_cte WHERE n<6
)
DELETE FROM numbers
# delete the numbers greater than the average of 1,...,6 (=3.5)
WHERE numbers.n > (SELECT AVG(n) FROM my_cte);
Query OK, 4 rows affected (0,01 sec)

SELECT * FROM numbers;
+------+
| n    |
+------+
|    0 |
|    2 |
|    3 |
|    0 |
|    0 |
|    2 |
|    3 |
|    0 |
+------+
8 rows in set (0,00 sec)

WITH RECURSIVE my_cte(n) AS

(

SELECT 1

UNION ALL

SELECT 1+n FROM my_cte WHERE n<6

)

DELETE FROM numbers

# delete the numbers greater than the average of 1,...,6 (=3.5)

WHERE numbers.n > (SELECT AVG(n) FROM my_cte);

Query OK, 4 rows affected (0,01 sec)

SELECT * FROM numbers;

+------+

| n |

+------+

| 0 |

| 2 |

| 3 |

| 0 |

| 2 |

| 3 |

| 0 |

+------+

8 rows in set (0,00 sec)

And in a similar DELETE, where the CTE is defined in the subquery itself, not anymore in front of DELETE:

DELETE FROM numbers 
WHERE numbers.n >
  (
    WITH RECURSIVE my_cte(n) AS
    (
      SELECT 1
      UNION ALL
      SELECT 1+n FROM my_cte WHERE n<6
    )
    # Half the average is 3.5/2=1.75
    SELECT AVG(n)/2 FROM my_cte
  );
Query OK, 4 rows affected (0,07 sec)

SELECT * FROM numbers;
+------+
| n    |
+------+
|    0 |
|    0 |
|    0 |
|    0 |
+------+
4 rows in set (0,00 sec)

DELETE FROM numbers

WHERE numbers.n >

(

WITH RECURSIVE my_cte(n) AS

(

SELECT 1

UNION ALL

SELECT 1+n FROM my_cte WHERE n<6

)

# Half the average is 3.5/2=1.75

SELECT AVG(n)/2 FROM my_cte

);

Query OK, 4 rows affected (0,07 sec)

SELECT * FROM numbers;

+------+

| n |

+------+

| 0 |

+------+

4 rows in set (0,00 sec)

Pretty flexible, isn’t it?

A word of caution: what would happen if I forgot to type the WHERE n<6 clause? It looks like the query would run forever, as it would always produce new rows (no reason to stop at 6, anymore). But, starting from MySQL 8.0.3, there is an upper bound on the number of allowed iterations (1000), so the query would stop with an error when it has done 1000 iterations; and that bound can be increased or decreased by setting variable @@cte_max_recursion_depth. In older versions, I would have to rely on other means: if using the mysql command-line client, I could stop that by typing Control-C; with another client, I would open another session and use KILL ; or I could put this at the start of my session:

SET max_execution_time = 10000;

1	SET max_execution_time = 10000;

so that the runaway query aborts automatically after 10 seconds, if the WHERE clause wasn’t correct.

That’s all for today. In next posts, I will continue on this topic, but there is already enough here to allow you to experiment with the feature. Before you do, please take note of two things which can save your time:

If you use a recursive CTE to create longer and longer strings (for example by using CONCAT() in the recursive SELECT), remember that the CTE’s column type is determined based on the non-recursive SELECTs only: so, in the non-recursive SELECTs, make the column wide enough with CAST (… AS CHAR(<length>)), or your long strings may not fit.
Inside the recursive CTE definition (the part in AS (…)), some syntax constraints must be respected (the reason for which will become clear in next posts):
- all SELECTs must be connected with UNION ALL or UNION DISTINCT ¹
- a recursive SELECT mustn’t contain GROUP BY, aggregate functions
  (like SUM), ORDER BY, LIMIT, DISTINCT (this rule doesn’t apply to the non-recursive/anchor/seed SELECT)
- a recursive SELECT must reference the CTE only once and only in its
  FROM clause, not in any subquery. Of course, it can additionally reference other tables than the CTE and join them with the CTE, which can be very useful to build hierarchies (for example, if we have a table of bosses and employees and want to answer the question “who reports directly or indirectly to Mrs. X?”). If used in a JOIN like this, the CTE must not be on the right side of a LEFT JOIN.

That’s it! And, as always,

select unhex("5468616E6B20796F7520666F722063686F6F73696E67204D7953514C21") as final_words;
+-------------------------------+
| final_words                   |
+-------------------------------+
| Thank you for choosing MySQL! |
+-------------------------------+

select unhex("5468616E6B20796F7520666F722063686F6F73696E67204D7953514C21") as final_words;

+-------------------------------+

| final_words |

+-------------------------------+

| Thank you for choosing MySQL! |

+-------------------------------+

1. MySQL 8.0.1 supports both types of unions; the older Labs release supported only UNION ALL.