[Update of April 2017: the features discussed here are now also available in the official release 8.0.1 of MySQL. And this series of posts has been continued, here.]
Here is the third in a series of posts about CTEs, a new feature of MySQL 8.0, available in this Labs release. In the first post we had explored the new SQL syntax, and in the second we had applied it to generating series.
Today we’ll look into the most famous use of recursive CTEs: manipulating hierarchical data. That is, data in SQL tables representing a hierarchy like
- CEO -> VP -> manager -> employee
- item -> sub-item -> sub-sub-item
- parent -> son -> grand-son
- email or forum thread (question -> reply -> reply to the reply)
- town -> region -> state
These hierarchies can be arbitrarily deep: that may be a large corporation, or a complex assembly, a numerous family, a heated email debate… From that data, a lot of interesting questions can be answered:
- Human Resources is trying to spot over- or under-used VPs, they ask: how many employees directly or indirectly reporting to each VP?
- VP is considering dropping product X from the catalog, she asks: what’s the cost of all items necessary to make the assembly X ?
- Genealogy: how many people are descendants of Mrs X?
- MySQL engineers are trying to spot which feature is a recurring problem to users, they ask : what are the topics in the “MySQL help forum” which had more than 20 replies in the last month?
All these questions can be answered by exploring the hierarchical data, accumulating information as we explore it, and presenting summary information to the requester. How to do that in practice, heavily depends on the chosen data model. There are at least two famous data models to represent hierarchies:
- Adjacency list
- Nested set
My former colleague Mike Hillyer has a nice explanation of the two models : check them out in paragraphs “Introduction”, “the Adjacency List Model” and “The Nested Set Model” here. It was written at the times of MySQL 4.1 but it still holds.
In his blog Mike has two points:
- arbitrarily-deep hierarchies are difficult to handle with traditional SQL: either a stored procedure must be written (with WHILE loops and recursive procedure calls, to find an item, then find its sub-items, then find sub-sub-items of the sub-items), or a maximum depth N must be assumed and then a N-table JOIN can be used.
- the nested set model is an easier-to-use alternative.
However, as we will shortly see, the advent of recursive CTEs puts the adjacency list model back on stage.
To prove my point, what I’m going to do now is to go through Mike’s post, review each problem he solved with the nested set model, and show how a recursive CTE makes it possible to solve the same problem with the adjacency list model. I’ll be using the same table data as Mike: table category, with columns category_id, name, parent : every row is a category which has a unique ID, a name, and the ID of its parent category. For example:
1
2
3
4
5
6
7
|
+-------------+----------------------+--------+ | category_id | name | parent | +-------------+----------------------+--------+ | 1 | ELECTRONICS | NULL | | 2 | TELEVISIONS | 1 | | 3 | TUBE | 2 | etc ... |
“Tube” (ID 3) belongs to parent having ID 2 which is “Televisions”; “Televisions” (ID 2) belongs to parent having ID 1 which is “Electronics”; “Electronics” (ID 1) is the top category as it has no parent (NULL)1.
I really encourage you to check the tree diagram in Mike’s “Introduction” section, it makes it clear how every category relates to each other.
Now, let’s go.
Retrieving a Full Tree
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
WITH RECURSIVE cte AS ( # seed SELECT SELECT category_id, name FROM category WHERE parent IS NULL UNION ALL # recursive SELECT SELECT c.category_id, c.name FROM category c JOIN cte ON cte.category_id=c.parent # find children ) SELECT name FROM cte; +-------------+----------------------+ | category_id | name | +-------------+----------------------+ | 1 | ELECTRONICS | | 2 | TELEVISIONS | | 6 | PORTABLE ELECTRONICS | | 3 | TUBE | | 4 | LCD | | 5 | PLASMA | | 7 | MP3 PLAYERS | | 9 | CD PLAYERS | | 10 | 2 WAY RADIOS | | 8 | FLASH | +-------------+----------------------+ 10 rows in set (0,00 sec) |
What this does is: start with the top category (which we grab with WHERE parent IS NULL), it is “Electronics”, then find all categories which have it as parent: that gives us the categories right below “Electronics”, those at “depth 1” in the tree (“Televisions”, “Portable Electronics”). Then iterate: find all categories which have the “depth 1” categories as parent; that gives us the categories at “depth 2”. And so on: at depth 3 there is only “Flash”, and the recursive SELECT finds nothing at depth 4: so it returns no new rows, and the CTE is done.
You may notice that the query’s result is the same as in Mike’s blog (it better be!), though not in the same order. Mike used ORDER BY node.lft which achieves a so-called depth-first order: you see the children of “Televisions” (“Tube”, …) before the siblings of “Television” (“Portable Electronics”). The order of my result is apparently breadth-first (siblings before children); however, as usual in SQL, if you don’t write ORDER BY, no order is guaranteed. If we want an order, let’s request it:
- to get breadth-first order: add an integer column “depth”, increment it as you dive deeper, and order by it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
WITH RECURSIVE cte AS ( SELECT category_id, name, 0 AS depth FROM category WHERE parent IS NULL UNION ALL SELECT c.category_id, c.name, cte.depth+1 FROM category c JOIN cte ON cte.category_id=c.parent ) SELECT * FROM cte ORDER BY depth; +-------------+----------------------+-------+ | category_id | name | depth | +-------------+----------------------+-------+ | 1 | ELECTRONICS | 0 | | 2 | TELEVISIONS | 1 | | 6 | PORTABLE ELECTRONICS | 1 | | 7 | MP3 PLAYERS | 2 | | 9 | CD PLAYERS | 2 | | 10 | 2 WAY RADIOS | 2 | | 3 | TUBE | 2 | | 4 | LCD | 2 | | 5 | PLASMA | 2 | | 8 | FLASH | 3 | +-------------+----------------------+-------+ |
- to get depth-first order: add a character-string column “path”, extend it as you dive down, and order by it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
WITH RECURSIVE cte AS ( SELECT category_id, name, CAST(category_id AS CHAR(200)) AS path FROM category WHERE parent IS NULL UNION ALL SELECT c.category_id, c.name, CONCAT(cte.path, ",", c.category_id) FROM category c JOIN cte ON cte.category_id=c.parent ) SELECT * FROM cte ORDER BY path; +-------------+----------------------+---------+ | category_id | name | path | +-------------+----------------------+---------+ | 1 | ELECTRONICS | 1 | | 2 | TELEVISIONS | 1,2 | | 3 | TUBE | 1,2,3 | | 4 | LCD | 1,2,4 | | 5 | PLASMA | 1,2,5 | | 6 | PORTABLE ELECTRONICS | 1,6 | | 10 | 2 WAY RADIOS | 1,6,10 | | 7 | MP3 PLAYERS | 1,6,7 | | 8 | FLASH | 1,6,7,8 | | 9 | CD PLAYERS | 1,6,9 | +-------------+----------------------+---------+ |
I had already mentioned the necessity of CAST here : the type of the path column is determined from the anchor SELECT, and it has to be long enough to fit the path of the deepest leaf in the tree: CAST ensures that.
Finding all the leaf nodes
No change here, the adjacency list model already allowed to do that, no need for CTEs: just find rows which ID is not the parent of another row; a LEFT JOIN or a subquery with a NOT EXISTS or NOT IN predicate can do that; as Mike has given the LEFT JOIN solution I’ll give the subquery one, just for fun:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
SELECT category_id, name FROM category WHERE category_id NOT IN # IDs of all parents: (SELECT parent FROM category WHERE parent IS NOT NULL); +-------------+--------------+ | category_id | name | +-------------+--------------+ | 3 | TUBE | | 4 | LCD | | 5 | PLASMA | | 8 | FLASH | | 9 | CD PLAYERS | | 10 | 2 WAY RADIOS | +-------------+--------------+ 6 rows in set (0,00 sec) |
Retrieving a Single Path
Say, the path of “Flash”: start from “Flash” and go to its parent, then to the parent of its parent, up to the top:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
WITH RECURSIVE cte AS ( SELECT name, parent FROM category WHERE name='FLASH' UNION ALL SELECT c.name, c.parent FROM category c JOIN cte ON c.category_id=cte.parent # find parent ) SELECT * FROM cte; +----------------------+--------+ | name | parent | +----------------------+--------+ | FLASH | 7 | | MP3 PLAYERS | 6 | | PORTABLE ELECTRONICS | 1 | | ELECTRONICS | NULL | +----------------------+--------+ 4 rows in set (0,00 sec) |
The order of rows is from child to parent; if the opposite order is desired, add a depth and order by it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
WITH RECURSIVE cte AS ( SELECT name, parent, 0 as depth FROM category WHERE name='FLASH' UNION ALL SELECT c.name, c.parent, cte.depth-1 FROM category c JOIN cte ON c.category_id=cte.parent ) SELECT * FROM cte ORDER BY depth; +----------------------+--------+-------+ | name | parent | depth | +----------------------+--------+-------+ | ELECTRONICS | NULL | -3 | | PORTABLE ELECTRONICS | 1 | -2 | | MP3 PLAYERS | 6 | -1 | | FLASH | 7 | 0 | +----------------------+--------+-------+ |
Finding the Depth of the Nodes
We want one item, then right under it we want its children indented to the right. So we want depth-first order (children before siblings): a path column is needed. The amount of necessary indentation is easily calculated, by adding a depth column (it would also be possible, instead, to count the number of commas in path and use that as depth):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
WITH RECURSIVE cte AS ( SELECT category_id, CAST(name AS CHAR(200)) AS name, CAST(category_id AS CHAR(200)) AS path, 0 as depth FROM category WHERE parent IS NULL UNION ALL SELECT c.category_id, CONCAT(REPEAT(' ', cte.depth+1), c.name), # indentation CONCAT(cte.path, ",", c.category_id), cte.depth+1 FROM category c JOIN cte ON cte.category_id=c.parent ) SELECT * FROM cte ORDER BY path; +-------------+-----------------------+---------+-------+ | category_id | name | path | depth | +-------------+-----------------------+---------+-------+ | 1 | ELECTRONICS | 1 | 0 | | 2 | TELEVISIONS | 1,2 | 1 | | 3 | TUBE | 1,2,3 | 2 | | 4 | LCD | 1,2,4 | 2 | | 5 | PLASMA | 1,2,5 | 2 | | 6 | PORTABLE ELECTRONICS | 1,6 | 1 | | 10 | 2 WAY RADIOS | 1,6,10 | 2 | | 7 | MP3 PLAYERS | 1,6,7 | 2 | | 8 | FLASH | 1,6,7,8 | 3 | | 9 | CD PLAYERS | 1,6,9 | 2 | +-------------+-----------------------+---------+-------+ 10 rows in set (0,00 sec) |
Depth of a sub-tree
We want to see the sub-tree of a certain category, and to know the depth of every sub-category. Start from the sub-tree’s root and add a depth column. Depth-first order is wanted, so add a path column:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
WITH RECURSIVE cte AS ( SELECT category_id, name, CAST(category_id AS CHAR(200)) AS path, 0 as depth FROM category WHERE name='PORTABLE ELECTRONICS' # sub-tree root UNION ALL SELECT c.category_id, c.name, CONCAT(cte.path, ",", c.category_id), cte.depth+1 FROM category c JOIN cte ON cte.category_id=c.parent ) SELECT * FROM cte ORDER BY path; +-------------+----------------------+-------+-------+ | category_id | name | path | depth | +-------------+----------------------+-------+-------+ | 6 | PORTABLE ELECTRONICS | 6 | 0 | | 10 | 2 WAY RADIOS | 6,10 | 1 | | 7 | MP3 PLAYERS | 6,7 | 1 | | 8 | FLASH | 6,7,8 | 2 | | 9 | CD PLAYERS | 6,9 | 1 | +-------------+----------------------+-------+-------+ 5 rows in set (0,00 sec) |
Find the immediate subordinates of a node
Same query as above, but limit the dive to those which have depth=0; no need for a path as they’re all on the same level:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
WITH RECURSIVE cte AS ( SELECT category_id, name, 0 as depth FROM category WHERE name='PORTABLE ELECTRONICS' UNION ALL SELECT c.category_id, c.name, cte.depth+1 FROM category c JOIN cte ON cte.category_id=c.parent WHERE cte.depth=0 ) SELECT * FROM cte; +-------------+----------------------+-------+ | category_id | name | depth | +-------------+----------------------+-------+ | 6 | PORTABLE ELECTRONICS | 0 | | 7 | MP3 PLAYERS | 1 | | 9 | CD PLAYERS | 1 | | 10 | 2 WAY RADIOS | 1 | +-------------+----------------------+-------+ 4 rows in set (0,00 sec) |
Aggregate functions in a nested set
Adding the same product table as Mike, we want to know how many products a category has; including indirect containment. So we first get categories which directly contain a product, then we join those categories with their parents, to reflect that parents also contain the said product. In the result of the CTE, we have one row per “container and (in)directly contained product”, and we segment that per container, with GROUP BY:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
WITH RECURSIVE cte AS ( SELECT c.category_id, c.name AS cat_name, c.parent, p.name AS prod_name FROM category c JOIN product p ON c.category_id=p.category_id UNION ALL SELECT c.category_id, c.name, c.parent, cte.prod_name FROM cte JOIN category c ON c.category_id=cte.parent ) SELECT cat_name, COUNT(*) AS prod_in_cat FROM cte GROUP BY cat_name; +----------------------+-------------+ | cat_name | prod_in_cat | +----------------------+-------------+ | 2 WAY RADIOS | 1 | | CD PLAYERS | 2 | | ELECTRONICS | 10 | | FLASH | 1 | | LCD | 1 | | MP3 PLAYERS | 2 | | PLASMA | 2 | | PORTABLE ELECTRONICS | 5 | | TELEVISIONS | 5 | | TUBE | 2 | +----------------------+-------------+ 10 rows in set (0,01 sec) |
Conclusion
Today with recursive CTEs it is totally possible to answer complicated questions about a hierarchy represented in the adjacency list model: depth, sub-tree, immediate children, breadth-first order, depth-first order…
Does that mean that the nested set model is dead? It is not. If you compare Mike’s queries with mine, his are often shorter. On the other hand, queries to maintain the nested set model in the event of adding/deleting a node, have to be multi-statement ones and must use concurrency control (you can see them in the “Adding new nodes” and “Deleting nodes” section in Mike’s blog). Whereas in the adjacency list model, adding is one short INSERT. The easier insertions, harder reads of that model is easy to explain: in that model, only the direct containment is represented in the table (one row points to its parent). In the nested set model, indirect containment is also represented (with the numeric intervals lft and rgt, any two rows are easily put into relation): in other words, the nested set model puts more information right into the table; it makes reading easier as more information is readily available, and it makes writing harder as more information has to be computed and saved at writing time.
Food for thought…
That was a long blog… it is appropriate to say:
1
2
3
4
5
6
|
select uncompress(x'1F000000789C0BC948CCCB56A8CC2F5548CB2F52284A4D4CC9CC4B5728C9C82C56484B2C520400B5BD0B27') as final_words; +---------------------------------+ | final_words | +---------------------------------+ | Thank you for reading this far! | +---------------------------------+ |
1. SQL authors tend to use the words adjacency list model for such table so I’ll use them too throughout this post, for consistency. But a real adjacency list model would contain one row per node, and in that row, the list of all adjacent nodes – nodes which are directly linked to this node. So, the row of TELEVISIONS would list both its parent ELECTRONICS and its children (TUBE, etc). In our table the row of TELEVISIONS lists only its parent, so our table is actually a collection of edges, it’s in fact using the edge list model. Thanks Peter Brawley for pointing this out. More information here.