WL#8652: Lateral derived tables

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

WHAT
====

Support SQL:99 feature "lateral derived tables".

WHY
===

SQL Server has it (named CROSS APPLY), PG too, Oracle, DB2, and it's
not too hard to implement. It gives "one more SQL feature" to MySQL 8.0.
Ref: https://bugs.mysql.com/bug.php?id=78930 

WHAT, EXACTLY
=============

See slides 3-11 in
https://www.slideshare.net/MarkusWinand/modern-sql
and/or read my examples below.

If I have a table of salespeople (one row describes a salesperson),
and a table of all_sales (one row describes a sale: salesperson,
customer, amount, date), and I want to know, for each salesperson, the
biggest deal ever (size & customer), I can try two approaches.

1) First approach: for each salesperson, calculate the maximum (gives the
size) and then find the customer which provided this maximum.
In MySQL this can be done like this:

select salespeople.name,
       -- find max deal size for this salesperson
       ( select max(amount) as amount from all_sales
         where all_sales.salesperson_id = salespeople.id
       )
       as amount,
       -- find customer for this max size
       ( select customer_name
         from all_sales where all_sales.salesperson_id = salespeople.id
         and all_sales.amount =
              -- the max size, again
              ( select max(amount) as amount from all_sales
                where all_sales.salesperson_id = salespeople.id
              )
       )
       as customer_name
from
salespeople;

That calculates the max size twice per salesperson, once in the first
subquery and once in the second, which is inefficient. We would like
to calculate it once per salesperson and "cache" it.

Usually to cache results we can use a derived table, so we try:

select salespeople.name,
       max_sale.amount,
       customer_of_max_sale.customer_name
from
salespeople,
-- calculate max size and cache it in transient tmp table "max_sale"
( select max(amount) as amount from all_sales
  where all_sales.salesperson_id = salespeople.id
)
as max_sale,
-- find customer, reusing cached max size
( select customer_name
  from all_sales where all_sales.salesperson_id = salespeople.id
  and all_sales.amount =
  -- the cached max size
  max_sale.amount
)
as customer_of_max_sale;

This is illegal in SQL92 because a derived table cannot depend on
other tables in the same FROM clause (it has to be constant over the
query's duration).
In SQL99 it becomes legal if one adds the keyword LATERAL:

select salespeople.name,
       max_sale.amount,
       customer_of_max_sale.customer_name
from
salespeople,
-- calculate max size and cache it in transient tmp table "max_sale"
LATERAL
( select max(amount) as amount from all_sales
  where all_sales.salesperson_id = salespeople.id
)
as max_sale,
-- find customer, reusing cached max size
LATERAL
( select customer_name
  from all_sales where all_sales.salesperson_id = salespeople.id
  and all_sales.amount = max_sale.amount
)
as customer_of_max_sale;

which means "this derived table depends on previous tables on its
side" (which are... lateral).

2) Second approach: another solution to achieve our goal could be:

select salespeople.name,
       -- find max size and customer in one go
       ( select amount, customer_name from all_sales
         where all_sales.salesperson_id = salespeople.id
         order by amount desc limit 1
       )
from
salespeople;

it is efficient but illegal as a subquery in the SELECT list may not
return more than one column, in SQL.

We can rewrite to a derived table, which can have two columns:

select salespeople.name, max_sale.amount, max_sale.customer_name
from
salespeople,
-- find max size and customer in one go
( select amount, customer_name from all_sales
  where all_sales.salesperson_id = salespeople.id
  order by amount desc limit 1
)
as max_sale;

but then our derived table is dependent on the salespeople table,
which requires support for LATERAL.

In short, LATERAL is the efficient solution to all drawbacks in the
two tried approaches.

F-1 Support LATERAL keyword, reserved in SQL:99, in front of a derived table's
definition. A derived table with a LATERAL keyword can only occur in FROM,
either in the comma-separated list of tables, or inside a JOIN part:
SELECT ... FROM
t1, ...,
LATERAL (SELECT ... WHERE t1.a=...) AS alias,
t2 LEFT JOIN (t3 JOIN LATERAL (SELECT ...) AS alias2 ON ...) ON ...;
There can be several derived tables with a LATERAL keyword for each.

F-2 if a derived table is prefixed with the LATERAL keyword, in its definition
it can reference columns from tables which precede it in the FROM clause.

F-3 if a derived table is prefixed with the LATERAL keyword, it is re-evaluated
so that its content is always up-to-date with the referenced outer columns. For
example, in F-1, the content of 'alias' has to change every time a new row of t1
is processed by the top query.

F-4 if a lateral derived table 'alias' depends on table t1, it is forbidden that
there be a LEFT JOIN with 'alias' on left and t1 on right (or a symmetric RIGHT
JOIN). For example these should return error:
SELECT ... FROM
t1 RIGHT JOIN LATERAL (SELECT ... WHERE t1.a=...) AS alias ON 1;
SELECT ... FROM
LATERAL (SELECT ... WHERE t1.a=...) AS alias LEFT JOIN t1 ON 1;

F-5 This feature should not affect views and CTEs, only derived tables: there is
no SQL syntax to specify that a view/CTE is "lateral"; but using LATERAL for a
derived table inside the definition of a CTE/view is ok.

F-6 in EXPLAIN and in optimizer trace it should be visible at which stage in
execution a lateral derived table is rematerialized.

F-7 in optimizer trace the specific cost calculations for a lateral derived
table should be visible.

NF-1 depending on its definition, a lateral derived table may be merged in outer
query or materialized into a tmp table.

NF-2 if a lateral derived table turns out, after analysis, to not contain
references to other tables, the query should execute as fast as if the LATERAL
word had not been specified, and as fast as MySQL 8.0.

NF-3 in agreement with the SQL standard, a table function has an implicit
LATERAL, so it behaves as in previous 8.0 versions. However, in agreement with
the standard, the LATERAL word isn't allowed before JSON_TABLE, even though it's
implicit.

==Syntax specification==

The syntax of a lateral derived table is the same as a non-lateral derived table
except that the keyword LATERAL is specified before the derived table specification.

Example:

SELECT
  (
   SELECT 1
   FROM t1 JOIN LATERAL (SELECT *
                        FROM t2
                        WHERE t1.x = t2.x AND t2.y = t0.z
                       ) lateral_derived_table
       ON true
  ) FROM t0;

Notice the WHERE clause, which references a column from the table t1 in the same
query block as where the lateral derived table is defined.

Note that a derived table 'dt' may have two types of outer references:
1) outer ref to a table in the same FROM clause as 'dt', if 'dt' has LATERAL;
this is the case of t1.x. We'll call them "FROM-clause outer refs".
2) outer ref to a table in any query outer to the query which owns the FROM
clause where 'dt' is; this is the case of t0.z. We'll call them "non-FROM-clause
outer refs", and supporting them is being done in WL#461.

==Limitations==

* Per the SQL standard, if a lateral derived table is in the right operand of a
JOIN clause and contains a reference to the left operand, then the join
operation must be an INNER JOIN, CROSS JOIN or LEFT OUTER JOIN.

* If a lateral derived table 'dt' references an aggregate function, this
function's aggregation query cannot be that which owns the FROM clause where
'dt' is.

==Implementation==

* For a lateral derived table, the outer name resolution context of the subquery
must be the context of the outer query block, so that search in FROM-clause
tables is possible; for a non-lateral derived table, the outer context must be
the outer query block of the outer query block.

* After resolving a lateral derived table, check whether it contains FROM-clause
outer refs. It is doesn't, consider it's not lateral.

* For a lateral derived table, make sure that the join type with respect to
tables of FROM-clause outer refs are correct (inner join and left join allowed).

* Both materialization and merging can be used to evaluate lateral derived
tables. The same rules as for non-lateral derived tables apply when deciding
whether to merge or materialize.

* A lateral derived table is marked as "dependent on" the tables of its
FROM-clause outer refs (by setting TABLE_LIST::dep_tables); this makes sure that
it is put at a right place in the optimizer's plan.

* A lateral derived table cannot use join buffering, as it depends on the row of
a previous table.

* A lateral derived table cannot be marked as "const" by the optimizer.

* The lateral derived table is evaluated for each row of tables it has
references to.

* For a materialized lateral derived, make sure to empty and materialize it just
before it is read by the join executor (in QEP_TAB::prepare_scan()). For a
materialized derived table 'dt' having only non-FROM-clause outer references,
empty it only when the subquery owning 'dt' starts executing, and then
materialize (thus, once per subquery execution) it in QEP_TAB::prepare_scan().
For a materialized derived table having no outer ref, never empty it, and
materialize it (thus, once for all) in QEP_TAB::prepare_scan().

DT means "derived table".

Parsing
=======

LATERAL reserved word is added.
PT_derived_table gets a new bool argument, lateral=true/false.

Resolution
==========

* If it's LATERAL, a DT has an object of type Lateral, accessible through
TABLE_LIST::derived_unit()::m_lateral . It contains table maps of FROM-clause
outer refs.

* If it's lateral, name resolution cannot search beyond the DT; for that we
reuse the existing SELECT_LEX::end_lateral_table member.

* The logic in Item_field::fix_outer_field() and Item_ref::fix_fields() assumes
that when searching for an outer reference, all the way from the reference up to
the qualifying query is made of subqueries having Item_subselects, which isn't
true anymore as some subqueries may be DTs. They also assume that all subqueries
along the way should be searched in, which is not true anymore because if we
meet, along the way, a non-lateral DT, and have to search above it, then we
shouldn't search "right above it" (as it's not lateral) but "above of above".
Thus those functions have to be modified. They update m_lateral with maps. They
call mark_as_dependent() which marks SELECT_LEX_UNITs with UNCACHEABLE_DEPENDENT.

* If after resolution, m_lateral's map is still 0, m_lateral is set to nullptr,
the DT isn't LATERAL anymore (exception: we don't do this if
context_analysis_only). So:
** a DT with FROM-clause refs has m_lateral!=nullptr and a non-zero m_lateral's
map, and its unit has UNCACHEABLE_DEPENDENT
** a DT with only non-FROM-clause refs has m_lateral==nullptr and its unit has
UNCACHEABLE_DEPENDENT
** a DT with no outer ref has m_lateral==nullptr and its unit does not have
UNCACHEABLE_DEPENDENT

* TABLE_LIST::resolve_derived() cannot reset allow_sum_func to 0 anymore, as a
DT may contain an aggregate function dependent on an outer query block. Thus
allow_sum_func is reset to 0 only in LEX::reset().

* As a lateral DT dt2 may have a ref to another lateral DT dt1 (its left
neighbour in FROM), before resolving dt2 we need to have resolved dt1 _and_
merged it or set up for materialization, so that we know if the ref in dt2 is,
for example, a column of a materialized tmp table (practically: without this,
find_field_in_table() would crash). So, SELECT_LEX::resolve_derived() is changed
to process each table entirely before moving to the next.

* If a lateral DT is merged, we set m_lateral==nullptr (as m_lateral is now
useless). If it is set up for materialization we add to its dep_tables the maps
in m_lateral.

* If a DT is in a subquery which is merged up (e.g. by a semijoin):
** if the DT contains a FROM-clause ref, this ref's map may change, so
fix_tables_after_pullout() updates the maps in DT's m_lateral. DT's dep_tables
is also updated.
** If the DT contains a non-FROM-clause ref, and this ref belongs to the query
block which we're merging in, then this ref becomes a FROM-clause ref; to notice
this, a re-processing of all clauses of the DT is necessary, so the code in
Item_subselect::fix_after_pullout() is moved to
SELECT_LEX_UNIT::fix_after_pullout(), so that both Item_subselect and DT can use
it. After this reprocessing, if the ref has become to FROM-clause, the DT
becomes lateral and is given a m_lateral object with the right maps.

* Independently of this WL, I saw mark_select_range_as_dependent() was unused so
deleted it.

* The check that rejects "T1 RIGHT JOIN JSON_TABLE(T1.a,...)", is reused for
lateral DTs and changed: it used to be a loop over all tables in
SELECT_LEX::prepare(), now it's done in TABLE_LIST::resolve_derived() and
TABLE_LIST::setup_table_function() (calls to check_right_lateral_join()).

Optimization
============

* In SELECT_LEX_UNIT::optimize(), if this unit contains an outer ref, set its
estimated row count to at least 2, otherwise the DT containing this unit would
be treated as constant and evaluated once for all.

* Consider
select * from t1, t2, lateral (select t1.x from...) dt;
and assume that 'dt' is materialized and the execution plan is t1-t2-dt.
We do not want to empty and rematerialize 'dt' for every combined row of t1-t2
(so, card(t1)*card(t2) times), while it is necessary only when we read a new row
of t1 (i.e. card(t1) times). This is implemented (see the Execution paragraph).
Thus, plan t1-t2-dt is be more optimal than t2-t1-dt, because the former will do
card(t1) rematerializations vs card(t2)*card(t1) for the latter. So we take this
into account in the planner.
Moreover, plan t1-t2-dt is more optimal than t1-t2_with_join_buffer-dt,
because the latter will shuffle rows of t1 so when reaching the stage of 'dt'
the rows will be presented "sorted by t2", so will cause again card(t2)*card(t1)
rematerializations. So we take this into account in the planner.
Another similar scenario: consider
select * from t1, lateral (select t1.x from...) dt1,
                  lateral (select t1.y from...) dt2;

* To achieve the above, in greedy search we maintain a bitmap,
JOIN::deps_of_remaining_lateral_derived_tables. It contains the map of all
tables which are dependencies of not-yet-in-the-partial-plan lateral DTs.
In greedy search, when we have a join prefix and are considering adding any
table to it: if the said map tells that there is a lateral DT which is not in
the prefix but one of its referenced tables is, prevent join buffering on the
to-be-added table (done in best_access_path()).
In greedy search, when we have a join prefix and are considering adding a
lateral DT to it: add to the cost of DT a "cost of materialization", which is
the cost of a single materialization (available in the DT's underlying JOIN
final plan) multiplied by the number of rows output by the last-in-plan table
which DT references (available in a POSITION structure).

* Once the final plan is decided, for example t1-t2-t3-t4-dt-t5, where 'dt'
depends on t1 and t3, in make_join_readinfo() we go through all tables, and for
each DT we identify the dependency of 'dt' which is last in plan (here: t3), and
we add to a bitmap t3->lateral_derived_tables_depend_on_me, the bit of 'dt'. We
also make sure that setup_join_buffering() doesn't enable join buffering when
forbidden.

Execution
=========

* In class SELECT_LEX_UNIT, a function is added which empties all CTEs defined
in the WITH clause of this unit which have UNCACHEABLE_DEPENDENT. This function
is called by SELECT_LEX_UNIT::execute().

* A function is added which empties all non-CTE non-view derived tables of a
FROM clause which have UNCACHEABLE_DEPENDENT; it is called by JOIN::reset().

* Because subselect_indexsubquery_engine::exec() is some sort of (shortcut)
subquery execution, it also needs to call the two functions above.

* Various places which called delete_all_rows() and extra(HA_RESET_STATE) on an
internal tmp table, are replaced with a single function TABLE::tmp_empty().

* While derived tables with non-FROM-clause outer refs are properly handled by
the points above, for a derived table 'dt' with FROM-clause outer refs
additional logic is necessary. Each lateral DT has a boolean member
"rematerialize". Each time we read a new row from a table T, we set
"rematerialize" to true for each table of T->lateral_derived_tables_depend_on_me
 . In QEP_TAB::prepare_scan() of a DT there pre-exists logic (added for
JSON_TABLE) which empties and re-fills the DT if "rematerialize" is true; right
after rematerialization we set "rematerialize" to false. So the lateral DT will
remain unchanged until a new row of its last dependency is read.

EXPLAIN, optimizer trace
========================

In EXPLAIN, on the EXPLAIN row of T, we list "Rematerialize (X,...)" where X is
any lateral DT whose rematerialization is triggered when a new row of T is read;
EXPLAIN uses T->lateral_derived_tables_depend_on_me for that. Optimizer trace
shows similar information.