WL#6057: Make semijoin work with single-table UPDATE/DELETE

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

These:
 UPDATE t1 SET x=y WHERE z IN (SELECT * FROM t2);
 DELETE FROM t1 WHERE z IN (SELECT * FROM t2);
do not use semijoin, whereas
 SELECT x FROM t1 WHERE z IN (SELECT * FROM t2);
does.
So it happens that the SELECT is fast (because it uses a semijoin technique)
whereas the UPDATE/DELETE is slow (because it uses an older technique, known as 
IN-to-EXISTS).

The initial reason for the limitation is the absence of a JOIN object in
UPDATE/DELETE.

Multi-table UPDATE/DELETE can use a semijoin because they have a JOIN object, 
which goes through optimization, including planning, with a cost-based choice of 
semijoin strategy, and creation of iterators capable of implementing the chosen
strategy; single-table UPDATE/DELETE, on the other hand, have an implementation
where it is hard-coded that there's a single table, no iterators...

Everything stated above is also true for subquery materialization (WL#1110).
Except for the reason why multi-table UPDATE/DELETE can use it: it's again
because they have a JOIN object, which contains an estimate of how many
matching rows the updated table has, and that estimate is accessed by
the subquery to do a cost-based choice of materialization versus IN-to-EXISTS.
Single-table UPDATE/DELETE don't have #rows information in a JOIN object.

Solution: when a single-table UPDATE/DELETE is getting prepared, if:
- it has a [NOT] IN/EXISTS(subquery) predicate
- and it has no ORDER BY and no LIMIT (*),
- and the target table doesn't support read-before-write-removal (**)
- and semijoin or subquery materialization are allowed (based on the hints in 
the subquery and on @@optimizer_switch),
then there is a potential win to use the multi-table UPDATE/DELETE code path so 
we do it.

If the user knows that, for his query, semijoin would bring more performance 
than semi-consistent-read or read-before-write-removal do,
and MySQL doesn't convert to multi-table per the conservative
rules above, then the user can still rewrite the query to multi-table 
explicitely.

User requests:
https://bugs.mysql.com/bug.php?id=35794
https://bugs.mysql.com/bug.php?id=96423

(*) these clauses are not supported by multi-table UPDATE/DELETE
(**) that is a performance feature of some storage engines which is not 
supported by multi-table UPDATE/DELETE; NDB supports it;
not InnoDB/MyISAM/MEMORY. The principle is (quoting from sql/handler.h):
  "Read before write removal may be used for storage engines which support
  write without previous read of the row to be updated. Handler returning
  this flag must implement start_read_removal() and end_read_removal().
  The handler may return "fake" rows constructed from the key of the row
  asked for. This is used to optimize UPDATE and DELETE by reducing the
  number of round-trips between handler and storage engine.
  Example:
  UPDATE a=1 WHERE pk IN (@)"
It isn't a priority to add this feature to multi-table UPDATE when a
semijoin is used: indeed, there will be at least two tables in the query,
with an equality condition between the two tables' columns. Say we have
t1 SEMIJOIN t2 ON t1.pk=t2.fk, and plan is t1-t2 (firstmatch). Read-removal
can't work for t1 (we need to reads rows of t1, we don't know its values of
t1.pk from the query). If plan is t2(loosescan)-t1, read-removal may work
for t1, in theory. But it's not implemented. If we did, we should also
implement it in multi-table UPDATE for joins (not only semijoins).

As part of this WL, support for semi-consistent read is added to multi-table 
UPDATE. Semi-consistent read is a performance feature, supported by InnoDB in 
isolation levels weaker than REPEATABLE READ; quoting from handler.h:
"   Normally, when running UPDATE or DELETE queries, we need to wait for other
    transactions to release their locks on a given row before we can read it and
    potentially update it. However, in READ UNCOMMITTED and READ COMMITTED, we
    can ignore these locks if we don't intend to modify the row (e.g., because
    it failed a WHERE). This is signaled through enabling "semi-consistent
    read"[...]
    If semi-consistent read is enabled, and we are in READ UNCOMMITTED or READ
    COMMITTED, the storage engine is permitted to return rows that are locked
    and thus un-updatable. If the optimizer doesn't want the row, e.g., because
    it got filtered out, it can call unlock_row() as usual. However, if it
    intends to update the row, it needs to call was_semi_consistent_read()
    before doing so. If was_semi_consistent_read() returns false, the row was
    never locked to begin with and can be updated as usual. However, if it
    returns 1, it was read optimistically, must be discarded (ie., do not try to
    update the row) and must be re-read with locking enabled. The next read call
    after was_semi_consistent_read() will automatically re-read the same row,
    this time with locking enabled."
In trunk, this feature is enabled only in single-table UPDATE. Thus, when this
WL converts a single-table UPDATE to multi-table UPDATE (due to the presence of
a subquery), it risks making certain  queries slower. The conservative way would 
be to not do the conversion if semi-consistent read were a possibility for this
query; but that can only be tested with "if updating an InnoDB table, and we're
in READ COMMITTED isolation mode", which many queries may match. Thus, this
conservative approach would seriously decrease the potential value of the WL.
Therefore, semi-consistent read is instead implemented in multi-table UPDATE, in 
this WL.

"Single code path for multi-table and single-table UPDATE (DELETE)" is a 
bigger-size change which could be a follow-up, later.

F1 - if a single-table UPDATE/DELETE has a [NOT] IN/EXISTS(subquery) predicate,
and has no ORDER BY and no LIMIT, and the target table doesn't support read-
before-write-removal (*), and semijoin or subquery materialization are allowed
(based on the hints in the subquery and on @@optimizer_switch), then the
statement should be internally converted to a multi-table one.

(*) defined in the "high-level description"

F2 - when the conversion happens, the subqueries should become eligible for 
semijoin and subquery materialization, on top of IN-to-EXISTS,
just like for a SELECT statement.

F3 - when the conversion happens, it should be visible in the optimizer trace: 
for a multi-table statement there is a "join_optimization" object
in the trace, whereas there is none for a single-table statement. Note that
the conversion is currently also visible in EXPLAIN FORMAT=TREE, as there a
single-table statement reports
""
while a multi-table one reports a full plan.

NF1 - performance of UPDATE/DELETE queries: if such query doesn't use any 
subquery, performance should not change; if it uses a subquery, performance may 
improve (that's the WL's goal); it may also degrade but only marginally then 
(maximum -5%) (if it is a single-table UPDATE/DELETE, now converted to multi-
table, and the multi-table code path is less efficient, does a wrong cost-based 
planning...). Some queries have been proved to improve significantly already (see 
the numbers in "QA Notes").