WL#6057: Make semijoin work with single-table UPDATE/DELETE
Affects: Server-8.0
—
Status: Complete
These: UPDATE t1 SET x=y WHERE z IN (SELECT * FROM t2); DELETE FROM t1 WHERE z IN (SELECT * FROM t2); do not use semijoin, whereas SELECT x FROM t1 WHERE z IN (SELECT * FROM t2); does. So it happens that the SELECT is fast (because it uses a semijoin technique) whereas the UPDATE/DELETE is slow (because it uses an older technique, known as IN-to-EXISTS). The initial reason for the limitation is the absence of a JOIN object in UPDATE/DELETE. Multi-table UPDATE/DELETE can use a semijoin because they have a JOIN object, which goes through optimization, including planning, with a cost-based choice of semijoin strategy, and creation of iterators capable of implementing the chosen strategy; single-table UPDATE/DELETE, on the other hand, have an implementation where it is hard-coded that there's a single table, no iterators... Everything stated above is also true for subquery materialization (WL#1110). Except for the reason why multi-table UPDATE/DELETE can use it: it's again because they have a JOIN object, which contains an estimate of how many matching rows the updated table has, and that estimate is accessed by the subquery to do a cost-based choice of materialization versus IN-to-EXISTS. Single-table UPDATE/DELETE don't have #rows information in a JOIN object. Solution: when a single-table UPDATE/DELETE is getting prepared, if: - it has a [NOT] IN/EXISTS(subquery) predicate - and it has no ORDER BY and no LIMIT (*), - and the target table doesn't support read-before-write-removal (**) - and semijoin or subquery materialization are allowed (based on the hints in the subquery and on @@optimizer_switch), then there is a potential win to use the multi-table UPDATE/DELETE code path so we do it. If the user knows that, for his query, semijoin would bring more performance than semi-consistent-read or read-before-write-removal do, and MySQL doesn't convert to multi-table per the conservative rules above, then the user can still rewrite the query to multi-table explicitely. User requests: https://bugs.mysql.com/bug.php?id=35794 https://bugs.mysql.com/bug.php?id=96423 (*) these clauses are not supported by multi-table UPDATE/DELETE (**) that is a performance feature of some storage engines which is not supported by multi-table UPDATE/DELETE; NDB supports it; not InnoDB/MyISAM/MEMORY. The principle is (quoting from sql/handler.h): "Read before write removal may be used for storage engines which support write without previous read of the row to be updated. Handler returning this flag must implement start_read_removal() and end_read_removal(). The handler may return "fake" rows constructed from the key of the row asked for. This is used to optimize UPDATE and DELETE by reducing the number of round-trips between handler and storage engine. Example: UPDATE a=1 WHERE pk IN (@)" It isn't a priority to add this feature to multi-table UPDATE when a semijoin is used: indeed, there will be at least two tables in the query, with an equality condition between the two tables' columns. Say we have t1 SEMIJOIN t2 ON t1.pk=t2.fk, and plan is t1-t2 (firstmatch). Read-removal can't work for t1 (we need to reads rows of t1, we don't know its values of t1.pk from the query). If plan is t2(loosescan)-t1, read-removal may work for t1, in theory. But it's not implemented. If we did, we should also implement it in multi-table UPDATE for joins (not only semijoins). As part of this WL, support for semi-consistent read is added to multi-table UPDATE. Semi-consistent read is a performance feature, supported by InnoDB in isolation levels weaker than REPEATABLE READ; quoting from handler.h: " Normally, when running UPDATE or DELETE queries, we need to wait for other transactions to release their locks on a given row before we can read it and potentially update it. However, in READ UNCOMMITTED and READ COMMITTED, we can ignore these locks if we don't intend to modify the row (e.g., because it failed a WHERE). This is signaled through enabling "semi-consistent read"[...] If semi-consistent read is enabled, and we are in READ UNCOMMITTED or READ COMMITTED, the storage engine is permitted to return rows that are locked and thus un-updatable. If the optimizer doesn't want the row, e.g., because it got filtered out, it can call unlock_row() as usual. However, if it intends to update the row, it needs to call was_semi_consistent_read() before doing so. If was_semi_consistent_read() returns false, the row was never locked to begin with and can be updated as usual. However, if it returns 1, it was read optimistically, must be discarded (ie., do not try to update the row) and must be re-read with locking enabled. The next read call after was_semi_consistent_read() will automatically re-read the same row, this time with locking enabled." In trunk, this feature is enabled only in single-table UPDATE. Thus, when this WL converts a single-table UPDATE to multi-table UPDATE (due to the presence of a subquery), it risks making certain queries slower. The conservative way would be to not do the conversion if semi-consistent read were a possibility for this query; but that can only be tested with "if updating an InnoDB table, and we're in READ COMMITTED isolation mode", which many queries may match. Thus, this conservative approach would seriously decrease the potential value of the WL. Therefore, semi-consistent read is instead implemented in multi-table UPDATE, in this WL. "Single code path for multi-table and single-table UPDATE (DELETE)" is a bigger-size change which could be a follow-up, later.
F1 - if a single-table UPDATE/DELETE has a [NOT] IN/EXISTS(subquery) predicate, and has no ORDER BY and no LIMIT, and the target table doesn't support read- before-write-removal (*), and semijoin or subquery materialization are allowed (based on the hints in the subquery and on @@optimizer_switch), then the statement should be internally converted to a multi-table one. (*) defined in the "high-level description" F2 - when the conversion happens, the subqueries should become eligible for semijoin and subquery materialization, on top of IN-to-EXISTS, just like for a SELECT statement. F3 - when the conversion happens, it should be visible in the optimizer trace: for a multi-table statement there is a "join_optimization" object in the trace, whereas there is none for a single-table statement. Note that the conversion is currently also visible in EXPLAIN FORMAT=TREE, as there a single-table statement reports "" while a multi-table one reports a full plan. NF1 - performance of UPDATE/DELETE queries: if such query doesn't use any subquery, performance should not change; if it uses a subquery, performance may improve (that's the WL's goal); it may also degrade but only marginally then (maximum -5%) (if it is a single-table UPDATE/DELETE, now converted to multi- table, and the multi-table code path is less efficient, does a wrong cost-based planning...). Some queries have been proved to improve significantly already (see the numbers in "QA Notes").
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.