WL#2980: Subquery optimization: Semijoin

Affects: Server-6.0   —   Status: Complete

The optimizer should handle certain subqueries so that   
they run as quickly as joins. The optimizer may choose 
the driving table, may use semijoins, may skip index 
keys for some index structures. The EXPLAIN statement 
will explain. 
   
Example:
SELECT * FROM t1 WHERE t1.key in (SELECT t2.key FROM t2)
->
SELECT * FROM t1, t2(distinct t2.key) WHERE t1.key=t2.key
 
Note: As agreed during the Athens Dev-MTX meeting,
Igor will write the specification for this item
(it may then be split into two separate tasks).
The current estimate is that it will take one
man-month to write the specification.


Note (added by Trudy Pelzer, 2006-05-15):
Per Brian Aker, the subquery optimization must handle
statements of the type:
UPDATE foo SET a=10 WHERE a in (SELECT a FROM foo,bar WHERE foo.b=20 AND
foo.a=17 AND bar.c=foo.d)
Brian and Igor have discussed this requirement.

 
 
WL#2980: Subquery optimization: Semijoin: HLS
*********************************************


1. Task scope 
2. NL-SJ applicability criteria
2.1 No issues with NULL comparisons
3. Subquery to semijoin conversion
4. Automatic satisfaction
5. Duplicate outer rows elimination strategy
6. First match strategy
7. Inner-to-outer strategy
8. Nested-Loops Semi-Join implementation
8.1 Possible join orders
8.1.1 Interleaving between semijoins
8.1.2 Relationship with outer joins
8.1.3 The check to be used to determine if we can add table
8.2 Executioner adjustments
8.3 Join optimizer adjustments 
9. EXPLAIN changes
10. To be covered in the LLD
11. Other notes





1. Task scope 
=============

Under certain conditions, a SELECT with IN/=ANY subquery at top level in the 
WHERE/ON clause can be executed by running Nested-Loops SemiJoin (further 
NL-SJ). That is, a select with subquery
   
  SELECT ...
  FROM ot1 ... otN 
  WHERE (oe1, ..., oeN) IN (SELECT ie1, ... ieN FROM it1, ... itN WHERE ...)

is converted into select with a semi-join
  
  SELECT ... FROM 
    (ot1, ..., otN) SJ (it1, ..., itK) ON sj_cond(ot*, it*)

which can be evaluated using NL-join code with the exception that we'll need
to satisfy this requirement:
 
    for any row combination of outer tables we 
    may not generate more than one row combination      (INNER-UNIQ-REQ)
    of inner tables.

To satisfy the requirement we will use the following strategies:

 1. AutomaticSatisfaction: prove that the query's equalities and unique/primary
    keys on the queried tables will guarantee that the requirement is met.

 2. DuplicateOuterRowsElimination: use a temporary table to weed out the 
    duplicate combinations of outer table rows.

 3. FirstMatch: in certain join orders we can avoid producing duplicate
    row combinations by short-cutting the join execution.
  
 4. DistinctStreamProduction: Scan the inner (subquery's) tables in a way that
    produces at most one match per each combination of outer rows.

The implementation will be structured as follows:

 - Subquery predicate to semi-join nest (i.e. TABLE_LIST) converter. This is
   used by all strategies
 - Procedure to pull the tables out of semi-join nests (AutomaticSatisfaction
   implementation)
 - Adjustments in the join optimizer to account for use of strategies #2-#4 on
   the execution phase.
 - Adjustments in the join executioner for strategies #2-#4.


2. NL-SJ applicability criteria
===============================
(WL#3740 has a copy of this section contents)

A subquery predicate can be replaced with semi-join if the following
conditions are met:

SUBQ-TO-SJ-CRITERIA:
  1. Subquery predicate has form 
       (...)  IN  (SELECT ...)   
       (...) =ANY (SELECT ...)    

  2. Subquery is a single SELECT (no UNIONs)

  3. Subquery does not have GROUP BY or ORDER BY

  4. Subquery does not use aggregate functions

  5. #tables-in-parent-query + #tables-in-subquery <= MAX_TABLES

  6. Subquery predicate is at the AND-top-level of ON/WHERE clause

Note that the subquery may have DISTINCT or LIMIT clauses.

2.1 No issues with NULL comparisons
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A note about NULL values: The subuqery predicate is required to be at top-level 
of WHERE/ON (where we don't distinguish between FALSE and NULL). The subquery 
predicate is not a NOT IN, so we will not need any special handling for NULL 
values.


3. Subquery to semijoin conversion
==================================
This is covered in WL#3740.


4. Automatic satisfaction
=========================
This is covered in WL#3740.

 
5. Duplicate outer rows elimination strategy
============================================

This is covered in WL#3741.


6. First match strategy
=======================

This is covered in WL#3750.


7. Inner-to-outer strategy
==========================

This is covered in WL#3751.



8. Nested-Loops Semi-Join implementation
========================================

8.1 Possible join orders
------------------------

8.1.1 Interleaving between semijoins
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Interleaving between semijoins is ok. Tables of one semi-join are considered
"unrelated" tables for the other. 
TODO: need to say more here? it's "intuitively clear"..


8.1.2 Relationship with outer joins
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tables within an outer join nest must remain within that nest. this rule has
priority over all semijoin rules.


8.1.3 The check to be used to determine if we can add table
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Definitions of acceptable join orders were already given.

For JoinShortcutting, we can just use table dependencies.

For DistinctStream, we'll have to keep track in which part of the pattern we
are. Since interleaving is allowed, we'll need to do it for each semi-join.

A candidate implementation:
Have an array of ints to show in which pattern we are.
We'll need to
 - notice when we start atX tables (and set some flag so best_access_path only
   chooses appropriate access methods)
 - Notice when we end atX tables
 - prevent appearance of outer tables before the inner.

For each semijoin we have a kind of 4-state DFA. For each of the states it 
will have
 - table_map of tables that may be added to partial join_order
 - next_state(table#) function
.


8.2 Executioner adjustments
---------------------------
We will need to add the "jump-outs" to certain JOIN_TABs. The above sections
have a description of what kind of "jump outs" and where to.

Nothing more to cover here for the HLS.


8.3 Join optimizer adjustments 
------------------------------
{ A check if given table can be used to extend given partial join order }
{ Modifications to join cost calculations}
{   - the fanout changes }


9. EXPLAIN changes
==================
After all
 * Semi-joined tables show up in the upper SELECT.
 * EXPLAIN EXTENDED + SHOW WARNINGS shows the semijoin structure (this 
   will give idea about which tables were pulled out of the semi-join).

 * Temporary table use is indicated by "Start temporary" and "End temporary" 
   in the "Extra" column. Tables that were not pulled out (see previous entry) 
   and are in the [Start temporary; End temporary] range will have their rowid 
   in the temp table.

 * Join shortcutting is indicated by "FirstMatch(tableX)" in the "Extra" column.

TODO: Add the WL#3750 part here.


10. To be covered in the LLD
============================
 * TODO: what do we do if the parent and/or subquery's SELECT has a
   STRAIGHT_JOIN attribute(s)?

 * TODO: How does the conversion relate to prepared statements? In order for
   conversions to be valid we need to rely on table DDLs not to be changed
   (e.g. "automatic satisfaction" needs UNIQUE key to remain UNIQUE)

11. Other notes
===============
We can use a temporary table to run anti-semijoin to handle NOT IN.