WL#3751: Subquery optimization: Semijoin: Inside-out strategy

Affects: Server-6.0   —   Status: Complete

This is a subtask of WL#2980 "Subquery optimization: Semijoin".

This WL entry covers implementation of Inside-out semi-join execution strategy.

1. General idea
1.1 Limits on handling correlated subqueries
1.2 Definition of InsideOut strategy
2. Acceptable join orders
2.1 Prefix of outer tables is ok
2.2 No correlated tables in the suffix
2.3 Table ordering within the sj-nest range 
2.3.1 SJ-nest range, ordered enumeration part
2.3.2 SJ-nest range, FirstMatch part
3. Ordered streams and duplicate skipping
3.1 Required orderings
3.1.1 Constant parts removal effects
3.1.2 Equality propagation effects
3.2 Ordered scans
3.3 Access methods that produce ordered output
3.4 Ordered stream production summary
4. Executioner adjustments
4.1 Join shortcutting
4.2 Ordered scan adjustments
4.2.1 Removing duplicates from ordered stream
4.3 EXPLAIN output changes
5. Join optimizer adjustments
6. Things that will not be implemented
6.1 best_access_path() adjustments
6.1.1 Idea about how to handle best_access_path()
6.2 No accounting for functional deps when ordering
6.3 Not taking advantage of cross-join sj-nest ranges
6.4 Not passing info about distinct stream into MRR interface

1. General idea

Here is one more way to satisfy INNER-UNIQ-REQ: Consider a case of one
inner table and uncorrelated subquery:

  ot1 SJ it1 ON (it1.a = ot1.a)

We'll be able to use join order (it1, ot1) if we can make access to table
it1 to produce a distinct stream of {it1.a}. The execution will proceed 
according to this swimlane diagram:

      it1     |     ot1
  ===== ot1.a=it1.a=c1 group =====
   it1.rowX1  |
   it1.rowX2  |
   it1.rowX3 -|------------------ pick one row and get the value of c1
              |  ot1.rowX1
              |  ot1.rowX2
              |  ... (other matching rows, no repetitions
              |       because this is a single scan through table ot1)

  ===== ot1.a=it1.a=c2 group =====
   it1.rowY1  |
   it1.rowY2  |
   it1.rowY3 -|----------------- pick one row and get the value of c2
              |  ot1.rowX1
              |  ot1.rowX2
              |  ... (other rows, no intersection
              |       with the ot1.a=c1 group because 
              |       this is the ot1.a=c2 group, and no repetitions)
     ...      |

It is apparent that this execution will enumerate all rows from ot1 and will
not produce any INNER-UNIQ-REQ violations.

The approach relies on that it doesn't matter which it1's row in the group we
pick (e.g. it1.rowX1 or it1.rowX2). This means it works only if the subquery
is uncorrelated (or correlated only wrt ot1.a). 

1.1 Limits on handling correlated subqueries

It seems that it is not possible to use this approach to handle correlated 
subqueries when the outer correlated tables follow the subquery's inner
tables. Consider this diagram:

      it1     |     ot1
  --- ot1.a=it1.a=c1 group ---
   it1.rowX1  |
              |    ot1.rowP1 (corr. check fails) \
              |    ot1.rowP2 (corr check ok)      |- (*)
              |    ot1.rowP3 (corr check ok)     /
   it1.rowX2  |
              |    ot1.rowP1 (corr. check ok)    \    
              |    ot1.rowP2 (corr check fails)   |-  (**)
              |    ot1.rowP3 (corr check ok)     /    
   it1.rowX3  |
     ...      |      ... 

The first outer table scan (marked (*)) does not cause problems. The subquery
is correlated, so we will have to scan table ot1 for every other it1.rowX{i} 
as well. This is necessary as otherwise we might miss matching row 
combinations like

  {it1.rowX2, t1.rowP1}

But if we continue and do scan (**), we will not be able to tell which ot1's
rows have already been passed to output by scan (*).

From this we conclude that InsideOut strategy cannot handle a correlated 
subquery if there is an outer correlated table that is located after the
inner tables in the join order.

A practically-important special case is when the only source of the
correlation is one or several equalities which are AND-parts of the 
subquery's WHERE clause. In this case it is possible to decorrelate the 
subquery by converting the equalities to IN-equalities. This case will not be
handled within the scope of this WL entry.

1.2 Definition of InsideOut strategy

Suppose we have a semi-join

   (ot1 ... otN) SJ (it1 ... itN) ON (oe1=ie1 AND oe2=ie2 AND ...)

Executing the semi-join with NL-join algorithm will produce INNER-UNIQ-REQ
violations. Each violation is a pair

   (ot1.rowX, ..., otN.rowX, it_rows1)
   (ot1.rowX, ..., otN.rowX, it_rows2)

such that

   ie1(it_rows1) = ie1(it_rows2) AND
   ieK(it_rows2) = ieK(it_rows2) AND

That is, the set of all row combinations of inner tables is partitioned into
subsets. Each partition has its unique value of (ie1, ... ieN) tuple. Any two
members of the same partition will form an INNER-UNIQ-REQ violation. Two
members of different partitions won't form a violation.

We'll avoid producing INNER-UNIQ-REQ violations if we pick one row combination
from each partition. This can be achieved if we enumerate rows in a way that
row combinations get grouped by their partition, i.e. by the value of

  (ie1, ..., ieN).

One way to do such grouping is to enumerate row combinations in a way that the
output is ordered by

  some-permutation-of(ie1, ..., ieN).

Row combination orderings that allow for removal of INNER-UNIQ-REQ violations
will be called "InsideOut orderings".

2. Acceptable join orders

Throughout this section we'll assume that the join order consists of three 

  the_prefix     the_sj_nest       the_suffix
  (ntX|otX)*    (itX|ntX|otX)*     (ntX|otX)*

In subsequent sections we'll establish the allowed/required properties of 
the parts of the join order.

For this WL entry we will follow the approach of WL#3741/WL#3750 and assume
that every "semi-join range" in the join order is handled by only one
strategy. (A "semi-join range" is a range in join order that contains all
sj-inner tables of one or several semi-joins, and no sj-inner tables from any
other semi-joins. Semi-join ranges cannot be nested within one another).

It is possible to modify semi-join range definition so ranges contain one
another and are served by different strategies but this will not be
implemented at this phase.

2.1 Prefix of outer tables is ok
Suppose access to inner tables produces an InsideOut ordering of 

  some-permutation-of(ie1, ie2, ..., ieN). 

We make no assumptions about the_prefix, so there is no warranty that the 
entire result of join execution will be in InsideOut order. However, for 
each row combination of the_prefix, row combinations of (inner_tbls, 
the_suffix) will be enumerated in InsideOut order:
                   inner_tbls.row_x1  \
                   inner_tbls.row_x2   >-- those go in InsideOut order
                   inner_tbls.row_x3  /

                   inner_tbls.row_y1  \
                   inner_tbls.row_y2   >-- those go in InsideOut order
                   inner_tbls.row_y3  /


This will prevent INNER-UNIQ-REQ violations from being formed within one
the_prefix.rowX group. the_prefix.rowX != the_prefix.rowY, so a pair of row
combinations from different value groups cannot form an INNER-UNIQ-REQ
violation either.

Also, the subquery can be correlated wrt tables located in the_prefix: when 
all the outer references in the correlated condition are bound, we can pick 
any matching inner_tbls row combination within a (ie1, ..., ieN) value group.

If some of oe_i refer to tables in the_prefix, then those oe_i, and their 
matching ie_i remain constants throughout the InsideOut "pass". Because of 
that they can be removed from InsideOut ordering list: 
  any-permutation-of(ie1, ie2, ie3, ..., ieN) 


  any-permutation-of(ie_i1, ..., ie_ik).
When all outer tables are in the prefix, the requirement of InsideOut ordering
becomes a confluent "ORDER BY (empty-list)", i.e. ordering is no longer
required. The execution algorithm gets reduced to use of FirstMatch strategy.

2.2 No correlated tables in the suffix

As shown above in section "Limits on handling correlated subqueries",
InsideOut strategy cannot be used when the_suffix contains correlated tables.
Outer tables that are only referred from IN-equality can be present in the

2.3 Table ordering within the sj-nest range 

To recall, we're looking at the join order as the concatenation of three 

  the_prefix     the_sj_nest        the_suffix
  (ntX|otX)*    (itX|ntX|otX)*     (ntX|otX)*

the_sj_nest part is divided into two parts:

  1. Ordered enumeration part
  2. FirstMatch part 

The first part "is responsible" for enumerating row combinations in a way that
IN-expression value groups are not interleaved with each other.

The second part is "responsible" for picking one row combination from the 
IN-expression value group.

We will assume that the first part is minimal, i.e. removal of the last table
from the first part will cause the part to stop producing the required 


  SELECT * FROM ot1, ... 
    WHERE ot1.col IN (SELECT ie1.col FROM ie1, ie2 WHERE pred(ie1,ie2))

  Possible join order:

   it1, it2, ot1, ... 
    ^    ^   ^-------^-- suffix
    |    +-------------- the_sj_nest, FirstMatch part
    +------------------- the_sj_nest, Ordered enumeration part

2.3.1 SJ-nest range, ordered enumeration part
This will be covered further in the "Ordered streams and duplicate skipping"

2.3.2 SJ-nest range, FirstMatch part
We can reuse the results of WL#3741:
 * none of the tables must use join buffering
 * interleaving of inner and unrelated outer tables is ok.

In addition, for outer tables we can note that 
 * the subquery is uncorrelated,
 * for every "invocation" of the FirstMatch the values of all IN-expressions
   are constant.
which means that access to the outer tables does not depend on the inner
tables that are in the FirstMatch part. This means that outer tables have the 
same properties as unrelated tables, and we can treat them in the same way.

To sum up:

  Within the FirstMatch part, any interleaving of outer, unrelated, and inner
  tables is ok. The executioner code should work like WL#3741, treating outer
  tables as unrelated.

3. Ordered streams and duplicate skipping

3.1 Required orderings

3.1.1 Constant parts removal effects
As has been mentioned earlier, the required ordering is

  any-permutation-of(ie1, ..., ieN)

If there are outer tables that are located before the sj-nest, some of these 
expressions can be treated as constants and removed from the list. The 
remaining expressions give us the "nonconfluent" required ordering:
  any-permutation-of(iex1, ... iexN)

Within the scope of this WL we use only one method to produce orderings - do index 
scans, so each iex{i} needs to be it_i.keyXpartY.

  It is not possible to handle the case when ie_x are arbitrary functions. For 
  example, for ie1=x+y, producing a distinct stream of pairs (x,y) will not
  allow to produce a distinct stream of (x+y)
  It might be possible to handle injective functions as they have this
     (x1,y1,...) != (x1,x2,...) =>  func(x1, y2) != func(x2,y2)
  However this doesn't seem to be practically useful so we will not implement

3.1.2 Equality propagation effects
The set of allowed orderings becomes broader when we take into account 
equality propagation. The set becomes:

  non_constant_IN_components = {
   equ1 = (tbl1.keyX1partY1 , ...,tblK.keyXKpartYK),
   equN = (tbl1.keyX1partY1 , ...,tblK.keyXKpartYK),

and we must provide ordering on

3.2 Ordered scans
Ordered enumeration is performed by using index-based access methods:
 * range
 * index
 * ref (but not ref_or_null)

At first, it seems that NL-join execution can do ordered enumeration on
orderings that span multiple tables. However, we're only interested in
orderings that cannot be reduced to single-table orderings by means of

 - equality propagation (handled in previous section)

 - use of "constant table" property - all such constant sj-inner tables
   will be pulled out of the sj-nest (handled in the section before the 

 - use of "locally-constant table" property, when the table is accessed 
   using eq_ref(prefix-tables) and therefore is constant within the given 
   join order suffix. Tables like those will not be considered sj-inner.

Taking those considerations into account, lets try to see if it is possible
to perform ordered enumeration for multi-table ordering: suppose we need an 
ordering on

  order= ( t1.key, t2.key )

and the join order is (t1,t2). As stated above, t1.key is not a unique index,
 and there are different values of t1.key. Then the enumeration will proceed 
in a way like this:

   t1.key   t2.key
  'bar'       1




which shows that the rows will not be enumerated in (t1.key, t2.key) order.

The only scenario where we can have multi-table ordering is when table t2 is
accessed using eq_ref(prefix-tables, t1). This case will not be handled in
scope of this WL entry. The limitation will be denoted as "No accounting for
functional deps between the inner tables when resolving the ordering".

(Note: I have no proof that it is not possible to produce a multi-table 
ordering with some combination of inner and outer tables, like "it1 ot1 it2".
Thorough investigation will not be done in scope of this WL due to lack of

3.3 Access methods that produce ordered output
By definition, a range scan over
   index(kp1, ..., kpN)

produces (or, in case of MRR, can be set to produce) a stream that is ordered

  (kp1, kp2, ..., kpN).

In addition, if several of the first components are fixed (this can be checked
by either analyzing the ranges or analyzing the KEYUSE array), then the output
is ordered by the remaining suffix:
  (kp_n, kp_n+1, ... kpN)

The situation is completely analogous for "ref" and "index" scans.

Following the action taken by ORDER BY code (in test_if_skip_sort_order), we 
will ignore the fact that some special cases of index_merge access can produce
ordered output.

Due to lack of time, we will not implement no-duplicates range scan. We chose 
to omit non-duplicates range access because 
 - use of InsideOut strategy together with range access for the InsideOut
   table is a relatively uncommon case.
 - "range" will use join buffering by default, so we'll have to make a choice
   between range+join buffering+duplicate elimination and InsideOut strategy +
   no-duplicates range + no join buffering. This is not immediately trivial.

We will handle a case where the InsideOut scan is a duplicate-removal scan of
the entire index. This case will be evaluated as a confluent "ref" scan on 0

3.4 Ordered stream production summary
Summing up the results of the previous sections: 
 - ordering (i.e. grouping) is produced by exactly one inner table IGT
 - IGT must not use join buffering
 - IGT's access method must be "ref", which uses all non-constant components 
   of the ordering as its prefix, or its infix, where the prefix of the key 
   is fixed by "keypartX=const" constraints imposed by IGT's access method.

  Table that produces the ordering (IGT) will be referred to as 
  "InsideOut table". An appropriate scan of that table will be 
   referred to as "InsideOut scan".

4. Executioner adjustments

4.1 Join shortcutting
As mentioned in the previous sections, InsideOut strategy will be used
together with FirstMatch in certain cases. In that cases it will be
neccessary to set up appropriate join shortcutting. It will work in the
same way as WL#3741's shortcutting.

4.2 Ordered scan adjustments
As has been shown in earlier sections, grouping/ordering will be provided by
one table.  There are two possible cases:

1. The first table row in the value group is guaranteed to be the match. 
   This is true when the sj-nest range has no FirstMatch part (see preceding
   sections for explanation), and the entire table condition is guaranteed 
   to be true by the table's access method.

   In this case the engine can be instructed to produce a distinct stream by
   passing a corresponding MRR interface flag. This will not be done within
   the scope of this WL entry.

2. We need to enumerate rows (or row combinations) in the value group until 
   we find a matching row (combination). When we've found it, we'll need to 
   jump over the remainder of the group.

4.2.1 Removing duplicates from ordered stream
The "jump" will be performed at SQL-layer level. Jumping will be done by 
skipping the unneded rows.

Another possible option would be to use true jump calls 
   handler->index_read(..., HA_READ_AFTER_KEY)

but we will not do it as this approach has proved to be slow (TODO: link!).

4.3 EXPLAIN output changes
Use of FirstMatch strategy will be reflected in the same way as it was
reflected in WL#3741.

Use of "jump over to next value group" will be indicated by displaying 


in the Extra column, where m and n are the appropriate keypart numbers.

5. Join optimizer adjustments

(See the note in "Things that will not be implemented" section below)
For now, we'll settle for this simplistic apporach: when best_access_path() 
considers access methods for the first inner table it1 for which it could use 
InsideOut strategy:

          +------------ The "InsideOut table"
          |    +=======+---- this will be the "FirstMatch tail"
   ....  it1  it2 ... itN

We will assume that #rows fanout and access cost of table it will be


times lower than the values obtained by current best_access_path() code.

The special case of uncorrelated one-inner-table semi-join will be treated in
a special way. In this case we will perform a loose index scan, so the cost and 
#rows will be calculated accordingly (details to be covered in the LLD)

6. Things that will not be implemented

6.1 best_access_path() adjustments
At the moment, best_access_path() picks the cheapest available access method.
For join optimization this is ok, because choice of access method has no
effect on the optimization of the rest of the join order.

This property is what prevents from taking ORDER BY into account by the join

Introduction of InsideOut strategy poses a similar challenge. Applicability of
InsideOut depends on scanning certain tables in certain orders, so the cheapest
table access method is not necessarily the best one now.

[according to discussion with Igor:]
This problem should be solved using this approach: 

 1. Find the cheapest join order without taking any orderings into account
    (that's what current code does)
 2. Find the cheapest join order that allows to 
     - for ORDER BY case, skip sorting
     - for InsideOut case, use the InsideOut strategy

then compare the two and pick the best one.

In order to do the comparison we'll need to calculate the cost of InsideOut 
and the alternate strategy. Since we don't have any of those we will not
implement this in full scope in this WL.

6.1.1 Idea about how to handle best_access_path()
Make best_access_path() (and other functions as needed) return some flag to 
inform the join optimizer that it needs to "fork" the search at this point.
The first best_access_path() call will ignore the presense of sj-nest.
The second call (the one after fork) will pick the loose index scan access

NOTE: Perhaps this could be used immediately for 1-table semi-joins. Here
we'll have immediate cost difference due to use of loose index scan.

6.2 No accounting for functional deps when ordering
We will not give any special treatment to cases where one inner table is 
functionally dependent on the other and that allows to simplify the required
ordering (but not in a way that equality propagation would allow).
(search above for "No accounting for functional deps when resolving the 
ordering" to see this in the context)

6.3 Not taking advantage of cross-join sj-nest ranges
In certain cases, a semi-join range (as "range in the join order") can
contain a cross join, i.e. it will have tables

  ita_1, ... ita_K,    itb_1, .. itb_n

where the two collections are either from two different subqueries or they are
from the same subquery that was a cross join (that is, it calculated a 
full cartesian product).

It is possible to take advantage of the information about the cross join, but 
we will not do that within the scope of this WL entry.
6.4 Not passing info about distinct stream into MRR interface
In certain cases, it is possible to use MRR interface flag to inform the 
table engine that the SQL layer is interested in only one record (or index
tuple) from the (t.keypart1, ..., t.keypartN) value group. Table engines 
that batch or re-order index reads could benefit from this information.
This will not be done within the scope of this WL entry.

1. Code-level work overview
2. InsideOut table scan detection in best_access_path()
2.1 Maintaning cur_emb_sj_nests
3. Plan refinement phase fixes 
4. Join executioner adjustments
5. Milestones

1. Code-level work overview

Implementation of InsideOut strategy will consist of:

Join preparation phase
  - No work is required, as InsideOut will reuse the semi-join structures 
    created by other semi-join WLs. We might need to do some (simple) 
    preprocessing for work done in the other phases, like saving the
    separate bitmap of outer non-trivially correlated tables.

Join optimizer
  - best_access_path() needs to be adjusted to use modified cost and #rows 
    fanout values for "jumping" index scans.

Plan refinement
  - setup_semijoin_dups_elimination() needs to be modified to include the code
    that will detect when InsideOut or a combination of InsideOut+FirstMatch 
    can be used and make according query plan setups.

Join executioner
  - Add code to skip over records that have duplicate value of (ie1, ... ieN).

  - EXPLAIN output should show "LooseScan(m..n)" when necessary.

2. InsideOut table scan detection in best_access_path()

The main design consideration is to make the check in best_access_path() as
lightweight as possible as join optimizer may invoke it lots of times.

When a subquery is converted to semi-join its IN-equalities are injected into
parent join's WHERE/ON clause. We will mark those equalities by setting their
Item::name to a pre-defined const char* pointer.

The 'ref' optimizer will detect the marked equalities and mark their
corresponding KEYUSE elements. Then best_access_path() code will detect those
and construct the InsideOut scan.

Some definitions: for a subquery 

  (oe1, ... oeN) IN (SELECT ie1, ... ieN ...) 
and given join order prefix JP:

 * an equality "oe_i=ie_i" is called "bound" if JP contains all tables needed
   to evaluate ie_i.

 * an equality "oe_i=ie_i" is called "handled" if the InsideOut scan will
   produce a distinct stream of tuples (ie_... , ie_i ,...).
  in_equality_map bound_equalities=   0;
  in_equality_map handled_equalities= 0;
  key_part_map distinct_scan_on_this= 0;

    Discover the bound equalites. We need to do this, if
      1. The next table is an SJ-inner table, and
      2. It is the first table from that semijoin, and
      3. We're not within a semi-join range (i.e. all semi-joins either have
         all or none of their tables in join_table_map). (See the subsequent
         section on how cur_emb_sj_nests is maintained)
      3. All correlation references from this sj-nest are bound
  if (table->emb_sj_nest &&                                        // (1)
      !(table->emb_sj_nest->map & cur_table_map) &&                // (2)
      !cur_emb_sj_nests &&                                         // (3)
      bitmap_covers(cur_table_map, table->emb_sj_nest->corr_map))  // (4)
    /* This table is an InsideOut scan candidate */

      Walk over IN-equalites and see which are bound.
      This step has an additional merit over KEYUSE-based bind detection 
      because it handles ie_i's that are not keyparts.
    for (each element e in emb_sj_nest->IN_equalities)
        Q: should this take into account equality propagation and how?
        A: If e->outer_side is an Item_field, walk over the equality
           class and see if there is an element that is bound?
        (this is an optional feature)
      if (e->outer_side->used_tables())
        bound_equalities.set_bit(e's number);

  while (...) /* best_access_path()'s KEYUSE scan loop */
    /* Start of handling the next index */
    handled_equalities= 0;
      if (keyuse->in_equality_num) 
          This KEYUSE represents an equality

            it.keyXpartY = outer_expr
          which was created from an IN-equality.
        if (bitmap_covers(cur_table_map, keyuse->val->used_tables())
    if (((handled_equalities | bound_equalities) cover all IN-equalities) && 
        index access follows pattern of 
        ^(constant_part|subq_used_part)*, subq_used_part+, anything-else)
      InsideOut strategy is applicable, adujust index scan costs;

2.1 Maintaning cur_emb_sj_nests
We will maintain a bitmap of semi-join nests that have only some of their 
tables in the join order prefix. 

  if (tbl->emb_sj_nest)
    cur_emb_sj_nests |= tbl->emb_sj_nest->sj_inner_tables;
    // remove the sj_nest if all of its SJ-inner tables are in cur_table_map
    if (bitmap_covers(cur_table_map, tbl->emb_sj_nest->sj_inner_tables))
      cur_emb_sj_nests &= ~tbl->emb_sj_nest->sj_inner_tables;

  if (tbl->emb_sj_nest)
    // If we're removing the last SJ-inner table, remove the sj-nest 
    if ((cur_table_map & tbl->emb_sj_nest->sj_inner_tables) == tbl->map)
      cur_emb_sj_nests &=~tbl->emb_sj_nest->sj_inner_tables;

3. Plan refinement phase fixes 

setup_semijoin_dups_elimination() should make choices according to the 
following rules:
 - InsideOut|FirstMatch should be preferred over DuplicateElimination
   (note that this is not always the optimal choice - in some cases 
    using temptable + cheap scan may be cheaper than an expensive InsideOut
    scan (TODO: prove or refute this))
 - There is no need to ever make a choice between InsideOut and FirstMatch
   strategies: in every case there is an optional "head" handled by InsideOut 
   followed by the FirstMatch "tail"

4. Join executioner adjustments

Consider a join execution swimlane diagram (partial copy from the HLS):

    it1       |    it2     |  itN
              |            |
  ==== ot1.a=it1.a=c1 group =====
              |            |
   it1.rowX1  |  it2.rowX  |
              |  it2.rowY  |
      (1)     |            |
   it1.rowX2  |            |
              |  it2.rowY  |
              |  it2.rowZ  | it2.rowX (2)
     (3)      |            |
              |            |

(1) here we return to table it1 without having found a matching row
combination of SJ-inner tables, so we must continue scanning this value
(2) we've found a match. When we return to location (3) we must skip over to
the next value group.

Implementing this functionality calls for use of a flag. The current proposal
is that when we start scanning it1's group, we set 

  itN->status= STATUS_GARBAGE

when we receive a request to continue scanning it1, we check itN->status. If
there is anything other than STATUS_GARBAGE, we continue.
(TODO: need to check if this really works or we have to introduce a new,
separate flag)

5. Milestones