WL#3741: Subquery optimization: Semijoin: Duplicate elimination strategy

Affects: Server-6.0   —   Status: Complete

This is a subtask of WL#2980 "Subquery optimization: Semijoin".

It includes "Duplicate Elimination" semi-join execution strategy, where we use
NL-join execution algorithm and a temporary table with unique key/constraint to
weed out duplicate outer row combinations.

This WL covers both adjustments needed for join optimizer and for join executioner.

This WL is to be done after WL#3740. 
When this WL is complete we will have the minimal working semi-join subquery
implementation (i.e. always correct results, no errors for any valid subqueries).

1. Description
2. Temptable use minimization
2.1 Functionally dependent tables
2.2 Prefix of the outer tables
2.3 Suffix of the uncorrelated outer tables
2.4 Combining prefix and suffix
2.5 Case of several semi-joins
2.5.1 On semi-joins "boundaries"
2.5.2 Temptable use in case of several semi-joins
3. Accounting for use of temptables in the join optimization
3.1 The details
4. Changes in the join executioner
5. EXPLAIN output

1. Description
The idea of the Duplicate Elimination strategy is as follows: 

Execute semi-join as a regular nested-loops join. We will get duplicate row 
combinations of outer tables. They can be identified by comparing rowids. We
can use temporary table with unique constraint (or key) to weed out the 
duplicate combination of outer rows.

Codewise, NL-join algorithm will need this adjustment:

when we get a matching row combination:

    rowid_tuple= {ot1.rowid, ... otN.rowid};
    if (!temp_table.insert(rowid_tuple))
       pass row combination to output;
      /* This is a duplicate */

This strategy allows to handle several semi-joins at once, and it works for
any join order.

A note about outer joins: NULL-complemented rows do not naturally have
rowids, so we'll need to assign them some $invalid_rowid value. Two rows with
rowid=$invalid_rowid will be considered indentical.

This will allow us to handle arbitrary nested outer joins. Also the subquery
predicates could be located anywhere, either in the WHERE or in some nested
join's ON clause (this is "intuitively clear" when one looks at outer joins
algorithm but atm I have difficulty writing some kind of a proof).

NOTE: Temporary table lookups become expensive when the table overflows to
disk. In certain cases we could use a better approach: collect row
combinations in a Unique object, and then scan the Unique and join it with the
rest of the tables. This will not be implemented as a part of this WL entry
(at the moment we don't plan this in other WL entries either as we beleive
other subquery optimizations have more priority)

2. Temptable use minimization

We can reduce the cost of temptable use. We can

 - make the unique key shorter
 - make the table contain fewer records (by emptying it in the course of join
 - make fewer temptable lookups by moving WEED-DUPS-OUT call into the "middle"
   of the join.

if we use one of the approaches described in the subsequent sections.

2.1 Functionally dependent tables

If an outer table is not within an outer join nest, and
  - is a constant/system table, or
  - is accessed via eq_ref(some-other-table)
or, if we have a semi-join nest with not-exists optimization (i.e. only the
NULL-complemented row combination will pass the WHERE clause)

then we do not need to store its rowid in the temptable.

This approach shows how this WL interplays with WL#3740. WL#3740 pulls the
tables out semi-join nests. That action does not "make things worse" for
duplicate elimination strategy - we will not need to add pulled-out tables'
rowids into the temptable tuple.

NOTE: if there are several tables that are in a circular functional dependency
we can make a choice which of the tables will have rowid in the temptable. The
choice will affect the cost of temp table use. We'll want to pick
 - table with shortest rowid (so the temptable tuple is smaller)
 - the table for which the E(partial join result cardinality) is the lowest
   (so we'll have fewer WEED-DUPS-OUT calls, see subsequent 2.x sections)
We will not make any such optimizations and will always ue the first table in
the join order.

2.2 Prefix of the outer tables
If the join order has a prefix of outer tables, then we can exclude the
prefix tables from duplicate rows elimination.

 - Prefix table rowids don't need to be in the temptable.
 - We'll need to empty the temptable when transitioning from one prefix row
   combination to another.

This is apparent from this diagram:

  ot1  | ...  | otN  | it1
       |      |      |
  rowX | rowX | rowX | row1
       |      |      | ...
       |      |      | rowN
                      eof, go left
  rowY | rowY | rowY |

The point (*) partitions the time axis into two parts. The part above has
(ot1,...otN) = (ot1.rowX,...,otN.rowX). NL-execution guarantee that this row
combination won't be encountered anywhere below the point (*).

At location (*) we will need to place this code:

for every new prefix row combination:


2.3 Suffix of the uncorrelated outer tables
Suppose we have a join order of

 oit1 ...  oit1N   nt1  ...  ntK 

where nt_i are the outer tables that are not referred to from within the
subquery predicate.  Suppose we're executing NL-join with this join order.

At location (*) we will be getting duplicate row combinations. All of the
tables that subquery predicate referred to are before the (*), so each
duplicate row combination is a match, and consequently it doesn't matter
which of the duplicates we pick.  The "tail" after the (*) has only outer 
tables and therefore will not produce duplicates.

Therefore we can place WEED-DUPS-OUT after the last subquery table or the last
table that was referred from the subquery predicate.

2.4 Combining prefix and suffix
The two previous rules can be combined together. This gives us:


Given a join order and a semi-join nest S, we can find the first and last
  first_tbl(S) := first inner table
  last_tbl(S)  := the latest of the
                  - last inner table
                  - last correlated outer table

The range between those two tables will need to be "covered" by a temptable:
we'll need to
 - put RESET-TMP-TABLE before the first_tbl(S),
 - put WEED-DUPS-OUT after the last_tbl(S)

Note: we can get a special case where there are no inner tables in the
[first_tbl(S_i), last_tbl(S_i)) range. In that case, temporary table has
zero-length tuple, which corresponds to FirstMatch strategy from WL#3750.

2.5 Case of several semi-joins

Our first observation is that the bounds between semi-joins are not really
set in concrete:

2.5.1 On semi-joins "boundaries"

Subqueries that can be converted to semi-joins can be merged together into one
subquery that is a cross-join of the involved subqueries. For example, the 

   ot1 ... otN
   (oeA1, ...) IN (SELECT ieA1,... FROM itA1, ... itAk) AND 
   (oeB1, ...) IN (SELECT ieB1,... FROM ieB1, ... ieBk) AND

can be converted to

    ot1 ... otN
    (oeA1, ...oeB1, ...) IN (SELECT ieA1,... ieB1, ...
                               (itA1, ... itAk) inner join
                               (ieB1, ... ieBk)
    AND subq_where;

The reverse is also possible: if a subquery is (or can be converted to) a
cross-join then it can be split into two. 

2.5.2 Temptable use in case of several semi-joins
The case of several semi-join nests is handled as follows:
 - For each semi-join nest S_i, find the interval 
      [first_tbl(S_i), last_tbl(S_i)].

 - A semi-join whose interval doesn't have interesection with any other is 
   covered as described in SJ-TEMP-TABLE-USE.

 - If several semi-joins have intervals with non-zero intersection then we
   can think of them as one "interleaved" semi-join. For that semi-join, 
     first_tbl(S_sum) = min(first_tbl(S_i))
     last_tbl(S_sum)  = max(last_tbl(S_i) )
   and then it is covered as described in SJ-TEMP-TABLE-USE.
   (todo: saner wording)

It is apparent that once every interval is "covered" (either on itself or 
together with others), then the duplicate outer row combinations will not be

3. Accounting for use of temptables in the join optimization

WEED-DUPS-OUT code "removes" the fanout produced by the subquery tables, so
we'll need to adjust the size of the partial join output.

We do not have a formula for the cost of using the temporary table. We could
use the following hack:
 * Assume we'll have to write all temporary table contents to disk, using
   random disk seeks.
   We'll have to do this write the number of times we call RESET-TMP-TABLE.

with the hack, the cost of temptable will be
 - directly proportional to its key length.
 - directly proportional to E(#rows in temp table) 
   ( this is not true. there is O(#rows * log(#rows)) part.

the cost will be taken into account "immediately", e.g. if we have a semi-join

ot1, ot2, ot3 SJ (it1, it2) on sj_cond(ot1, ot2, ot3, it1, it2)

and a partial join order of 

  ot1 sj1 ot2 
then the cost of it will include cost of use of temptable with {ot2.rowid} key.

3.1 The details
when we position a TABLE in the join order we can tell whether we've just put
all of the semi-join tables into it.

4. Changes in the join executioner
We'll need those changes in the join executioner:

 * Add code to create the temptables
 * Insert WEED-DUPS-OUT check
 * Insert RESET-TMP-TABLE check
 * Add code to handle the special case of zero-length temptable
 * Add code to clean up the temptables after parent join execution is
   finished (if the parent join is inside a subselect itself we will probably
   not want to create/delete temp. tables on each execution? or do we?)

The details will be covered in the LLD.

5. EXPLAIN output

Temptable use shall be visible in EXPLAIN output.

The "Extra" column will show
 * "Start temporary" for the table before which we have RESET-TMP-TABLE code.
 * "End temporary" for the table after which we have WEED-DUPS-OUT code.

If the temporary table has zero-length row, i.e. we're actually using join
shortcutting the "Extra" column will show:

 * "Shortcut to $table", where $table is what is in the "table" column for the
   table we'll shortcut to.

EXPLAIN EXTENDED ...; SHOW WARNINGS will show the semi-join operation. The
operation will be shown as

  ... outer_tables  semi join (inner_tables) ...

where inner_tables shows the tables that were not pulled out. Note that
semi-join operation here lacks the ON condition. This is because semi-join's
ON condition is merged into parent's WHERE. 
Q: is that OK? Or we should save and display the semi-join's ON? 

TODO: check how this will work with BKA.

1. Data structures
2. Join optimizer adjustments
3. Setting temptable use
3.1 Rules for inclusion of table rowid into temptable
3.2 Finding out covered ranges
3.2.1 Data structure
3.2.2 Temporary table assignment
4. Temptable operations
3.1 Place to create/drop temporary table
3.2 Temptable properties
3.5 Getting table ROWIDs
3.5.1 prepare_for_position() for smart access methods
3.5.2 ROWIDs and join buffering
3.5.3 ROWIDs and filesort()
4. Join executioner changes
5. Schedule
6. Not scheduled functionality
6.1 Fanout adjustments
6.2 Temptable use cost
6.3 Join optimizer fixes

1. Data structures

WL#3740 code provides this WL with the following data structures:

 - The TABLE_LIST join nest tree contains SJ-nests. There may be several
   SJ-nests that contain or are contained within OJ-nests, but SJ-nests do not
   contain one another.

 - Some tables in SJ-nests are actually pulled out of them. WL#3740 code does
   calculate which tables are pulled out but does not save this information.

We'll need to use this structure to
 - make adjustments to join cost/cardinality
 - setup temptable use, position the WEED-DUPS-OUT and RESET-TMP-TABLE calls.

2. Join optimizer adjustments
In the scheduled functionality, join optimizer will execute unaffected by
presense of SJ-nests.

3. Setting temptable use

This is done at plan refinement stage, when the order of the tables is already
known. Inclusion of table's rowid into temptable is governed by the following

3.1 Rules for inclusion of table rowid into temptable

The rules do not depend on join order (*) and are as follows:

1. SJ-inner tables that were pulled are not included.
2. SJ-inner tables are not included.

3. SJ-outer OJ-outer const tables are not included, as there is one possible
   rowid value anyway.
   (the check is: table->map & const_table_map & !table->embedding)

4. Tables functionally dependent on other included tables are not included.
   This is harder to check because of circular dependencies.
   For now we will use this simple but incomplete criterion:

     If the table is accessed with eq_ref access dependent on sj-outer, 
     constant, or include tables, then the table is not included.

   (i.e. (*) will actually be not true in this implementation. But that is

It is possible to have a more comprehensive check but we put this off to the
unscheduled part of this WL.

3.2 Finding out covered ranges

3.2.1 Data structure
class JOIN will have a list of SJ_TMP_TABLE stuctures. 

  A temporary table that is used to weed out semi-join duplicates
  uint end_idx;

  /* Array of pointers to tables that should be "used" */
  JOIN_TAB **tables;

3.2.2 Temporary table assignment
We determine temporary table contents in one pass over the join order. The
code for the pass is as follows:

  SJ_TMP_TABLE *cur_tmp_table= NULL;
  /* A bitmap of sj-nests we're currently in */
  table_map emb_sj_map= 0;

  for (each join tab join_tab)
    table= join_tab->table;
    if (emb_sj_map)
        Check if the partial join order now contains all tables of all 
        leaving coverage of all semi-joins.
      if (bitmap_covers(cur_map | table->bit, emb_sj_map))
          Ok, current table is the last table that was missing in the set of
          tables that should be in this temptable.
           - Save this temp table.
        if (included(join_tab) 
          cur_tmp_table.end_idx= join_tab;
        emb_sj_map= 0;
        cur_table= new SJ_TMP_TABLE(); 
          The current table is either
           - an sj-inner table. A table is contained within only one sj-nest,
             that sj-nest is "served" by the temptable that we've just 
             created, or
           - a correlated outer table. This table cannot be the first table of
             sj-nest's range.
          For both cases, we're certain that this table is not the 1st table
          of some sj-nest, and further steps in this loop are not needed.

      If we're entering the coverage of another SJ-nest, record that we need
      all of its tables
    emb_sj_map |= get_emb_sj(table)->tables_map;

    if (included(join_tab))
      /* Record that this table is included in the temptable */
      cur_tmp_table.end_idx= join_tab->idx;

After the pass we have an array of SJ_TMP_TABLE structures that describe the
needed temptables.

4. Temptable operations

3.1 Place to create/drop temporary table
Temp table is to be created before the join execution. If the join is in a
(non-flattened) subselect, we should avoid re-creating the temp. table for
every subselect execution (Q: does this matter? A: yes, since similar
arrangements are already made for other optimizations
AA: no, just save and reuse the info.)
TODO: see what is done to other tmp tables.

3.2 Temptable properties
There is no point in creating multi-part key. In fact, that is not always
possible since max # of keyparts is 16 while max # tables in a join is 64.
Therefore we will use the

The table will have the following structure:

  CREATE TABLE t (`rowids` BINARY(N) NOT NULL, UNIQUE(`rowids`));

N is the length of the rowid tuple:

  N= null_bytes + SUM(rowid_length)

  null_bytes = ceil(#(tables inner wrt some outer join) / 8)

The tuple format is

   [null-bytes] [rowid1] [rowid2] ...

for $invalid_rowid rowid, the null bit is set and the rowid bytes a bzero'ed.

3.5 Getting table ROWIDs
We'll need to be able to get rowids of tables that are "served" by a
temptable. The general rule is that we'll need to call

for all "served" tables.

This doesn't work in all cases, the exceptions are covered in the subsequent

3.5.1 prepare_for_position() for smart access methods
We'll need some some other solution for index_merge/DS-MRR and similar methods
which clone/reinitialize the table handlers during the scan. This will be done
as follows:

First, introduce 

   bool st_table::save_rowid;
and have post- set it accordingly. 

Then call file->prepare_for_position(), no matter
which access method we're using. This will serve the "simple" access methods.

"Smart" access methods (the two known are DS-MRR and index_merge) will check
st_table::need_rowid themselves.

Note: checking "if this is a null-complemented row" is easy: just look at

3.5.2 ROWIDs and join buffering
Join buffering accumulates table rows in a buffer and then iterates over the
buffer. When we iterate over the buffer, calling handler->position() will not
return the rowid of current row in the buffer. 

If we use join buffering within the [RESET-TMP-TABLE; WEED-DUPS-OUT] range, we
will have to store {table_row, rowid} sequence in the buffer. 

If several tables use join buffers, we will need to arrange so that rowids are
copied from one buffer to another.
(psergey-todo: will this happen automatically if we set some 'length' member? 
               check it.)

3.5.3 ROWIDs and filesort()
filesort() makes rowids unavailable just like join buffering does. Current 
code may perform filesort:

 1. After the 1st non-const table
 2. After the last table

(we intend to perform filesort at arbitrary location in the join order but
that is not done yet)

In its current state, filesort will not make any rowids unavailable. 

In case #1, the 1st table is the outer table, because parent select can't have
"ORDER BY subquery_table.col" 
(psergey-todo: check if this can be "arranged" by equality propagation?)

In case #2 there is no problem: WEED-DUPS-OUT should be placed before the
filesort, not after.

  when/if we need handle rowid loss in filesort, the solution is same as with
  join buffering: save the rowids as "addon" fields.

4. Join executioner changes

5. Schedule

START:                                                           19.04 midday

T1:  Code the temp table detection                               8 hrs  

MS1: EXPLAINs now work as specified, SELECTs produce             20.04 midday 
     incorrect results with extra row combinations.

T2:  Implement temp. table creation/deletion                     16 hrs 

MS2: SELECTs that use "simple" access methods (ALL, index, and 
     ref, w/o use of join buffering) now produce correct results 25.04 midday

T3:  Handle rowid saving for range/index_merge access methods    8 hrs

MS3: Queries with DS-MRR-range and index_merge produce correct   26.04 midday
     results now.

T4:  Handle join buffering.                                      16 hrs

MS4: Now all queries produce correct results.                    28.04 evening

T5:  Fix some unexpected problems                                16 hrs

MS5: Now all queries produce correct results.                    30.04 evening

6. Not scheduled functionality

This will not be implemented. It will either be implemented later or
superceded by code in other WLs.

6.1 Fanout adjustments
WEED-DUPS-OUT code removes all of the fanout produced by the tables within the
semi-join nest. This should be taken into account in prev_record_reads().

6.2 Temptable use cost

We assume we read and write this much:

bytes_to_write = 
  fanout(start_tbl) *                     // Cycles of temp table use
  SUM(rowid_length) *                     // temp. table row length
  fanount(end_tbl) / fanout (start_tbl)   // #rows in temp table


  use_cost = cost_of_read_and_write( SUM(rowid_length) * fanout(end_tbl) );

where cost_of_read_and_write() assumes random disk seeks (4K = 1 seek)

Q: is this adequate or we need to come up with a better model that accounts
for index lookups, etc?

6.3 Join optimizer fixes

Before the optimization:

Introduce this variable: 

  /* Pointer to embedding sj-nest, or NULL if the table is not in a sj-nest */
  TABLE_LIST *JOIN_TAB::emb_sj_nest 

and set it for every table in the join.

 * Let us accumulate a bitmap of tables that are in the partial join order
   and are within a nested join nest. 

 * Let prev_record_reads() ignore these tables when calculating the partial
   join result size.


In order to account for cost of temptable use, we'll need to track the use of
temp table while we're in the join optimizer.