WL#3740: Subquery optimization: Semijoin: Pull-out of inner tables

Affects: Server-6.0   —   Status: Complete

This is a subtask of WL#2980 "Subquery optimization: Semijoin".

It includes:
 - Conversion of subquery IN predicates to semi-join nests (i.e. TABLE_LISTs)
 - Elimination of nested semi-join nests.
 - A procedure that pulls out tables out of semi-join nests when this is 
   possible.

1. Task scope
2. SJ applicability criteria
2.1 MAX_TABLES limitation
3. Subquery to semijoin conversion
3.1 Conversion steps and their location
3.2 The conversion procedure
3.1 Nested semi-join nests removal
3.2 Treatment of semi-join nests by other optimizations
4. The Pull-out procedure



1. Task scope
=============
This WL entry covers:

 - Conversion of subquery IN predicates to semi-join nests (i.e. TABLE_LISTs)
 - Elimination of nested semi-join nests.
 - A procedure that pulls out tables out of semi-join nests when this is
   possible.

After this WL is implemented, the code will be in this state:

 - Subqueries that can't be executed with SJ will be handled as before;
 - Subqueries that can be executed by SJ and allow to eliminate all
   semi-join nests (by pulling out all of the tables) will be handled
   correctly.
 - Subqueries that can be executed with SJ but do not allow for all tables
   to be pulled out will not be executed, a query failure will be reported
   (without any special error message).

That is, this WL will leave the subquery implementation in a state where some
subqueries are not handled. The next task, WL#3741, will restore server's
ability to handle all kinds of subqueries.


2. SJ applicability criteria
============================
(WL#2980 has a copy of this section contents)

A subquery predicate is replaced with semi-join nest if the following
conditions are met:

SUBQ-TO-SJ-CRITERIA:
  1. Subquery predicate is of the form 
       (...)  IN  (SELECT ...)   
       (...) =ANY (SELECT ...)    

  2. Subquery is a single SELECT (no UNIONs)

  3. Subquery does not have GROUP BY or ORDER BY

  4. Subquery does not use aggregate functions or HAVING

  5. Subquery predicate is at the AND-top-level of ON/WHERE clause
  
  6. #tables-in-parent-query + #tables-in-subquery < MAX_TABLES

Note that the subquery may have DISTINCT or LIMIT clauses.

A more precise wording for condition #5 is:

  The WHERE/ON clause can be converted to "subquery_predicate AND remainder"

while the check will be performed as follows:
   
  Subquery is at the AND-top-level of the clause if we can reach it by starting
  from the root Item* and descending into the arguments of Item_cond_and() 
  functions.


2.1 MAX_TABLES limitation
~~~~~~~~~~~~~~~~~~~~~~~~~
The condition #6 is:
 
  #tables-in-parent-query + #tables-in-subquery < MAX_TABLES

The reason for this limitation is that we must not end up in a situation
where the total number of tables in a join is >= MAX_TABLES. If there are
several candidate subqueries and the total number of tables is >= MAX_TABLES
we'll have to make a choice which of the subqueries to convert to semi-joins.

If the "children" subqueries have "grandchildren" subqueries we may also have
a choice between pulling grandchild into the child or child into the parent.

We are not able to make an educated choice due to lack of information, in
particular we don't know subquery's output cardinality. We'll settle for this
set of preferences, in the order of their importance:
  
Parents/children
  1. Pull the most deeply nested subqueries first;

Siblings 
  2.1. Prefer correlated subqueries over uncorrelated;
  2.2. Prefer subqueries that have greater number of outer tables;

The preferences place some restrictions on how the subqueries to semi-join
nests conversion will need to be organized.
 - The transformation must proceed as a depth-first traversal.
 - In order to make the choice we will first need to run JOIN::prepare() for
   all "sibling" subqueries, then collect and compare all candidates, and then
   perform the transformation.

3. Subquery to semijoin conversion
==================================

3.1 Conversion steps and their location
---------------------------------------

Currently the execution proceeds according to this scenario:

  parent_join->prepare() 
  {
    where->fix_fields() 
    {
      Item_subselect->fix_fields() 
      {
        engine->prepare()
        {
          child_join->prepare()
          {
            // Preparation part #1 (fix_fields(), etc)
            [row|single_value]_select_transformer(child_join)
            {
              Inject predicates into the child select,
              convert IN to EXISTS
            }            
            [ select_lex->fix_prepare_information();       
             // GROUP BY and PROCEDURE fix-ups;          ](**)
          }
        }
      }
    }
  }

  parent_join->optimize();

  parent_join->exec()
  {
    ...
    subq_predicate->val_int()
    {
      child_join->optimize();
      child_join->exec();
    }
  }


The two notable properties are
 1. IN->EXISTS conversion is performed "early"
 2. Subqery's optimization is performed lazily

In order to use the strategy described in section 2.1 we must return out of
child_join->prepare() without doing the IN->EXISTS conversion. Since
subqueries are processed recursively, this applies to parent_join->prepare()
as well.

The actions marked (**) seem to have a prerequisite that we either have
perfomed IN->EXISTS conversion or have decided not to do it at all, so they
will have to be delayed as well.

This gives us the following strategy:

  parent_join->prepare()
  {
    where->fix_fields()
    {
      Item_subselect->fix_fields()
      {
        engine->prepare()
        {
          child_join->prepare()
          {
            // Preparation part #1 (fix_fields(), etc)
            if (this is a subquery that meets SUBQ-TO-SJ-CRITERIA #1-#5)
            {
              mark it for further processing // see below
            }
            else
            {
              [row|single_value]_select_transformer(child_join);
              select_lex->fix_prepare_information();
              // GROUP BY and PROCEDURE fix-ups;
            }
          }
        }
      }
    }
  }

  if (this a top-level-select)
  {
    fix_subqueries()
    {
      // 1. fix children
      for each marked subquery S
        S->fix_subqueries();
      
      // 2. Pick which subqueries to convert, and convert as many as we can
      // Materialization-check:
      // 3. Not coded as part of this WL: Pick which subqueries are to be 
      //    handled via Materialization.
      // 4. Finalize join->prepare() for the subqueries that we haven't
      // converted:
      for every marked subquery that wasn't processed
      {
        select_lex->fix_prepare_information(); 
        // GROUP BY and PROCEDURE fix-ups;
      }
    }
  }
  parent_join->optimize();
  parent_join->exec();
  ...

Things to note:
 - The above is not completely in line with WL#1110 
   "Subquery optimization: Materialization". Timour's intent was to have the 
   "Materialization-check" (see above) at the beginning of JOIN::optimize() 
    and we've moved it out and ahead (but didn't move it over anything 
    significant?)
 - In the current code subqueries are optimized lazily (if the subquery is
   never invoked it is never optimized). For subqueries executed by SJ this
   property will be lost as such subqueries are optimized/executed together
   with their parent select.
 - Conversion to semi-join nests will occur recursively, in depth-first
   fashion.

3.2 The conversion procedure
----------------------------

In order to turn subquery predicate into a semi-join nest we'll need to do
the following:

 * In the parent's WHERE/ON clause, replace Item_subselect with Item_int(1).
   An alternative approach is to remove the Item_subselect item, i.e:
     - replace it with NULL if the said item is the complete WHERE/ON clause,
       or
     - remove it from the list of arguments of its parent Item_cond_and and 
       remove the Item_cond_and if it becomes "unary" AND.
   We'll make a choice between the two approaches in LLD or at coding time.
  
 * Walk through child's tables and adjust table->map

 * Adjust the parent_join->tables counter

 * Create a TABLE_LIST join nest that describes the semi-join (its exact
   definition will be given in the LLD)

 * Attach this TABLE_LIST as a child of parent select's global TABLE_LIST or
   of the appropriate outer join nest.

 * Move the subquery's WHERE into semi-join's ON condition.
   - For subquery's WHERE/ON conditions, recalculate all attributes that 
     involve table bitmaps:
      = call update_used_tables()
      = do something to update not_null_tables()
      = adjust other attributes as necessary (to be covered in the LLD)

 * Insert the IN equalities into semi-join's ON condition, i.e for
      (oe1 .. oeN) IN (SELECT ie1, ... ieN)
   create
      oe1=ie1 AND ... AND oeN=ieN
   and insert it into the semi-join's ON condition.
   
   Note: the question whether semijoin's ON condition should be in
   TABLE_LIST::on_expr is to be resolved in the LLD. 
    - Pro:    this is the "natural place"
    - Contra: all over the code the (on_expr!=NULL) check is used to check if
	      this is an outer join.

 * Do nothing for "Item_refs".
   When a subquery is located in the WHERE clause, references "out of the
   subquery" are Item_direct_ref objects which are just call redirectors, and
   consequently can be simply ignored. 
   (The reason for their creation is that we must have 
      Item_direct_ref(
        Item_field(outer_tbl.col)).used_tables() == OUTER_REF_TABLE_BIT
    while
      Item_field(outer_tbl.col).used_tables() = outer_tbl.map
   ).

 * (TODO: if any of the execution strategies will need more information to be
    saved we could cache it here. One candidate is the is_correlated flag)

The above list is not inclusive. Details to be covered in the LLD.

Once these actions are performed we can discard the child st_select_*/JOIN
and operate only on the structures that are in the parent join.


3.1 Nested semi-join nests removal
----------------------------------

We do not need to keep nested semi-join TABLE_LISTs to produce correct query
results. This comes from the fact that for some nested semi-join

  ot_i SJ (it_k SJ iit_j ON sj_cond2) ON sj_cond1

the result is defined as
  
  all row combinations of ot_i
    such that exists a matching row combination it_k
      such that exists a matching row combination iit_j

which is the same as

  all row combinations of ot_i
    such that exist a matching row combination {it_k, iit_j}.

Consequently, we can make simplify_joins() to remove the nested semi-joins.

NOTE: This flattening will destroy e.g. information about whether the 
      grandchild subselect was uncorrelated. We don't need it now but may
      need it for some further optimizations.


3.2 Treatment of semi-join nests by other optimizations
-------------------------------------------------------

This WL code will cause semi-join nests to be seen by the following 
optimizations:

Equality Propagation
  optimize_cond() can use semi-join nest's ON condition together with parent
  select's WHERE condition.

Partition Pruning
  For an inner table, its semi-join nest's ON condition must be used.

Aggregate Functions optimization, opt_sum_query call and code around it.
  It seems we don't need to do anything here.

Constant table detection
  Like with outer joins, constant table detection should detect constant inner
  tables. The pull-out procedure will pull them out of the semijoin.

Ref-analysis
  update_ref_and_keys() should process semi-join's ON expression. It is ok to
  create candidate ref accesses that mix inner and outer tables (e.g. for 
  outer_tbl.key =inner_tbl.key two KEYUSE elements should be created).


4. The Pull-out procedure
=========================

Consider a semi-join

  (ot1 ... otN) SJ (it1 , ... itK) ON sj_cond

Suppose we execute it as inner join. We will violate INNER-UNIQ-REQ when we
construct two row combinations

  (ot1.row1 ... otN.row1, it1.rowX, it2.row1, it3.row1  ...)
  (ot1.row1 ... otN.row1, it1.rowY, it2.row1, it3.row1, ...)

that have the same rows for outer tables but different rows for the inner
table. We can guarantee that this won't happen if there is a functional
dependency between the inner table and the outer tables:

   it1.row = func(ot1.row, ... otN.row)

We can detect and use those two cases of functional dependency:

  * the inner table has at most one row
  * the inner table can be accessed using eq_ref(consts,outer-tables).

Presense of functional dependency allows us to pull the table out of the 
semi-join nest.

The pull-out can be performed once we've detected the constant tables and 
ran update_ref_and_keys(). These two actions are performed inside
make_join_statistics():
 
  make_join_statistics()
  {
    ...
    update_ref_and_keys();
    ...
    do {
      for each non-const table
        if (can make table constant)
          const_tables |= table;
    } while(const_tables was widened);
    ...
    // do range analysis
  }

We can use analogous approach to pull tables out of the semi-join nests:

  for (each semi-join nest)
  {
    // Action #1: Mark the constant tables to be pulled out
    pulled_tables= {};
    for (each table T in semi-join nest)
    {
      if (T has at most one row)
      {
        pulled_tables.insert(T);
      }
    }
    
    // Action #2: Find which tables we can pull out based on
    // update_ref_and_keys() data. Note that pulling one table out
    // can allow us to pull out some other tables too.
    do {
      pulled_a_table= FALSE;
      for each possible ref access to some inner table T
      {
        if (ref access is eq_ref &&
            all used tables are outer tables)
        {
          pulled_a_table=TRUE;
          pulled_tables.insert(T)
        }
      }
    } while (pulled_a_table);

    // Remove pulled_tables from the join nest;
    if (we've removed all of the tables)
      destroy the nest;
    else
    {
      //we're not capable of executing semi-join nests yet
      report query failure;
    }
  }

NOTE: 
It is possible that the above may be accomplished by another strategy: create 
a graph of functional dependencies, topologically sort the graph, and then
retrieve all const tables in one pass.
This is potentially more efficient/elegant but it is harder to do:
 - "standard" topological sort requires the graph to be acyclic. Functional 
    dependency graph is not (and there seems to be no trivial way to remove
    the cycles).
 -  The graph edges are actually "multi-edges", they have one source and
    several destinations.
This seems to make building the graph too complicated and not worth it?


1. SJ applicability check
1.1 Finding out if we at the AND-top-level of ON/WHERE
2. Conversion
3. Interplay with other optimizations
4. Nested semi-joins removal
5. Table pull-out
6. Testcases
7. Schedule
8. Relationship with Prepared Statements
8.1 subquery to semi-join nest conversion
8.2 Nested semi-joins flattening
8.3 Table pullout
8.3.1 Implementation of per-execution table pull-out





1. SJ applicability check
=========================

SUBQ-TO-SJ-CRITERIA (copying from the HLS) are:

  1. Subquery predicate is of the form (...) IN/=ANY (SELECT ...)
  2. Subquery is a single SELECT (no UNIONs)
  3. Subquery does not have GROUP BY or ORDER BY
  4. Subquery does not use aggregate functions or HAVING
  5. Subquery predicate is at the AND-top-level of ON/WHERE clause
  6. #tables-in-parent-query + #tables-in-subquery < MAX_TABLES

These conditions will be checked as follows:
#1: check if the subquery is an Item_in_subselect
#2: test(master_unit()->children == 1)
#3: single 'if'
#4: single 'if'

1.1 Finding out if we at the AND-top-level of ON/WHERE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The condition #5 is:

  Subquery predicate is at the AND-top-level of ON/WHERE clause:

Currently it is not possible to determine out of item->fix_fields() if we're
at the AND-top-level or not.

We can check this as follows:
 * Add THD::marker= 0;

 * In setup_conds(), for WHERE/ON conditions do:
     thd->marker= MAGIC_NO;
     cond->fix_fields();
     thd->marker= 0;

 * In Item_cond_and:
   - Save thd->marker into local variable and restore after each 
     arg_i->fix_fields() call.
  
 * In Item_cond_or::fix_fields() and Item_func::fix_fields():
     thd->marker= 0;

 * In Item_subselect::fix_fields(): 
    if (thd->marker == MAGIC_NO)
    { we're at AND-top-level of the WHERE or ON }

This will also check condition #6.

We'll collect a list of Item_subselects that are to be converted to
semi-joins. The list will be made of
in class JOIN:      Item_subselect** sj_subqs;
in Item_subselect:  Item** next; 

We'll store pointers-to-pointers as we will need to replace Item_subselect
with Item_int(1) later on.

Startegy for the satisfying #6 is described in the HLD.


2. Conversion
=============

The subquery predicate will be converted to Item_int(1).

When wrapping the tables into semi-join nest, we'll need to keep the following
invariants:
 
 * The semi-join nest will have
    sj_nest->nested_join = {NESTED_JOIN structure with the list of the
                            children};
    sj_nest->join_list= { table list this semi-join is in. This is either
			  embedding->nested_join->list or 
			  select_lex->top_level_list };
    sj_nest->next_leaf= NULL ; // nest can't be leaf

 * When we insert a semi-join nest into the subquery we'll need to make sure
   that all leaves (i.e. base tables) are connected via TABLE_LIST::next_leaf
   chain.
 * ::next_local is same as ::next_leaf but views are considered leaves.
 * ::next_global - it seems we don't need to update this.

NOTE  Look at information_schema and see what's with the Gluh's in_subquery
      flag.

Re semi-join condition we have those considerations:
 - semi-join nests need to be distinguishable from outer join nests and 
   from inner join nests. 
 - it seems there is no merit in placing semi-join's ON condition into the 
   sj_nest->on_expr, as on_expr is used to identify outer joins.

 - Igor's idea is to add semi-joins' ON condition into the WHERE (and use some
   flag in TABLE_LIST to tell semi-join nests from inner join nests). 
   This makes dealing with other optimizations simpler. On the other hand we
   might need semi-join's ON condition for some forthcoming optimizations.
   

3. Interplay with other optimizations
=====================================

Make those optimizations treat semi-join nest's ON conditions as if they were
AND-parts of the WHERE:
 * Equality Propagation
 * Ref-analysis
 * Constant table detection

Verify that those optimizations "properly ignore" semi-join nests:
 * Aggregate Functions optimization


4. Nested semi-joins removal
============================

We can first do all the outer->inner conversions (and we will not miss
anything due to semi-join nest bounds), and then, at "Flatten nested joins"
phase we can remove the nested semi-joins.


5. Table pull-out
=================
See HLS, nothing to add here.


6. Testcases
============
Create a set of testcases that demonstrate that subqueries that are fully
pulled-out are executed correctly.
Query EXPLAINSs should show that the tables are moved into the upper SELECT.

Create a testcase with a subquery that can't be fully pulled out into the
parent and show that it produces an error.


7. Schedule
===========

2007-03-05 StartOfDay:
  Implement the SJ applicability check                    8 hrs
  Implement the subquery to semi-join nest conversion    
  procedure.                                              12 hrs

2007-03-09 EndOfDay:
MS1: subquery predicates are converted to [possibly nested] semi-join
  nests.

  Implement flattening of nested semi-joins.              12 hrs
  Adjust all "other optimizations" (see LLD text) to 
  handle the semi-join nests.                             12 hrs

2007-03-19 EndOfDay:
MS2: semi-join nests are flattened, i.e. everything except the pull-out
     prodcedure is done.

  Implement PullOut procedure                              4 hrs
  Create a set of testcases (actually some of testcases   12 hrs
  will probably be created for earlier milestones but 
  the delivery is in this milestone)

2007-03-23 EndOfDay
MS3: WL entry complete, the code is in the state as described in "Task scope"
     section of the HLD.

  Address some unforeseen problems that will pop up       16 hrs 

2007-03-29 EndOfDay
MS4: WL entry complete


8. Relationship with Prepared Statements
========================================

8.1 subquery to semi-join nest conversion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(See WL#3813 for a general contract between PS runtime and the optimizer)

The conditions listed in SUBQ-TO-SJ-CRITERIA remain true (or false) for the
entire lifetime of the statement, so subquery predicate to semi-join nest
conversion should be done once per statement execution.

To achieve that, we'll need 
 - Block search/conversion attempts on non-1st execution
 - Make sure the TABLE_LIST allocations and Item tree changes are performed on
   statement's arena.

8.2 Nested semi-joins flattening
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To be done on per-statement arena.


8.3 Table pullout
~~~~~~~~~~~~~~~~~

1. According to the WL#3813 rule, we can pull out a table permanently if
   it is functionally dependent on outer tables.  This is because we can count
   on corresponding UNIQUE/PK index not to be dropped.

2. We can't permanently pull out a 1-row constant table, because it may have
   more than one row on a subsequent execution.

There are also cases when a table can be pulled with rule #1 once a table with
rule #2 was pulled out. It looks like there is little merit in having distinct
"permanent" and "per-execution" pull-out phases.  A simpler solution of one
single phase that is invoked for every execution seems to be more appropriate.

8.3.1 Implementation of per-execution table pull-out
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Don't really pull the tables out, leave the SJ-nest as is. 

// Join flattening:
  for each sj-nest
  {
    Reset the bitmap of pulled-out tables;
    Accumulate a bitmap of tables that are pulled out
    if (!all tables are pulled out)
    {
      abort query with an error;
    }
  }


// Join optimizer, plan refinement:
  Ignore the SJ-nests when processing outer join nests
   

EOF