WL#6042: Split JOIN_TAB structure in two parts: optimizer's and executor's

Status: Complete

GOAL: make easier the maintenance of the Optimizer code, reduce memory usage, and open future possibility for more reduced memory usage.

MEANS: by separation of concerns: split the JOIN_TAB data structure into one structure used in optimization ("the optimization plan") and one used in execution ("the execution plan"); get rid of redundant allocations (of SQL_SELECTs and JOIN_TABs). In execution, the optimization structure will not be used, so its memory storage can then be reclaimed (in a future WL).

User Documentation: Code refactoring. No user documentation required.

No functional requirement. Non-functional requirement: all queries should work as before, same results, no performance decrease. This is a refactoring task. A few query plans might change, which is acceptable if it can be proven that the new plans are no worse.

Tree at my:trunk-wl6042.

  1. As it is a pure refactoring, there is no user-visible change, no documentation in the manual is needed.
  2. create a class named QEP_tab in sql_select.h
  3. right before make_join_readinfo(), allocate an array of QEP_TAB, stored in a new member JOIN::qep. One QEP_TAB per JOIN_TAB.
  4. in class declarations, move members which are used only in execution from JOIN_TAB to QEP_TAB, thus JOIN_TAB::member becomes QEP_TAB::member.
  5. at end of make_tmp_tables_info(), which is really the end of optimization, make JOIN_TABs unavailable (i.e. join->join_tab=join->best_ref=NULL, table->reginfo.join_tab= NULL, trash content of JOIN_TABs)
  6. in execution and EXPLAIN, work exclusively with QEP_TABs, as JOIN_TABs are unavailable; this achieves separation of concerns.
  7. JOIN_TAB members needed both before make_join_readinfo() and after make_tmp_tables_info() (like "JOIN_TAB::quick") need to be available to JOIN_TAB and QEP_tab; either put them in a shared structure referenced by JOIN_TAB and QEP_TAB (getters/setters pointing to the shared structure); or duplicate them in JOIN_TAB and QEP_TAB (getters/setters ensuring consistency of copies). In the prototype, sharing has been implemented.
  8. in the current trunk we allocate 2 arrays of JOIN_TABs: one before greedy search, and one after greedy search. Get rid of the second allocation, because the second array is more or less a shuffled version of the first array; we just need to keep track of the shuffling done by get_best_combination(), which is doable with the JOIN::best_ref array. So: before get_best_combination(), work with JOIN::join_tab (as in trunk); at the end of get_best_combination(): set JOIN::join_tab to NULL, write the new order in JOIN::best_ref; after get_best_combination(): access JOIN_TABs through JOIN::best_ref. This reduces memory usage and structure copying.
  9. eliminate the SQL_SELECT class, putting all its members into QEP_TAB (some can even be removed as they are redundant, like "SQL_SELECT::cond"). This implies that single-table UPDATE/DELETE will use QEP_TAB. Creation of multiple SQL_SELECTs as in trunk will be avoided, reducing memory usage and complexity for the reader.
  10. For single-table UPDATE/DELETE which need a QEP_TAB, and don't need any JOIN_TAB, create a RAII class QEP_TAB_standalone: it wraps a QEP_TAB, and in its destructor it cleans up this QEP_TAB.
  11. During optimization, it is decided to use a certain condition/QUICK_SELECT_I for filtering. In certain query, a first phase of execution (like filesort) uses this filtering and produces a result, then a second phase uses this result, so does not need the filtering. Simply deleting the condition/QUICK_SELECT_I in that phase is not a solution - if EXPLAIN CONNECTION runs it will mislead the reader on the plan chosen in the optimization phase. So: distinguish between condition/QUICK_SELECT_I chosen in the optimization phase (stored in new members m_condition_optim, m_quick_optim, stored in QEP_TAB), and condition/QUICK_SELECT_I used in the current phase of execution (m_condition, m_quick). Same idea for index reads (m_keyread_optim).


SQL_SELECT is removed

Its functionality moves to QEP_TAB. Let's look at its members:

 QUICK_SELECT_I *quick;	///< Non-NULL if quick-select used.

removed. Now only JOIN_TAB::quick is used.

 Item		*cond;		///< Where condition.

removed. JOIN_TAB::m_condition is used.

 Item		*icp_cond;	///< Conditions pushed to index.

removed. table->file->pushed_idx_cond is used.

 TABLE	*head;

removed. QEP_TAB has a table already.

 IO_CACHE file;		///< Positions to used records.

removed. Was used in one single case: when UPDATE does a filesort. Now this case simply reuses table->sort.io_cache.

 ha_rows records;		///< Records in use if read from file.
 double read_time;		///< Time to read rows.

test_quick_select, which was the only user of them. Made them local variables of this function.

 key_map needed_reg;		///< Possible quick keys after prev tables.

removed. Now JOIN_TAB::needed_reg is used.

 table_map const_tables,read_tables;

removed. Computed on the fly in test_quick_select()

 bool	free_cond;

removed. Was not used.

   Used for QS_DYNAMIC_RANGE, i.e., "Range checked for each record".
   Used by optimizer tracing to decide whether or not dynamic range
   analysis of this select has been traced already. If optimizer
   trace option DYNAMIC_RANGE is enabled, range analysis will be
   traced with different ranges for every record to the left of this
   table in the join. If disabled, range analysis will only be traced
   for the first range.
 bool traced_before;

moved to QEP_TAB.

 void set_quick(QUICK_SELECT_I *new_quick);

moved to QEP_TAB.

 bool check_quick(THD *thd, bool force_quick_range, ha_rows limit)
   key_map tmp(key_map::ALL_BITS);
   return test_quick_select(thd, tmp, 0, limit, force_quick_range,
                            ORDER::ORDER_NOT_RELEVANT) < 0;

Removed. Replaced with calls to test_quick_select.

 inline bool skip_record(THD *thd, bool *skip_record)
   *skip_record= cond ? cond->val_int() == FALSE : FALSE;
   return thd->is_error();

moved to QEP_TAB.

 int test_quick_select(THD *thd, key_map keys, table_map prev_tables,
                       ha_rows limit, bool force_quick_range,
                       const ORDER::enum_order interesting_order);

Made free function, taking QEP_TAB in argument. It might be a good idea to make it a member function of QEP_TAB.

Shared members of JOIN_TAB

Those are shared between JOIN_TAB and QEP_TAB. There is a class QEP_shared which contains the shared members. There is a class QEP_shared_owner, representing the owner of a QEP_shared. JOIN_TAB and QEP_TAB inherit from QEP_shared_owner. QEP_shared contains members moved from JOIN_TAB. Certain such members, like first_sj_inner_tab, were of type JOIN_TAB*. I replaced them with type ptidx (Plan Table Index).

   This represents the index of a JOIN_TAB/QEP_TAB in an array.
   It is signed, because:
   - firstmatch_return may be -1 (it can happen that the first table of the
   plan uses FirstMatch: SELECT ... WHERE literal IN (SELECT ...)).
   - it must hold the invalid value NOPTIDX (which means "no
   JOIN_TAB/QEP_TAB", equivalent of NULL pointer); this invalid value must
   itself be different from -1, to distinguish "FirstMatch to
   before-first-table" (firstmatch_return==-1) from "No FirstMatch"
typedef int8 ptidx;
#define NOPTIDX (-2)

   Holds members common to JOIN_TAB and QEP_TAB.

   Reviewers: so this is the "sharing" implementation. The "duplication"
   implementation is possible too, by changing getters/setters (taking
   measures so that copies stay in sync).
class QEP_shared : public Sql_alloc
    For each member, if there is no written explanation of why it's shared,
    it's simply because the member is set before QEP_TABs exist, and used in
    execution/EXPLAIN (when only QEP_TABs exist).
    I have sometimes modified execution code, to reduce the number of shared
    members; without this effort, these would have been shared too:
    use_quick, keyuse, rowcount.

  JOIN	*m_join;

     Index of structure in array:
     - NOPTIDX if before get_best_combination()
     - index of pointer to this JOIN_TAB, in JOIN::best_ref array
     - index of this QEP_TAB, in JOIN::qep array
  ptidx  m_idx;

  /// Corresponding table. Might be an internal temporary one.
  TABLE *m_table;

  /// Points into best_positions array. Includes cost info.
  POSITION      *m_position;

    semijoin-related members.

    Struct needed for materialization of semi-join. Set for a materialized
    temporary table, and NULL for all other join_tabs (except when
    materialization is in progress, @see join_materialize_semijoin()).
  Semijoin_mat_exec *m_sj_mat_exec;

    Boundaries of semijoin inner tables around this table. Valid only once
    final QEP has been chosen. Depending on the strategy, they may define an
    interval (all tables inside are inner of a semijoin) or
    not. last_sj_inner is not set for Duplicates Weedout.
  ptidx m_first_sj_inner, m_last_sj_inner;

    outer-join-related members.
  ptidx m_first_inner;   ///< first inner table for including outer join
  ptidx m_last_inner;    ///< last table table for embedding outer join
  ptidx m_first_upper;   ///< first inner table for embedding outer join

     Used to do index-based look up based on a key value.
     Used when we read constant tables, in misc optimization (like
     remove_const()), and in execution.
  TABLE_REF	m_ref;

  /// ID of index used for index scan or semijoin LooseScan
  uint		m_index;

  /// Type of chosen access method (scan, etc).
  enum join_type m_type;

     condition to be applied for filtering rows in execution.
  Item          *m_condition;

     All keys with can be used.
     Used by add_key_field() (optimization time) and execution of dynamic
     range (join_init_quick_record()), and EXPLAIN.
  key_map       m_keys;

     Either #rows in the table or 1 for const table.
     Used in optimization, and also in execution for FOUND_ROWS().
  ha_rows	m_records;

     Non-NULL if quick-select used.
     Filled in optimization, used in execution to find rows, and in EXPLAIN.
  QUICK_SELECT_I *m_quick;

    Maps below are shared because of dynamic range: in execution, it needs to
    know the prefix tables, to find the possible QUICK methods.

    The set of all tables available in the join prefix for this table,
    including the table handled by this JOIN_TAB.
  table_map     prefix_tables_map;
    The set of tables added for this table, compared to the previous table
    in the join prefix.
  table_map     added_tables_map;

A QEP_shared_owner contains a pointer to a QEP_shared (m_qs) and offers accessors which do simple forwarding, like:

 ptidx first_sj_inner() const { return m_qs->first_sj_inner(); }
 void set_first_inner(ptidx i) { return m_qs->set_first_inner(i); }


JOIN_TAB members not listed in the previous section have moved to QEP_TAB.

class QEP_TAB : public Sql_alloc, public QEP_shared_owner
 /* Variables for semi-join duplicate elimination */
 SJ_TMP_TABLE *flush_weedout_table;
 SJ_TMP_TABLE *check_weed_out_table;

   If set, means we should stop join enumeration after we've got the first
   match and return to the specified join tab. May point to
   join->join_tab[-1] which means stop join execution after the first
 ptidx firstmatch_return;

   Length of key tuple (depends on #keyparts used) to store in loosescan_buf.
   If zero, means that loosescan is not used.
 uint loosescan_key_len;

 /* Buffer to save index tuple to be able to skip duplicates */
 uchar *loosescan_buf;

   If doing a LooseScan, this QEP is the first (i.e.  "driving")
   QEP_TAB, and match_tab points to the last QEP_TAB handled by the strategy.
   match_tab->found_match should be checked to see if the current value group
   had a match.
   If doing a FirstMatch, check this QEP_TAB to see if there is a match.
   Unless the FirstMatch performs a "split jump", this is equal to the
   current QEP_TAB.
 ptidx match_tab;

   Used by FirstMatch and LooseScan. TRUE <=> there is a matching
   record combination
 bool found_match;
 bool found;         /**< true after all matches or null complement*/
 bool not_null_compl;/**< true before null complement is added    */

 ptidx first_unmatched; /**< used for optimization purposes only   */

 /// For a materializable derived or SJ table: true if has been materialized
 bool materialized;

 READ_RECORD::Setup_func materialize_table;
    Initialize table for reading and fetch the first row from the table. If
    table is a materialized derived one, function must materialize it with
 READ_RECORD::Setup_func read_first_record;
 Next_select_func next_select;
 READ_RECORD read_record;
   The following two fields are used for a [NOT] IN subquery if it is
   executed by an alternative full table scan when the left operand of
   the subquery predicate is evaluated to NULL.
 READ_RECORD::Setup_func save_read_first_record;/* to save read_first_record */
 READ_RECORD::Read_func save_read_record;/* to save read_record.read_record */

 // join-cache-related members
    Reviewers: I deleted used_fields,used_fieldlength,used_blobs,
    used_rowid_fields, I found they didn't have to be in QEP/JOIN_TABs.
 bool          used_null_fields;
 bool          used_uneven_bit_fields;

   Used by DuplicateElimination. tab->table->ref must have the rowid
   whenever we have a current record. copy_current_rowid needed because
   we cannot bind to the rowid buffer before the table has been opened.
 bool keep_current_rowid;
 st_cache_field *copy_current_rowid;

 /** TRUE <=> remove duplicates on this table. */
 bool distinct;

 bool not_used_in_distinct;

 /// Index condition for BKA access join
 Item *cache_idx_cond;

 /** HAVING condition for checking prior saving a record into tmp table*/
 Item *having;

 QEP_operation *op;

 /* Tmp table info */
 Temp_table_param *tmp_table_param;

 /* Sorting related info */
 Filesort *filesort;

   List of topmost expressions in the select list. The *next* JOIN TAB
   in the plan should use it to obtain correct values. Same applicable to
   all_fields. These lists are needed because after tmp tables functions
   will be turned to fields. These variables are pointing to
   tmp_fields_list[123]. Valid only for tmp tables and the last non-tmp
   table in the query plan.
   @see JOIN::make_tmp_tables_info()
 List<Item> *fields;
 /** List of all expressions in the select list */
 List<Item> *all_fields;
   Pointer to the ref array slice which to switch to before sending
   records. Valid only for tmp tables.
 Ref_ptr_array *ref_array;

 /** Number of records saved in tmp table */
 ha_rows send_records;

   Used for QS_DYNAMIC_RANGE, i.e., "Range checked for each record".
   Used by optimizer tracing to decide whether or not dynamic range
   analysis of this select has been traced already. If optimizer
   trace option DYNAMIC_RANGE is enabled, range analysis will be
   traced with different ranges for every record to the left of this
   table in the join. If disabled, range analysis will only be traced
   for the first range.
 bool quick_traced_before;

 /// @See m_quick_optim
 Item          *m_condition_optim;

    m_quick is the quick "to be used at this stage of execution".
    It can happen that filesort uses the quick (produced by the optimizer) to
    produce a sorted result, then the read of this result has to be done
    without "quick", so we must reset m_quick to NULL, but we want to delay
    freeing of m_quick or it would close the filesort's result and the table
    In that case, we move m_quick to m_quick_optim (=> delay deletion), reset
    m_quick to NULL (read of filesort's result will be without quick); if
    this is a subquery which is later executed a second time,
    QEP_TAB::reset() will restore the quick from m_quick_optim into m_quick.
    quick_optim stands for "the quick decided by the optimizer".
 QUICK_SELECT_I *m_quick_optim;

    TRUE <=> only index is going to be read for this table.
    Reviewers: this was introduced by the WL "explain connection" as
    "use_keyread", then
    was filed. Here I rename it to make it clear that it's "the optimizer's
    decision" (just like quick_optim or condition_optim), and we don't need
    anymore to update it every time we update keyread in the optimization
    phase; just update it once for all when optimization is finishing. This
    change, imho, partially solves the bug above.
    However, it remains, as in trunk, that certain functions like
    join_read_first (execution function!) set keyread to true, which is bad -
    EXPLAIN does not execute so does not see this decision, and what
    if EXPLAIN CONNECTION runs before join_read_first - it won't see this
    decision. Decision should be made and recorded by the optimizer only.
    This strange design is the reason why
    opt_explain.cc has to test both keyread_optim and keyread, in an
    attempt to make EXPLAIN CONNECTION reliable; but that doesn't solve the
    problem of normal EXPLAIN, so there are some differences between EXPLAIN
    and execution, like in main.explain_for_connection_rqg_trad
    @@ -3,3 +3,3 @@
1	PRIMARY	alias2	NULL	ALL	NULL	NULL	NULL	NULL	20	100.00	Using join buffer (Block Nested Loop)
-1	PRIMARY	alias3	NULL	ALL	col_varchar_key	NULL	NULL	NULL	20	33.33	Range checked for each record (index map: 0x20)
+1	PRIMARY	alias3	NULL	range	col_varchar_key	col_varchar_key	4	NULL	5	100.00	Range checked for each record (index map: 0x20); Using index

   This could probably be fixed in the scope of the above report.
   I imagine that the setting of
   keyread needs to be taken out of join_read_first, to the optimization
   phase (like, that of UPDATE which misses it currently), then such phase should
   call set_keyread_optim() (like UPDATE calls set_quick_optim) and EXPLAIN
   should read only keyread_optim. Execution would still be free to turn
   keyread off for certain parts of execution like create_sort_index(), but
   that would not affect EXPLAIN. In other words: same solution as
 bool m_keyread_optim;



What remains in the class:

typedef struct st_join_table : public Sql_alloc, public QEP_shared_owner
  Key_use       *m_keyuse;        /**< pointer to first used key               */

     Pointer to the associated join condition:
     - if this is a table with position==NULL (e.g. internal sort/group
     temporary table), pointer is NULL
     - otherwise, pointer is the address of some TABLE_LIST::m_join_cond.
     Thus, TABLE_LIST::m_join_cond and *JOIN_TAB::m_join_cond_ref are the same
     thing (changing one changes the other; thus, optimizations made on the
     second are reflected in SELECT_LEX::print_table_array() which uses the
  Item          **m_join_cond_ref;
  COND_EQUAL    *cond_equal;    /**< multiple equalities for the on expression*/
  double	worst_seeks;
  /** Keys with constant part. Subset of keys. */
  key_map	const_keys;
  key_map	checked_keys;			/**< Keys checked */
  key_map	needed_reg;

    Used to avoid repeated range analysis for the same key in
    test_if_skip_sort_order(). This would otherwise happen if the best
    range access plan found for a key is turned down.
    quick_order_tested is cleared every time the select condition for
    this JOIN_TAB changes since a new condition may give another plan
    and cost from range analysis.
  key_map       quick_order_tested;

    Number of records that will be scanned (yes scanned, not returned) by the
    best 'independent' access method, i.e. table scan or QUICK_*_SELECT)
  ha_rows       found_records;
    Cost of accessing the table using "ALL" or range/index_merge access
    method (but not 'index' for some reason), i.e. this matches method which
    E(#records) is in found_records.
  ha_rows       read_time;
    The set of tables that this table depends on. Used for outer join and
    straight join dependencies.
  table_map     dependent;
    The set of tables that are referenced by key from this table.
  table_map     key_dependent;
  uint		used_fieldlength;
  enum quick_type use_quick;

    Join buffering strategy.
    After optimization it contains chosen join buffering strategy (if any).
  uint          m_use_join_cache;

  /* SemiJoinDuplicateElimination variables: */
    Embedding SJ-nest (may be not the direct parent), or NULL if none.
    This variable holds the result of table pullout.
  TABLE_LIST    *emb_sj_nest;

  /* NestedOuterJoins: Bitmap of nested joins this table is part of */
  nested_join_map embedding_map;

  /** Flags from SE's MRR implementation, to be used by JOIN_CACHE */
  uint join_cache_flags;

  /** TRUE <=> AM will scan backward */
  bool reversed_access;

Changes to SELECT code

  1. Before get_best_combination(): very little changes. init_planner_arrays() takes care to allocate 2 nodes for JOIN_TABs of sort/group internal tmp tables. It allocates both JOIN_TABs and QEP_shared-s for the JOIN_TABs. JOIN_TABs are accessed through JOIN::join_tab.
  2. In get_best_combination(): don't memcpy old JOIN_TABs to new JOIN_TABs; just fill JOIN::best_ref with JOIN_TABs in plan order.
  3. After get_best_combination(): JOIN_TABs are accessed through JOIN::best_ref.
  4. Right before make_join_readinfo(): QEP_TABs are allocated, pointing to the QEP_shared of their JOIN_TAB.
  5. In make_join_readinfo() and make_tmp_tables_info(), QEP_TABs are filled with execution information. At end of make_tmp_tables_info(), JOIN_TABs are trashed. QEP_shared are not trashed, as QEP_TAB will need them in execution/EXPLAIN.

Changes to single-table UPDATE/DELETE code

They now use a single QEP_TAB (on the stack) instead of multiple memroot-allocated SQL_SELECTs. They don't use any JOIN_TAB. The QEP_TAB is inside an object of type QEP_TAB_standalone:

   Use this class when you need a QEP_TAB not connected to any JOIN_TAB.
class QEP_TAB_standalone : public Sql_alloc
  QEP_TAB_standalone() { m_qt.set_qs(&m_qs); }
  ~QEP_TAB_standalone() { m_qt.qepcleanup(); }
  /// @returns access to the QEP_TAB
  QEP_TAB &as_QEP_TAB() { return m_qt; }
  QEP_shared m_qs;
  QEP_TAB m_qt;

Changes to range optimizer

There used to be JOIN_TAB::quick and SQL_SELECT::quick, the latter is gone (no more juggling back and forth, as described in wl#6602).

set_quick() used to free the old quick object, change JOIN_TAB::type, which sometimes forced the caller to do "juggling" to block this internal behaviour. Now, set_quick() sets the quick, period. The caller is responsible for freeing the old quick, and changing type.

When a filesort was done for a table, the quick+condition pair of JOIN_TAB::select used to move to JOIN_TAB::filesort::select; this forced other code (like EXPLAIN) to consult both places to find "quick". It also required a bool "own_select" for filesort to know that it had to free "select". Now this is gone. See *_optim members described earlier. filesort code now uses qep_tab instead of pair SQL_SELECT+TABLE.

Changes related to EXPLAIN

EXPLAIN uses QEP_TABs only, JOIN_TABs are not available (alternative: keep JOIN_TABs around, but ensure with getters/setters that they are read-only; does not prevent people from adding execution members to JOIN_TABs in the future, though).

EXPLAIN consults only {quick,condition}_optim, never {quick,condition} which can change during execution without holding a mutex.

Explain_table_base and Modification_plan use QEP_TAB instead of SQL_SELECT.

JOIN_TAB::use_quick not being available to EXPLAIN (because JOIN_TAB is not available and I didn't want to put use_quick in QEP_shared), EXPLAIN tests, instead of use_quick, if QEP_TAB uses join_init_quick_record (new function dynamic_range()).

For the same reason, to not consult JOIN_TAB::use_join_cache, a function JOIN_CACHE::cache_type() has been added.

For the same reason, to not consult JOIN_TAB::rowcount, POSITION::rows_fetched is used instead. rowcount was set by some strange code in test_if_skip_order(), such code is rewritten (more correctly) and done in make_join_readinfo().

Misc changes

  1. When we create Item_func_trig_cond, QEP_TAB does not exist; but when we evaluate the Item, QEP_TAB is needed (because JOIN_TAB::{not_null_compl,found} moved to it); thus the ctor of the Item has been changed: instead of a JOIN_TAB* it gets JOIN* and index of table. It can thus locate QEP_TAB in execution.
  2. JOIN_TAB::{info,packed_info} are gone, I rewrote code to do without them (whoo-hoo).
  3. In select_lex::cleanup_all_joins(bool full): "full" was always false, I removed it for simplification.
  4. JOIN::cleanup(true) was only called from JOIN::destroy(); JOIN::cleanup(bool) was written like: "if (full) x else y", so I merged JOIN::cleanup(true) (the "x" part) into JOIN::destroy(), JOIN::cleanup(false) (the "y" part) becomes JOIN::cleanup(no argument). This is one step in the direction of "less complexity in cleanup functions" (join_free, cleanup, destroy...).

Size of structures

A comparison using release non-debug binary on Linux 64-bit.

  1. Trunk: sizeof(JOIN_TAB)=768
  2. WL6042 tree: JOIN_TAB 144, QEP_TAB 368, QEP_shared 200, total (per table): 710. The reduction is due to the use of one-byte index instead of pointer, like in first_sj_inner.

Size of binary:

  1. trunk: 130872657
  2. WL: 131183741, increase is 311 kB (0.23%). If we then unshare "JOIN *join" (duplicate it in QEP_TAB and JOIN_TAB): 306 kB. Unshare "TABLE *table": 281 kB. table() is often invoked in code; by unsharing it we save a dereference of m_qs. If we duplicate everything we save on code size, but increase on structure size.

Hard to predict what will be best for performance: benchmarks to be done by QA.