WL#6042: Split JOIN_TAB structure in two parts: optimizer's and executor's
GOAL: make easier the maintenance of the Optimizer code, reduce memory usage, and open future possibility for more reduced memory usage.
MEANS: by separation of concerns: split the JOIN_TAB data structure into one structure used in optimization ("the optimization plan") and one used in execution ("the execution plan"); get rid of redundant allocations (of SQL_SELECTs and JOIN_TABs). In execution, the optimization structure will not be used, so its memory storage can then be reclaimed (in a future WL).
User Documentation: Code refactoring. No user documentation required.
No functional requirement. Non-functional requirement: all queries should work as before, same results, no performance decrease. This is a refactoring task. A few query plans might change, which is acceptable if it can be proven that the new plans are no worse.
Tree at my:trunk-wl6042.
- As it is a pure refactoring, there is no user-visible change, no documentation in the manual is needed.
- create a class named QEP_tab in sql_select.h
- right before make_join_readinfo(), allocate an array of QEP_TAB, stored in a new member JOIN::qep. One QEP_TAB per JOIN_TAB.
- in class declarations, move members which are used only in execution from JOIN_TAB to QEP_TAB, thus JOIN_TAB::member becomes QEP_TAB::member.
- at end of make_tmp_tables_info(), which is really the end of optimization, make JOIN_TABs unavailable (i.e. join->join_tab=join->best_ref=NULL, table->reginfo.join_tab= NULL, trash content of JOIN_TABs)
- in execution and EXPLAIN, work exclusively with QEP_TABs, as JOIN_TABs are unavailable; this achieves separation of concerns.
- JOIN_TAB members needed both before make_join_readinfo() and after make_tmp_tables_info() (like "JOIN_TAB::quick") need to be available to JOIN_TAB and QEP_tab; either put them in a shared structure referenced by JOIN_TAB and QEP_TAB (getters/setters pointing to the shared structure); or duplicate them in JOIN_TAB and QEP_TAB (getters/setters ensuring consistency of copies). In the prototype, sharing has been implemented.
- in the current trunk we allocate 2 arrays of JOIN_TABs: one before greedy search, and one after greedy search. Get rid of the second allocation, because the second array is more or less a shuffled version of the first array; we just need to keep track of the shuffling done by get_best_combination(), which is doable with the JOIN::best_ref array. So: before get_best_combination(), work with JOIN::join_tab (as in trunk); at the end of get_best_combination(): set JOIN::join_tab to NULL, write the new order in JOIN::best_ref; after get_best_combination(): access JOIN_TABs through JOIN::best_ref. This reduces memory usage and structure copying.
- eliminate the SQL_SELECT class, putting all its members into QEP_TAB (some can even be removed as they are redundant, like "SQL_SELECT::cond"). This implies that single-table UPDATE/DELETE will use QEP_TAB. Creation of multiple SQL_SELECTs as in trunk will be avoided, reducing memory usage and complexity for the reader.
- For single-table UPDATE/DELETE which need a QEP_TAB, and don't need any JOIN_TAB, create a RAII class QEP_TAB_standalone: it wraps a QEP_TAB, and in its destructor it cleans up this QEP_TAB.
- During optimization, it is decided to use a certain condition/QUICK_SELECT_I for filtering. In certain query, a first phase of execution (like filesort) uses this filtering and produces a result, then a second phase uses this result, so does not need the filtering. Simply deleting the condition/QUICK_SELECT_I in that phase is not a solution - if EXPLAIN CONNECTION runs it will mislead the reader on the plan chosen in the optimization phase. So: distinguish between condition/QUICK_SELECT_I chosen in the optimization phase (stored in new members m_condition_optim, m_quick_optim, stored in QEP_TAB), and condition/QUICK_SELECT_I used in the current phase of execution (m_condition, m_quick). Same idea for index reads (m_keyread_optim).
Contents |
SQL_SELECT is removed
Its functionality moves to QEP_TAB. Let's look at its members:
QUICK_SELECT_I *quick; ///< Non-NULL if quick-select used.
removed. Now only JOIN_TAB::quick is used.
Item *cond; ///< Where condition.
removed. JOIN_TAB::m_condition is used.
Item *icp_cond; ///< Conditions pushed to index.
removed. table->file->pushed_idx_cond is used.
TABLE *head;
removed. QEP_TAB has a table already.
IO_CACHE file; ///< Positions to used records.
removed. Was used in one single case: when UPDATE does a filesort. Now this case simply reuses table->sort.io_cache.
ha_rows records; ///< Records in use if read from file. double read_time; ///< Time to read rows.
test_quick_select, which was the only user of them. Made them local variables of this function.
key_map needed_reg; ///< Possible quick keys after prev tables.
removed. Now JOIN_TAB::needed_reg is used.
table_map const_tables,read_tables;
removed. Computed on the fly in test_quick_select()
bool free_cond;
removed. Was not used.
/** Used for QS_DYNAMIC_RANGE, i.e., "Range checked for each record". Used by optimizer tracing to decide whether or not dynamic range analysis of this select has been traced already. If optimizer trace option DYNAMIC_RANGE is enabled, range analysis will be traced with different ranges for every record to the left of this table in the join. If disabled, range analysis will only be traced for the first range. */ bool traced_before;
moved to QEP_TAB.
void set_quick(QUICK_SELECT_I *new_quick);
moved to QEP_TAB.
bool check_quick(THD *thd, bool force_quick_range, ha_rows limit) { key_map tmp(key_map::ALL_BITS); return test_quick_select(thd, tmp, 0, limit, force_quick_range, ORDER::ORDER_NOT_RELEVANT) < 0; }
Removed. Replaced with calls to test_quick_select.
inline bool skip_record(THD *thd, bool *skip_record) { *skip_record= cond ? cond->val_int() == FALSE : FALSE; return thd->is_error(); }
moved to QEP_TAB.
int test_quick_select(THD *thd, key_map keys, table_map prev_tables, ha_rows limit, bool force_quick_range, const ORDER::enum_order interesting_order);
Made free function, taking QEP_TAB in argument. It might be a good idea to make it a member function of QEP_TAB.
Those are shared between JOIN_TAB and QEP_TAB. There is a class QEP_shared which contains the shared members. There is a class QEP_shared_owner, representing the owner of a QEP_shared. JOIN_TAB and QEP_TAB inherit from QEP_shared_owner. QEP_shared contains members moved from JOIN_TAB. Certain such members, like first_sj_inner_tab, were of type JOIN_TAB*. I replaced them with type ptidx (Plan Table Index).
/** This represents the index of a JOIN_TAB/QEP_TAB in an array. It is signed, because: - firstmatch_return may be -1 (it can happen that the first table of the plan uses FirstMatch: SELECT ... WHERE literal IN (SELECT ...)). - it must hold the invalid value NOPTIDX (which means "no JOIN_TAB/QEP_TAB", equivalent of NULL pointer); this invalid value must itself be different from -1, to distinguish "FirstMatch to before-first-table" (firstmatch_return==-1) from "No FirstMatch" (firstmatch_return==NOPTIDX). */ typedef int8 ptidx; #define NOPTIDX (-2) /** Holds members common to JOIN_TAB and QEP_TAB. Reviewers: so this is the "sharing" implementation. The "duplication" implementation is possible too, by changing getters/setters (taking measures so that copies stay in sync). */ class QEP_shared : public Sql_alloc { /* For each member, if there is no written explanation of why it's shared, it's simply because the member is set before QEP_TABs exist, and used in execution/EXPLAIN (when only QEP_TABs exist). I have sometimes modified execution code, to reduce the number of shared members; without this effort, these would have been shared too: use_quick, keyuse, rowcount. */ JOIN *m_join; /** Index of structure in array: - NOPTIDX if before get_best_combination() - index of pointer to this JOIN_TAB, in JOIN::best_ref array - index of this QEP_TAB, in JOIN::qep array */ ptidx m_idx; /// Corresponding table. Might be an internal temporary one. TABLE *m_table; /// Points into best_positions array. Includes cost info. POSITION *m_position; /* semijoin-related members. */ /** Struct needed for materialization of semi-join. Set for a materialized temporary table, and NULL for all other join_tabs (except when materialization is in progress, @see join_materialize_semijoin()). */ Semijoin_mat_exec *m_sj_mat_exec; /** Boundaries of semijoin inner tables around this table. Valid only once final QEP has been chosen. Depending on the strategy, they may define an interval (all tables inside are inner of a semijoin) or not. last_sj_inner is not set for Duplicates Weedout. */ ptidx m_first_sj_inner, m_last_sj_inner; /* outer-join-related members. */ ptidx m_first_inner; ///< first inner table for including outer join ptidx m_last_inner; ///< last table table for embedding outer join ptidx m_first_upper; ///< first inner table for embedding outer join /** Used to do index-based look up based on a key value. Used when we read constant tables, in misc optimization (like remove_const()), and in execution. */ TABLE_REF m_ref; /// ID of index used for index scan or semijoin LooseScan uint m_index; /// Type of chosen access method (scan, etc). enum join_type m_type; /** condition to be applied for filtering rows in execution. */ Item *m_condition; /** All keys with can be used. Used by add_key_field() (optimization time) and execution of dynamic range (join_init_quick_record()), and EXPLAIN. */ key_map m_keys; /** Either #rows in the table or 1 for const table. Used in optimization, and also in execution for FOUND_ROWS(). */ ha_rows m_records; /** Non-NULL if quick-select used. Filled in optimization, used in execution to find rows, and in EXPLAIN. */ QUICK_SELECT_I *m_quick; /* Maps below are shared because of dynamic range: in execution, it needs to know the prefix tables, to find the possible QUICK methods. */ /** The set of all tables available in the join prefix for this table, including the table handled by this JOIN_TAB. */ table_map prefix_tables_map; /** The set of tables added for this table, compared to the previous table in the join prefix. */ table_map added_tables_map; };
A QEP_shared_owner contains a pointer to a QEP_shared (m_qs) and offers accessors which do simple forwarding, like:
ptidx first_sj_inner() const { return m_qs->first_sj_inner(); } void set_first_inner(ptidx i) { return m_qs->set_first_inner(i); }
Class QEP_TAB
JOIN_TAB members not listed in the previous section have moved to QEP_TAB.
class QEP_TAB : public Sql_alloc, public QEP_shared_owner { /* Variables for semi-join duplicate elimination */ SJ_TMP_TABLE *flush_weedout_table; SJ_TMP_TABLE *check_weed_out_table; /* If set, means we should stop join enumeration after we've got the first match and return to the specified join tab. May point to join->join_tab[-1] which means stop join execution after the first match. */ ptidx firstmatch_return; /* Length of key tuple (depends on #keyparts used) to store in loosescan_buf. If zero, means that loosescan is not used. */ uint loosescan_key_len; /* Buffer to save index tuple to be able to skip duplicates */ uchar *loosescan_buf; /* If doing a LooseScan, this QEP is the first (i.e. "driving") QEP_TAB, and match_tab points to the last QEP_TAB handled by the strategy. match_tab->found_match should be checked to see if the current value group had a match. If doing a FirstMatch, check this QEP_TAB to see if there is a match. Unless the FirstMatch performs a "split jump", this is equal to the current QEP_TAB. */ ptidx match_tab; /* Used by FirstMatch and LooseScan. TRUE <=> there is a matching record combination */ bool found_match; bool found; /**< true after all matches or null complement*/ bool not_null_compl;/**< true before null complement is added */ ptidx first_unmatched; /**< used for optimization purposes only */ /// For a materializable derived or SJ table: true if has been materialized bool materialized; READ_RECORD::Setup_func materialize_table; /** Initialize table for reading and fetch the first row from the table. If table is a materialized derived one, function must materialize it with prepare_scan(). */ READ_RECORD::Setup_func read_first_record; Next_select_func next_select; READ_RECORD read_record; /* The following two fields are used for a [NOT] IN subquery if it is executed by an alternative full table scan when the left operand of the subquery predicate is evaluated to NULL. */ READ_RECORD::Setup_func save_read_first_record;/* to save read_first_record */ READ_RECORD::Read_func save_read_record;/* to save read_record.read_record */ // join-cache-related members /** Reviewers: I deleted used_fields,used_fieldlength,used_blobs, used_rowid_fields, I found they didn't have to be in QEP/JOIN_TABs. */ bool used_null_fields; bool used_uneven_bit_fields; /* Used by DuplicateElimination. tab->table->ref must have the rowid whenever we have a current record. copy_current_rowid needed because we cannot bind to the rowid buffer before the table has been opened. */ bool keep_current_rowid; st_cache_field *copy_current_rowid; /** TRUE <=> remove duplicates on this table. */ bool distinct; bool not_used_in_distinct; /// Index condition for BKA access join Item *cache_idx_cond; /** HAVING condition for checking prior saving a record into tmp table*/ Item *having; QEP_operation *op; /* Tmp table info */ Temp_table_param *tmp_table_param; /* Sorting related info */ Filesort *filesort; /** List of topmost expressions in the select list. The *next* JOIN TAB in the plan should use it to obtain correct values. Same applicable to all_fields. These lists are needed because after tmp tables functions will be turned to fields. These variables are pointing to tmp_fields_list[123]. Valid only for tmp tables and the last non-tmp table in the query plan. @see JOIN::make_tmp_tables_info() */ List<Item> *fields; /** List of all expressions in the select list */ List<Item> *all_fields; /* Pointer to the ref array slice which to switch to before sending records. Valid only for tmp tables. */ Ref_ptr_array *ref_array; /** Number of records saved in tmp table */ ha_rows send_records; /** Used for QS_DYNAMIC_RANGE, i.e., "Range checked for each record". Used by optimizer tracing to decide whether or not dynamic range analysis of this select has been traced already. If optimizer trace option DYNAMIC_RANGE is enabled, range analysis will be traced with different ranges for every record to the left of this table in the join. If disabled, range analysis will only be traced for the first range. */ bool quick_traced_before; /// @See m_quick_optim Item *m_condition_optim; /** m_quick is the quick "to be used at this stage of execution". It can happen that filesort uses the quick (produced by the optimizer) to produce a sorted result, then the read of this result has to be done without "quick", so we must reset m_quick to NULL, but we want to delay freeing of m_quick or it would close the filesort's result and the table prematurely. In that case, we move m_quick to m_quick_optim (=> delay deletion), reset m_quick to NULL (read of filesort's result will be without quick); if this is a subquery which is later executed a second time, QEP_TAB::reset() will restore the quick from m_quick_optim into m_quick. quick_optim stands for "the quick decided by the optimizer". */ QUICK_SELECT_I *m_quick_optim; /** TRUE <=> only index is going to be read for this table. Reviewers: this was introduced by the WL "explain connection" as "use_keyread", then Bug #17620039 REFACTOR TABLE::KEY_READ VS JOIN_TAB::USE_KEYREAD was filed. Here I rename it to make it clear that it's "the optimizer's decision" (just like quick_optim or condition_optim), and we don't need anymore to update it every time we update keyread in the optimization phase; just update it once for all when optimization is finishing. This change, imho, partially solves the bug above. However, it remains, as in trunk, that certain functions like join_read_first (execution function!) set keyread to true, which is bad - EXPLAIN does not execute so does not see this decision, and what if EXPLAIN CONNECTION runs before join_read_first - it won't see this decision. Decision should be made and recorded by the optimizer only. This strange design is the reason why opt_explain.cc has to test both keyread_optim and keyread, in an attempt to make EXPLAIN CONNECTION reliable; but that doesn't solve the problem of normal EXPLAIN, so there are some differences between EXPLAIN and execution, like in main.explain_for_connection_rqg_trad @@ -3,3 +3,3 @@ 1 PRIMARY alias2 NULL ALL NULL NULL NULL NULL 20 100.00 Using join buffer (Block Nested Loop) -1 PRIMARY alias3 NULL ALL col_varchar_key NULL NULL NULL 20 33.33 Range checked for each record (index map: 0x20) +1 PRIMARY alias3 NULL range col_varchar_key col_varchar_key 4 NULL 5 100.00 Range checked for each record (index map: 0x20); Using index This could probably be fixed in the scope of the above report. I imagine that the setting of keyread needs to be taken out of join_read_first, to the optimization phase (like, that of UPDATE which misses it currently), then such phase should call set_keyread_optim() (like UPDATE calls set_quick_optim) and EXPLAIN should read only keyread_optim. Execution would still be free to turn keyread off for certain parts of execution like create_sort_index(), but that would not affect EXPLAIN. In other words: same solution as quick_optim/condition_optim. */ bool m_keyread_optim; };
Class JOIN_TAB
What remains in the class:
typedef struct st_join_table : public Sql_alloc, public QEP_shared_owner { Key_use *m_keyuse; /**< pointer to first used key */ /** Pointer to the associated join condition: - if this is a table with position==NULL (e.g. internal sort/group temporary table), pointer is NULL - otherwise, pointer is the address of some TABLE_LIST::m_join_cond. Thus, TABLE_LIST::m_join_cond and *JOIN_TAB::m_join_cond_ref are the same thing (changing one changes the other; thus, optimizations made on the second are reflected in SELECT_LEX::print_table_array() which uses the first). */ Item **m_join_cond_ref; COND_EQUAL *cond_equal; /**< multiple equalities for the on expression*/ double worst_seeks; /** Keys with constant part. Subset of keys. */ key_map const_keys; key_map checked_keys; /**< Keys checked */ key_map needed_reg; /** Used to avoid repeated range analysis for the same key in test_if_skip_sort_order(). This would otherwise happen if the best range access plan found for a key is turned down. quick_order_tested is cleared every time the select condition for this JOIN_TAB changes since a new condition may give another plan and cost from range analysis. */ key_map quick_order_tested; /* Number of records that will be scanned (yes scanned, not returned) by the best 'independent' access method, i.e. table scan or QUICK_*_SELECT) */ ha_rows found_records; /* Cost of accessing the table using "ALL" or range/index_merge access method (but not 'index' for some reason), i.e. this matches method which E(#records) is in found_records. */ ha_rows read_time; /** The set of tables that this table depends on. Used for outer join and straight join dependencies. */ table_map dependent; /** The set of tables that are referenced by key from this table. */ table_map key_dependent; uint used_fieldlength; enum quick_type use_quick; /** Join buffering strategy. After optimization it contains chosen join buffering strategy (if any). */ uint m_use_join_cache; /* SemiJoinDuplicateElimination variables: */ /* Embedding SJ-nest (may be not the direct parent), or NULL if none. This variable holds the result of table pullout. */ TABLE_LIST *emb_sj_nest; /* NestedOuterJoins: Bitmap of nested joins this table is part of */ nested_join_map embedding_map; /** Flags from SE's MRR implementation, to be used by JOIN_CACHE */ uint join_cache_flags; /** TRUE <=> AM will scan backward */ bool reversed_access; }
Changes to SELECT code
- Before get_best_combination(): very little changes. init_planner_arrays() takes care to allocate 2 nodes for JOIN_TABs of sort/group internal tmp tables. It allocates both JOIN_TABs and QEP_shared-s for the JOIN_TABs. JOIN_TABs are accessed through JOIN::join_tab.
- In get_best_combination(): don't memcpy old JOIN_TABs to new JOIN_TABs; just fill JOIN::best_ref with JOIN_TABs in plan order.
- After get_best_combination(): JOIN_TABs are accessed through JOIN::best_ref.
- Right before make_join_readinfo(): QEP_TABs are allocated, pointing to the QEP_shared of their JOIN_TAB.
- In make_join_readinfo() and make_tmp_tables_info(), QEP_TABs are filled with execution information. At end of make_tmp_tables_info(), JOIN_TABs are trashed. QEP_shared are not trashed, as QEP_TAB will need them in execution/EXPLAIN.
Changes to single-table UPDATE/DELETE code
They now use a single QEP_TAB (on the stack) instead of multiple memroot-allocated SQL_SELECTs. They don't use any JOIN_TAB. The QEP_TAB is inside an object of type QEP_TAB_standalone:
/** Use this class when you need a QEP_TAB not connected to any JOIN_TAB. */ class QEP_TAB_standalone : public Sql_alloc { public: QEP_TAB_standalone() { m_qt.set_qs(&m_qs); } ~QEP_TAB_standalone() { m_qt.qepcleanup(); } /// @returns access to the QEP_TAB QEP_TAB &as_QEP_TAB() { return m_qt; } private: QEP_shared m_qs; QEP_TAB m_qt; };
Changes to range optimizer
There used to be JOIN_TAB::quick and SQL_SELECT::quick, the latter is gone (no more juggling back and forth, as described in wl#6602).
set_quick() used to free the old quick object, change JOIN_TAB::type, which sometimes forced the caller to do "juggling" to block this internal behaviour. Now, set_quick() sets the quick, period. The caller is responsible for freeing the old quick, and changing type.
When a filesort was done for a table, the quick+condition pair of JOIN_TAB::select used to move to JOIN_TAB::filesort::select; this forced other code (like EXPLAIN) to consult both places to find "quick". It also required a bool "own_select" for filesort to know that it had to free "select". Now this is gone. See *_optim members described earlier. filesort code now uses qep_tab instead of pair SQL_SELECT+TABLE.
EXPLAIN uses QEP_TABs only, JOIN_TABs are not available (alternative: keep JOIN_TABs around, but ensure with getters/setters that they are read-only; does not prevent people from adding execution members to JOIN_TABs in the future, though).
EXPLAIN consults only {quick,condition}_optim, never {quick,condition} which can change during execution without holding a mutex.
Explain_table_base and Modification_plan use QEP_TAB instead of SQL_SELECT.
JOIN_TAB::use_quick not being available to EXPLAIN (because JOIN_TAB is not available and I didn't want to put use_quick in QEP_shared), EXPLAIN tests, instead of use_quick, if QEP_TAB uses join_init_quick_record (new function dynamic_range()).
For the same reason, to not consult JOIN_TAB::use_join_cache, a function JOIN_CACHE::cache_type() has been added.
For the same reason, to not consult JOIN_TAB::rowcount, POSITION::rows_fetched is used instead. rowcount was set by some strange code in test_if_skip_order(), such code is rewritten (more correctly) and done in make_join_readinfo().
Misc changes
- When we create Item_func_trig_cond, QEP_TAB does not exist; but when we evaluate the Item, QEP_TAB is needed (because JOIN_TAB::{not_null_compl,found} moved to it); thus the ctor of the Item has been changed: instead of a JOIN_TAB* it gets JOIN* and index of table. It can thus locate QEP_TAB in execution.
- JOIN_TAB::{info,packed_info} are gone, I rewrote code to do without them (whoo-hoo).
- In select_lex::cleanup_all_joins(bool full): "full" was always false, I removed it for simplification.
- JOIN::cleanup(true) was only called from JOIN::destroy(); JOIN::cleanup(bool) was written like: "if (full) x else y", so I merged JOIN::cleanup(true) (the "x" part) into JOIN::destroy(), JOIN::cleanup(false) (the "y" part) becomes JOIN::cleanup(no argument). This is one step in the direction of "less complexity in cleanup functions" (join_free, cleanup, destroy...).
Size of structures
A comparison using release non-debug binary on Linux 64-bit.
- Trunk: sizeof(JOIN_TAB)=768
- WL6042 tree: JOIN_TAB 144, QEP_TAB 368, QEP_shared 200, total (per table): 710. The reduction is due to the use of one-byte index instead of pointer, like in first_sj_inner.
Size of binary:
- trunk: 130872657
- WL: 131183741, increase is 311 kB (0.23%). If we then unshare "JOIN *join" (duplicate it in QEP_TAB and JOIN_TAB): 306 kB. Unshare "TABLE *table": 281 kB. table() is often invoked in code; by unsharing it we save a dereference of m_qs. If we duplicate everything we save on code size, but increase on structure size.
Hard to predict what will be best for performance: benchmarks to be done by QA.