WL#7093: Optimizer provides InnoDB with a bigger buffer
Affects: Server-8.0
—
Status: Complete
In order to reduce the cost of latching and B-tree navigation, InnoDB uses a small internal buffer to fetch records in batches for scans. It would be more beneficial if the optimizer would provide a bigger buffer and specify how many records it is going to read. The goal of this WL is to introduce a new data structure and handler API for providing the storage engine a bigger buffer. The implementation of the new record buffer, the extensions to the handler API, and the support for the buffer in the executor will be independent of the storage engine. In this worklog, only InnoDB will be extended to use the new record buffer. Other storage engines may benefit from starting to use the new record buffer, but that is outside of the scope of this worklog. In SELECT queries, a buffer will be provided for each table and index scan that is believed to read more than one row (with some exceptions detailed below). The bigger buffer will primarily improve the performance of queries that perform scans which read a large number of rows. Some example queries that should benefit from it: # If t is estimated to contain more than one row, a buffer is # allocated before the table is scanned. If the number of rows in the # table is large, the bigger buffer will improve the performance. SELECT * FROM t; # pk is a primary key. If the predicate is believed to match more than # one row, a buffer is allocated before the primary key index is # scanned. The performance benefit is larger the more rows that match # the predicate. SELECT * FROM t WHERE pk BETWEEN 1000 AND 10000; # pk is primary key in t1 and t2. A buffer will be allocated for the # scan of the outer table in the join (whichever table the optimizer # chooses) if that table has more than one row. No buffer is allocated # for the inner table, since that table will be accessed using # multiple single-row eq_ref scans which don't benefit from a buffer. # The buffer on the outer table could have a positive effect on the # performance, but probably not much, since the many scans of the # inner table are likely to dominate the cost. SELECT * FROM t1, t2 WHERE t1.pk = t2.pk; When is a buffer not allocated? If the executor believes a buffer cannot be used, or that it won't help improving the performance, a buffer will not be allocated. The executor will not allocate a buffer in these cases: - If the access type is not one of ref, ref_or_null, index_merge, range, index or ALL. - If the scan is estimated to return no more than one row. - If the scan is a loose index scan, which typically doesn't read many consecutive rows. - If the scan is against an internally generated temporary table. - If the storage engine reports that it won't use the buffer. InnoDB will report that it won't use the buffer in the following cases: - If the scan is not read-only. - If the scan accesses BLOB-based columns (such as BLOB, TEXT, JSON, GEOMETRY). - If it is a fulltext query. - If the table does not have a user-defined primary key or a unique constraint that could be used as a primary key. In the cases where a buffer is not allocated, the performance should be unaffected. In the cases where a buffer is allocated, the performance should generally be the same or higher than before. There could be cases where performance is hurt, though. For example: - If the estimates used by the optimizer are too high, performance could be affected negatively because an unnecessarily big buffer is allocated. This could be particularly bad for queries that use a low LIMIT, since they might perform unnecessary work filling the buffer with rows that are discarded later. - When fetching bigger batches of rows, InnoDB will hold latches for a longer time, which could hurt multi-threaded performance.
Functional requirements: F-1: If there is not enough memory to allocate a record buffer, an out-of-memory error should be raised and the statement should be aborted. Non-functional requirements: NF-1: The performance of queries with large scans should improve. NF-2: No significant negative performance impact on any queries.
This worklog implements the following main functionality: 1. A new record buffer that will be allocated by the executor and used by the storage engine to retrieve batches of records. 2. New handler API functions that the executor can use to check if the storage engine wants a record buffer, and to provide the record buffer to the storage engine. 3. Changes in the executor that will determine if a record buffer should be used and how big it should be, and allocate the record buffer and give it to the storage engine. 4. Changes to InnoDB to use the new record buffer. Record buffer ============= A new Record_buffer class is introduced. It contains information about how many rows to fetch in each batch, and a uchar buffer large enough to hold a batch of records. The new buffer might be significantly larger than InnoDB's current prefetch cache. We don't want to waste time filling a big buffer with records that fall outside of the requested range, so we want the storage engine to compare the records against the end-range condition before they add them to the buffer, and stop filling the buffer as soon as the end of the range is reached. To aid this, the Record_buffer class also has a flag that tells if the end of the range was reached while the buffer was being filled. The buffer will be represented by a class that looks like this: class Record_buffer { /// The maximum number of records that can be stored in the buffer. ha_rows m_max_records; /// The number of bytes available for each record. size_t m_record_size; /// The number of records currently stored in the buffer. ha_rows m_count= 0; /// The @c uchar buffer that holds the records. uchar *m_buffer; /// Tells if end-of-range was found while filling the buffer. bool m_out_of_range= false; public: /// Constructor. Record_buffer(ha_rows records, size_t record_size, uchar *buffer); /** Helper function that calculates the size of the buffer argument to give to the constructor. */ static constexpr size_t buffer_size(ha_rows records, size_t record_size); /// Get the maximum number of records that can be stored in the buffer. ha_rows max_records() const; /// Get the amount of space allocated for each record in the buffer. size_t record_size() const; /// Get the number of records currently stored in the buffer. ha_rows records() const; /// Get the buffer for the record at a given position. uchar *record(ha_rows pos) const; /** Add a new record at the end of the buffer. @return the uchar buffer of the added record */ uchar *add_record(); /// Remove the record that was last added to the buffer. void remove_last(); /// Clear the buffer. Remove all rows. The out-of-range flag is preserved. void clear(); /// Reset the buffer. Remove all rows and clear the out-of-range flag. void reset(); /// Set whether the end of the range was reached while filling the buffer. void set_out_of_range(bool val); /// Check if the end of the range was reached while filling the buffer. bool is_out_of_range() const; }; Handler API for supporting record buffer ======================================== A new function is added to the handler class for passing the buffer from the optimizer to the storage engine: /** Set a record buffer that the storage engine can use for multi-row reads. The buffer has to be provided prior to the first read from an index or a table. @param buffer the buffer to use for multi-row reads */ void ha_set_record_buffer(Record_buffer *buffer); The storage engine may retrieve the buffer using another new function added to the handler class: /** Get the record buffer that was set with ha_set_record_buffer(). @return the buffer to use for multi-row reads, or nullptr if there is none */ Record_buffer *ha_get_record_buffer() const; For now, the Record_buffer will not be used for returning rows from the storage engine to the executor. The executor will still use calls such as ha_index_next() and ha_rnd_next() to read the next row into TABLE::record[0]. The storage engine should implement this as a simple copy from the Record_buffer to TABLE::record[0], if it has records in the buffer. The optimizer will ask the handler if the storage engine wants a Record_buffer to be pushed down. /** Does this handler want to get a Record_buffer for multi-row reads via the ha_set_record_buffer() function? And if so, what is the maximum number of records to allocate space for in the buffer? @param[out] max_rows gets set to the maximum number of records to allocate space for in the buffer if the function returns true @retval true if the handler would like a Record_buffer @retval false if the handler does not want a Record_buffer */ bool ha_is_record_buffer_wanted(ha_rows *const max_rows) const; Storage engines that support use of a Record_buffer override the virtual handler::is_record_buffer_wanted() function and return true if they want a buffer for the scan. They also tell the caller via the output parameter the maximum number of rows to allocate space for in the buffer. InnoDB is the only storage engine that requests a buffer using this function at the moment. It sets the maximum number of rows to 100. This is probably not the ideal maximum size, it is set this low for now to be conservative. In the future it could be made cleverer. Use of the record buffer in the server ====================================== Each TABLE instance will have a Record_buffer instance. It is initially empty, but the optimizer may choose to replace it with a non-empty one before a scan is started (that is, right after ha_index_init() or ha_rnd_init() is called, and before the first row is read with ha_index_first() or similar function). The buffer is allocated in the execution memroot, using THD::alloc(), and it is freed when the statement has completed. The optimizer decides the size of the buffer based on the number of rows it expects to be retrieved.
sql/record_buffer.h: ==================== This file contains the definition of the Record_buffer class specified in the HLS. sql/table.h: ============ A new member, m_record_buffer, is added to the TABLE class. This member holds the record buffer that might be pushed down to the storage engine. sql/sql_executor.cc: ==================== set_record_buffer() allocates the record buffer and gives it to the handler via handler::ha_set_record_buffer(). It first determines if it should create a buffer, using the following criteria: 1) The storage engine must report that it can and will use a buffer for the scan, through the handler::ha_is_record_buffer_wanted() function. (Only InnoDB reports that it wants a buffer currently.) 2) The scan must be expected to fetch more than one row. 3) The buffer is not used for temporary tables for now. 4) The buffer is not used for loose index scan for now. It uses the st_position::rows_fetched variable to find the number of rows. If it is a scan of the outer table in a join, and the query block has a LIMIT clause, it also takes the fanout into account in order to see how many of those rows it will take before the limit is reached, and possibly reduce the buffer size accordingly. Also, it limits the size of the buffer to the maximum size reported by the storage engine through handler::ha_is_record_buffer_wanted(). set_record_buffer() also calculates how much space it needs for each record in the buffer. It inspects the read set to see which of the columns are accessed, and only sets aside space to hold the prefix of the columns that covers all the columns in the read set. If the scan is an index merge, the prefix is extended to cover all primary key columns too, even if they are not in the read_set, since an index merge may reopen the scan later with primary key columns added for positioning. The size of the buffer will not be allowed to exceed the compile-time constant MAX_RECORD_BUFFER_SIZE, which is set to 128KB. If necessary, the maximum number of rows in the buffer will be reduced to make the total size of the buffer not exceed that limit. The set_record_buffer() function is called at the start of scans that have join_type equal to JT_REF, JT_REF_OR_NULL, JT_INDEX_SCAN, JT_ALL, JT_RANGE or JT_INDEX_MERGE. This is achieved by placing calls to set_record_buffer() in: - join_read_always_key (for JT_REF forward scans, and for JT_REF_OR_NULL) - join_read_last_key (for JT_REF backward scans) - join_init_read_record (for JT_ALL, JT_RANGE, JT_INDEX_MERGE) - join_read_first (for JT_INDEX_SCAN forward scans) - join_read_last (for JT_INDEX_SCAN backward scans) sql/handler.h: ============== Three new functions, as specified in the HLS: - handler::ha_set_record_buffer - handler::ha_get_record_buffer - handler::ha_is_record_buffer_wanted The handler class gets a new member, m_record_buffer, which holds the Record_buffer. sql/handler.cc: =============== handler::ha_index_end(), handler::ha_rnd_end() and handler::ha_reset() make the handler forget the record buffer, so that a stale, freed buffer will not be used when the handler is re-initialized. handler::set_end_range() clears the out-of-range flag in the record buffer, so that it doesn't stick around in multi-range reads. handler::compare_key_in_buffer() is a new function which checks if the end range condition has been met. It performs essentially the same check as the existing compare_key_icp() function, with the following differences: - It is able to read a row from an arbitrary memory location. This is because we want to compare the row which is in the record buffer, whereas the ICP variant always compares the row in table->record[0]. - It sets the handler::in_range_check_pushed_down to avoid doing redundant range checks in the handler. In the (existing) ICP case this can be done up front when the end range is set, because it is known that the end range will be evaluated by the SE if it is pushed down. But we don't know if the Record_buffer that is pushed down, will actually be used by the SE, so we don't know if the end range that was pushed with the buffer will be evaluated. Set the flag to tell the handler that the end range in fact was evaluated. storage/innobase/handler/ha_innodb.cc: ====================================== ha_innobase::open() sets a pointer to the handler in the prebuilt data structure, so that it can be used to access the record buffer in row_search_mvcc(). ha_innobase::is_record_buffer_wanted() overrides handler::is_record_buffer_wanted() and reports if InnoDB thinks it can use a buffer for this scan. Essentially, it checks whether or not the row_search_mvcc() function is likely to go into this branch: /* Decide whether to prefetch extra rows. At this point, the clustered index record is protected by a page latch that was acquired when pcur was positioned. The latch will not be released until mtr_commit(&mtr). */ if ((match_mode == ROW_SEL_EXACT || prebuilt->n_rows_fetched >= MYSQL_FETCH_CACHE_THRESHOLD) && prebuilt->select_lock_type == LOCK_NONE && !prebuilt->m_no_prefetch && !prebuilt->templ_contains_blob && !prebuilt->templ_contains_fixed_point && !prebuilt->clust_index_was_generated && !prebuilt->used_in_HANDLER && !prebuilt->innodb_api && prebuilt->template_type != ROW_MYSQL_DUMMY_TEMPLATE && !prebuilt->in_fts_query) { storage/innobase/handler/ha_innopart.cc: ======================================== ha_innopart::open() sets a pointer to the handler in the prebuilt data structure, just like ha_innobase::open() does. storage/innobase/include/row0mysql.h: ===================================== The row_prebuilt_t struct gets a new member variable, m_mysql_handler, which points to the handler that is performing the scan. row_prebuilt_t already has a pointer to the TABLE, and the TABLE has a pointer to the handler. This could however not be used to get the handler, because a) row_update_for_mysql_using_upd_graph() sets prebuilt->m_mysql_table to NULL when performing cascading deletes/updates, and doesn't restore it, so reused handlers could have NULL. Introduced by the fix for bug#22469130. This might be a bug. b) Index merge clones the handler for each index. All the handlers share the same TABLE, and in the clones prebuilt->m_mysql_table->file points to the parent handler. We need to be able to access the handler currently in use, rather than the parent handler, so we cannot use m_mysql_table. Since the handler is now always available from the prebuilt struct, the type of row_prebuilt_t::idx_cond is changed to bool. It used to be a pointer to the handler, or nullptr to indicate that ICP was not to be used. The row_prebuilt_t struct gets a new member function, can_prefetch_records(), which returns whether or not the scan can use a record buffer or prefetch cache. This allows ha_innobase::is_record_buffer_wanted() and row_search_mvcc() to use the same logic to determine whether a record buffer should be used. storage/innobase/include/row0mysql.cc: ====================================== Added definition of row_prebuilt_t::can_prefetch_records(). storage/innobase/row/row0sel.cc: ================================ row_sel_dequeue_cached_row_for_mysql() and row_sel_fetch_last_buf() are changed to get a buffer for a row from the server-provided Record_buffer if there is one. Otherwise, it gets it from InnoDB's own prefetch cache. row_sel_enqueue_cache_row_for_mysql() only copies the actually accessed prefix of the read row into record[0], instead of the full record length. Since Record_buffer only allocates space for a prefix that covers the accessed columns, copying the full record length from the record buffer might read data from outside the buffer. row_search_idx_cond_check(), used by ICP, now sets the end-of-range flag in the Record_buffer when it detects end-of-range, if the server has provided a buffer. This helps stopping the scan sooner after the end range has been detected. row_search_mvcc(): - An existing check for returning early when the end-of-range was detected, is broken (its condition could never be true), causing one extra row to be fetched and discarded after end-of-range has been reached. It is fixed by using Record_buffer::is_out_of_range() to detect the situation. (This of course only fixes the broken check if the optimizer has pushed down a buffer. The problem is still there if ICP pushes down an end-range, and InnoDB's prefetch cache is used instead of the Record_buffer.) - The part of the function which fills InnoDB's prefetch cache is changed so that if there is a Record_buffer provided by the server, that buffer is used instead, and so that the Record_buffer is populated on the first call to row_search_mvcc(). (InnoDB's prefetch cache is only populated after row_search_mvcc() has been called MYSQL_FETCH_CACHE_THRESHOLD times.) - When filling the Record_buffer, and ICP is not in use, evaluate the end of range condition, if there is one. Stop filling the buffer if the end of the range is reached. (ICP already has code for evaluating the end range condition, so no extra code is needed for that case.)
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.