WL#7093: Optimizer provides InnoDB with a bigger buffer

Affects: Server-8.0   —   Status: Complete

In order to reduce the cost of latching and B-tree navigation, InnoDB
uses a small internal buffer to fetch records in batches for scans. It
would be more beneficial if the optimizer would provide a bigger
buffer and specify how many records it is going to read.

The goal of this WL is to introduce a new data structure and handler
API for providing the storage engine a bigger buffer.

The implementation of the new record buffer, the extensions to the
handler API, and the support for the buffer in the executor will be
independent of the storage engine. In this worklog, only InnoDB will
be extended to use the new record buffer. Other storage engines may
benefit from starting to use the new record buffer, but that is
outside of the scope of this worklog.

In SELECT queries, a buffer will be provided for each table and index
scan that is believed to read more than one row (with some exceptions
detailed below).

The bigger buffer will primarily improve the performance of queries
that perform scans which read a large number of rows. Some example
queries that should benefit from it:

# If t is estimated to contain more than one row, a buffer is
# allocated before the table is scanned. If the number of rows in the
# table is large, the bigger buffer will improve the performance.

# pk is a primary key. If the predicate is believed to match more than
# one row, a buffer is allocated before the primary key index is
# scanned. The performance benefit is larger the more rows that match
# the predicate.

# pk is primary key in t1 and t2. A buffer will be allocated for the
# scan of the outer table in the join (whichever table the optimizer
# chooses) if that table has more than one row. No buffer is allocated
# for the inner table, since that table will be accessed using
# multiple single-row eq_ref scans which don't benefit from a buffer.
# The buffer on the outer table could have a positive effect on the
# performance, but probably not much, since the many scans of the
# inner table are likely to dominate the cost.
SELECT * FROM t1, t2 WHERE t1.pk = t2.pk;

When is a buffer not allocated? If the executor believes a buffer
cannot be used, or that it won't help improving the performance, a
buffer will not be allocated.

The executor will not allocate a buffer in these cases:

- If the access type is not one of ref, ref_or_null, index_merge,
  range, index or ALL.

- If the scan is estimated to return no more than one row.

- If the scan is a loose index scan, which typically doesn't read many
  consecutive rows.

- If the scan is against an internally generated temporary table.

- If the storage engine reports that it won't use the buffer.

InnoDB will report that it won't use the buffer in the following

- If the scan is not read-only.

- If the scan accesses BLOB-based columns (such as BLOB, TEXT, JSON,

- If it is a fulltext query.

- If the table does not have a user-defined primary key or a unique
  constraint that could be used as a primary key.

In the cases where a buffer is not allocated, the performance should
be unaffected.

In the cases where a buffer is allocated, the performance should
generally be the same or higher than before. There could be cases
where performance is hurt, though. For example:

- If the estimates used by the optimizer are too high, performance
  could be affected negatively because an unnecessarily big buffer is
  allocated. This could be particularly bad for queries that use a low
  LIMIT, since they might perform unnecessary work filling the buffer
  with rows that are discarded later.

- When fetching bigger batches of rows, InnoDB will hold latches for a
  longer time, which could hurt multi-threaded performance.
Functional requirements:

F-1: If there is not enough memory to allocate a record buffer, an out-of-memory
error should be raised and the statement should be aborted.

Non-functional requirements:

NF-1: The performance of queries with large scans should improve.

NF-2: No significant negative performance impact on any queries.
This worklog implements the following main functionality:

1. A new record buffer that will be allocated by the executor and used
   by the storage engine to retrieve batches of records.

2. New handler API functions that the executor can use to check if the
   storage engine wants a record buffer, and to provide the record
   buffer to the storage engine.

3. Changes in the executor that will determine if a record buffer
   should be used and how big it should be, and allocate the record
   buffer and give it to the storage engine.

4. Changes to InnoDB to use the new record buffer.

Record buffer

A new Record_buffer class is introduced. It contains information about
how many rows to fetch in each batch, and a uchar buffer large enough
to hold a batch of records.

The new buffer might be significantly larger than InnoDB's current
prefetch cache. We don't want to waste time filling a big buffer with
records that fall outside of the requested range, so we want the
storage engine to compare the records against the end-range condition
before they add them to the buffer, and stop filling the buffer as
soon as the end of the range is reached. To aid this, the
Record_buffer class also has a flag that tells if the end of the range
was reached while the buffer was being filled.

The buffer will be represented by a class that looks like this:

class Record_buffer
  /// The maximum number of records that can be stored in the buffer.
  ha_rows m_max_records;
  /// The number of bytes available for each record.
  size_t m_record_size;
  /// The number of records currently stored in the buffer.
  ha_rows m_count= 0;
  /// The @c uchar buffer that holds the records.
  uchar *m_buffer;
  /// Tells if end-of-range was found while filling the buffer.
  bool m_out_of_range= false;
  /// Constructor.
  Record_buffer(ha_rows records, size_t record_size, uchar *buffer);
    Helper function that calculates the size of the buffer argument
    to give to the constructor.
  static constexpr size_t buffer_size(ha_rows records, size_t record_size);
  /// Get the maximum number of records that can be stored in the buffer.
  ha_rows max_records() const;
  /// Get the amount of space allocated for each record in the buffer.
  size_t record_size() const;
  /// Get the number of records currently stored in the buffer.
  ha_rows records() const;
  /// Get the buffer for the record at a given position.
  uchar *record(ha_rows pos) const;
    Add a new record at the end of the buffer.
    @return the uchar buffer of the added record
  uchar *add_record();
  /// Remove the record that was last added to the buffer.
  void remove_last();
  /// Clear the buffer. Remove all rows. The out-of-range flag is preserved.
  void clear();
  /// Reset the buffer. Remove all rows and clear the out-of-range flag.
  void reset();
  /// Set whether the end of the range was reached while filling the buffer.
  void set_out_of_range(bool val);
  /// Check if the end of the range was reached while filling the buffer.
  bool is_out_of_range() const;

Handler API for supporting record buffer

A new function is added to the handler class for passing the buffer
from the optimizer to the storage engine:

    Set a record buffer that the storage engine can use for multi-row reads.
    The buffer has to be provided prior to the first read from an index or a

    @param buffer the buffer to use for multi-row reads
  void ha_set_record_buffer(Record_buffer *buffer);

The storage engine may retrieve the buffer using another new function
added to the handler class:

    Get the record buffer that was set with ha_set_record_buffer().

    @return the buffer to use for multi-row reads, or nullptr if there is none
  Record_buffer *ha_get_record_buffer() const;

For now, the Record_buffer will not be used for returning rows from
the storage engine to the executor. The executor will still use calls
such as ha_index_next() and ha_rnd_next() to read the next row into
TABLE::record[0]. The storage engine should implement this as a simple
copy from the Record_buffer to TABLE::record[0], if it has records in
the buffer.

The optimizer will ask the handler if the storage engine wants a
Record_buffer to be pushed down.

    Does this handler want to get a Record_buffer for multi-row reads
    via the ha_set_record_buffer() function? And if so, what is the
    maximum number of records to allocate space for in the buffer?

    @param[out] max_rows  gets set to the maximum number of records to
                          allocate space for in the buffer if the function
                          returns true

    @retval true   if the handler would like a Record_buffer
    @retval false  if the handler does not want a Record_buffer
  bool ha_is_record_buffer_wanted(ha_rows *const max_rows) const;

Storage engines that support use of a Record_buffer override the
virtual handler::is_record_buffer_wanted() function and return true
if they want a buffer for the scan. They also tell the caller via the
output parameter the maximum number of rows to allocate space for in
the buffer.

InnoDB is the only storage engine that requests a buffer using this
function at the moment. It sets the maximum number of rows to 100.
This is probably not the ideal maximum size, it is set this low for
now to be conservative. In the future it could be made cleverer.

Use of the record buffer in the server

Each TABLE instance will have a Record_buffer instance. It is
initially empty, but the optimizer may choose to replace it with a
non-empty one before a scan is started (that is, right after
ha_index_init() or ha_rnd_init() is called, and before the first row
is read with ha_index_first() or similar function). The buffer is
allocated in the execution memroot, using THD::alloc(), and it is
freed when the statement has completed. The optimizer decides the size
of the buffer based on the number of rows it expects to be retrieved.

This file contains the definition of the Record_buffer class specified
in the HLS.


A new member, m_record_buffer, is added to the TABLE class. This
member holds the record buffer that might be pushed down to the
storage engine.


set_record_buffer() allocates the record buffer and gives it to the
handler via handler::ha_set_record_buffer(). It first determines if it
should create a buffer, using the following criteria:

1) The storage engine must report that it can and will use a buffer
for the scan, through the handler::ha_is_record_buffer_wanted()
function. (Only InnoDB reports that it wants a buffer currently.)

2) The scan must be expected to fetch more than one row.

3) The buffer is not used for temporary tables for now.

4) The buffer is not used for loose index scan for now.

It uses the st_position::rows_fetched variable to find the number of
rows. If it is a scan of the outer table in a join, and the query
block has a LIMIT clause, it also takes the fanout into account in
order to see how many of those rows it will take before the limit is
reached, and possibly reduce the buffer size accordingly. Also, it
limits the size of the buffer to the maximum size reported by the
storage engine through handler::ha_is_record_buffer_wanted().

set_record_buffer() also calculates how much space it needs for each
record in the buffer. It inspects the read set to see which of the
columns are accessed, and only sets aside space to hold the prefix of
the columns that covers all the columns in the read set. If the scan
is an index merge, the prefix is extended to cover all primary key
columns too, even if they are not in the read_set, since an index
merge may reopen the scan later with primary key columns added for

The size of the buffer will not be allowed to exceed the compile-time
constant MAX_RECORD_BUFFER_SIZE, which is set to 128KB. If necessary,
the maximum number of rows in the buffer will be reduced to make the
total size of the buffer not exceed that limit.

The set_record_buffer() function is called at the start of scans that
have join_type equal to JT_REF, JT_REF_OR_NULL, JT_INDEX_SCAN, JT_ALL,
JT_RANGE or JT_INDEX_MERGE. This is achieved by placing calls to
set_record_buffer() in:

- join_read_always_key (for JT_REF forward scans, and for JT_REF_OR_NULL)
- join_read_last_key (for JT_REF backward scans)
- join_init_read_record (for JT_ALL, JT_RANGE, JT_INDEX_MERGE)
- join_read_first (for JT_INDEX_SCAN forward scans)
- join_read_last (for JT_INDEX_SCAN backward scans)


Three new functions, as specified in the HLS:

- handler::ha_set_record_buffer
- handler::ha_get_record_buffer
- handler::ha_is_record_buffer_wanted

The handler class gets a new member, m_record_buffer, which holds the


handler::ha_index_end(), handler::ha_rnd_end() and handler::ha_reset()
make the handler forget the record buffer, so that a stale, freed
buffer will not be used when the handler is re-initialized.

handler::set_end_range() clears the out-of-range flag in the record
buffer, so that it doesn't stick around in multi-range reads.

handler::compare_key_in_buffer() is a new function which checks if the
end range condition has been met. It performs essentially the same
check as the existing compare_key_icp() function, with the following

- It is able to read a row from an arbitrary memory location. This is
  because we want to compare the row which is in the record buffer,
  whereas the ICP variant always compares the row in table->record[0].

- It sets the handler::in_range_check_pushed_down to avoid doing
  redundant range checks in the handler. In the (existing) ICP case
  this can be done up front when the end range is set, because it is
  known that the end range will be evaluated by the SE if it is pushed
  down. But we don't know if the Record_buffer that is pushed down,
  will actually be used by the SE, so we don't know if the end range
  that was pushed with the buffer will be evaluated. Set the flag to
  tell the handler that the end range in fact was evaluated.


ha_innobase::open() sets a pointer to the handler in the prebuilt data
structure, so that it can be used to access the record buffer in

ha_innobase::is_record_buffer_wanted() overrides
handler::is_record_buffer_wanted() and reports if InnoDB thinks it
can use a buffer for this scan. Essentially, it checks whether or not
the row_search_mvcc() function is likely to go into this branch:

	/* Decide whether to prefetch extra rows.
	At this point, the clustered index record is protected
	by a page latch that was acquired when pcur was positioned.
	The latch will not be released until mtr_commit(&mtr). */

	if ((match_mode == ROW_SEL_EXACT
	     || prebuilt->n_rows_fetched >= MYSQL_FETCH_CACHE_THRESHOLD)
	    && prebuilt->select_lock_type == LOCK_NONE
	    && !prebuilt->m_no_prefetch
	    && !prebuilt->templ_contains_blob
	    && !prebuilt->templ_contains_fixed_point
	    && !prebuilt->clust_index_was_generated
	    && !prebuilt->used_in_HANDLER
	    && !prebuilt->innodb_api
	    && prebuilt->template_type != ROW_MYSQL_DUMMY_TEMPLATE
	    && !prebuilt->in_fts_query) {


ha_innopart::open() sets a pointer to the handler in the prebuilt data
structure, just like ha_innobase::open() does.


The row_prebuilt_t struct gets a new member variable, m_mysql_handler,
which points to the handler that is performing the scan.

row_prebuilt_t already has a pointer to the TABLE, and the TABLE has a
pointer to the handler. This could however not be used to get the
handler, because

a) row_update_for_mysql_using_upd_graph() sets prebuilt->m_mysql_table
   to NULL when performing cascading deletes/updates, and doesn't
   restore it, so reused handlers could have NULL. Introduced by the
   fix for bug#22469130. This might be a bug.

b) Index merge clones the handler for each index. All the handlers
   share the same TABLE, and in the clones
   prebuilt->m_mysql_table->file points to the parent handler. We need
   to be able to access the handler currently in use, rather than the
   parent handler, so we cannot use m_mysql_table.

Since the handler is now always available from the prebuilt struct,
the type of row_prebuilt_t::idx_cond is changed to bool. It used to be
a pointer to the handler, or nullptr to indicate that ICP was not to
be used.

The row_prebuilt_t struct gets a new member function,
can_prefetch_records(), which returns whether or not the scan can use
a record buffer or prefetch cache. This allows
ha_innobase::is_record_buffer_wanted() and row_search_mvcc() to use
the same logic to determine whether a record buffer should be used.


Added definition of row_prebuilt_t::can_prefetch_records().


row_sel_dequeue_cached_row_for_mysql() and row_sel_fetch_last_buf()
are changed to get a buffer for a row from the server-provided
Record_buffer if there is one. Otherwise, it gets it from InnoDB's own
prefetch cache.

row_sel_enqueue_cache_row_for_mysql() only copies the actually
accessed prefix of the read row into record[0], instead of the full
record length. Since Record_buffer only allocates space for a prefix
that covers the accessed columns, copying the full record length from
the record buffer might read data from outside the buffer.

row_search_idx_cond_check(), used by ICP, now sets the end-of-range
flag in the Record_buffer when it detects end-of-range, if the server
has provided a buffer. This helps stopping the scan sooner after the
end range has been detected.


- An existing check for returning early when the end-of-range was
  detected, is broken (its condition could never be true), causing one
  extra row to be fetched and discarded after end-of-range has been
  reached. It is fixed by using Record_buffer::is_out_of_range() to
  detect the situation. (This of course only fixes the broken check if
  the optimizer has pushed down a buffer. The problem is still there
  if ICP pushes down an end-range, and InnoDB's prefetch cache is used
  instead of the Record_buffer.)

- The part of the function which fills InnoDB's prefetch cache is
  changed so that if there is a Record_buffer provided by the server,
  that buffer is used instead, and so that the Record_buffer is
  populated on the first call to row_search_mvcc(). (InnoDB's prefetch
  cache is only populated after row_search_mvcc() has been called

- When filling the Record_buffer, and ICP is not in use, evaluate the
  end of range condition, if there is one. Stop filling the buffer if
  the end of the range is reached. (ICP already has code for
  evaluating the end range condition, so no extra code is needed for
  that case.)