WL#7170: InnoDB buffer estimates for tables and indexes

Affects: Server-8.0 — Status: Complete

This is part of an ongoing effort to make the Optimizer make more informed decisions, supplying it with valuable statistical information.

When the Optimizer chooses which access method to use, it can benefit if it knows whether data is stored in memory or has to be read from disk. WL#7168 describes a new API that will be implemented for the handler class to get estimates for how much of a table or an index that is currently available in a main memory buffer.

This worklog is to implement the necessary support in InnoDB for providing these estimates to the handler.

Functional requirements:

It adds a new INFORMATION_SCHEMA table that visualizes the number of cached pages per index. This WL would be fully functional even without this table which is added only to enable users and developers to peek at the contents of the InnoDB container that has the number of cached pages per index.

Non-Functional requirements:

InnoDB must collect information about how many pages of each index are cached into the buffer pool and ship this information to the Optimizer when requested.

This WL implements the InnoDB side of WL#7168.

For informational purposes it extends the interface :

 SHOW CREATE TABLE information_schema.innodb_cached_indexes;
 Table	Create Table
 INNODB_CACHED_INDEXES	CREATE TEMPORARY TABLE `INNODB_CACHED_INDEXES` (
   `INDEX_ID` bigint(21) unsigned NOT NULL DEFAULT '0',
   `N_CACHED_PAGES` bigint(21) unsigned NOT NULL DEFAULT '0'
 ) ENGINE=MEMORY DEFAULT CHARSET=utf8

The table provides an overview of how many pages from each index are cached in the InnoDB buffer pool. The index_id can be resolved via information_schema.innodb_sys_indexes like this:

 SELECT
 tables.name AS table_name,
 indexes.name AS index_name,
 cached.n_cached_pages AS n_cached_pages
 FROM
 information_schema.innodb_cached_indexes AS cached,
 information_schema.innodb_sys_indexes AS indexes,
 information_schema.innodb_sys_tables AS tables
 WHERE
 cached.index_id = indexes.index_id
 AND indexes.table_id = tables.table_id;

In order to be able to return how many pages of a given index are cached in the buffer pool InnoDB must collect this statistical data on the fly as pages are added and removed to/from the buffer pool. Full scanning of the whole buffer pool when requested to supply the data for a given index is also possible, but this would have a huge negative impact on performance and thus is not considered.

So this WL adds a global storage that contains a set of tuples (index_id, number_of_cached_pages). index_id alone is globally unique, but this may change in the future where the globally unique identifier may become (space_id, index_id). The tuple that corresponds to a given index gets an increment whenever a page enters the buffer pool and a decrement when a page is removed from the buffer pool. Only leaf pages are to be counted. Blob pages are ignored because they are irrelevant to the Optimizer.

Unfortunately our buffer pool works in such a way that when a page is added to it (buf_pool_t::LRU) then the actual data of the page is not available yet (buf_block_t::frame). The id of the index is stored there and can only be retrieved when the frame is available. Thus we must hook the "page added to the buffer pool" trigger elsewhere. There are two places that cause a page to enter the buffer pool:

When an existent page is read from disk. I chose to plant the increment in buf_page_monitor(), called from buf_page_io_complete().

When a new page is created (as a result from an INSERT). I chose to plant the increment in btr_page_create().

Removal of a page from the buffer pool can happen on two occasions too:

When a page is not accessed for some time and is pushed away from the LRU by other pages. Then it is either flushed to the disk if there are modifications to it or just evicted.

When a page is deleted (as a result from DELETE).

Fortunately both of these call buf_LRU_block_remove_hashed() and this is where I have planted the decrement call.

Because the stats storage is global it would normally require a protection by a latch to deal with the concurrent access from multiple threads. But this would impose a huge bottleneck and thus I have implemented a basic lock-free hash table to store the data in a manner that it can be accessed and modified by multiple threads concurrently without them having to wait for each other.

We do not want to introduce another latch bottleneck (a global latch that too many threads fight for) while the desire is to split the existing ones or get rid of them entirely. With the number of (logical) CPUs in today's machines increasing and a tendency to have more and more (logical) CPUs on a machine, I believe the way forward is to use algorithms that do not need to access common data or if they do - then they do that in a (lock free) manner that does not block other threads.

The hash table consists of a fixed-size array of (key, val) tuples. The location of a given key's tuple is calculated using hash function + modulus the array size from the key. If that cell in the array is occupied then a search is made to the right, looking for a free slot. Growing is implemented by appending a new (bigger) array to the hash and each get/insert have to search the two (or more) arrays. This is suboptimal, but is only a temporary state before the first (smaller) array gets removed. Removing of an old array is implemented by first migrating all of its elements to the next (bigger) array, then waiting for all its readers to go away and freeing the memory it occupies.

Another alternatives to implementing a lock free hash table that were considered are: 1. std::map + mutex 2. std::map + rwlock 3. std::unordered_map + mutex 4. tbb::concurrent_hash_map (uses latches internally, not a lock free one) 5. mysys' LF_HASH

Surprisingly 2. was slower than 1. for 50% read, 50% write workload even though it is supposed to let readers to concurrently access the table. So the rwlock variants were discarded.

1., 3. and 4. were performance compared to the newly written lock free hash table and, as expected they are a few _magnitudes_ slower on many-CPU machines when lots of threads try to access a common resource. On normal machines with just a few CPUs the lock free hash is again faster than the rest, not a few magnitudes, but still faster.

The mysys' LF_HASH imposes the calling _thread_ to maintain an opaque object - a set of "pins" and pass it to the hash's functions every time. This renders it too unflexible to use. A perf comparison showed that the newly written lock free hash is about 2x faster than LF_HASH.

1., 3. and 4. can be tested and performance compared by enabling them in the unit test ut0lock_free_hash-t.cc. E.g. enable one of "#define TEST_STD_MAP 1", "#define TEST_STD_UNORDERED_MAP 1", "#define TEST_TBB 1", recompile and re-run the unit test.

This WL also adds a new INFORMATION_SCHEMA table 'innodb_cached_indexes' which can be useful to peek at the status of the n-cached-pages-per-index container. For each index in the system it uses get(index_id) to retrieve the number of cached pages. Nothing special about this.