WL#14100: InnoDB: Faster truncate/drop table space

Affects: Server-8.0 — Status: Complete

When a table space is truncated or dropped InnoDB scans the LRU and FLUSH (dirty 
pages) list and moves all the pages that belong to the truncated/dropped 
tablespace from the LRU/FLUSH list to the buffer pool FREE list. With large buffer 
pools, where large can be anything > 32G, this full list scan can slow down the 
user threads to crawl. It gets worse, since InnoDB doesn't know how many pages 
from a tablespace are in the buffer pool it will scan the full list even if the 
table is otherwise empty or has only one page in the buffer pool (or none). This 
problem is exacerbated with the use of a pool of temporary tablespaces. On 
connection disconnect this can cause long stalls. Since the truncate of temporary 
tablespaces reuses the same tablespace ID we cannot use the DROP/CREATE trick from 
other tablespaces (See BUG#98869).

To solve this problem we generalize the solution from WL#11819 so that it can be 
applied to all tablespaces and not just UNDO.

On tablespace drop or truncation we mark it as deleted, delete its file and report back to user the operation is finished. Afterwards, we will lazily free all pages from the Buffer Pool, that were referencing this tablespace, as we encounter them one by one.

There are several blogs out there describing the pain and the general problem.

Note: This doesn't fix the problem of physically unlinking a large file taking a 
very long time. That requires a separate fix and will need to be coordinated with 
the Runtime team.

NF-1: Should fix the stall in user threads after dropping a large tablespace with 
> 32G buffer pool

NF-2: Should fix the stall when truncating a Temporary Tablespace

NF-3: Should fix stall in user threads when AHI is enabled and the user drops a 
tablespace that has lots of pages referenced from the AHI.

NF-4: sizeof(buf_block_t) & sizeof(but_page_t) should not increase.

Scanning large LRU and FLUSH lists to remove pages from dropped/truncated 
tablespaces is not necessary if we can lazily remove the pages and get rid of the 
AHI entries lazily too.

The idea is to increment the version number in the tablespace in memory data 
structure (fil_space_t) when the tablespace is dropped or truncated. Referred to 
DVER (Drop/Truncate version number) from now on. For a page write to  make it to 
persistent store it must have a version number >= DVER. The pages with VER < DVER 
will be freed.

Handling reads is trickier. When a caller tries to fetch a page from the buffer 
pool we need to detect if the page is stale or not. If the page is stale then we 
free the page and force a disk read. This check for stale pages for all reads 
should be very cheap otherwise it will cost us on each page fetch from the buffer 
pool. To reduce the stale page check overhead we will store a pointer to the 
fil_space_t instance in buf_page_t.

The pointer to the fil_space_t instance embedded in buf_page_t is reference 
counted. Each time we set buf_page_t::m_space we increment a reference count in 
fil_space_t. This is required for temporary and undo tablespaces but not for user 
tablespaces. For user tablespaces we do a drop/create which will change the 
tablespace ID therefore we can't have any reads of dropped/truncated user 
tablespaces. Undo and Temporary tablespaces IDs are part of a smaller pool and in 
theory can be reused therefore we must ensure that there must not be any page in 
the buffer pool from a dropped/truncated undo or temporary tablespace. 
Additionally temporary tablespaces reuse the tablespace ID on truncate (which is 
very frequent) and we must ensure any stale pages are purged from the buffer pool 
lazily before the old instance's memory can be freed.

We have to handle the impact on flushing when dealing with stale pages. This will 
be explained in the low level design.

Additions to  fil_space_t {
 ...
 // Version number of the instance, not persistent. Every time we truncate
 // or delete we bump up the version number.
 lsn_t m_version{};

 // Reference count of how many pages point to this instance. An instance cannot
 // be deleted if the reference count is greater than zero. The only exception
 // is shutdown.
 std::atomic_int m_n_ref_count{};
};

Additions to buf_page_t {
 ...
 // Pointer to the fil_space_t instance that the page belongs to
 fil_space_t *m_space{};

 // Version number of the page to check for stale pages. This value is
 // "inherited" from the m_space->m_version when we init a page.
 uint32_t m_version{};
};

Rearranged the bug_page_t data members to get rid of holes and changed the type of 
several enums used in buf_block_t and buf_page_t to uint8_t. This change was done 
to preserve the size of these two core data structures.

 1. sizeof(buf_block_t) = 368 bytes
 2. sizeof(buf_page_t) = 128 bytes

New error code to signal that a page is stale DB_PAGE_IS_STALE; this error code 
is returned by Fil_shard::do_io() when there is an attempt to write a stale page 
to persistent store.

The stale page detection is done in Buf_fetch_normal::get() and 
Buf_fetch_other::get().

A page is stale if:

  if (bug_page_t::m_version == fil_space_t::version) {
    return false;
  }
  ut_a(fil_space_t::m_version > buf_page_t::version);
  return true


Ramifications for flushing. When scanning LRU and FLUSH list for candidate pages 
to write to disk we cannot free directly from the LRU list because it's a little 
tricky to determine the precise state of a page in the flushing pipeline. We rely 
on Fil_shard::do_io() to do the check and return DB_PAGE_IS_STALE to the caller 
on such writes. The caller of Fil_shard::do_io() which is the code in the double 
write buffer handles the freeing of the page for this use case.

When scanning the FLUSH list we can remove the pages during the candidate page
selection phase itself.

Ramifications for non-redo logged tablespaces like the temporary tablespaces. We 
create the new tablespace using direct calls to low level buffer pool functions. 
These pages must be explicitly must inherit the latest version number from 
fil_space_t::m_version, otherwise their  contents will be discarded on writes.

There is no change in latching rules and no changes to page state.

New function that does the free of stale pages:

/** Free a stale page.
@param[in,out] buf_pool       Buffer pool the page belogns to.
@param[in,out] hash_lock      Hash lock covering the fetch from the hash table.
@param[in,out] bpage          Page to free. */
static void buf_free_stale_page(buf_pool_t *buf_pool, rw_lock_t *hash_lock,
                                buf_page_t *bpage) noexcept {
  ut_ad(bpage->is_stale());
    
  /* Hash lock is lower in order than the LRU list mutex, we have to release
  it in order to acquire the LRU mutex. To prevent other threads from freeing
  the stale block we increase the fix count so that the page can't be freed
  by other threads. */
      
  /* No other thread can free the page after we release the hash lock now. */
  buf_block_fix(bpage);

  rw_lock_s_unlock(hash_lock);
      
  auto *block_mutex = buf_page_get_mutex(bpage);
      
  mutex_enter(&buf_pool->LRU_list_mutex);
  
  /* Prepare to free, we own the LRU. */
  buf_block_unfix(bpage);
  
  mutex_enter(block_mutex);
    
  /* The page can't be freed because we acquired the LRU mutex before
  the page unfix. */
                   
  const auto io_type = buf_page_get_io_fix_unlocked(bpage);
          
  if (io_type != BUF_IO_WRITE) {
    if (bpage->is_dirty()) {
      ut_a(io_type == BUF_IO_NONE);
      buf_flush_remove(bpage);
    }                        
    auto success = buf_LRU_free_page(bpage, true);
    ut_a(success);
  } else {
    mutex_exit(block_mutex);
    mutex_exit(&buf_pool->LRU_list_mutex);
  }
}