WL#14100: InnoDB: Faster truncate/drop table space
Affects: Server-8.0
—
Status: Complete
When a table space is truncated or dropped InnoDB scans the LRU and FLUSH (dirty pages) list and moves all the pages that belong to the truncated/dropped tablespace from the LRU/FLUSH list to the buffer pool FREE list. With large buffer pools, where large can be anything > 32G, this full list scan can slow down the user threads to crawl. It gets worse, since InnoDB doesn't know how many pages from a tablespace are in the buffer pool it will scan the full list even if the table is otherwise empty or has only one page in the buffer pool (or none). This problem is exacerbated with the use of a pool of temporary tablespaces. On connection disconnect this can cause long stalls. Since the truncate of temporary tablespaces reuses the same tablespace ID we cannot use the DROP/CREATE trick from other tablespaces (See BUG#98869). To solve this problem we generalize the solution from WL#11819 so that it can be applied to all tablespaces and not just UNDO. On tablespace drop or truncation we mark it as deleted, delete its file and report back to user the operation is finished. Afterwards, we will lazily free all pages from the Buffer Pool, that were referencing this tablespace, as we encounter them one by one. There are several blogs out there describing the pain and the general problem. Note: This doesn't fix the problem of physically unlinking a large file taking a very long time. That requires a separate fix and will need to be coordinated with the Runtime team.
NF-1: Should fix the stall in user threads after dropping a large tablespace with > 32G buffer pool NF-2: Should fix the stall when truncating a Temporary Tablespace NF-3: Should fix stall in user threads when AHI is enabled and the user drops a tablespace that has lots of pages referenced from the AHI. NF-4: sizeof(buf_block_t) & sizeof(but_page_t) should not increase.
Scanning large LRU and FLUSH lists to remove pages from dropped/truncated tablespaces is not necessary if we can lazily remove the pages and get rid of the AHI entries lazily too. The idea is to increment the version number in the tablespace in memory data structure (fil_space_t) when the tablespace is dropped or truncated. Referred to DVER (Drop/Truncate version number) from now on. For a page write to make it to persistent store it must have a version number >= DVER. The pages with VER < DVER will be freed. Handling reads is trickier. When a caller tries to fetch a page from the buffer pool we need to detect if the page is stale or not. If the page is stale then we free the page and force a disk read. This check for stale pages for all reads should be very cheap otherwise it will cost us on each page fetch from the buffer pool. To reduce the stale page check overhead we will store a pointer to the fil_space_t instance in buf_page_t. The pointer to the fil_space_t instance embedded in buf_page_t is reference counted. Each time we set buf_page_t::m_space we increment a reference count in fil_space_t. This is required for temporary and undo tablespaces but not for user tablespaces. For user tablespaces we do a drop/create which will change the tablespace ID therefore we can't have any reads of dropped/truncated user tablespaces. Undo and Temporary tablespaces IDs are part of a smaller pool and in theory can be reused therefore we must ensure that there must not be any page in the buffer pool from a dropped/truncated undo or temporary tablespace. Additionally temporary tablespaces reuse the tablespace ID on truncate (which is very frequent) and we must ensure any stale pages are purged from the buffer pool lazily before the old instance's memory can be freed. We have to handle the impact on flushing when dealing with stale pages. This will be explained in the low level design.
Additions to fil_space_t { ... // Version number of the instance, not persistent. Every time we truncate // or delete we bump up the version number. lsn_t m_version{}; // Reference count of how many pages point to this instance. An instance cannot // be deleted if the reference count is greater than zero. The only exception // is shutdown. std::atomic_int m_n_ref_count{}; }; Additions to buf_page_t { ... // Pointer to the fil_space_t instance that the page belongs to fil_space_t *m_space{}; // Version number of the page to check for stale pages. This value is // "inherited" from the m_space->m_version when we init a page. uint32_t m_version{}; }; Rearranged the bug_page_t data members to get rid of holes and changed the type of several enums used in buf_block_t and buf_page_t to uint8_t. This change was done to preserve the size of these two core data structures. 1. sizeof(buf_block_t) = 368 bytes 2. sizeof(buf_page_t) = 128 bytes New error code to signal that a page is stale DB_PAGE_IS_STALE; this error code is returned by Fil_shard::do_io() when there is an attempt to write a stale page to persistent store. The stale page detection is done in Buf_fetch_normal::get() and Buf_fetch_other::get(). A page is stale if: if (bug_page_t::m_version == fil_space_t::version) { return false; } ut_a(fil_space_t::m_version > buf_page_t::version); return true Ramifications for flushing. When scanning LRU and FLUSH list for candidate pages to write to disk we cannot free directly from the LRU list because it's a little tricky to determine the precise state of a page in the flushing pipeline. We rely on Fil_shard::do_io() to do the check and return DB_PAGE_IS_STALE to the caller on such writes. The caller of Fil_shard::do_io() which is the code in the double write buffer handles the freeing of the page for this use case. When scanning the FLUSH list we can remove the pages during the candidate page selection phase itself. Ramifications for non-redo logged tablespaces like the temporary tablespaces. We create the new tablespace using direct calls to low level buffer pool functions. These pages must be explicitly must inherit the latest version number from fil_space_t::m_version, otherwise their contents will be discarded on writes. There is no change in latching rules and no changes to page state. New function that does the free of stale pages: /** Free a stale page. @param[in,out] buf_pool Buffer pool the page belogns to. @param[in,out] hash_lock Hash lock covering the fetch from the hash table. @param[in,out] bpage Page to free. */ static void buf_free_stale_page(buf_pool_t *buf_pool, rw_lock_t *hash_lock, buf_page_t *bpage) noexcept { ut_ad(bpage->is_stale()); /* Hash lock is lower in order than the LRU list mutex, we have to release it in order to acquire the LRU mutex. To prevent other threads from freeing the stale block we increase the fix count so that the page can't be freed by other threads. */ /* No other thread can free the page after we release the hash lock now. */ buf_block_fix(bpage); rw_lock_s_unlock(hash_lock); auto *block_mutex = buf_page_get_mutex(bpage); mutex_enter(&buf_pool->LRU_list_mutex); /* Prepare to free, we own the LRU. */ buf_block_unfix(bpage); mutex_enter(block_mutex); /* The page can't be freed because we acquired the LRU mutex before the page unfix. */ const auto io_type = buf_page_get_io_fix_unlocked(bpage); if (io_type != BUF_IO_WRITE) { if (bpage->is_dirty()) { ut_a(io_type == BUF_IO_NONE); buf_flush_remove(bpage); } auto success = buf_LRU_free_page(bpage, true); ut_a(success); } else { mutex_exit(block_mutex); mutex_exit(&buf_pool->LRU_list_mutex); } }
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.