WL#5655: InnoDB:Separate file to ensure atomic writes
Affects: Server-8.0
—
Status: Complete
Currently, the InnoDB doublewrite buffer lives in the system tablespace. Splitting the doublewrite buffer to a separate file would benefit users of solid-state storage (SSD). See notes in BUG#56283 for some details. Currently, the contention on the double write mutex causes a bottleneck on our IO throughput and increases the write latency.
F1 - More flexibility on placement of the doublewrite buffer pages NF1- Reduce write latency and increase throughput NF2- When the doublewrite buffer is switched on, the buffer_flush_avg_time should not increase by more than 20% NF3- Remove the doublewrite mutex
Current design: =============== * dblwr buffer is used as a guard against torn pages - to ensure atomic writes * 128 pages in system tablespace are allocated at fixed location for dblwr * 120 pages are used for batch flushes and 8 pages are used for single page flushes * batch flush (flush_list or LRU list flush) * Calling thread is almost always page_cleaner thread (apart from sync checkpoint flushing) * pages are copied to dblwr buffer in memory * dblwr buffer is written to system tablespace as one big synchronous IO * system tablespace is flushed to disk, if O_DIRECT_NO_FSYNC is not set * pages are written to datafiles * calling thread waits till all IO requests are completed * calling thread flushes all datafiles to disk, if O_DIRECT_NO_FSYNC is not set * calling thread declares dblwr buffer reusable * Single page flush * Always happen in user thread and is a last resort when we are unable to find a replaceable victim in LRU * A free slot is reserved in the 8 dedicated pages, if slot is not found then wait for IO to complete and the slot becoming available * In case of uncompressed pages the page is directly written to the dblwr on disk (no in-memory copy) * Rest of the steps same as batch flushing Proposed design: ================ * Create a single double write file, allow user to create more. * Split the file into segments * Flush list segments * LRU flush segments * Single page segments (these only exist in LRU flush double write files). * Each double write segment within the file will be of a fixed size * Create a request queue per buffer pool instance * Write request handling 1. Select a request queue (depending on the flush type) 2. Requestor thread will post the IO request to the request queue 3. If there are no more slots available in the request queue flush the queue to the double write file and got to #2 * Sync IO requests * Requestor thread will reserve a slot from the single page flush segment * The requestor thread will then write the page to the data file * When writing the data to the double write file segment copy the pages to a common buffer and write that single buffer to disk. This minimizes write system calls. * We will reserve space in the system tablespace but not use it. This is to maintain backward compatibility in file layouts. Proposed Changes: ================= Three distinct areas to cover are: * Updrage/downgrade * Recovery * New configuration Upgrade/Downgrade: ================== * No change to existing system tablespace format * Clean shutdown required. So that we don't have to handle the case of recovery from the old configuration Recovery: ========= * Recovery will work as it does now, with one difference. Instead of reading from the double write buffer in the system tablespace we will read from the double write file. * Create the double write file on start based on the new configuration parameters New configuration: ================== Introduce three new configuration parameters. User experience should be better with the default settings. Changing the defaults should be rare. --innodb-doublewrite=on|off --innodb-doublewrite-batch-size --innodb-doublewrite-dir=/path/name --innodb-doublewrite-files=Number of files to use (it will be a max of number of buffer pool instances * 2) --innodb-doublewrite-pages=N_PAGES (rounded up to a power of 2) Size of a flush list double write file will be: * SRV_PAGE_SIZE * doublewrite-pages bytes. Size of a LRU list double write file will be: * SRV_PAGE_SIZE * (doublewrite-pages + (512 / buffer-pool-instances)) Where 512 is the total number of slots reserved for single page flushes. Name of the file will be #ib_$srv_page_size.dblwr. The defaults for the above configuration parameters will be: --innodb-doublewrite-dir=innodb_data_home_dir --innodb-doublewrite-files=0 (default to number of buffer pool instances) x 2 --innodb-doublewrite-pages=0 (default to innodb-write-io-threads) --innodb-doublewrite-batch-size=0 (default to innodb-doublewrite-pages) The minimum value will be: --innodb-doublewrite-pages=8 --innodb-doublewrite-files=1 --innodb-doublewrite-batch-size=0 (default to innodb-doublewrite-pages) The maximum value will be: --innodb-doublewrite-pages=1024 --innodb-doublewrite-files=128 (capped at the number of buffer pool instances x 2). --innodb-doublewrite-batch-size=max value of innodb-doublewrite-pages) The above configuration parameters will not be dynamic except innodb-doublewrite- batch-size it is a global variable. If the --innodb-doublewrite-dir value has a prefix of '.', '#' or '/' then we will not prefix a '#' to the value. Otherwise we will prefix a hash to avoid clashes conflicts with potential schema names. If the user circumvents this rule by using say "./dblwr" then we will assume that's what the user wants to force the setting to be for --innodb-doublewrite- dir.
The core data structure required to make this work is a lock free bounded queue. The proposed interface is: The type T must be trivially copyable. templateclass mpmc_bq { public: /** Constructor @param[in] n_elems Capacity of the queue, must be a power of 2 */ explicit mpmc_bq(size_t n_elems); /** Destructor */ ~mpmc_bq(); /** Enqueue an element. @param[in] data Element to queue @return true if the element was queued successfully */ bool enqueue(T const& data); /** Dequeue an element @param[out] data Element to fetch from the queue @return true if the element was dequeued successfully */ bool dequeue(T& data); /** @return the capacity of the queue */ size_t capacity() const; }; Note: There is no requirement for this queue to ensure strict FIFO. Introduce dblwr, dblwr::v1 and dblwr::recv namespaces: namespace dblwr { /** Doublewrite IO configuration. */ struct Config { /** Constructor @param[in] file File handle @param[in] name File name @param[in] sys_space true if in system space */ Config(os_file_t file, const std::string& name, bool sys_space) : m_file(file), m_name(name), m_size(-1), m_sys_space(sys_space) { m_fsync_required = srv_unix_file_flush_method != SRV_UNIX_O_DIRECT_NO_FSYNC; } /** Open file handle */ os_file_t m_file; /** File name */ std::string m_name; /** Size of the file in bytes */ os_offset_t m_size; /** true if it is the system tablespace */ bool m_sys_space; /** true if fsync required */ bool m_fsync_required; }; /** Creates the doublewrite buffer persistent store. @param[in] config Doublewrite configuration @return true if successful, false if not. */ bool create(const Config& config) MY_ATTRIBUTE((warn_unused_result)); /** Reset the double write file */ void reset_file(); /** Startup the background thread and create the instance. @param[in] dir_name Doublewrite files directory, or NULL @return DB_SUCCESS or error code */ dberr_t open(const char* dir_name) MY_ATTRIBUTE((warn_unused_result)); /** Shutdown the background thread and destroy the instance */ void close(); /** Writes a page to the doublewrite buffer on disk, syncs it, then writes the page to the datafile. @param[in] bpage buffer block to write @param[in] sync true if sync IO requested @return DB_SUCCESS or error code */ dberr_t write( buf_page_t* bpage, bool sync); /** Toggle the doublewrite buffer @param[in] value Current value */ void toggle(bool value); /** Notify when a page write completes */ void io_complete(void* ptr); } // dblwr namespace dblwr { namespace v1 { /** Determines if a page number is located inside the doublewrite buffer. @param[in] page_no Page number to check @return true if the page is inside the two extents */ bool is_inside(ulint page_no) MY_ATTRIBUTE((warn_unused_result)); } } // dblwr::v1 namespace dblwr { namespace recv { class pages_t; /** Load the doublewrite buffer pages, if this is an upgrade from < 4.1 then update the format of the pages in the dblwr extents. @param[in] config Doublewrite configuration @param[in,out] pages For storing the doublewrite pages @return DB_SUCCESS or error code */ dberr_t load_pages( const Config& config, pages_t* pages) MY_ATTRIBUTE((warn_unused_result)); /** Recover double write buffer pages @param[in] apges Pages from the doublewrite buffer */ void recover(const pages_t * pages); /** Find a doublewrite copy of a page. @param[in] pages Pages read from the doublewrite buffer @param[in] space_id tablespace identifier @param[in] page_no page number @return page frame @retval NULL if no page was found */ const byte* find_page( const pages_t* pages, ulint space_id, ulint page_no); /** Create the recovery dblwr data structures @param[out] pages Pointer to newly created instance */ void create(pages_t*& pages); /** Free the recovery dblwr data structures @param[out] pages Free the instance */ void destroy(pages_t*& pages); } } // dblwr::recv The single page flush is handling is different from the other (batch) write requests. Single page flushes are usually sync flushes. First select a slot from the LRU double write file, write the page to that slot, sync it and then do the IO completion. All this is done by the user thread that initiated the single page flush in order to free up a page to use. Batch flushes are written to a latch free queue. When the queue fills up the queue contents are written to a buffer in memory and then the buffer is written using a single sync IO call to the double write file. The file is then fsync(ed) (if required) and then the pages are submitted for writing to the double write file. Once all the pages in the batch are written to the data files the table spaces are synced to disk.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.