WL#5655: InnoDB:Separate file to ensure atomic writes

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

Currently, the InnoDB doublewrite buffer lives in the system tablespace.
Splitting the doublewrite buffer to a separate file would benefit users of
solid-state storage (SSD). See notes in BUG#56283 for some details. 

Currently, the contention on the double write mutex causes a bottleneck on our IO 
throughput and increases the write latency.


F1 - More flexibility on placement of the doublewrite buffer pages

NF1- Reduce write latency and increase throughput

NF2- When the doublewrite buffer is switched on, the 
     buffer_flush_avg_time should not increase by more than 20%

NF3- Remove the doublewrite mutex

Current design:
===============
* dblwr buffer is used as a guard against torn pages - to ensure atomic writes
* 128 pages in system tablespace are allocated at fixed location for dblwr
* 120 pages are used for batch flushes and 8 pages are used for single page 
flushes
* batch flush (flush_list or LRU list flush)
  * Calling thread is almost always page_cleaner thread (apart from sync 
checkpoint flushing)
  * pages are copied to dblwr buffer in memory
  * dblwr buffer is written to system tablespace as one big synchronous IO
  * system tablespace is flushed to disk, if O_DIRECT_NO_FSYNC is not set
  * pages are written to datafiles
  * calling thread waits till all IO requests are completed
  * calling thread flushes all datafiles to disk, if O_DIRECT_NO_FSYNC is not set
  * calling thread declares dblwr buffer reusable
* Single page flush
  * Always happen in user thread and is a last resort when we are unable to find a 
replaceable victim in LRU
  * A free slot is reserved in the 8 dedicated pages, if slot is not found then 
wait for IO to complete and the slot becoming available
  * In case of uncompressed pages the page is directly written to the dblwr on 
disk (no in-memory copy)
  * Rest of the steps same as batch flushing

Proposed design:
================
* Create a single double write file, allow user to create more.
* Split the file into segments
  * Flush list segments
  * LRU flush segments
  * Single page segments (these only exist in LRU flush double write files).
* Each double write segment within the file will be of a fixed size
* Create a request queue per buffer pool instance
* Write request handling
  1. Select a request queue (depending on the flush type)
  2. Requestor thread will post the IO request to the request queue
  3. If there are no more slots available in the request queue flush the queue to 
the double write file and got to #2 
* Sync IO requests
  * Requestor thread will reserve a slot from the single page flush segment
  * The requestor thread will then write the page to the data file
* When writing the data to the double write file segment copy the pages to a 
common buffer and write that single buffer to disk. This minimizes write system 
calls.
* We will reserve space in the system tablespace but not use it. This is to 
maintain backward compatibility in file layouts.

Proposed Changes:
=================

Three distinct areas to cover are:

* Updrage/downgrade
* Recovery
* New configuration

Upgrade/Downgrade:
==================

* No change to existing system tablespace format

* Clean shutdown required. So that we don't have to handle the case of recovery
  from the old configuration

Recovery:
=========
* Recovery will work as it does now, with one difference. Instead of reading from 
the double write buffer in the system tablespace we will read from the double 
write file.

* Create the double write file on start based on the new configuration parameters

New configuration:
==================
Introduce three new configuration parameters. User experience should be better 
with the default settings. Changing the defaults should be rare.

  --innodb-doublewrite=on|off
  --innodb-doublewrite-batch-size
  --innodb-doublewrite-dir=/path/name
  --innodb-doublewrite-files=Number of files to use (it will be a max of number of 
buffer pool instances * 2)
  --innodb-doublewrite-pages=N_PAGES (rounded up to a power of 2)
 

Size of a flush list double write file will be:
   * SRV_PAGE_SIZE * doublewrite-pages bytes.

Size of a LRU list double write file will be:
   * SRV_PAGE_SIZE * (doublewrite-pages + (512 / buffer-pool-instances))
     Where 512 is the total number of slots reserved for single page flushes.

Name of the file will be #ib_$srv_page_size.dblwr.

The defaults for the above configuration parameters will be:

   --innodb-doublewrite-dir=innodb_data_home_dir
   --innodb-doublewrite-files=0 (default to number of buffer pool instances) x 2
   --innodb-doublewrite-pages=0 (default to innodb-write-io-threads)
   --innodb-doublewrite-batch-size=0 (default to innodb-doublewrite-pages)

The minimum value will be:
   --innodb-doublewrite-pages=8
   --innodb-doublewrite-files=1
   --innodb-doublewrite-batch-size=0 (default to innodb-doublewrite-pages)

The maximum value will be:
   --innodb-doublewrite-pages=1024
   --innodb-doublewrite-files=128 (capped at the number of buffer pool instances x 
2).
   --innodb-doublewrite-batch-size=max value of innodb-doublewrite-pages)

   
The above configuration parameters will not be dynamic except innodb-doublewrite-
batch-size it is a global variable.

If the --innodb-doublewrite-dir value has a prefix of '.', '#' or '/' then we will 
not prefix a '#' to the value. Otherwise we will prefix a hash to avoid clashes 
conflicts with potential schema names.

If the user circumvents this rule by using say "./dblwr" then we will assume 
that's what the user wants to force the setting to be for --innodb-doublewrite-
dir.

The core data structure required to make this work is a lock free bounded queue. 
The proposed interface is:

The type T must be trivially copyable.

template
class mpmc_bq {
public:

        /** Constructor
        @param[in] n_elems Capacity of the queue, must be a power of 2 */
        explicit mpmc_bq(size_t n_elems);

        /** Destructor */
        ~mpmc_bq();

        /** Enqueue an element.
        @param[in] data    Element to queue
        @return true if the element was queued successfully */
        bool enqueue(T const& data);

        /** Dequeue an element
        @param[out] data   Element to fetch from the queue
        @return true if the element was dequeued successfully */
        bool dequeue(T& data);

        /** @return the capacity of the queue */
        size_t capacity() const;
};

Note: There is no requirement for this queue to ensure strict FIFO.

Introduce dblwr, dblwr::v1 and dblwr::recv namespaces:

namespace dblwr {

        /** Doublewrite IO configuration. */
        struct Config {
                /** Constructor
                @param[in]      file            File handle
                @param[in]      name            File name
                @param[in]      sys_space       true if in system space */
                Config(os_file_t file, const std::string& name, bool sys_space)
                        :
                        m_file(file),
                        m_name(name),
                        m_size(-1),
                        m_sys_space(sys_space)
                {
                        m_fsync_required =
                                srv_unix_file_flush_method
                                != SRV_UNIX_O_DIRECT_NO_FSYNC;
                }

                /** Open file handle */
                os_file_t       m_file;

                /** File name */
                std::string     m_name;

                /** Size of the file in bytes */
                os_offset_t     m_size;

                /** true if it is the system tablespace */
                bool            m_sys_space;
                
                /** true if fsync required */
                bool            m_fsync_required;
        };

        /** Creates the doublewrite buffer persistent store.
        @param[in]      config          Doublewrite configuration
        @return true if successful, false if not. */
        bool create(const Config& config)
                MY_ATTRIBUTE((warn_unused_result));

        /** Reset the double write file */
        void reset_file();

        /** Startup the background thread and create the instance.
        @param[in]      dir_name        Doublewrite files directory, or NULL
        @return DB_SUCCESS or error code */
        dberr_t open(const char* dir_name)
                MY_ATTRIBUTE((warn_unused_result));

        /** Shutdown the background thread and destroy the instance */
        void close();   
                
        /** Writes a page to the doublewrite buffer on disk, syncs it,
        then writes the page to the datafile.
        @param[in]      bpage           buffer block to write
        @param[in]      sync            true if sync IO requested
        @return DB_SUCCESS or error code */
        dberr_t write(
                buf_page_t*     bpage,
                bool            sync);

        /** Toggle the doublewrite buffer
        @param[in]      value           Current value */
        void toggle(bool value);

        /** Notify when a page write completes */
        void io_complete(void* ptr);

} // dblwr

namespace dblwr { namespace v1 {

        /** Determines if a page number is located inside the
        doublewrite buffer.
        @param[in]      page_no         Page number to check
        @return true if the page is inside the two extents */
        bool is_inside(ulint page_no)
                MY_ATTRIBUTE((warn_unused_result));

} } // dblwr::v1

namespace dblwr { namespace recv {

        class pages_t;

        /** Load the doublewrite buffer pages, if this is an upgrade from
        < 4.1 then update the format of the pages in the dblwr extents.
        @param[in]      config          Doublewrite configuration
        @param[in,out]  pages           For storing the doublewrite pages
        @return DB_SUCCESS or error code */
        dberr_t load_pages(
                const Config&   config,
                pages_t*        pages)
                MY_ATTRIBUTE((warn_unused_result));

        /** Recover double write buffer pages
        @param[in]      apges           Pages from the doublewrite buffer */
        void recover(const pages_t * pages);

        /** Find a doublewrite copy of a page.
        @param[in]      pages           Pages read from the doublewrite buffer
        @param[in]      space_id        tablespace identifier
        @param[in]      page_no         page number
        @return page frame
        @retval NULL if no page was found */
        const byte* find_page(
                const pages_t*  pages,
                ulint           space_id,
                ulint           page_no);

        /** Create the recovery dblwr data structures
        @param[out]     pages           Pointer to newly created instance */
        void create(pages_t*& pages);

        /** Free the recovery dblwr data structures
        @param[out]     pages           Free the instance */
        void destroy(pages_t*& pages);

} } // dblwr::recv


The single page flush is handling is different from the other (batch) write 
requests. Single page flushes are usually sync flushes. First select a slot from 
the LRU double write file, write the page to that slot, sync it and then do the IO 
completion. All this is done by the user thread that initiated the single page 
flush in order to free up a page to use.

Batch flushes are written to a latch free queue. When the queue fills up the queue 
contents are written to a buffer in memory and then the buffer is written using a 
single sync IO call to the double write file. The file is then fsync(ed) (if 
required) and then the pages are submitted for writing to the double write file. 
Once all the pages in the batch are written to the data files the table spaces are 
synced to disk.