WL#7277: InnoDB: Bulk Load for Create Index

Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

Currently we create an index by repeatedly inserting records. It's very slow
when we have a large collection of records. Bulk loading can be done much
more efficiently.

We sort the records using index key and then insert (exploiting sorted order of
the records). The basic idea of bulk load is we builds an index from bottom up
(also known as sorted index build).

Index entries first go into the leaf page, and when the leaf page is filled up
we add the corresponding search-key entry in to the parent non-leaf node and
start filling next leaf page. Filling records into leaf page is fast, and we
have much less splitting in the building process. Also this approach help in
maintaining good locality and page once pinned and unpinned can be flushed to
disk as it would not be pinned again for build phase.

Functional requirements:
F-1: InnoDB should provide bulk-load path for index build or rebuild which in
turn should acclerate build process (w.r.t to traditional insert at a time
approach).

F-2: Fill factor is supported for future growth of a index.

F-3: There could be differences in the stats (if compared with normal build) as
different algorithm is used for population of index. Though we don't expect
follow-up workload to take toll from this. (like follow-up query or random
insert workload).

Non-Functional requirements:
NF-1: Implicit requirements: No new SQL needed, work on all  platforms, do not 
break replication, backup, partitioning, FK, or any  other exiting features.

NF-2: No change in semantics expected.

Problem Statement:
------------------

Speed up index build by bulking load the data.


When we build or rebuild an index, we usually have three phases.

Phase-1: Generate Initial Runs
Scan cluster index, generate index entries and add it to sort buffer. When 
sort buffer is full, we sort the entries, and write/append the buffer to a
temporary intermediate file.

Phase-2: Merge Sort
Now we have one or more runs in the file, so we run merge sort to get complete
file sorted (in turn all the entries are now sorted).
Phase-1 and Phase-2 is a common external sort alogrithm.

Phase-3: Index Build
Once we have all index entries in sorted order, entries are inserted into B-tree
using normal insert apis. Obivious optimization here should have been using
sorted build. This WL addresses this requirement and enables sorted build path
instead of normal insert apis.

-------------------------------------------------------------------------------

Let's understand current implementation in InnoDB:

- STEP-1: Optimistic insert - Open a B-tree cursor to find the insert position,
and try  to insert the tuple in the page.
If insert fails (page can't accomodate new record as page has already reached
its capacity) try pessimistic insert.

- STEP-2: Pessimistic insert - Open a B-tree cursor and do necessary split &
merge to find a proper space for the tuple.

Current algorithm has following shortcomings:
  - For each entry, position to insert is searched which is costly process even
    though each search is bounded by log(n) complexity.
  - Current algorithm is top-down which would result in exhaustive 
    split and merge.

New proposed algorithm try to address these issues using bottom-up build
technique and thus avoids/reduces split and merge. (For compressed tables splits
will occur.)

-------------------------------------------------------------------------------

High Level Approach Description:
--------------------------------

- Bottom-up Index Build
=======================
Since tuples are ordered, we allocate a leaf page and append tuples to it. Once 
it is full (fill factor is dictated by external configuration variable), we
append a node pointer of the page to its parent page and same process is then
followed for all the non-leaf page which in turn can lead to insert upto root.
Next tuple would then go to new page following same semantics as described 
above.

We hold references to the rightmost pages at each level in the B-tree.
All inserts goes to these pages (including node-pointer inserted to non-leaf).
If a page has no space left for a new tuple, we allocate a new sibling page
releasing reference to the old page after updating the non-leaf pages as
needed. The newly pinned siblings will now act as rightmost page and default
location for upcoming new tuples.

- Redo Logging Disabled
=======================
REDO logging is selectively turned off and instead checkpoint is done to ensure
that build index can withstand crash/failure. Checkpoint will force writing up
of all the dirty pages to disk. During the progress of bulk load, we signal
page cleaner to flush dirty pages periodically, so the checkpoint could be short
and fast.

(Note: page-cleaner thread by default would do it but would trigger the action
only if non-dirty pages fall below some set threshold. In this case we would
like to flush the pages immediately reducing overhead of checkpoint and also
parallelize IO and CPU activities.)

There are 2 main action that needs redo logging:
1. Page-Allocation: REDO logging is turned-on as usual for this action except if
complete table is being rebuild.
2. Page-Modification: REDO logging is turned-off for this. On crash there is no
need to restore page content as anyways index is half-build and will not be
updated in metadata.

- Fill Factor
=============

The fill-factor value determines percentage of space on page to be filled with 
data, reserving the remainder on each page as free space for future growth. For 
example, specifying a fill-factor value of 80 means that 20 percent of each page
will be left empty. We don't keep exactly 20 percent free space, and the actual 
free percentage varies, greater than 20, or less than 20. It would be 
interpreted as hint and not a hard assurance.

Fill factor applies to both leaf-level page and non-leaf page, but doesn't apply 
to external page for TEXT/BLOB in B-tree. Fill-factor concept is newly being
introduced and use of it is currently restricted only for sorted build.

It is controlled using a global configuration parameter named
"INNODB_FILL_FACTOR".
Agreeable limits: 10 <= fill-factor <= 100

Note: it could be better that we support fill factor per index in syntax. e.g.
CREATE INDEX ... FILLFACTOR=80. We need a separate WL to address it cooperating 
with other teams. As Marko mentioned, fill factor should be an attribute in 
dd.indexes.options string.

- Compressed Table
==================
In existing approach record is inserted to both uncompressed and compressed
page. When the modification log of the compressed page is filled up, we 
recompress the compressed page, and split the page if compression fails.

In bulk load, we append records only to uncompressed pages. when a uncompressed 
page is filled up, we compress the page. If compression fails, we split the page
into two pages, and recompress until the compression succeeds.

We keep certain padding space in a page, which is determined by adaptive 
padding.

- Fulltext Index
================
Current implementation use SQL to insert a record into fulltext index. We will 
use bulk load to speedup fulltext index build as well.

This section lists more detail and actual code changes.

We have two classes: PageBulk and BtrBulk.

PageBulk only handle operations inside the page. A PageBulk object has its own 
mini-transaction, heap and page properties including block, page, heap top, free
space, etc. BtrBulk is responsible for maintaining the B-tree structure, and it 
holds a PageBulk object for each level.

-------------------------------------------------------------------------------

Bottom-up Index Build:
======================

dberr_t	BtrBulk::insert(dtuple_t*	tuple, ulint	level)
Insert tuple to needed level of the index tree.
- First check if there is enough space for the insert.
- If so, we insert the tuple in the page, otherwise we allocate a sibling 
page for the insert, meanwhile we commit the page.

pseudo code:
  Get converted record size.

  if (space is not available)
  {
     pin/get new page with a new mtr.
     commit cur page with its mtr.
     set new page as cur page.
  }

  Convert tuple to record.
  Insert the record to cur page.

  if (blob record)
  {
     store the blob record.
  }

dberr_t BtrBulk::pageCommit(PageBulk* page_bulk, PageBulk* next_page_bulk, bool 
insert_father)

- We first set page header(heap top, n_recs, etc) and page tailer(dir/slot), and 
check whether we need to compress the page. if compression fails, we split the 
page. If needed, we insert a node pointer to its parent page.

- Fill Factor
=============
Fill factor is effective for uncompressed page.

We set reserved empty space specified by fill factor in PageBulk::init.
m_reserved_space = UNIV_PAGE_SIZE * (100 - innobase_index_fill_factor) / 100;

And we check the empty space in PageBulk::spaceAvailable.
m_page_zip == NULL && m_free_space - required_space < m_reserved_space

- Adaptive Padding
==================
We keep padding space for compressed page, so as to ensure that page-compression
succeed in most of the cases. Fill factor can not guarantee we can insert more
rows into a compressed page because of compression.

We set padding space in PageBulk::init.
m_padding_space = UNIV_PAGE_SIZE - 
dict_index_zip_pad_optimal_page_size(m_index);

And we check the padding in PageBulk::sapceAvailable.
m_page_zip != NULL && m_free_space - required_space < m_padding_space

- Redo Logging Disabled
=======================
void BtrBulk::logFreeCheck()
When a leaf page is full and we move to a sibling leaf page, we do log free 
check, and signal to page cleaner once we commit 500 new dirty pages.

pseudo code:
  Release all block locks by committing mtrs.
  log free check.
  if (we generate new dirty pages is more than flush threshold)
  {
     signal to page cleaner;
  }
  Reacquire all blocks by restarting mtrs.

- Compressed Table
==================
We have four functions for the page split:
rec_t*	PageBulk::getSplitRec();
Get the split record in the page.

void PageBulk::copyIn(rec_t*	split_rec);
Copy the records from split record to last record to the new page.

void PageBulk::copyOut(rec_t*	split_rec);
Remove records from split record to last record in the page.

dberr_t BtrBulk::pageSplit(PageBulk* page_bulk, PageBulk* next_page_bulk)
We do page split only when page compression fails.

Pseudo code:

  Get split record from the split page.
  Pin/get a new page.
  Copy the upper half records to the new page.
  Remove the copied records from the split page.
  commit the split page(node pointer inserted).
  commit the new page(node pointer inserted).

- Blob Handling
===============
A new blob operation 'BTR_STORE_INSERT_BULK' added.

In btr_store_big_rec_extern_fields, we disable redo logging for blob mtr. for 
compressed table, modification to uncompressed page is skipped( 
'page_zip_write_blob_ptr' is skipped here), because we will generate the 
compressed page in pageCommit.

- Fulltext Index
================
We add BtrBulk instance in fts_psort_insert_t(fulltext insert context), and 
initialize it in row_fts_merge_insert, then use a new function below to create a 
tuple and do insert.

dberr_t
row_merge_write_fts_node(
/*=====================*/
        fts_psort_insert_t*
                        ins_ctx,        /*!< in: insert context */
        fts_string_t*   word,           /*!< in: word string */
        fts_node_t*     node)           /*!< in: node columns */