WL#8960: InnoDB: Partial Fetch and Update of BLOB

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

Clustered index records can contain externally stored fields or large
objects (LOB), which are stored outside of the clustered index pages.
These LOBs are always operated in full.  Either the full LOB is read
or the full LOB is modified.  This means that even if there is a need
to fetch only a few bytes from an LOB, the complete LOB must be read.
Likewise, even to modify a small portion of an LOB, the complete LOB
must be modified.

This worklog provides the following:

*  Fetch a small portion of LOB.
*  Modify a small portion of LOB.
Functional:
===========

F-1: Insert operation now not only stores LOB data, but also builds an
     LOB index, which will help for quick random access of LOB data.
     (Note: This is an internal feature and is not user visible.)

F-2: Full fetch operation:  Fetching a full LOB (or a JSON document)
     will happen via the LOB index.  This is not an externally
     visible feature.  Suffice to know that all read operations on LOB
     that are stored in the new format will happen via LOB index.  No
     specific test case is needed for this. 

F-3: Update operation:  When a JSON document is updated, by modifying the
     length of total JSON document, then a full update will be performed.
     Optimizer decides when to go for partial updates.  So please refer
     to WL#8963 "Support for partial update of JSON in the optimizer"
     for examples.  

F-4: Partial update operation: When a JSON object is updated, without
     modifying the total length, only the modified portion must be replaced
     with new data. Optimizer decides when to go for partial updates. 
     So please refer to WL#8963 "Support for partial update of JSON
     in the optimizer" for examples.

F-5: Partial Fetch: When a portion of JSON object is fetched, for example
     one key-value pair, then no need to fetch the full JSON document.
     This funcionality is available in InnoDB, but is not yet used by
     server layer.  (Note:  Optimizer doesn't use this feature now.  So,
     it cannot be tested.)

F-6: MVCC must work correctly on partially updated JSON documents.

F-7: Rollback must work correctly on partially updated JSON documents.

F-8: Rollback to savepoint must work correctly on partially updated JSON
     documents.

F-9: LOBs are either stored in old-format or new format.  If LOBs are stored
     in old format, then it will work as before.  Partial updates will not
     happen on them.  When a row containing old LOBs is updated, then LOB
     format will become new format.  Then on this we can do partial update. 

Non-Functional:
===============

NF-1: Space performance: More meta data will be stored per LOB.  The extra
      space used needs to be reasonable.  For small LOBs the space overhead
      could be more.  (QA Note:  We need to check if information regarding
      number of pages used by LOB is available.  If yes, this information
      can be used.  Otherwise the ibd file size can be used to get a rough
      idea about extra space usage.)

NF-2: Time performance: More meta data needs to be maintained per LOB.
      While time taken for various LOB operations are expected to increase,
      the impact shouldn't be user visible.
LOB index (for uncompressed LOB):
=================================

The LOB index is a file based list of index_entry_t objects.  The base node of
this list is available in the first page of the LOB.  One half of the first
page is used to store the entries of LOB index (metadata) and other half is
used to store the actual LOB data.

Once the first LOB page contains no more space for LOB index entries, the more
LOB index pages are used.  These LOB index pages contains only the LOB index
entries and will not contain any LOB data.

Each LOB index entry is represented by the index_entry_t struct.  It contains
the following information:

.  Pointer to previous entry
.  Pointer to next entry.
.  Base node of the list of older versions of this entry.
.  The transaction that created this entry.
.  The transaction undo number to identify the statement that created this
   entry.  This is used while doing rollback to savepoint.
.  The page number where LOB data is available.
.  The amount of data available in the page.

Given the LOB reference, we can access the first page of LOB which contains
the head of LOB index list.  From this base node we can access all the LOB
index entries.

ZLOB index (for compressed LOB):
================================

The ZLOB index for compressed LOB is similar to the LOB index for
uncompressed LOB.  It is a file based list of z_index_entry_t objects.
The base node of this list is available in the first page of the LOB.
Each zlib stream will be represented by one z_index_entry_t
objects.

A fixed part of the first page is allocated for use of ZLOB index.  Once this
space is used, then more ZLOB index pages are allocated and used for storing
the ZLOB index entries.

Each ZLOB index entry contains the following information:

.  Pointer to previous entry
.  Pointer to next entry.
.  Base node of the list of older versions of this entry.
.  The transaction that created this entry.
.  The transaction undo number to identify the statement that created this
   entry.  This is used while doing rollback to savepoint.
.  The page number where zlib stream data is available.
.  The fragment ID (if the page is a fragment page).
.  The amount of data available in the stream (compressed length).
.  The amount of data available in the stream (uncompressed length).

Given the LOB reference, we can access the first page of ZLOB which contains
the head of ZLOB index list.  From this base node we can access all the ZLOB
index entries by traversing the next entry pointer.

How uncompressed LOB is stored?
===============================

The clustered index record contains the LOB reference as usual.  It points to
the first page of LOB (of type FIL_PAGE_TYPE_LOB_FIRST).  The first page of LOB
is divided into two parts - the first part contains the LOB index and the second
part contains the LOB data.

The LOB index is a list of index_entry_t objects stored in the first part of
the first page (FIL_PAGE_TYPE_LOB_FIRST) and index pages of type
FIL_PAGE_TYPE_LOB_INDEX.

The actual data is stored in the lob pages of type
FIL_PAGE_TYPE_LOB_DATA.  The uncompressed LOB is stored in many LOB
pages of type FIL_PAGE_TYPE_LOB_DATA.  These pages are not needed to
be linked to one another.  The order in which they need to be accessed
is determined by the LOB index.

How compressed LOB is stored?
=============================

The basic idea is that a single compressed LOB is stored as a list of zlib
streams.  Information about each of these zlib streams are captured and stored
in the ZLOB index. 

The first page of compressed LOB is of type FIL_PAGE_TYPE_ZLOB_FIRST.  This
first page is divided into 3 parts.  One part is used to store the LOB index,
the second part is used to store the frag page list, and the third part is used
to store the actual LOB data (compressed zlib stream).

The pages of type FIL_PAGE_TYPE_ZLOB_DATA is used to store zlib streams. Each
page of type FIL_PAGE_TYPE_ZLOB_DATA stores a portion of a single zlib stream.

The pages of type FIL_PAGE_TYPE_ZLOB_FRAG is used to store fragments of zlib
streams. Each page of type FIL_PAGE_TYPE_ZLOB_FRAG stores a small portion of
zlib stream from many different zlib streams.

One single LOB is stored as a list of zlib streams.  Each zlib stream is
represented by one z_index_entry_t object.  The list of z_index_entry_t
objects are stored in first part of FIL_PAGE_TYPE_ZLOB_FIRST and
FIL_PAGE_TYPE_ZLOB_INDEX.

The pages of type FIL_PAGE_TYPE_ZLOB_FRAG_ENTRY contains a list of
z_frag_entry_t objects.  Each of these entry describe one fragment.

Motivation for fragment pages for compressed LOB:
=================================================

In the case of compressed LOB, the LOB is stored in the form of
multiple zlib streams.  Each zlib stream can span multiple pages.  So
for each zlib stream, it is possible that space is wasted in the last
pages.  Since many such pages could exist for a large LOB, the
fragment pages are introduced.  These fragment pages can store a small
portion of data from different zlib streams (but all belonging to the
same LOB).  This is just a space saving optimization.

Page Types:
===========

This worklog introduces 8 new page types:

For storing uncompressed LOB:
FIL_PAGE_TYPE_LOB_INDEX
FIL_PAGE_TYPE_LOB_DATA
FIL_PAGE_TYPE_LOB_FIRST

For storing compressed LOB:
FIL_PAGE_TYPE_ZLOB_FIRST
FIL_PAGE_TYPE_ZLOB_DATA
FIL_PAGE_TYPE_ZLOB_INDEX
FIL_PAGE_TYPE_ZLOB_FRAG
FIL_PAGE_TYPE_ZLOB_FRAG_ENTRY

Accessing Full LOBs (uncompressed):
===================================

When there is a request to access the full LOB, the steps are as follows:

*  The first page of the LOB is accessed.  This page contains the base node
   of the list of index_entry_t objects.
*  The base node of the list of index_entry_t objects is accessed.
*  The first node of the list of index_entry_t objects is accessed.
*  Each index_entry_t object contains information about one uncompressed
   LOB page.  It also contains transaction information.
*  Trx id is used to check if the fetching trx can read this entry.
*  Data associated with the current index_entry_t object is accessed.
*  Then the next index_entry_t object is accessed and the steps are
   repeated till the end of the list is encountered.

Accessing Full LOBs (compressed):
=================================

When there is a request to access the full LOB, the steps are as follows:

*  The first page of the LOB is accessed.  This page contains the base node
   of the list of z_index_entry_t objects.
*  The base node of the list of z_index_entry_t objects is accessed.
*  The first node of the list of z_index_entry_t objects is accessed.
*  Each z_index_entry_t object contains information about one zlib stream.
   It also contains transaction information.
*  Trx id is used to check if the fetching trx can read this entry.
*  Data associated with the current z_index_entry_t object is accessed.
*  Then the next z_index_entry_t object is accessed and the steps are
   repeated till the end of the list is encountered.

Partial Update of uncompressed LOBs:
====================================

To partially update an LOB the following information is provided:

*  The offset from where the LOB is modified.
*  The length of the LOB to be modified.
*  The new data.

Note:  Currently only equal length partial update is allowed at server
layer (even though code at InnoDB level allows unequal). Deleting from
the middle of LOB is not allowed at server layer (even though InnoDB
can handle it.)  I think this limitation is for performance reasons.

The steps for partial update of uncompressed LOB is as follows:

* Through the first LOB page, access the LOB index list.
* Using the LOB index list, skip LOB data till we reach the
  offset where the modify operation begins.
* For each entry in the LOB index list, determine if the entry
  must be modified or not.
* If an entry needs to be modified, create a new entry and replace
  the older entry. The older entry is added to the previous versions
  list (which will be later used for MVCC).
* Like this modify as many entries as needed till the requested
  modification is complete.

The important point to note here is that each LOB index entry node
provides information about one single LOB data page.

Partial Update of Compressed LOBs:
==================================

To partially update a compressed LOB the following information is provided:

*  The offset from where the LOB is modified.
*  The length of the LOB to be modified.
*  The new data.

Note:  Currently only equal length partial update is allowed at server
layer (even though code at InnoDB level allows unequal).

The steps for partial update of compressed LOB is as follows:

Note: The important point to note here is that each compressed LOB
index entry node provides information about one single zlib stream
(which can span multiple LOB data pages).

* Through the first LOB page, access the LOB index list.
* Using the LOB index list, skip LOB data till we reach the
  offset where the modify operation begins.
* For each entry in the LOB index list, determine if the entry
  must be modified or not.
* If an entry needs to be modified, create a new entry and replace
  the older entry. The older entry is added to the previous versions
  list (which will be later used for MVCC).
* Like this modify as many entries as needed till the requested
  modification is complete.

Handler Interface:
==================

The handler interface is modified as follows to enable partial update
of LOBs:

int handler::ha_update_row(const uchar *old_data, uchar *new_data,
                           Binary_diffs *diffs);


The 3rd argument (diffs) is the newly added argument.  It is the update
vector containing the list of partially modified LOBs.  The storage engine
can use it or ignore it.

MVCC (using LOB version number):
===============================

With this worklog each LOB contains both LOB index and the LOB data.  Also,
each LOB contains data from different transactions.  So given the LOB
reference, we will have an LOB containing data from different transactions.
While reading an LOB, the LOB version number stored in the LOB reference
will help to read the correct version of LOB.  The read view is not used as
was initially planned.

An example:

Let E1, E2, E3 ... En be the LOB index entries.
Let E1 have x older versions.  They are E11, E12, E13 ... E1x.
Let E2 have y older versions.  They are E21, E22, E23 ... E2y.
Let E3 have z older versions.  They are E31, E32, E33 ... E3z.
Let En have m older versions.  They are En1, En2, En3 ... Enm.

Then the LOB index is arranged as follows:

BEGIN
↓
E1 → E11 → E12 -> E13 -> E1x
↓
E2 → E21 → E22 -> E23 -> E2y
↓
E3 → E31 → E32 -> E33 -> E3z
↓
En → En1 → En2 -> En3 -> Enm
↓
END

A reading trx will look at the LOB version to which each of those
entries are part of.  Each entry in the LOB index will tell to which
LOB version it belongs.  Using this version information correct
version of LOB is accessed.

Purging of LOB:
==============

After this worklog, many versions of clustered index record can point to
the same LOB. The LOB will be purged as part of clustered index record.
The clustered index record will contain the trx id.  So using this trx id
we will purge only those LOB index entries that are equal to this trx id.

Each LOB also stores the trx id that last modified the LOB.  This
information is stored in the header of first page of LOB (both 
compressed and uncompressed).  If this trx id is getting purged, then
the complete LOB will be purged.

During rollback, the trx id will be equal to the last modified trx id of LOB.
This is treated specially and will check only for equality.

Ownership and Inheritance Flags:
================================

This worklog does not modify the semantics of the ownership and inheritance
flags.  These flags are used in different situation.  Let us look at an
example scenario to clarify.

Suppose there is a clustered index record and it has an LOB.  If the key of the
clustered index record is modified without modifying the LOB, then the LOB will
not be copied, but the older clust rec will dis-own the LOB, and the newer
clust rec will mark that it has inherited the LOB.  

While purging the older clustered index record the LOB will not be purged,
because it is disowned.  When this operation is rolled back, then the
inheritance flag will be used to avoid deleting the LOB.

The purge uses the ownership flag and the rollback uses both ownership flag
and the inheritance flag.  These situations are not affected by this worklog.

Older LOBs:
===========

Partial update is applicable only for newly created LOBs which
contains the LOB index.  There is no option to create an LOB index on
older LOBs.  The older LOBs will behave as usual.  All new inserts
will create LOBs in newer format along with LOB index.

Will we use LOB Index for updates if entire BLOB is updated?
============================================================

If entire LOB is updated, then it is equal to inserting a new LOB.
While inserting new LOB, LOB index is automatically built.

Redo Logging and Crash Recovery:
================================

All LOB operations are redo logged.  And all LOB operations are done
as part of an operation on the clustered index record.  So either the
LOB operation will be completed by applying the redo logs, or they
will be rolled back during recovery by means of the rollback operation
of the clustered index record. So LOB operations are crash safe.

For partial updates, the redo log is generated only for the modified
portion of the LOB.

Concurrency Control:
====================

When an LOB is modified, the transaction will be holding an exclusive
lock on the clustered index record.  The operating thread will also
be holding an x-latch on the clustered index record page.  This is
needed because the LOB reference will be updated as needed.

Since the LOB is always accessed only as part of a clustered index
record, the concurrency mechanism that protects a clustered index
record also protects the associated LOB.  Logically they need to be
considered together.

First page of LOB:
==================

The LOB reference stored in a clustered index record points to the first
page of LOB.  This page is represented by struct first_page_t for
uncompressed LOB and struct z_first_page_t for compressed LOB.

Index for Uncompressed LOB (LOB Index):
=======================================

The LOB index is a file-based linked list containing meta data
about each LOB data page.  Each node of this list is called as
an entry.  The LOB index entry is defined by the data structure
index_entry_t.

The data structure index_entry_t contains the following members:

* Link to the previous node.
* Link to the next node.
* Base node of version list.
* Transaction id of trx that created this entry.
* Transaction undo number to be used for savepoints.
* The LOB page number (P1) containing LOB data.
* The amount of LOB data (in bytes) available in P1.

The base node of this LOB index list is available in the first
page of an LOB at offset OFFSET_INDEX_LIST.

Index for compressed LOB (ZLOB Index):
======================================

The compressed LOB is stored as a list of zlib streams.  The LOB index is a
file-based linked list containing meta data about each zlib stream.  Each
node of this list is called as an entry.  The LOB index entry is defined by
the data structure z_index_entry_t.

The data structure z_index_entry_t contains the following members:

* Link to the previous node.
* Link to the next node.
* Base node of version list.
* Transaction id of trx that created this entry.
* Transaction undo number to be used for savepoints.
* The LOB page number (P1) where the zlib stream begins.
* The amount of compressed LOB data (in bytes) available in P1.
* The amount of uncompressed LOB data (in bytes) available in P1.
* The fragment ID pointing data in the fragment page.

The base node of this LOB index list is available in the first
page (z_first_page_t) of an LOB at offset OFFSET_INDEX_LIST.