WL#8960: InnoDB: Partial Fetch and Update of BLOB
Affects: Server-8.0
—
Status: Complete
Clustered index records can contain externally stored fields or large objects (LOB), which are stored outside of the clustered index pages. These LOBs are always operated in full. Either the full LOB is read or the full LOB is modified. This means that even if there is a need to fetch only a few bytes from an LOB, the complete LOB must be read. Likewise, even to modify a small portion of an LOB, the complete LOB must be modified. This worklog provides the following: * Fetch a small portion of LOB. * Modify a small portion of LOB.
Functional: =========== F-1: Insert operation now not only stores LOB data, but also builds an LOB index, which will help for quick random access of LOB data. (Note: This is an internal feature and is not user visible.) F-2: Full fetch operation: Fetching a full LOB (or a JSON document) will happen via the LOB index. This is not an externally visible feature. Suffice to know that all read operations on LOB that are stored in the new format will happen via LOB index. No specific test case is needed for this. F-3: Update operation: When a JSON document is updated, by modifying the length of total JSON document, then a full update will be performed. Optimizer decides when to go for partial updates. So please refer to WL#8963 "Support for partial update of JSON in the optimizer" for examples. F-4: Partial update operation: When a JSON object is updated, without modifying the total length, only the modified portion must be replaced with new data. Optimizer decides when to go for partial updates. So please refer to WL#8963 "Support for partial update of JSON in the optimizer" for examples. F-5: Partial Fetch: When a portion of JSON object is fetched, for example one key-value pair, then no need to fetch the full JSON document. This funcionality is available in InnoDB, but is not yet used by server layer. (Note: Optimizer doesn't use this feature now. So, it cannot be tested.) F-6: MVCC must work correctly on partially updated JSON documents. F-7: Rollback must work correctly on partially updated JSON documents. F-8: Rollback to savepoint must work correctly on partially updated JSON documents. F-9: LOBs are either stored in old-format or new format. If LOBs are stored in old format, then it will work as before. Partial updates will not happen on them. When a row containing old LOBs is updated, then LOB format will become new format. Then on this we can do partial update. Non-Functional: =============== NF-1: Space performance: More meta data will be stored per LOB. The extra space used needs to be reasonable. For small LOBs the space overhead could be more. (QA Note: We need to check if information regarding number of pages used by LOB is available. If yes, this information can be used. Otherwise the ibd file size can be used to get a rough idea about extra space usage.) NF-2: Time performance: More meta data needs to be maintained per LOB. While time taken for various LOB operations are expected to increase, the impact shouldn't be user visible.
WL#8963: Support for partial update of JSON in the optimizer
WL#8985: InnoDB: Refactor compressed BLOB code to facilitate partial fetch/update
WL#9141: InnoDB: Refactor uncompressed BLOB code to facilitate partial fetch/update
WL#9263: InnoDB: Enable partial access of LOB using multiple zlib streams
WL#8985: InnoDB: Refactor compressed BLOB code to facilitate partial fetch/update
WL#9141: InnoDB: Refactor uncompressed BLOB code to facilitate partial fetch/update
WL#9263: InnoDB: Enable partial access of LOB using multiple zlib streams
LOB index (for uncompressed LOB): ================================= The LOB index is a file based list of index_entry_t objects. The base node of this list is available in the first page of the LOB. One half of the first page is used to store the entries of LOB index (metadata) and other half is used to store the actual LOB data. Once the first LOB page contains no more space for LOB index entries, the more LOB index pages are used. These LOB index pages contains only the LOB index entries and will not contain any LOB data. Each LOB index entry is represented by the index_entry_t struct. It contains the following information: . Pointer to previous entry . Pointer to next entry. . Base node of the list of older versions of this entry. . The transaction that created this entry. . The transaction undo number to identify the statement that created this entry. This is used while doing rollback to savepoint. . The page number where LOB data is available. . The amount of data available in the page. Given the LOB reference, we can access the first page of LOB which contains the head of LOB index list. From this base node we can access all the LOB index entries. ZLOB index (for compressed LOB): ================================ The ZLOB index for compressed LOB is similar to the LOB index for uncompressed LOB. It is a file based list of z_index_entry_t objects. The base node of this list is available in the first page of the LOB. Each zlib stream will be represented by one z_index_entry_t objects. A fixed part of the first page is allocated for use of ZLOB index. Once this space is used, then more ZLOB index pages are allocated and used for storing the ZLOB index entries. Each ZLOB index entry contains the following information: . Pointer to previous entry . Pointer to next entry. . Base node of the list of older versions of this entry. . The transaction that created this entry. . The transaction undo number to identify the statement that created this entry. This is used while doing rollback to savepoint. . The page number where zlib stream data is available. . The fragment ID (if the page is a fragment page). . The amount of data available in the stream (compressed length). . The amount of data available in the stream (uncompressed length). Given the LOB reference, we can access the first page of ZLOB which contains the head of ZLOB index list. From this base node we can access all the ZLOB index entries by traversing the next entry pointer. How uncompressed LOB is stored? =============================== The clustered index record contains the LOB reference as usual. It points to the first page of LOB (of type FIL_PAGE_TYPE_LOB_FIRST). The first page of LOB is divided into two parts - the first part contains the LOB index and the second part contains the LOB data. The LOB index is a list of index_entry_t objects stored in the first part of the first page (FIL_PAGE_TYPE_LOB_FIRST) and index pages of type FIL_PAGE_TYPE_LOB_INDEX. The actual data is stored in the lob pages of type FIL_PAGE_TYPE_LOB_DATA. The uncompressed LOB is stored in many LOB pages of type FIL_PAGE_TYPE_LOB_DATA. These pages are not needed to be linked to one another. The order in which they need to be accessed is determined by the LOB index. How compressed LOB is stored? ============================= The basic idea is that a single compressed LOB is stored as a list of zlib streams. Information about each of these zlib streams are captured and stored in the ZLOB index. The first page of compressed LOB is of type FIL_PAGE_TYPE_ZLOB_FIRST. This first page is divided into 3 parts. One part is used to store the LOB index, the second part is used to store the frag page list, and the third part is used to store the actual LOB data (compressed zlib stream). The pages of type FIL_PAGE_TYPE_ZLOB_DATA is used to store zlib streams. Each page of type FIL_PAGE_TYPE_ZLOB_DATA stores a portion of a single zlib stream. The pages of type FIL_PAGE_TYPE_ZLOB_FRAG is used to store fragments of zlib streams. Each page of type FIL_PAGE_TYPE_ZLOB_FRAG stores a small portion of zlib stream from many different zlib streams. One single LOB is stored as a list of zlib streams. Each zlib stream is represented by one z_index_entry_t object. The list of z_index_entry_t objects are stored in first part of FIL_PAGE_TYPE_ZLOB_FIRST and FIL_PAGE_TYPE_ZLOB_INDEX. The pages of type FIL_PAGE_TYPE_ZLOB_FRAG_ENTRY contains a list of z_frag_entry_t objects. Each of these entry describe one fragment. Motivation for fragment pages for compressed LOB: ================================================= In the case of compressed LOB, the LOB is stored in the form of multiple zlib streams. Each zlib stream can span multiple pages. So for each zlib stream, it is possible that space is wasted in the last pages. Since many such pages could exist for a large LOB, the fragment pages are introduced. These fragment pages can store a small portion of data from different zlib streams (but all belonging to the same LOB). This is just a space saving optimization. Page Types: =========== This worklog introduces 8 new page types: For storing uncompressed LOB: FIL_PAGE_TYPE_LOB_INDEX FIL_PAGE_TYPE_LOB_DATA FIL_PAGE_TYPE_LOB_FIRST For storing compressed LOB: FIL_PAGE_TYPE_ZLOB_FIRST FIL_PAGE_TYPE_ZLOB_DATA FIL_PAGE_TYPE_ZLOB_INDEX FIL_PAGE_TYPE_ZLOB_FRAG FIL_PAGE_TYPE_ZLOB_FRAG_ENTRY Accessing Full LOBs (uncompressed): =================================== When there is a request to access the full LOB, the steps are as follows: * The first page of the LOB is accessed. This page contains the base node of the list of index_entry_t objects. * The base node of the list of index_entry_t objects is accessed. * The first node of the list of index_entry_t objects is accessed. * Each index_entry_t object contains information about one uncompressed LOB page. It also contains transaction information. * Trx id is used to check if the fetching trx can read this entry. * Data associated with the current index_entry_t object is accessed. * Then the next index_entry_t object is accessed and the steps are repeated till the end of the list is encountered. Accessing Full LOBs (compressed): ================================= When there is a request to access the full LOB, the steps are as follows: * The first page of the LOB is accessed. This page contains the base node of the list of z_index_entry_t objects. * The base node of the list of z_index_entry_t objects is accessed. * The first node of the list of z_index_entry_t objects is accessed. * Each z_index_entry_t object contains information about one zlib stream. It also contains transaction information. * Trx id is used to check if the fetching trx can read this entry. * Data associated with the current z_index_entry_t object is accessed. * Then the next z_index_entry_t object is accessed and the steps are repeated till the end of the list is encountered. Partial Update of uncompressed LOBs: ==================================== To partially update an LOB the following information is provided: * The offset from where the LOB is modified. * The length of the LOB to be modified. * The new data. Note: Currently only equal length partial update is allowed at server layer (even though code at InnoDB level allows unequal). Deleting from the middle of LOB is not allowed at server layer (even though InnoDB can handle it.) I think this limitation is for performance reasons. The steps for partial update of uncompressed LOB is as follows: * Through the first LOB page, access the LOB index list. * Using the LOB index list, skip LOB data till we reach the offset where the modify operation begins. * For each entry in the LOB index list, determine if the entry must be modified or not. * If an entry needs to be modified, create a new entry and replace the older entry. The older entry is added to the previous versions list (which will be later used for MVCC). * Like this modify as many entries as needed till the requested modification is complete. The important point to note here is that each LOB index entry node provides information about one single LOB data page. Partial Update of Compressed LOBs: ================================== To partially update a compressed LOB the following information is provided: * The offset from where the LOB is modified. * The length of the LOB to be modified. * The new data. Note: Currently only equal length partial update is allowed at server layer (even though code at InnoDB level allows unequal). The steps for partial update of compressed LOB is as follows: Note: The important point to note here is that each compressed LOB index entry node provides information about one single zlib stream (which can span multiple LOB data pages). * Through the first LOB page, access the LOB index list. * Using the LOB index list, skip LOB data till we reach the offset where the modify operation begins. * For each entry in the LOB index list, determine if the entry must be modified or not. * If an entry needs to be modified, create a new entry and replace the older entry. The older entry is added to the previous versions list (which will be later used for MVCC). * Like this modify as many entries as needed till the requested modification is complete. Handler Interface: ================== The handler interface is modified as follows to enable partial update of LOBs: int handler::ha_update_row(const uchar *old_data, uchar *new_data, Binary_diffs *diffs); The 3rd argument (diffs) is the newly added argument. It is the update vector containing the list of partially modified LOBs. The storage engine can use it or ignore it. MVCC (using LOB version number): =============================== With this worklog each LOB contains both LOB index and the LOB data. Also, each LOB contains data from different transactions. So given the LOB reference, we will have an LOB containing data from different transactions. While reading an LOB, the LOB version number stored in the LOB reference will help to read the correct version of LOB. The read view is not used as was initially planned. An example: Let E1, E2, E3 ... En be the LOB index entries. Let E1 have x older versions. They are E11, E12, E13 ... E1x. Let E2 have y older versions. They are E21, E22, E23 ... E2y. Let E3 have z older versions. They are E31, E32, E33 ... E3z. Let En have m older versions. They are En1, En2, En3 ... Enm. Then the LOB index is arranged as follows: BEGIN ↓ E1 → E11 → E12 -> E13 -> E1x ↓ E2 → E21 → E22 -> E23 -> E2y ↓ E3 → E31 → E32 -> E33 -> E3z ↓ En → En1 → En2 -> En3 -> Enm ↓ END A reading trx will look at the LOB version to which each of those entries are part of. Each entry in the LOB index will tell to which LOB version it belongs. Using this version information correct version of LOB is accessed. Purging of LOB: ============== After this worklog, many versions of clustered index record can point to the same LOB. The LOB will be purged as part of clustered index record. The clustered index record will contain the trx id. So using this trx id we will purge only those LOB index entries that are equal to this trx id. Each LOB also stores the trx id that last modified the LOB. This information is stored in the header of first page of LOB (both compressed and uncompressed). If this trx id is getting purged, then the complete LOB will be purged. During rollback, the trx id will be equal to the last modified trx id of LOB. This is treated specially and will check only for equality. Ownership and Inheritance Flags: ================================ This worklog does not modify the semantics of the ownership and inheritance flags. These flags are used in different situation. Let us look at an example scenario to clarify. Suppose there is a clustered index record and it has an LOB. If the key of the clustered index record is modified without modifying the LOB, then the LOB will not be copied, but the older clust rec will dis-own the LOB, and the newer clust rec will mark that it has inherited the LOB. While purging the older clustered index record the LOB will not be purged, because it is disowned. When this operation is rolled back, then the inheritance flag will be used to avoid deleting the LOB. The purge uses the ownership flag and the rollback uses both ownership flag and the inheritance flag. These situations are not affected by this worklog. Older LOBs: =========== Partial update is applicable only for newly created LOBs which contains the LOB index. There is no option to create an LOB index on older LOBs. The older LOBs will behave as usual. All new inserts will create LOBs in newer format along with LOB index. Will we use LOB Index for updates if entire BLOB is updated? ============================================================ If entire LOB is updated, then it is equal to inserting a new LOB. While inserting new LOB, LOB index is automatically built. Redo Logging and Crash Recovery: ================================ All LOB operations are redo logged. And all LOB operations are done as part of an operation on the clustered index record. So either the LOB operation will be completed by applying the redo logs, or they will be rolled back during recovery by means of the rollback operation of the clustered index record. So LOB operations are crash safe. For partial updates, the redo log is generated only for the modified portion of the LOB. Concurrency Control: ==================== When an LOB is modified, the transaction will be holding an exclusive lock on the clustered index record. The operating thread will also be holding an x-latch on the clustered index record page. This is needed because the LOB reference will be updated as needed. Since the LOB is always accessed only as part of a clustered index record, the concurrency mechanism that protects a clustered index record also protects the associated LOB. Logically they need to be considered together.
First page of LOB: ================== The LOB reference stored in a clustered index record points to the first page of LOB. This page is represented by struct first_page_t for uncompressed LOB and struct z_first_page_t for compressed LOB. Index for Uncompressed LOB (LOB Index): ======================================= The LOB index is a file-based linked list containing meta data about each LOB data page. Each node of this list is called as an entry. The LOB index entry is defined by the data structure index_entry_t. The data structure index_entry_t contains the following members: * Link to the previous node. * Link to the next node. * Base node of version list. * Transaction id of trx that created this entry. * Transaction undo number to be used for savepoints. * The LOB page number (P1) containing LOB data. * The amount of LOB data (in bytes) available in P1. The base node of this LOB index list is available in the first page of an LOB at offset OFFSET_INDEX_LIST. Index for compressed LOB (ZLOB Index): ====================================== The compressed LOB is stored as a list of zlib streams. The LOB index is a file-based linked list containing meta data about each zlib stream. Each node of this list is called as an entry. The LOB index entry is defined by the data structure z_index_entry_t. The data structure z_index_entry_t contains the following members: * Link to the previous node. * Link to the next node. * Base node of version list. * Transaction id of trx that created this entry. * Transaction undo number to be used for savepoints. * The LOB page number (P1) where the zlib stream begins. * The amount of compressed LOB data (in bytes) available in P1. * The amount of uncompressed LOB data (in bytes) available in P1. * The fragment ID pointing data in the fragment page. The base node of this LOB index list is available in the first page (z_first_page_t) of an LOB at offset OFFSET_INDEX_LIST.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.