WL#7066: External tool to extract InnoDB tablespace dictionary information

Affects: Server-8.0   —   Status: Complete

We need an external tool for extracting tablespace dictionary information from
an InnoDB tablespace.

Although we don't do copy-on-write, we will do best effort to make
reads from the tool reliable.

The main objective of this tool is to read the tablespace dictionary
by direct access to the tablespace file.
F-1: Should be able to read tablespace dictionary on tablespace files
     on a running server
F-2: Should not modify any tablespace files
(and should not modify the tablespace dictionary inside *.ibd files)
F-3: Should be able to read tablespace dictionary when server is offline
This tool assumes the tablespace dictionary pages are in sane state (by flushing
the SDI pages ASAP).

Remember, a tablespace can be single-table tablespace (t1.ibd) or multi-table
tablespace or InnoDB system tablespace (ibdata*).

Proposed name of this tool: ibd2sdi

Primary objective of this tool:
===============================
To read all the serialized metadata in a tablespace

   The tool should be able to effectively do READ UNCOMMITTED of tablespace
   dictionary from the specified single-file tablespace:

   ibd2sdi --read=C t1.ibd
   ibd2sdi --read=C ibdata1

   The above commands should dump the Serialized Dictionary Info(SDI) for the
   tablespaces.

   The schema of SDI table records is id<8bytes>, type<4bytes>,
   data. We will not parse/interpret the contents of the 'data'
   field. i.e. we do only byte-by-byte comparison

   The argument C specifies which copy of the information to retrieve.
   InnoDB can write N=0 or N=2 copies.

   The command will ONLY read the specified tablespace. There will not be
   any redo log apply or any access to the undo log. This is a ‘best effort’.
   When the server is running, it is preferred to use the INFORMATION_SCHEMA
   to retrieve information about tablespaces. Most of the time, this tool
   should work while the MySQL server is running. During DDL (or ROLLBACK or
   undo log purge related to the SDI), there may be a short time window when
   this tool fails to read one copy of the information.

   Multi-file tablespace is supported but multiple tablespaces aren't 
   supported.

   ibd2sdi --read=C ibdata*                    - supported
   ibd2sdi --read=C foo.ibd ibdata* bar.ibd    - not suppported.

   The files of a multi-file tablespace should be specified in a strict
   order with ascending page number. If two successive files have same space_id,
   then the latter file must start with a page_no that is 1+last_page_no
   of the previous file. The first file for a tablespace always has to start
   with page number 0 (unless page 0 is corrupted).

How the SDI generated from this tool should be used
===================================================
    If a table is corrupted in tablespace, the SDI should be extracted into
    a file and remove the corrupted table in SDI file.

    With the edited SDI & .ibd, it can be used to import tablespace
    with corrupted table.

How the read will be done?
==========================

We maintain two copies of the SDI table information and stores the
root page numbers in pages 1 & 2.

Check if page 0 is not corrupted:

IF page 0 is corrupted
----------------------
Since Page 0 stores page_size used by the InnoDB tablespace, we wouldn't know
the page size.

To determine page size, we read 60 (or actual number of pages if less than 60)
pages with different supported innodb_page_size (uncompressed and compressed),
trying to find pages that satisfy any supported checksum.

Once we determine the page with valid checksum and valid page size, we can read
page 1 & 2. Also print warning about the detected page corruption.

a. page_size is determined by doing the above calculations
b. Read the index root page numbers from page 1 & page 2 (page corruption check
   will done on page 1 & 2 as well)

IF page 0 is not corrupted
--------------------------
a. Read page_size from FSP header(page 0)
b. Read the flag which indicates the tablespace dictionary exists or not
c. Read the index root page numbers
d. Also read index root page numbers from page 1 & page 2

Output format:
==============
The tool will write the SDI records id, type, data in JSON like format.

Example:
[
 ["ibd2sdi", {"type:%u,"id":%lu}],
[SDI blob 1, not ending in comma]
],
[
 ["ibd2sdi", {"type:%u,"id":%lu}],
[SDI blob 2, not ending in comma]
],


Options:
========
--read=C (C = 0 or 1)

  Explained above. This allows the user to mention the explicit copy to read.
  Should be either 0 or 1.

--dump-file=

  Dump the tablespace SDI into the file passed by user. Without the filename,
  it will default to stdout.

--skip-data

  Used for skipping 'data' field when parsing the fields of sdi_table.
  This option will be used to print only the primary keys of SDI stored.

  For example: (Sample output)
  [
   ["ibd2sdi", {"type:1,"id":1}]
  ],
  [
   ["ibd2sdi", {"type:2,"id":2}]
  ],

--id=X
--type=Y

  When explicit id or type is provided, dump only the matching records.
  Note: The tool will always scan all records even when these options
  are provided. It will not perform searches in the SDI B-tree.

--checksum-algorithm=[innodb,crc32,none]

  algorithm to verify the checksum of the pages read.
  By default, we will use non-strict checksum. i.e. check for all checksum
  algorithms: none,crc32,innodb.

--no-check
  Skip checksum verification on pages read

Without the option "--read=C":
===============================

When the user do not specify the copy to read, the tool will read *both* copies
and tries to match. 

The matching is done by parsing the fields in record.i.e Extract id, type &
blob(blob is read from multiple pages) and then a byte-by-byte comparision is done. 

If copies match:
----------------
We will write one copy into file (if provided) by user.

If copies don't match:
----------------------
Write nothing to the file(if provided)and return a nonzero exit status. Dump
both copies to different files so that user can compare the mismatched records.

Sample output from tool:
./ibd2sdi t1.ibd.mismatch 
[ERROR] ibd2sdi: Corruption detected when comparing records.
[ERROR] ibd2sdi: Dumping SDI Copy: 0 into file: /tmp/ib_sdi_90_copy_0_GzcqtH.
[ERROR] ibd2sdi: Dumping SDI Copy: 1 into file: /tmp/ib_sdi_90_copy_1_5bhhgk.
[ERROR] ibd2sdi: Please compare the above files to find the difference between
two SDI copies.

The file will be located in system tmp directory. The pattern is ib_sdi
+space_id+copy_num+uniqtext.

Handle Page Corruption
======================
The tool should handle corrupted pages. It shouldn't go into infinite loop or
assert. Always gracefully exit.

The following scenarios will be handled:
----------------------------------------
1. Broken page links
2. Page with 0s
3. Checksum failures
4. Corrupted records (records pointing to 0x0 or pointing to offset making the
   record chain a cycle)
STEP 1: Create tablespaces map
------------------------------
1. Read all files on command line

2. Create a map of tablespaces (we restrict to only one tablespace for now)

3. Read space_id from first page of every file to determine the
   datafiles of a tablespace

4. In case of multi-file tablespaces, verify that 
   first_page_num_of_data_file = last_page_num_of_last_data_file + 1

5. If the above validation fails, we don't add such tablespace to map

6. Now we have map of tablespaces. For each tablespace we have determined
   the vector of datafiles, page_size, space_id, etc

STEP 2: Fetch Root page numbers
--------------------------------
   For each tablespace (currently one)

1. Check if tablespace flags(FSP_FLAGS) say we have SDI in tablespace. Also
   determine page_size from FSP_FLAGS. If the page size is invalid, read N
   (lets say 60) pages, calculate checksum using different page_sizes. Determine
   the best page_size based on checksum failure count. (determine_page_size()) 

2. Even if FSP_FLAGS says we don't have SDI in tablespace, we double verify by
   reading root page numbers stored at FIL_PAGE_FILE_FLUSH_LSN stored at
   page 1 & page 2

3. If we find valid root page numbers at page 1 & 2, we proceed

4. Match the root page numbers found in page 1 & 2. If they mismatch
   we try to determine the better root page number by reading FIL_PAGE_TYPE
   If we can't determine even after this step, we abort.
   determine_good_root_page_no()

5. If user chooses to read only from one copy, we set the other copy num
   as -1 to avoid processing of other copy. If not, we will process both
   copies

STEP 3: Reach to Level Zero
----------------------------
1. Based on user input, we either match both copies ->
dump_all_recs_in_leaf_level_compare()  
   For processing only single copy -> dump_all_recs_in_leaf_level()

    First step is reach Leaf Level (Level Zero)
      reach_to_level_zero():
	a. Read the current level of root page, if 0, return
        b. If level of root page is not zero:
           * Fetch Left most child page num (fetch INFIMUM, go to next-rec
             offset from INFIMUM)
           * In this node page rec, read child page_num
        c. Read the child page_num. Find the child page_num level.
        d. Repeat the above procedure, until level zero is reached


STEP 4: Process & Compare Records
----------------------------------
1. After reaching to Left most Leaf level page, for both copies get
   first user record.

2. Extract (id, type, data) from record and compare both records and
   write to file (if provided by user) check_and_dump_record()
   i. if data is stored externally, fetch them following blob chain
   copy_uncompressed_blob()/copy_compressed_blob()

3. Get next record, if record reaches end of page (SUPREMUM record), fetch
   next page

4. Repeat until all records are processed in both copies.

5. The procedure for fetching from single copy is same except that the
   comparision function is not done.