WL#7066: External tool to extract InnoDB tablespace dictionary information
Affects: Server-8.0
—
Status: Complete
We need an external tool for extracting tablespace dictionary information from an InnoDB tablespace. Although we don't do copy-on-write, we will do best effort to make reads from the tool reliable. The main objective of this tool is to read the tablespace dictionary by direct access to the tablespace file.
F-1: Should be able to read tablespace dictionary on tablespace files on a running server F-2: Should not modify any tablespace files (and should not modify the tablespace dictionary inside *.ibd files) F-3: Should be able to read tablespace dictionary when server is offline
This tool assumes the tablespace dictionary pages are in sane state (by flushing the SDI pages ASAP). Remember, a tablespace can be single-table tablespace (t1.ibd) or multi-table tablespace or InnoDB system tablespace (ibdata*). Proposed name of this tool: ibd2sdi Primary objective of this tool: =============================== To read all the serialized metadata in a tablespace The tool should be able to effectively do READ UNCOMMITTED of tablespace dictionary from the specified single-file tablespace: ibd2sdi --read=C t1.ibd ibd2sdi --read=C ibdata1 The above commands should dump the Serialized Dictionary Info(SDI) for the tablespaces. The schema of SDI table records is id<8bytes>, type<4bytes>, data. We will not parse/interpret the contents of the 'data' field. i.e. we do only byte-by-byte comparison The argument C specifies which copy of the information to retrieve. InnoDB can write N=0 or N=2 copies. The command will ONLY read the specified tablespace. There will not be any redo log apply or any access to the undo log. This is a ‘best effort’. When the server is running, it is preferred to use the INFORMATION_SCHEMA to retrieve information about tablespaces. Most of the time, this tool should work while the MySQL server is running. During DDL (or ROLLBACK or undo log purge related to the SDI), there may be a short time window when this tool fails to read one copy of the information. Multi-file tablespace is supported but multiple tablespaces aren't supported. ibd2sdi --read=C ibdata* - supported ibd2sdi --read=C foo.ibd ibdata* bar.ibd - not suppported. The files of a multi-file tablespace should be specified in a strict order with ascending page number. If two successive files have same space_id, then the latter file must start with a page_no that is 1+last_page_no of the previous file. The first file for a tablespace always has to start with page number 0 (unless page 0 is corrupted). How the SDI generated from this tool should be used =================================================== If a table is corrupted in tablespace, the SDI should be extracted into a file and remove the corrupted table in SDI file. With the edited SDI & .ibd, it can be used to import tablespace with corrupted table. How the read will be done? ========================== We maintain two copies of the SDI table information and stores the root page numbers in pages 1 & 2. Check if page 0 is not corrupted: IF page 0 is corrupted ---------------------- Since Page 0 stores page_size used by the InnoDB tablespace, we wouldn't know the page size. To determine page size, we read 60 (or actual number of pages if less than 60) pages with different supported innodb_page_size (uncompressed and compressed), trying to find pages that satisfy any supported checksum. Once we determine the page with valid checksum and valid page size, we can read page 1 & 2. Also print warning about the detected page corruption. a. page_size is determined by doing the above calculations b. Read the index root page numbers from page 1 & page 2 (page corruption check will done on page 1 & 2 as well) IF page 0 is not corrupted -------------------------- a. Read page_size from FSP header(page 0) b. Read the flag which indicates the tablespace dictionary exists or not c. Read the index root page numbers d. Also read index root page numbers from page 1 & page 2 Output format: ============== The tool will write the SDI records id, type, data in JSON like format. Example: [ ["ibd2sdi", {"type:%u,"id":%lu}], [SDI blob 1, not ending in comma] ], [ ["ibd2sdi", {"type:%u,"id":%lu}], [SDI blob 2, not ending in comma] ], Options: ======== --read=C (C = 0 or 1) Explained above. This allows the user to mention the explicit copy to read. Should be either 0 or 1. --dump-file=Dump the tablespace SDI into the file passed by user. Without the filename, it will default to stdout. --skip-data Used for skipping 'data' field when parsing the fields of sdi_table. This option will be used to print only the primary keys of SDI stored. For example: (Sample output) [ ["ibd2sdi", {"type:1,"id":1}] ], [ ["ibd2sdi", {"type:2,"id":2}] ], --id=X --type=Y When explicit id or type is provided, dump only the matching records. Note: The tool will always scan all records even when these options are provided. It will not perform searches in the SDI B-tree. --checksum-algorithm=[innodb,crc32,none] algorithm to verify the checksum of the pages read. By default, we will use non-strict checksum. i.e. check for all checksum algorithms: none,crc32,innodb. --no-check Skip checksum verification on pages read Without the option "--read=C": =============================== When the user do not specify the copy to read, the tool will read *both* copies and tries to match. The matching is done by parsing the fields in record.i.e Extract id, type & blob(blob is read from multiple pages) and then a byte-by-byte comparision is done. If copies match: ---------------- We will write one copy into file (if provided) by user. If copies don't match: ---------------------- Write nothing to the file(if provided)and return a nonzero exit status. Dump both copies to different files so that user can compare the mismatched records. Sample output from tool: ./ibd2sdi t1.ibd.mismatch [ERROR] ibd2sdi: Corruption detected when comparing records. [ERROR] ibd2sdi: Dumping SDI Copy: 0 into file: /tmp/ib_sdi_90_copy_0_GzcqtH. [ERROR] ibd2sdi: Dumping SDI Copy: 1 into file: /tmp/ib_sdi_90_copy_1_5bhhgk. [ERROR] ibd2sdi: Please compare the above files to find the difference between two SDI copies. The file will be located in system tmp directory. The pattern is ib_sdi +space_id+copy_num+uniqtext. Handle Page Corruption ====================== The tool should handle corrupted pages. It shouldn't go into infinite loop or assert. Always gracefully exit. The following scenarios will be handled: ---------------------------------------- 1. Broken page links 2. Page with 0s 3. Checksum failures 4. Corrupted records (records pointing to 0x0 or pointing to offset making the record chain a cycle)
STEP 1: Create tablespaces map ------------------------------ 1. Read all files on command line 2. Create a map of tablespaces (we restrict to only one tablespace for now) 3. Read space_id from first page of every file to determine the datafiles of a tablespace 4. In case of multi-file tablespaces, verify that first_page_num_of_data_file = last_page_num_of_last_data_file + 1 5. If the above validation fails, we don't add such tablespace to map 6. Now we have map of tablespaces. For each tablespace we have determined the vector of datafiles, page_size, space_id, etc STEP 2: Fetch Root page numbers -------------------------------- For each tablespace (currently one) 1. Check if tablespace flags(FSP_FLAGS) say we have SDI in tablespace. Also determine page_size from FSP_FLAGS. If the page size is invalid, read N (lets say 60) pages, calculate checksum using different page_sizes. Determine the best page_size based on checksum failure count. (determine_page_size()) 2. Even if FSP_FLAGS says we don't have SDI in tablespace, we double verify by reading root page numbers stored at FIL_PAGE_FILE_FLUSH_LSN stored at page 1 & page 2 3. If we find valid root page numbers at page 1 & 2, we proceed 4. Match the root page numbers found in page 1 & 2. If they mismatch we try to determine the better root page number by reading FIL_PAGE_TYPE If we can't determine even after this step, we abort. determine_good_root_page_no() 5. If user chooses to read only from one copy, we set the other copy num as -1 to avoid processing of other copy. If not, we will process both copies STEP 3: Reach to Level Zero ---------------------------- 1. Based on user input, we either match both copies -> dump_all_recs_in_leaf_level_compare() For processing only single copy -> dump_all_recs_in_leaf_level() First step is reach Leaf Level (Level Zero) reach_to_level_zero(): a. Read the current level of root page, if 0, return b. If level of root page is not zero: * Fetch Left most child page num (fetch INFIMUM, go to next-rec offset from INFIMUM) * In this node page rec, read child page_num c. Read the child page_num. Find the child page_num level. d. Repeat the above procedure, until level zero is reached STEP 4: Process & Compare Records ---------------------------------- 1. After reaching to Left most Leaf level page, for both copies get first user record. 2. Extract (id, type, data) from record and compare both records and write to file (if provided by user) check_and_dump_record() i. if data is stored externally, fetch them following blob chain copy_uncompressed_blob()/copy_compressed_blob() 3. Get next record, if record reaches end of page (SUPREMUM record), fetch next page 4. Repeat until all records are processed in both copies. 5. The procedure for fetching from single copy is same except that the comparision function is not done.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.