WL#9538: InnoDB_New_DD: Integrating InnoDB SDI with new DD
Affects: Server-8.0
—
Status: Complete
Main objective of this WL is to integrate new DD with InnoDB new dictionary branch. This WL will handle the below cases: 1. Removal of one SDI copy in InnoDB tablespaces 2. Fix the integration issues. Make sure SDI in InnoDB tablespace is updated after ALTER. 3. On upgrade from 5.7, SDI is created for tablespaces
Functional: ----------- 1. SDI should be present in all InnoDB tablespaces except temporary tablespace(ibtmp*) and undo tablespaces 2. SDI should be updated after ALTERs 3. ibd2sdi should be able extract SDI from InnoDB tablespace when server is offline 4. SDI stored in tablespace is in JSON format and written by Server. So the output from ibd2sdi should be JSON parseable. Non-Functional: --------------- 1. ALTER performance shouldn't be degraded because of SDI.
Reasons for changing design:
1. The original requirement of two copies is not met
The original requirement from support is that, we should have two copies
of SDI and these two copies should be far away in .ibd file. So even if one
SDI is corrupted because of bad disk(page corruption), there are more
chances to recover if the SDI copies are far.
This far requirement cannot be met. So there is equal chance of
corrupting both copies.
[Proposed Change]: Remove one copy
2. Increase in .ibd size
Each copy will create one root page. Assuming 16K page size,
32K size increase per .ibd file.
This might look OK. But with 1 million tables, the overhead is 20G for SDI
pages.
For file per table tablespaces, there will be only one entry
(or at max 2-3 entries). So even one SDI index page is overkill on
storage space.
We cannot go for further optimization because the SDI storage promised is
"transactional". To be transactional, undo logging & redo logging is
necessary.
And undo logging is not possible without an index page. So, we have to
live with one SDI index page even for file_per_table tablespaces.
[Proposed Change]: One copy will reduce the size but still 10G overhead.
3. Trying to be too perfect
Perfect here means, even after a DDL crash, the aim is to have one copy of
SDI as it was before DDL and another copy of SDI after WL.
We realized that the main purpose of SDI is to extract data when we only
have an .ibd. So as long as we can get the schema for clustered index,
we can extract clustered index data and rebuild other indexes.
For file per table, when a clustered index is changed, we will always have
intermediate .ibd. This intermediate ibd will contain new SDI.
[Proposed Change]: See below
4. Flushing tricks assumed in design
It is assumed that we could flush one copy of SDI before trx commit and
another after trx commit.
This is not straight forward. We need more tweaks in flushing to achieve
this. Lesser the tweaks, the better
[Proposed Change]: Since the changes to SDI pages are undo & redo logged, there
is no need to flush SDI pages during ALTER. But it is required only for
those cases where undo & redo is not available. SDI will not be consistent
without applying undo & redo.
For version 1, we will not do any extra flushing of SDI pages. Undo & redo
has to be available to make SDI consistent.
5. No version for SDI
One mistake in existing design is that, there is no version tag for on-disk
SDI. So it is hard to change formats of SDI.
[Proposed Change]: Since one copy is removed, we will use that 4 bytes as
version number at FIL_PAGE_FILE_FLUSH_LSN offset on pages 1 & 2.
Summary(tasks in this WL)
-------------------------
1. Remove one SDI copy. Remove all code which deals with two copies.
2. Use 4 bytes of FIL_PAGE_FILE_FLUSH_LSN offset on pages 1 & 2 as version 1
3. SDI updates should be done in the same trx used for DDL. There will be only
single transacation that covers entire DDL.
For integration purpose, we will
create a internal transaction to do SDI changes.
4. InnoDB will compress the SDI and then store it. This will reduce the size
of the SDI to be stored. This applies to uncompressed tablespaces as well.
ibd2sdi will decompress the SDI first.
To handle compression, we need two extra fields compressed_len and
ucompressed_len. So the schema of SDI table becomes
CREATE TABLE SDI(
type INT UNSIGNED,
id BIGINT UNSIGNED,
uncompressed_len INT UNSIGNED,
compressed_len INT UNSIGNED,
compressed_data BLOB NOT NULL,
PRIMARY KEY (type,id));
Compression library used is zlib.
5. Enable SDI in mysql tablespace (mysql.ibd)
6. API changes:
------------
i) Change dd::sdi_key_t from (id, type) to (type, id)
After this change, objects of same type will be stored together.
In future, accessing objects of same type could be made faster,
because they are stored continuously
Since the format of SDI is different, ibd2sdi tool should handle this.
ii)Change return types of SDI APIs from bool to uint32_t. This allows APIs
to return more error code. Zero on success, Non-zero on failure
iii)Add dd::Table* to SDI APIs
Unlike other SE's which stores SDI in files, InnoDB stores SDI in
tablespaces.
So a dd::Table object can be serialized and stored only after it has
valid se_private_id. This is because of presence of valid se_private_id
indicates table is created, so a tablespace (implicit) is also created.
Similary for partitions, a valid se_private_id of all partitions indicate
the tablespaces for partitions are created.
So sdi_set() can be called twice for same dd::Table object. Once without
valid se_private_id and later with valid se_private_id.
SE is responsible for this check and it should avoid storing sdi if
dd::Table or Partitions do not have valid se_private_id.
Thus, to do these checks, dd::Table* is added to SDI APIs
sdi_set() & sdi_delete()
Side notes:
------------
For partitions, the SDI is stored as a whole(includes all partition info).
For example, if table has 10 partitions, and lets say each partition goes
to a separate tablespace(file_per_table), then SDI will be stored 10 times
in all the partitioned tablespaces.
SDI is with all partitions info, stored in each tablespace.
Another example:
If there is table with 10 partitions, 4 of them in one tablespace
(general tablespace) and remaining 6 as file_per_table tablespaces.
In this case, there will be 7 sdi_set(). Once per each tablespace of each
partitions. For tablespace where it is already stored, duplicate stores
are avoided.
In the above case, where 4 partitions are in one general tablespace,
there will only one sdi_set(table_blob_with_all_partiton_info);
instead of 4.
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.