WL#8619: InnoDB: Provide offline database portability without ISL files

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

When data sets are transferred to a different location, we cannot rely on persistent data files (such as data dictionary tables or the InnoDB redo log) to contain file paths that need to be rewritten.

The ultimate 'source of truth' about where the tablespace can be found should be the data dictionary. But if a database is moved to a new location while offline, the DD needs to be updated at startup.

The *.isl files that were introduced in 5.6 and removed in 8.0 provided a limited way to rename data files during server startup. There could be a large number of these tiny files, and the files could introduce an inconsistency. For any given tablespace file, we could have multiple conflicting sources of information in MySQL 5.6 and MySQL 5.7:

1. The file name in the data dictionary

for example, '/foo/test/t1.ibd'

2. The contents of an ISL file in the data directory

for example, 'test/t1.isl' contains '/bar/test/t1.ibd'

3. A data file exists in the data directory

for example, 'test/t1.ibd' exists

There are no clearly defined semantics for resolving these conflicts and adjusting the data dictionary. The situation is particularly tricky when recovering from a crash (or restoring a hot backup), because the data dictionary is not available for reading or writing.

To avoid such unclear situations with multiple redundant copies, we will use a new start-up option called --innodb-directories that provides directory locations to scan for datafiles.

Using --innodb-directories=dir, files can be relocated while no engine is using them. Users will be able to move files while the server is offline, no matter how the server went offline (shut down, forcibly killed, or crashed).

The directories listed in --innodb-directories are scanned for filenames ending in '.ibd'. An in-memory mapping is created. Then during normal startup using the tablespaces in the DD, the scanned filenames are used instead of the filenames in the DD and the DD will be updated to refer to these new filepaths.

This solution has the advantage that it does not add any new way of providing a datafile name. Anything could be written into an ISL file and thus InnoDB v5.7 needed to validate that file name. But this solution puts into the mapping, file filenames that are in the data dictionary or have been found on the file system through scanning directories.

So we need no special measures to maintain portability between Windows and the rest of the computing world. If someone renames a file while moving it to a scan directory, they should avoid ‘funny’ characters that are prohibited or have a special meaning on some other file system or operating system unless they have no intention of ever moving the file there.

It will be up to the user to ensure proper replication of DDL statements when replicating between Windows and other systems. This affects statements like;

   CREATE TABLE…DATA DIRECTORY
   CREATE TABLE…INDEX DIRECTORY
   CREATE TABLESPACE . . . ADD DATAFILE

When starting up with --innodb-directories, InnoDB will search for all filenames in those directories that end with '.ibd'. It will read the header of those files and find the space_id. It will collect unique tuples of (space_id,filename). If two different files have the same (space_id,first_page_no) then startup will fail with a message logging the duplicate file names.

Once we have a full set of (space_id, filename) tuples, this mapping is used in the normal process of opening files from either log-based discovery or dictionary based discovery so that the scanned filepaths are used.

If the redo log was processed correctly, then InnoDB would update the contents of the data dictionary whenever an entry in the file mapping disagrees with the data dictionary.

FR1: Portability when avoiding absolute pathnames in SQL

If no absolute pathnames have been used in CREATE TABLESPACE…ADD DATAFILE or TABLE…DATA DIRECTORY, the following shall hold:

FR1.1: The server must start successfully, even if data files were moved and datadir adjusted accordingly.

FR1.2: The server must successfully perform crash recovery, even if the server was killed before restarting.

FR1.3: Both crash recovery and normal startup shall succeed after moving files between Microsoft Windows and POSIX-compliant systems.

If absolute path names were used or, files have been relocated to a different directory outside of --datadir then those locations must be set in --innodb-directories.

FR2: Portability requirements

FR2.1: When the server is shut down or killed, ibd and system tablespace files can be moved, and the server restarted with --innodb-directories="dir1<;dir2;...;dirN>", so that the server will reflect the new locations. Wildcard expressions are not accepted.

FR2.2 Implicitly created undo tablespace files named "undo_001", etc must be located in --innodb-undo-directory, just as they have always been. They can be move while offline, as before, as long as -innodb-undo-directory is updated to reflect the new location.

FR3: Validation and Impact of --innodb-directories

FR3.1: Startup will be aborted if the directory list provided in --innodb-directories cannot be parsed according the rules for this setting.

Main Changes

The redo log version number has been incremented from 2 to 3. Redo log before this WL cannot be parsed in version 3. Uses have to do a clean shutdown to upgrade.

Following redo log type handling has been removed

  1. MLOG_FILE_CREATE2
  2. MLOG_FILE_RENAME2
  3. MLOG_FILE_OPEN

New redo log codes are:

  1. MLOG_FILE_CREATE
  2. MLOG_FILE_RENAME

InnoDB must be able to;

  1. Scan all directories in --innodb-directories on every startup.
  2. The following paths "--innodb-home-directory;--innodb-undo-directory;--datadir" are always appended to --innodb-directories. This full list of directories, including all sudirectories found within each, is consolidated by deleting duplicates and each unique directory remaining is searched for ".ibd" files.
  3. Implicitly created undo tablespaces must still be located only in --innodb-undo-directory.
  4. The contents of dd::Tablespace_file will be updated if the file name scanned and opened is different.

File Discovery

Because all InnoDB data file names used by the MySQL server will be persistently stored inside the data dictionary tables, all InnoDB files that belong to an instance can be found by querying the Global DD view INFORMATION_SCHEMA.FILES.

Previous rules for where an IBD file can be location will be enforced during startup when the scanned list of IBD files is compared to the DD. There are really just two:

  1. File-per-table IBD files cannot be moved into a directory that does not match the schema name.
  2. General Tablespace IBD files cannot be moved into any location under the --datadir

Portability to a Different OS

When a database is moved from one OS to another with different file naming conventions (posix compliant vs Windows), the new file names will be built using the directory paths found in --innodb-directories. Paths from the other OS found in the mapping file or the dd will be replaced with what is found during the scan. Thus --innodb-directories is essential to porting a database to another system unless all paths used are relative paths to the default data directory settings.

This simple class tracks the table space ID to physical filename mapping.It's a light weight front end to a std::unordered_map.

class Tablespace_files { public:

       using Names = std::vector>;
       using Paths = std::unordered_map;
       /** Default constructor */
       Tablespace_files() : m_paths() { }
       /** Add a space ID to filename mapping.
       @param[in]      space_id        Tablespace ID
       @param[in]      name            File name.
       @return number of files that map to the space ID */
       size_t add(space_id_t space_id, const std::string& name);
       /** Get the file names that map to a space ID
       @param[in]      space_id        Tablespace ID
       @return the filenames that map to space id */
       Names* find(space_id_t space_id);
       /** Remove the entry for the space ID.
       @param[in]      space_id        Tablespace ID mapping to remove */
       void erase(space_id_t space_id);

private:

       /** Mapping from tablespace ID to filenames. */
       Paths              m_paths;

};

Since directories contain files the actually mapping is:

class Tablespace_dirs {

   std::string path to directory scanned.
   std::vector of Tablespace_files found in the subtree.

};

When the server is started up, walk the directories listed in innodb-directories, innodb-data-home-dir and --datadir. Look for files that end in .ibd or have the prefix "undo". Read the first n pages to determine the table space ID and if it's a valid table space add it to the Tablespace_files class. If duplicates are found then complain and exit. The user must disambiguate by moving the conflicting file(s) out of the way. Files names are converted to the real path when comparing to get around symlinks. When changing to the new path, use a path that is close to the original path see:

/** Lookup the tablespace ID and return the path to the file. The idea is to
make the path as close to the original path as possible. If we can't find a
match then use the real path that was found during scanning.
@param[in]      old_path        Old/Original path name, could be relative
@param[in]      new_path        Real path found during the scan
@return path name that is closer to the old_path or new_path as is */
std::string fil_make_relative_path(const char* old_path, const std::string& new_path);

Additional changes:

  1. Remove redo logging of file open. This was introduced in WL#9499 as a stop gap arrangement.
  2. Fix the fil_sys mutex performance issue by sharing it. The shards are hashed on the table space ID. The mutex coverage rules have not changed, except that a table space is now covered by its shard mutex instead of the fil sys mutex.
  3. New class File_shard that is split from file_sys.

class Fil_shard {

...
private:
       /** Fil_shard ID */
       const size_t    m_id;
       /** Tablespace instances hashed on the space id */
       Spaces          m_spaces;
       /** Tablespace instances hashed on the space name */
       Names           m_names;
       /** Base node for the LRU list of the most recently used open
       files with no pending I/O's; if we start an I/O on the file,
       we first remove it from this list, and return it to the start
       of the list when the I/O ends; log files and the system
       tablespace are not put to this list: they are opened after
       the startup, and kept open until shutdown */
       File_list       m_LRU;
       /** Base node for the list of those tablespaces whose files
       contain unflushed writes; those spaces have at least one file
       where modification_counter > flush_counter */
       Space_list      m_unflushed_spaces;
       /** When we write to a file we increment this by one */
       int64_t         m_modification_counter;
       /** Number of files currently open */
       static std::atomic_size_t       s_n_open;
       /** ID of shard that has reserved the open slot. */
       static std::atomic_size_t       s_open_slot;

};

The only new rule for mutexes is that when updating the shared data members of Fil_sys all the Fil_shard mutexes must be locked in ascending order. When looking up a table space instance from a table space name all the shard mutexes must be acquired. Since we don't know which shard the table space could be in. There are new latching rules added to track violation of out of order mutex acquires.

The DD table space paths are updated in:

/** Discover all InnoDB tablespaces.
@param[in,out]  thd     thread handle
@retval true    on error
@retval false   on success */
static MY_ATTRIBUTE((warn_unused_result))
bool  boot_tablespaces(THD* thd)
Pseudo code is as follows:
foreach (table space : DD::tablespace_file) {
  if (is_ibd_tablespace(space) {
    switch (lookup(space.id, in: space.name, out: scanned_name)) {
    case MISSING:
      print_missing_error_message;
    case DELETED:
      print_deleted_error_message;
      break;
    case MOVED:
      update_dd_tablespace(space, scanned_name);
      break;
    case MATCHED:
      break;
    }
  }
}