WL#2910: Implement Disk-based Indexes for MySQL Cluster
Status: Assigned — Priority: Medium
Implement Disk-based Indexes for MySQL Cluster A user today can specify a table to store it's columns on disk by default by adding the option STORAGE DISK as a table characteristic. The user can also add this option as an column characteristic. In the current MySQL Cluster code if the table characteristic is DISK and a column is indexed, this column will be converted to a MEMORY column. However if the column characteristic is explicitly specified as DISK then indexing this column will fail since MySQL Cluster currently doesn't support disk indexes. The change proposed in this worklog to add disk indexes in MySQL Cluster will always remove this error since it will be possible to create disk-based indexes in MySQL Cluster. The question is then whether to retain the default behaviour in the current MySQL Cluster version. The proposal here is to add a config variable that specifies whether the old behaviour is desirable. This would mean that a disk index can only be created on columns where the column is explicitly specified as DISK. It's important to have this config variable to be able to handle backwards compatibility. It is however not a good idea to use this as the default behaviour of MySQL Cluster. The default behaviour should be to create a disk index whenever any of the columns are stored on DISK. This means that if the user specified the table to be STORAGE DISK, then all indexes on the table will be disk indexes unless the user have explicitly overridden the default on specific columns using STORAGE MEMORY for all columns of an index. This means that also the primary key index can be a disk index. So as an example: CREATE TABLE t1 (a int primary key) STORAGE DISK; will create a table with one primary key index placed on disk and the column a will also be stored on disk. CREATE TABLE t1 (a int) STORAGE DISK; will create a table with two columns, one hidden column which the primary key index is based on and the a column which isn't indexed at all. Both the table and the primary key index will be disk-based. CREATE TABLE t1 (a int STORAGE MEMORY primary key, b int STORAGE DISK, index a(b)); will create a table with one traditional memory based hash index on the column a, the column a will also be memory based, the column will be stored on disk and also the column b will be disk-based. Current NDB ordered indexes are not recoverable, they're recreated at restart instead. Disk indexes will only support being recoverable, thus the parameter to set setLogging(FALSE) will be ignored. Also disk indexes will always be an ordered index, thus specifying UniqueHashIndex will create a unique index but it will still be an ordered B-Tree index. As part of this worklog we won't attempt to solve the problem for the MySQL optimizer to handle the new disk indexes. It will use the same method as is currently used and only adapt this to provide some information to the MySQL optimizer. Another worklog will have to handle this in a proper manner.
A new block DBINX will be added to NDB Kernel. This block is intended to implement a disk index which can be used for both unique and non-unique use cases. When creating a table it can be discovered that the table contains a disk-based column in the primary key. If this happens, the table will use a disk-based index primary key index implemented by DBINX. When creating an ordered index a special index table is created and we discover whether it is a disk index in the same manner as for normal tables.
The changes required in the MySQL layer is only the new configuration variable required to handle backwards compatibility and some minor changes in the NDB handler to make it possible to create indexes with fields stored on disk. This can be done by small changes in the NDB handler and should require no changes in the interface to create a table in the NDB API since the NDB Kernel will use a very simple logic whereby an index is made into a disk index if any of the columns it's based on is stored on disk. Performing a LQHKEYREQ request: Definition 1: ACC will be used as Lock Manager Definition 2: The actual disk-based index will be implemented in a new block called DBINX. Definition 3: Each table requires at least a disk-based index for the primary key of the table. Definition 4: Usage of Change Buffering is choosen based on an algorithm, the basic idea is that when accesses often are disk-based (cache misses) then we use change buffering and when most page accesses are found in page cache then there is little need of change buffering. Corollary 1: Definition 1 means that upon access of ACC using ACCKEYREQ operations of type ZREAD, ZUPDATE, ZDELETE, ZWRITE cannot expect to find a record in ACC even if the record exists. They need to insert this record if it doesn't exist. This means that operation type in ACCKEYREQ must be mapped as follows: ZREAD => ZWRITE ZUPDATE => ZWRITE ZDELETE => ZWRITE ZWRITE and ZINSERT retain their operation type. The lock type flag will however not change, thus we can access ACC using the combination ZWRITE with Read lock Corollary 2: Upon commit or abort of an operation we need to remove the entry in ACC unless it has a lock queue. This will work fine if Commit views the operation as a delete operation. For aborts it will work fine if it's viewed as an abort of an INSERT operation. With those simple modifications ACC should be reusable for use as a LOCK manager. Corollary 3: All locked tuples will be present in ACC and in a memory based shadow table stored in TUP, this table needs to store at a minimum the primary key of the table plus a reference to the disk record (which is NULL until the Record Reference (RR) have been read from the disk index). The reason is to avoid complicating the design of ACC by reintroducing storage of primary keys in ACC. The reason for storing the RR in this shadow table is a small optimisation to avoid the need to look up in the disk index while the record is already locked by another operation. Corollary 4: When statistics shows that we should perform change buffering we do that by storing a part of the table in the shadow table normally used to hold the primary key used by ACC. Lemma 1: Corollary 4 and the fact that we don't store all fields in the shadow table and dependent on the state, we will store different fields. These facts together means that all fields possible to store in shadow table must be part of metadata of shadow table. It also becomes clear that all fields except the primary key must be representable as Non-existent. This is not the same as NULLable since a field can be Existent and NULL. Non-existent fields have an unknown value, Existent fields have a known value which could be NULL.
Copyright (c) 2000, 2015, Oracle Corporation and/or its affiliates. All rights reserved.