WL#2910: Implement Disk-based Indexes for MySQL Cluster

Status: Assigned

Implement Disk-based Indexes for MySQL Cluster

A user today can specify a table to store it's columns on disk by default
by adding the option STORAGE DISK as a table characteristic. The user can also
add this option as an column characteristic.

In the current MySQL Cluster code if the table characteristic is DISK and a
column is indexed, this column will be converted to a MEMORY column. However
if the column characteristic is explicitly specified as DISK then indexing
this column will fail since MySQL Cluster currently doesn't support disk
indexes.

The change proposed in this worklog to add disk indexes in MySQL Cluster will
always remove this error since it will be possible to create disk-based
indexes in MySQL Cluster.

The question is then whether to retain the default behaviour in the current
MySQL Cluster version. The proposal here is to add a config variable that
specifies whether the old behaviour is desirable. This would mean that a
disk index can only be created on columns where the column is explicitly
specified as DISK. It's important to have this config variable to be able to
handle backwards compatibility. It is however not a good idea to use this
as the default behaviour of MySQL Cluster.

The default behaviour should be to create a disk index whenever any of the
columns are stored on DISK. This means that if the user specified the table
to be STORAGE DISK, then all indexes on the table will be disk indexes unless
the user have explicitly overridden the default on specific columns using
STORAGE MEMORY for all columns of an index.

This means that also the primary key index can be a disk index.

So as an example:

CREATE TABLE t1 (a int primary key) STORAGE DISK;
will create a table with one primary key index placed on disk and the column
a will also be stored on disk.

CREATE TABLE t1 (a int) STORAGE DISK;
will create a table with two columns, one hidden column which the primary key
index is based on and the a column which isn't indexed at all. Both the table
and the primary key index will be disk-based.

CREATE TABLE t1 (a int STORAGE MEMORY primary key,
                 b int STORAGE DISK,
                 index a(b));
will create a table with one traditional memory based hash index on the column
a, the column a will also be memory based, the column will be stored on disk
and also the column b will be disk-based.

Current NDB ordered indexes are not recoverable, they're recreated at restart
instead. Disk indexes will only support being recoverable, thus the parameter
to set setLogging(FALSE) will be ignored. Also disk indexes will always be
an ordered index, thus specifying UniqueHashIndex will create a unique index
but it will still be an ordered B-Tree index.

As part of this worklog we won't attempt to solve the problem for the MySQL
optimizer to handle the new disk indexes. It will use the same method as is
currently used and only adapt this to provide some information to the MySQL
optimizer. Another worklog will have to handle this in a proper manner.
A new block DBINX will be added to NDB Kernel. This block is intended to implement
a disk index which can be used for both unique and non-unique use cases.

When creating a table it can be discovered that the table contains a disk-based
column in the primary key. If this happens, the table will use a disk-based
index primary key index implemented by DBINX.

When creating an ordered index a special index table is created and we discover
whether it is a disk index in the same manner as for normal tables.
The changes required in the MySQL layer is only the new configuration
variable required to handle backwards compatibility and some minor
changes in the NDB handler to make it possible to create indexes with
fields stored on disk. This can be done by small changes in the NDB
handler and should require no changes in the interface to create a
table in the NDB API since the NDB Kernel will use a very simple logic
whereby an index is made into a disk index if any of the columns it's
based on is stored on disk.

Performing a LQHKEYREQ request:
Definition 1: ACC will be used as Lock Manager
Definition 2: The actual disk-based index will be implemented in
a new block called DBINX.
Definition 3: Each table requires at least a disk-based index for
the primary key of the table.
Definition 4: Usage of Change Buffering is choosen based on an algorithm,
the basic idea is that when accesses often are disk-based (cache misses)
then we use change buffering and when most page accesses are found in
page cache then there is little need of change buffering.

Corollary 1:
Definition 1 means that upon access of ACC using ACCKEYREQ
operations of type ZREAD, ZUPDATE, ZDELETE, ZWRITE cannot expect
to find a record in ACC even if the record exists. They need to
insert this record if it doesn't exist. This means that operation
type in ACCKEYREQ must be mapped as follows:
ZREAD => ZWRITE
ZUPDATE => ZWRITE
ZDELETE => ZWRITE

ZWRITE and ZINSERT retain their operation type.
The lock type flag will however not change, thus we can access
ACC using the combination ZWRITE with Read lock

Corollary 2:
Upon commit or abort of an operation we need to remove the
entry in ACC unless it has a lock queue. This will work fine
if Commit views the operation as a delete operation. For aborts
it will work fine if it's viewed as an abort of an INSERT
operation. With those simple modifications ACC should be reusable
for use as a LOCK manager.

Corollary 3: All locked tuples will be present in ACC and in
a memory based shadow table stored in TUP, this table needs
to store at a minimum the primary key of the table plus a
reference to the disk record (which is NULL until the Record
Reference (RR) have been read from the disk index). The reason
is to avoid complicating the design of ACC by reintroducing
storage of primary keys in ACC. The reason for storing the RR
in this shadow table is a small optimisation to avoid the need
to look up in the disk index while the record is already locked
by another operation.

Corollary 4: When statistics shows that we should perform change
buffering we do that by storing a part of the table in the
shadow table normally used to hold the primary key used by ACC.

Lemma 1: Corollary 4 and the fact that we don't store all fields
in the shadow table and dependent on the state, we will store
different fields. These facts together means that all fields
possible to store in shadow table must be part of metadata
of shadow table. It also becomes clear that all fields except the
primary key must be representable as Non-existent. This is not
the same as NULLable since a field can be Existent and NULL.
Non-existent fields have an unknown value, Existent fields have
a known value which could be NULL.