WL#3600: NDB schema transactions

Status: In-Progress

Description
High Level Architecture
Low Level Design

WL#3600 HLD

NDB schema transactions and operations
--------------------------------------

Schema here is synonym for metadata or dictionary data.

Metadata contains definitions of tables, indexes, etc.  For
various reasons, NDB cannot store this information in database
tables.  Instead it is stored in special structures on each
database node.

Metadata operations include create, alter, and drop.

Metadata operations (unlike data operations) in NDB are not
done within a transaction.  Normal errors and node failures are
handled poorly or not at all in many cases.  These problems
lead to corrupted databases seen at customer sites.

This wl# implements metadata transactions and common structure
for metadata operations.  Old operations are converted and
normal error handling is fixed.  The new structure is then used
to handle node failures.

An added benefit is that metadata transactions are much faster
than individual operations.  NDB API, and also MySQL level can
make good use of this.

NDB schema transactions and operations
--------------------------------------

Why
---
DBDICT is the NDB block (software module) which handles
dictionary data (metadata).  For various reasons, metadata
cannot be stored in tables, but is instead stored in DBDICT
structures which are identical on every node.  There is
a DBDICT master node (same as "usual" master) which controls
each metadata operation (create, alter, drop).

Originally there was only tables and attributes.  Over the years
new types have been added, such as blobs, indexes, triggers,
and events.  These additions share little infrastructure or
methodology.  Often they require a sequence of external (NDB API)
or internal (DBDICT) operations.  For example:

- create table and its blob tables (external)

- create index and its index table (index is kind of table) and
  its index triggers, and build the index entries (internal)

- drop table and its blob tables and indexes and events (both)

This has led to increasing problems:

1) No schema transactions.  If the api (mysqld) is stopped in
the middle of an operation sequence, an inconsistent state
is left in DBDICT.  In particular with ALTER TABLE we have seen
corrupted databases which cannot be restarted.

2) Normal failure handling is poor in many cases.

3) Some operations handle slave-NF (non-master node failure).
Others do not and may leave the nodes inconsistent or the master
"busy" forever.

4) No operation handles master-NF.  This is much more complex
than slave-NF and impossible to fix before there is a common
infrastructure for schema transactions and operations.

5) Adding new operations is "ad-hoc" and duplicates code.
For example "add node+table reorg" will be controlled by DBDICT
and uses about 20 separate old and new operation types.

6) Originally NDB was planned to have max 20 tables (good enough
for an HLR).  Current installations have hundreds of metadata
objects.  DBDICT is slow due to having to sync each operation to
disk separately.  Schema transactions allow batching of disk I/O.
This is faster, for same reasons that a disk based dbms like
Oracle is much faster with large transactions.

7) Concurrent operations are not allowed.  This is only an
inconvenience to users.

What
----
This wl# includes:

A. implement schema transactions and operations
B. convert existing operations

These fix most problems.  Following from the list are not fixed
by the general design:

C. master-NF handling (prio: critical)
D. concurrent operations (prio: medium)

They may become separate wl# entries later.

How
---
A. implement schema transactions and operations

1) Schema operations are always done within a schema transaction
controlled by the user.  For example:

  NDB API <--> DBDICT
    -> begin schema trans
    <- CONF : trans started
    -> create table
    <- CONF : operation added (returns table id,version)
    -> end schema trans (commit=true)
    <- CONF : trans completed

2) Each request (begin trans / add op / end trans) is received
by master.  The request is distributed to all participants
(i.e. master and slave nodes) before a CONF is sent.

3) Each operation received is parsed by master.  Objects are
allocated or looked up, and variable data is packed into pages
(sections) attached to the operation.  Everything is distributed
to participants and slaves repeat the parse step.  Objects are
created in memory in un-committed state and can be referred to by
subsequent operations in the same transaction.

4) Any sub-operations (such as create blob tables) are added
after the operation.  The master sends operation requests to
itself and repeats step 3) for each.

5) After each 3)-4) operation CONF is sent to user.  At the end,
identical transactions and operation sequences have been created
in memory on all participants.

6) At end schema trans (commit=true), the operation sequence
is executed in several phases 8).  Each phase iterates over all
operations.  The master handles one operation at a time and
orders the participants to execute it and to report back.
This is a simple ping-pong protocol (where master plays all
tables simultaneously).  Special cases in 12)-14).

7) The "local" operation on the participant can be complex.
Often it needs to access other blocks or to do disk I/O.
In general it will only access blocks on the local node.
An exception is hash index build where rows scanned locally
are inserted on random nodes.

8) Initial phase is the "parse" 3)-5) which builds up operations.
Main execution phases are prepare, commit, and complete.  This is
normal 2-phase commit stuff.

9) Prepare phase. [todo]

10) Commit phase. [todo]

11) Complete phase.  Discussed under C.

12) Special phases.  A "simple phase" does not iterate over
operations.  A "master phase" is run only on master.

13) More phases can be added easily.  For example pending disk
write of schema file can be separated from prepare and commit
phases into simple (non-master) phases.

14) The execution of an operation can be repeated a number of
times.  Each iteration is synchronized by the master.  For
example drop table iterates over NDB blocks in both prepare
and commit phases.

15) On error before commit, the sequence of actions done so far
is done backwards in "abort mode".  Abort in prepare phase can
call abort method for an operation twice: a) abort prepare
(external changes) b) abort parse (in-memory DICT changes).

Error after commit start is by design not handled.  If a node
cannot commit, it must crash.

16) A schema query "can see own (uncommitted) transaction".
This is an essential feature.  It is also useful in the beginning
to allow NDB API run operation sequences before they are moved
to DBDICT.

17) SR (system restart) and NR (node restart) use internal
transaction for each index activation and build.  For NR the
trans is local i.e. only this node is involved.

B. convert existing operations

This should be a simple but tedious rewrite.

C. master-NF handling

[ todo ]

D. concurrent operations

[ todo ]

NDB schema transactions and operations
--------------------------------------

Only things which are not obvious from the HLS.

NDB API schema transactions
---------------------------

- transaction is represented by NdbDictionary::Dictionary
  int beginSchemaTrans();
  int endSchemaTrans(bool commit);

- all ops done within begin / end schema trans
  uses transId (of api) and transKey (of DICT)
  the transId is extra check

- one trans at a time per Ndb / Dictionary

- stand-alone (old-style) ops use implicit trans
  implemented (on high level) in NdbDictionary.cpp

- DICT does not fix missing begin / end trans (so far, anyway)

- no operation record on API side (can use counter etc)

- schema queries must be "able to see own transaction"
  done by adding and checking transKey
  fragment info is nasty since DIH is unmodified in PARSE phase
  needs to be "re-factored" from DIH

Transaction record
------------------

    struct SchemaTrans;
    typedef Ptr SchemaTransPtr;
    SchemaTransPtr trans_ptr;

Kept in DLHashTable with key "trans_key".
Only one instance is allowed now.

Operation record
----------------

Consists of 2 records, the second one specific to operation type.
Use CreateIndex as example.

1)  struct SchemaOp;
    typedef Ptr SchemaOpPtr;
    SchemaOpPtr op_ptr;

Kept in DLHashTable with key "op_key".

2)  struct CreateIndexData;
    typedef Ptr CreateIndexDataPtr;
    CreateIndexDataPtr data_ptr (or, createIndexPtr);

Kept in simple ArrayPool.

The two are always seized and released together.  Two records are
needed since inheritance does not work with our record pools.

Method pointers
---------------

Static info for each operation type.

    struct OpInfo {
      const char m_opType[4]; // e.g. CTa for CreateTable
      // internal signal
      Uint32 m_impl_req_gsn;
      Uint32 m_impl_req_length;
      // helpers
      bool (Dbdict::*m_seize)(SchemaOpPtr&);
      void (Dbdict::*m_release)(SchemaOpPtr&);
      // parse phase
      void (Dbdict::*m_parse)(Signal* signal, SchemaOpPtr, ErrorInfo&);
      bool (Dbdict::*m_subOps)(Signal*, SchemaOpPtr);
      void (Dbdict::*m_reply)(Signal*, SchemaOpPtr);
      // run phases
      void (Dbdict::*m_prepare)(Signal*, SchemaOpPtr);
      void (Dbdict::*m_commit)(Signal*, SchemaOpPtr);
      void (Dbdict::*m_complete)(Signal*, SchemaOpPtr);
      // abort
      void (Dbdict::*m_abortParse)(Signal*, SchemaOpPtr);
      void (Dbdict::*m_abortPrepare)(Signal*, SchemaOpPtr);
    };

Polymorphism
------------

Data record inherits from OpData

    struct OpData {
      const OpInfo& m_op_info;
      Uint32* const m_impl_req_data; // internal sig
      Uint32 m_obj_ptr_i; // object operated on
    };
    struct CreateTriggerData : public OpData {...

Data record provides following "reflection" info which allows
single template function to seize/find/release the 2 records

    static ArrayPool& getPool(Dbdict* dict);
    static const OpInfo g_opInfo;

The static g_opInfo is member in base class OpData.  This allows
generic code to access OpInfo.

External vs internal signal
---------------------------

For each operation type there is external (or, client) signal and
internal signal.  This is used to clearly separate the protocols.
It does add some redundancy and code overhead.

- external: REQ enters black-box protocol, CONF returns.  Sender
  is NDB API, another block, or DICT master recursively to itself.

- internal: distributed within DICT and locally to other blocks.

Operation records are _always_ created by external signals.
Signal names are in App S.

In PARSE phase, the internal REQ is constructed by master and
piggy-backed on SCHEMA_TRANS_REQ to others.  Signal sections may
follow along or new ones may be created.

- internal signal is similar to external signal
  instead of transId, transKey it has opKey (or senderData)
  create signals add the newly allocated object id

- some internal signals are never sent but GSN_ is reserved anyway

- internal signal is stored in data record.  It is a convenient
  place to hold the "essential" info about the request

  struct DropIndexData : public OpData {
    DropIndxImplReq m_request;
    ...

Additional notes about external vs internal:

- object Version is extra verification between client and kernel.
  During PARSE it is sent (in internal sig) to participants to
  verify that all agree.  It is not sent/used by other blocks.

  sidenote: NDB confuses 2 meanings for Version, it is 1) part of
  object id 2) version of ALTERed object.  Should remove 1).

- client CREATE req has no object id (it does not exist!).  The
  corresponding internal sig does have object id.  BACKUP and SUMA
  create triggers in TUP, therefore acting as DICT and using
  internal signal CREATE_TRIG_IMPL_REQ.

Parse
-----

Using DropTable as example.

coordinator-master:

- receive client request
- call common method:

  SchemaOpPtr op_ptr;
  checkClientReq(signal, op_ptr, error);

  - check basics
  - find trans_ptr from req->transKey
  - seize op_ptr for master (allocates new op_key)
  - add op_ptr to trans_ptr
  - save clientRef, clientData in op_ptr
    for recursive ops this is DICT, not original client

- copy (parts of) external signal (client request) to internal signal
  which is stored under dropTablePtr.p->m_impl_req (aka m_request)

- call OpInfo::m_parse == dropTable_parse

  - check request fully
  - find or allocate the object
  - fill in data record and missing parts of m_impl_req
  - consume signal sections
  - master releases sections after sending parse request

- optionally make new signal sections (only CreateTable)

- save signal sections to DICT memory

- call schemaTrans_sendParseReq
  
  - send SCHEMA_TRANS_IMPL_REQ [PARSE] to all participants
  - piggy-back m_impl_req on SCHEMA_TRANS_REQ
  - any signal sections follow along

participant:

- op_key is both in SCHEMA_TRANS_REQ and in impl_req->senderData

participant-master:
  - find op_ptr
  - do _not_ call dropTable_parse
    this branch just provides a sanity check and a reply signal

participant-slave:
  - seize op_ptr with same op_key
  - un-piggy-back impl_req to normal Signal* position
  - copy Signal impl_req to dropTablePtr.p->m_impl_req

  - call OpInfo::m_parse == dropTable_parse

    - check request fully
    - check any data added my master (e.g. object id)
    - fill in data record (m_impl_req was done above)
    - save signal sections to DICT memory

- release signal sections

- call schemaTrans_sendConf or schemaTrans_sendRef

This procedure performs same checks in master and slave and
creates identical operation records.

Sections / pages
----------------
[todo]

Sub-operations
--------------

An operation can internally create additional operations.
These are put after the main op because new ops are always added
at end of operations list.  Like any op, a sub-op is created by
an external signal (master to itself).

An operation has a depth.  Top level depth is 0 and direct sub-ops
have depth one more than parent op.  The depth is equivalent to
making a tree structure on operations.

Information flows from main op to sub-ops (obviously) but the
reverse is also true.  CreateTrigger from AlterIndex creates
triggers which must be connected to the index.  There are a number
of ways to do this.  Choose E.

A. Make sub-op aware of its parent's requirements, so that it
   can fix things.  This is undesirable.

B. Do it in external signal CONF.  This is ideal except that
   the CONF is only received by master.  A second "sync" phase
   is required to have identical info on all nodes.

C. Make the main op pick up the extra info from the sub-op
   via tree walk.  The info may still be stored under main op.

D. Run the CONF code (which is currently a callback) on all nodes.
   This requires some piggy-backing.

E. Like A. but specify callback to run on all nodes.  Put callback
   number in requestInfo.  It is index to table of callbacks.
   Callback data is op_ptr.p->op_key.

Sub-operations before
---------------------

Sometimes a sub-operation must be run before main operation.
Hash index build is the example:

  create constraint - build index (main op) - drop constraint

The sub-op is NOT placed before the main op since this obfuscates
the tree structure.  Instead the main op is duplicated:

  build index (main op) -
    create constraint - build index (real build) - drop constraint

Other cases do not need a full sub-op but use the repeat feature.
E.g. create table takes lock (in master) on repeat 0 and releases
it on last repeat.  Non-master nodes do nothing here but are
synchronized with master automatically.

Dict objects
------------

This is list of all objects.  Object name is unique.
Connection to schema op allows generic code:

    SchemaOp  -->  OpData             -->  DictObject
                     | inherit               ^ ptr_i
                     v                       |
                   CreateTableData    -->  TableRecord

Abort mode
----------

Schema transaction proceeds in following main stages:

1 client sends tx begin
2 client sends parse requests
3 client sends tx end (commit)
4 DICT does prepare
5 DICT does commit

Abort in stage 1-2 only sends a REF to client.  The rollback is
run when client sends tx end (abort) or implicitly when client
disconnects.  Either way, client is responsible for abort.

Abort in stage 1-2 may be direct response to REQ, or may come from
another node, or from a recursive failed sub-operation parse.

Abort in stage 3-4 does complete rollback before returning error.
This is because no more client interaction is possible.  There
are no separate execute (commit) and close signals.

Abort is not allowed once commit is started.  A node which cannot
commit must crash itself.

Transaction state is determined by:

- requestType SchemaTransImplReq::RequestType

- abort mode
  0 = normal run, no abort
  1 = error in current step, becomes 2 after current step
  2 = aborting (similar to normal run but in abort mode)

In abort mode the phases and operations which have been run
so far, are run backwards in "abort mode".  The same operation
records are used so the exact state is known.  For example,
AlterTrigger knows state of local blocks (TC, TUP).  Each step
is repeated as requested.

A node which cannot abort must crash itself.

Running backwards from 3-4 means that abort code for an op may
be run twice, in general:

abort prepare: undo external changes (disk, other blocks)
abort parse: undo in-memory DICT changes

Error inserts are listed in App E.

Abort of table ops
------------------

Tables are special.  For historical reasons, create and drop
are different protocols.  Abort of create table in prepare phase
jumps into drop table protocol, using a drop table record.

Internal transactions
---------------------

This is DICT acting as client to schema transaction.

- client side record in TxHandle
- has same role as NdbDictInterface::Tx in NDB API
- kept in DLHashTable with key tx_key
- creates a normal schema trans (no special preperties)

tx_key shares id-space with op_key and trans_key.  This is
used in callbacks to find right object type.

Internal trans is used by:
- SR to activate and build (non-logged) indexes
- NR to activate indexes locally (see local transactions)
- API to run single ops (see simple transactions)

Local transactions
------------------

These are schema transactions where one node is the coordinator
and the only participant.

Simple transactions
-------------------

Client does not create a schema trans.  Client sends a single
operation request.  The op creates an internal trans.  The op
and its sub-ops are run within the internal trans.

This results in fewer api-kernel roundtrips.

Request info
------------

All REQ signals have Uint32 requestInfo.  This has a standard
format defined in DictSignal.hpp.

In particular there are requestFlags (last 16 bits) and they
are divided into global flags (first 8) and local flags (last 8).
Global flags are passed recursively to all involved nodes and
objects.  Local flags apply only to the object of the REQ.

DictSignal::RF_LOCAL_TRANS
this is a global flag set on all involved objects

DictSignal::RF_SIMPLE_TRANS
this is a local flag (sub-op must not start another trans)

AlterIndex and AlterTrigger
---------------------------

Triggers can be explicitly online/offline (enabled/disabled).

There is no user-level concept of "offlined index".  AlterIndex
is forced by SR / NR which create the index table (by reading
from file, not via normal protocols) in "offline" state.  That is,
it has no associated index in TC and triggers in TUP.

To online trigger: create triggers in all blocks (TUP).  To
offline drop them all.

To online index: create triggers in DICT and index in TC and
triggers in TUP.  To offline drop them all.  Online is used at
SR / NR.  Offline is used as sub-operation of drop index.

Error codes
-----------

Remove translations from Dbdict.hpp.  Translate in NDB API if
needed.

TODO
----

- finish table ops conversion
- fill in missing abort cases
- able to drop "broken" objects (yes, they should not happen)
- API FAIL => trans abort
- fragmentation info of PARSEd object (DIH unmodified)
- fix drop etc to happen at right time (prepare vs commit)
- build hash index in prepare phase
- template and simplify what can be
- decide which objects have versions in which signals
- minimize redundant info in CONF/REF sigs
- alloc trigger id from schemafile
- clean up connection to DictObject
- verify rolling upgrade from 5.1

merge
- to 5.1-telco

next round
- convert event ops
- convert file and tablespace ops
- unify code for permanent objects (schemafile, tablefile)

applications
- create table: subops to create blobs, (maybe) events
- drop table: subops to drop blobs, indexes, events

later
- master NF

Annoyances and side-issues
--------------------------

Name randomness table/tab index/indx trigger/trig etc.

Dbdict.cpp is almost 20,000 lines.  Split it up.

Dbdict.hpp method declarations should be put after the data
declarations that they belong to.

Inline constructors and placement new should be used everywhere
as a simple way to avoid missing some initialization.

Sometimes members have "m_" prefix, sometimes not.  Dbdict.hpp
(major) structs should have it.  Signal classes should not.

Signals classes should be PODs.  PODs allow member and static
methods (far as I know).  Move common methods to static methods
somewhere else e.g. trigger_definitions.h.

Index and trigger external REQ should include object versions.

Index and trigger CONF/REF signals have redundant fields.

Should have separate IndexRecord and make indexes "more equal"
with tables.

Signal number/name/printer declared in too many places.
Put printer in signal class (POD allows static methods).
Have single list of all signal info.

Error codes should be inherited from one master file.

App S. Signal names
-------------------

Have to use old signal names.  Exception (*) adds underscore.
Each name implies 3 signals with REQ/CONF/REF appended.

transaction		external		internal
-----------		--------		--------
begin tx		SCHEMA_TRANS_BEGIN_
internal tx					SCHEMA_TRANS_IMPL_
end tx			SCHEMA_TRANS_END_

operation               external                internal
---------               --------                --------
CreateTable             CREATE_TABLE_           CREATE_TAB_
DropTable               DROP_TABLE_             DROP_TAB_
AlterTable              ALTER_TABLE_            ALTER_TAB_

CreateIndex             CREATE_INDX_            CREATE_INDX_IMPL_
DropIndex               DROP_INDX_              DROP_INDX_IMPL_
AlterIndex              ALTER_INDX_             ALTER_INDX_IMPL_
BuildIndex              BUILDINDX              *BUILD_INDX_IMPL_

CreateTrigger           CREATE_TRIG_            CREATE_TRIG_IMPL_
DropTrigger             DROP_TRIG_              DROP_TRIG_IMPL_
AlterTrigger            ALTER_TRIG_             ALTER_TRIG_IMPL_

App E. Error inserts
--------------------

6101: Fail seize of schema trans (master)

  api
    SCHEMA_TRANS_BEGIN_REQ
      dict/m
    SCHEMA_TRANS_BEGIN_REF (error 780)
  api

6101: Fail seize of schema trans (slave)

  api
    SCHEMA_TRANS_BEGIN_REQ
      dict/m
        SCHEMA_TRANS_IMPL_REQ (begin)
          dict/p
        SCHEMA_TRANS_IMPL_REF (error 780)
      dict/m
    SCHEMA_TRANS_BEGIN_REF (error 780)
  api

6111: Fail seize op in create table (master)

No operation was created.  As always, client is responsible
for tx end (which could happen via API fail).

  api
    CREATE_TABLE_REQ
      dict/m
    CREATE_TABLE_REF (error 783)
  api
    SCHEMA_TRANS_END_REQ (abort)
      dict/m
        SCHEMA_TRANS_IMPL_REQ (end)
          dict/p
        SCHEMA_TRANS_IMPL_CONF
      dict/m
    SCHEMA_TRANS_END_CONF
  api

6111: Fail seize op in create table (slave)

Operation op0 was created and parsed on master.
Seize op failed on some slave.

  api
    CREATE_TABLE_REQ
      dict/m
        SCHEMA_TRANS_IMPL_REQ (parse)
          dict/s
        SCHEMA_TRANS_IMPL_REF (error 783)
      dict/m
    CREATE_TABLE_REF
  api
    SCHEMA_TRANS_END_REQ (abort)
      dict/m
        SCHEMA_TRANS_IMPL_REQ (abort parse op0)
          dict/p (not all)
        SCHEMA_TRANS_IMPL_CONF
      dict/m
        SCHEMA_TRANS_IMPL_REQ (end)
          dict/p
        SCHEMA_TRANS_IMPL_CONF
      dict/m
    SCHEMA_TRANS_END_CONF
  api

6112: Fail parse in create table (master)

Operation op0 was created on master but not on slaves.

  api
    CREATE_TABLE_REQ
      dict/m
    CREATE_TABLE_REF (error 9999)
  api
    SCHEMA_TRANS_END_REQ (abort)
      dict/m
        SCHEMA_TRANS_IMPL_REQ (abort parse op0)
          dict/p (only master)
        SCHEMA_TRANS_IMPL_CONF
      dict/m
        SCHEMA_TRANS_IMPL_REQ (end)
          dict/p
        SCHEMA_TRANS_IMPL_CONF
      dict/m
    SCHEMA_TRANS_END_CONF
  api

6112: Fail parse in create table (slave)

  api
    CREATE_TABLE_REQ
      dict/m
        SCHEMA_TRANS_IMPL_REQ (parse)
          dict/s
        SCHEMA_TRANS_IMPL_REF (error 9999)
      dict/m
    CREATE_TABLE_REF
  api
    SCHEMA_TRANS_END_REQ (abort)
      dict/m
        SCHEMA_TRANS_IMPL_REQ (abort parse op0)
          dict/p (not all)
        SCHEMA_TRANS_IMPL_CONF
      dict/m
        SCHEMA_TRANS_IMPL_REQ (end)
          dict/p
        SCHEMA_TRANS_IMPL_CONF
      dict/m
    SCHEMA_TRANS_END_CONF
  api

6113: Fail prepare in create table (master or slave)

This is regular coordinator-participant code running
after client has sent tx end.

  api
    SCHEMA_TRANS_END_REQ (commit)
      dict/m
        SCHEMA_TRANS_IMPL_REQ (prepare)
          dict/p
        SCHEMA_TRANS_IMPL_REF (error 9999)
      dict/m
        SCHEMA_TRANS_IMPL_REQ (abort prepare op0)
          dict/p
        SCHEMA_TRANS_END_CONF
      dict/m
        SCHEMA_TRANS_IMPL_REQ (abort parse op0)
          dict/p
        SCHEMA_TRANS_END_CONF
      dict/m
    SCHEMA_TRANS_END_REF (error 9999)
  api

vim: set et sw=2: