WL#4675: Unique Index Operation TC cleanup

Status: Assigned   —   Priority: Medium

The current implementation of unique key operations in the Transaction
Coordinator (TC) component of NDB Cluster can be improved.

The current implementation implements unique key operations by semi-externally
invoking two separate primary key operations on the index table and the base table.

This should be modified so that unique key operations are handled alongside
primary key operations in the TCKEYREQ handler.

By aligning unique key and primary key operation processing, this work should
alleviate problems with :
  - Transaction (APIConnectRecord) state handling
  - Savepoint alignment between linked table operations
  - Error handling scenarios

Other aspects of unique key indices should not be affected : 
  - NDBAPI interface to unique indexes
  - Unique index maintenance through triggers
  - Unique index (re)build.
General implementation goals
G1) Bring implementation of primary key access (TCKEYREQ) and unique index
access (TCINDXREQ) closer together, ideally into the same handler function.
etc. signals
    (Support for them must be maintained for backwards compatibility)
  b) Avoid duplication of code for 
      - transaction state handling (e.g. CS_* states)
      - packed send of TCINDXCONF signals
      - commit ack marker handling
  c) Try to avoid complexities around 2 PK requests using a common savepoint

G2) Maintain current optimised primary key (TCKEYREQ) flow
G3) Simplify TC code where possible

Limitations and Restrictions
L1) Unique index maintenance should be mostly unaffected, including : 
    - Unique index definition/alter
    - Unique index (re)build
    - Unique index maintenance (insert/delete from triggers etc.)

L2) Current unique index implementation limitations (e.g. limitations on locking
modes etc.) will be maintained. 

L3) TC block and maybe NDBAPI will be affected by this WL.  No other areas
should be affected.
Current implementation notes
  NdbIndexOperation is very similar to NdbOperation, with just different signal
numbers for the request and response signals.

  TCINDXREQ handling :
    The TCINDXREQ handler stores the received TCINDXREQ in a special pool and
then uses the received KeyInfo (From INDXKEYINFO signals) to build a TCKEYREQ on
the Index table, which reads the special $PK column from the index table.  
    This 'index table read' TCKEYREQ contains most of the request flags from the
original TCINDXREQ, as well as some special indicators to inform TC that it's
the first part of a unique index lookup.  Additionally the TC block is set as
the client, so that the resulting TRANSID_AI information will be sent back from
TUP to TC.  

  TCKEYREQ handling : 
    This first TCKEYREQ is directly executed, and results in an LQHKEYREQ being
sent to the involved LQH block, which may be on the same, or another node.
    << LQH, ACC, TUP >> for index table read
    When the LQHKEYCONF returns, a TCKEYCONF is 'sent' to the TC block(self).

  TCINDXREQ handling : 
    When the TC block has received all of the TRANSID_AI from TUP for the index
table read, as well as the TCKEYCONF, it builds a second TCKEYREQ, on the base
table, with the received primary key and fragment identity as key, and the
AttrInfo (from IndxAttrInfo signals) as AttrInfo.  This TCKEYREQ contains a
special indicator to indicate that it is the second part of a unique index
lookup operation.  The client reference is set to be the original NDBAPI id. 
The TCKEYREQ is executed directly.  

  TCKEYREQ handling : 
    This second TCKEYREQ results in an LQHKEYREQ being sent to the involved LQH
block, which may be on the same, or another node.
    << LQH, ACC, TUP >> for base table operation
    When the LQHKEYCONF returns, a TCKEYCONF is 'sent' to the TC block(self).
  TCINDXREQ handling : 
    When the TCKEYCONF is received, a TCINDXCONF is sent to the NDBAPI client
and some operation-related resources can be freed.

Observations and discussion

Avoiding duplication

The original design attempts to keep unique index and primary key access
separated by having unique index access be a client of primary key access. 
Unfortunately, this abstraction 'leaks' as the primary key access handling has
to be aware that an index operation is being processed in certain cases.  Also,
connection (transaction) state checking functionality is duplicated despite
having the same requirements for both operation types.  The implementation
should be refactored to minimise the code where the operation types differ to
avoid this duplication.  The main technique for doing this is to have both PK
and UK access handled via the current TCKEYREQ handler.

Provide PK access to index table

The Optimize table functionality added to mysql-5.1-telco-6.3 supports
optimization of unique indexes.  This functionality requires the ability to be
able to perform a table scan on a unique index, and perform scan-takeover update
operations on the tuples returned to allow a unique index table to be optimized
(i.e. to send TCKEYREQ requests).  This suggests that there is at least one
requirement to provide access to the index table as-is.  e.g. TCKEYREQ on an
index table should operate on it as a standalone table, TCINDXREQ should use it
as an access method to it's base table.
Suggestion from Jonas O : Have a unified signal handler, but if the GSN is
TCINDXREQ, perform indirect access, otherwise perform direct (PK) access.

Interaction between operations in a batch

The behaviour of the system when multiple operations in a batch access the same
tuple(s) via the same or different access methods is not well defined.  For
example, a PK delete in the same batch as a unique index read may result in
success for both operations, deadlock, or a failing read operation.  

Locking behaviour

Note that unique indexes do not currently support unlocked access.  Within
NDBAPI (storage/ndb/src/ndbapi/NdbIndexOperation.cpp), simple,
dirty/last-committed reads are mapped to shared reads.
(Note, does NdbRecord also perform this limited mapping? No.)  This restrictive
mapping should be done in the kernel rather than relying on the API.

Interactions with index maintenance

Index maintenance is done via triggers.  Immediate triggers are fired in TUP
when changes are made by an operation, and the fact that triggers were fired is
returned in the LQHKEYCONF.  This tells TC that the operation must wait until
the fired triggers have completed before continuing with processing the
operation.  To suspend an operation in this way, saveTriggeringOpState() is
called from execLQHKEYCONF().  This method simply saves the received LQHKEYCONF
in a buffer.
It is assumed that all triggered operations will complete with an LQHKEYCONF or
LQHKEYREF signal being processed.  These handlers will check whether the
triggering operation has completed all of its triggers, and if so, will restart
it (which is done by (re)sending it's LQHKEYCONF with the numberOfFiredTriggers
set to 0).
Immediate triggers are currently used for unique index maintenance and table
reorganisation.  This behaviour should be unaffected by this WL.
For testing, note that primary key operations on unique-indexed tables can
result in triggers being fired, and unique key operations on unique-indexed
tables can result in triggers being fired.  It is not (currently) expected that
the first operation on the index table (read only) can fire triggers.

Long signals and backwards compatibility

WL4258 added long TCINDXREQ signals to mysql-5.1-telco-6.4 while maintaining
support for short TCINDXREQ signals from API nodes at previous releases.  This
WL should maintain support for short and long variants of TCINDXREQ.  (Is
TCINDXREQ sent internally at all?)
Note that TUP does not currently send long TRANSID_AI to kernel blocks, and so
does not send long TRANSID_AI to TC as a result of the index table read.  There
are no plans to change this as part of this WL.

Simplifying interactions with Savepoints etc.

Unique index operations require that the initial index table access and the
second base table occur within the same savepoint.  The current savepoint
identifier is incremented in TC when a TCKEYREQ with the exec flag set is
encountered.  One way to simplify having the first and second operations have
the same savepoint is to create them both at the start.  The first operation
takes the KeyInfo from the TCINDXREQ, and gets generated ATTRINFO.  The second
takes the ATTRINFO from the TCINDXREQ, and waits for TRANSID_AI to get it's
KEYINFO.  This means that by the time the savepoint is incremented, all
operations in the previous savepoint have been defined, but not all will have
received all of their KEYINFO/ATTRINFO yet.  This aligns with 'normal' primary
key access. 

Proposed Implementation Modifications
1) execTCKEYREQ() modified to check received GSN, and for GSN==TCINDXREQ:
   - Perform unique index access specific checks
   - Create index table and base table access operations from received request,
with each marked as such.  The index table access operation also contains a
reference to the base table access operation.
     The index table access operation should :
       - Read with a read lock
       - Have the client 'return address' set to the TC block's id.
   - Place any received KeyInfo (either inline or attached in a long section)
into the index table access operation(1).
   - Place any received AttrInfo (either inline or attached in a long section)
into the base table access operation(2).
   - Generate the AttrInfo to read the index table's $PK column.
   - Continue processing the index table access operation as normal (if all
KeyInfo received, send LQHKEYREQ etc..)

execKEYINFO() and execATTRINFO() handlers.

3) execKEYINFO() handler modified to detect case where operation is (1) and add
KeyInfo to it (no work here?).

4) execATTRINFO() handler modified to add received ATTRINFO for operation(1) to
operation (2).

5) execTRANSID_AI() handler modified to add received TRANSID_AI as KeyInfo to
  - Specifically, AttributeHeader(s) must be removed if present, and the
fragmentid must be set. 

6) execATTRINFO() execLQHKEYCONF() and execTRANSID_AI() handlers modified so
that when all required ATTRINFO, TRANSID_AI and LQHKEYCONF signals have been
received for operation (2), it is executed as normal with :
   - At least read lock for read operations
   - client set to original NDBAPI client reference

7) execLQHKEYREF() modified to perform cleanup of both operations if required,
and return TCKEYREF() to client as expected.

8) Remove old definitions, code paths etc.

New implementation notes
Having both operations defined upfront helps with Savepoint issues and also
provides buffering for state information (received AttrInfo, fragId of base
table key etc.) that is currently placed in index operation specific buffers. 
The second operation can be considered as a slow-to-start PK op which is fed
keyinfo from the first, but otherwise looks like a standard operation.

The normal post-execution operation cleanup mechanisms which release the
CacheRecord and buffered key and attrinfo can continue to work in the normal way.

Implementation approach
The intended approach has two steps :
  1) modify the current TCKEYREQ handler to also handle TCINDXREQ requests in
the 'new' way.
  2) remove existing code, types, structures involved in handling TCINDXREQ in
the 'old' way.

It is hoped that this makes it easier to compare old and new implementations as
development progresses, and avoids a big-bang approach which may result in extra

Testing approach