The current implementation of unique key operations in the Transaction Coordinator (TC) component of NDB Cluster can be improved. The current implementation implements unique key operations by semi-externally invoking two separate primary key operations on the index table and the base table. This should be modified so that unique key operations are handled alongside primary key operations in the TCKEYREQ handler. By aligning unique key and primary key operation processing, this work should alleviate problems with : - Transaction (APIConnectRecord) state handling - Savepoint alignment between linked table operations - Error handling scenarios Other aspects of unique key indices should not be affected : - NDBAPI interface to unique indexes - Unique index maintenance through triggers - Unique index (re)build.
General implementation goals ---------------------------- G1) Bring implementation of primary key access (TCKEYREQ) and unique index access (TCINDXREQ) closer together, ideally into the same handler function. a) Remove *need* for separate TCINDXREQ, INDXKEYINFO, INDXATTRINFO TCINDXCONF etc. signals (Support for them must be maintained for backwards compatibility) b) Avoid duplication of code for - transaction state handling (e.g. CS_* states) - packed send of TCINDXCONF signals - commit ack marker handling c) Try to avoid complexities around 2 PK requests using a common savepoint G2) Maintain current optimised primary key (TCKEYREQ) flow G3) Simplify TC code where possible Limitations and Restrictions ---------------------------- L1) Unique index maintenance should be mostly unaffected, including : - Unique index definition/alter - Unique index (re)build - Unique index maintenance (insert/delete from triggers etc.) L2) Current unique index implementation limitations (e.g. limitations on locking modes etc.) will be maintained. L3) TC block and maybe NDBAPI will be affected by this WL. No other areas should be affected.
Current implementation notes ---------------------------- NDBAPI NdbIndexOperation is very similar to NdbOperation, with just different signal numbers for the request and response signals. TC BLOCK TCINDXREQ handling : The TCINDXREQ handler stores the received TCINDXREQ in a special pool and then uses the received KeyInfo (From INDXKEYINFO signals) to build a TCKEYREQ on the Index table, which reads the special $PK column from the index table. This 'index table read' TCKEYREQ contains most of the request flags from the original TCINDXREQ, as well as some special indicators to inform TC that it's the first part of a unique index lookup. Additionally the TC block is set as the client, so that the resulting TRANSID_AI information will be sent back from TUP to TC. TCKEYREQ handling : This first TCKEYREQ is directly executed, and results in an LQHKEYREQ being sent to the involved LQH block, which may be on the same, or another node. << LQH, ACC, TUP >> for index table read When the LQHKEYCONF returns, a TCKEYCONF is 'sent' to the TC block(self). TCINDXREQ handling : When the TC block has received all of the TRANSID_AI from TUP for the index table read, as well as the TCKEYCONF, it builds a second TCKEYREQ, on the base table, with the received primary key and fragment identity as key, and the AttrInfo (from IndxAttrInfo signals) as AttrInfo. This TCKEYREQ contains a special indicator to indicate that it is the second part of a unique index lookup operation. The client reference is set to be the original NDBAPI id. The TCKEYREQ is executed directly. TCKEYREQ handling : This second TCKEYREQ results in an LQHKEYREQ being sent to the involved LQH block, which may be on the same, or another node. << LQH, ACC, TUP >> for base table operation When the LQHKEYCONF returns, a TCKEYCONF is 'sent' to the TC block(self). TCINDXREQ handling : When the TCKEYCONF is received, a TCINDXCONF is sent to the NDBAPI client and some operation-related resources can be freed. Observations and discussion --------------------------- Avoiding duplication The original design attempts to keep unique index and primary key access separated by having unique index access be a client of primary key access. Unfortunately, this abstraction 'leaks' as the primary key access handling has to be aware that an index operation is being processed in certain cases. Also, connection (transaction) state checking functionality is duplicated despite having the same requirements for both operation types. The implementation should be refactored to minimise the code where the operation types differ to avoid this duplication. The main technique for doing this is to have both PK and UK access handled via the current TCKEYREQ handler. Provide PK access to index table The Optimize table functionality added to mysql-5.1-telco-6.3 supports optimization of unique indexes. This functionality requires the ability to be able to perform a table scan on a unique index, and perform scan-takeover update operations on the tuples returned to allow a unique index table to be optimized (i.e. to send TCKEYREQ requests). This suggests that there is at least one requirement to provide access to the index table as-is. e.g. TCKEYREQ on an index table should operate on it as a standalone table, TCINDXREQ should use it as an access method to it's base table. Suggestion from Jonas O : Have a unified signal handler, but if the GSN is TCINDXREQ, perform indirect access, otherwise perform direct (PK) access. Interaction between operations in a batch The behaviour of the system when multiple operations in a batch access the same tuple(s) via the same or different access methods is not well defined. For example, a PK delete in the same batch as a unique index read may result in success for both operations, deadlock, or a failing read operation. Locking behaviour Note that unique indexes do not currently support unlocked access. Within NDBAPI (storage/ndb/src/ndbapi/NdbIndexOperation.cpp), simple, dirty/last-committed reads are mapped to shared reads. (Note, does NdbRecord also perform this limited mapping? No.) This restrictive mapping should be done in the kernel rather than relying on the API. Interactions with index maintenance Index maintenance is done via triggers. Immediate triggers are fired in TUP when changes are made by an operation, and the fact that triggers were fired is returned in the LQHKEYCONF. This tells TC that the operation must wait until the fired triggers have completed before continuing with processing the operation. To suspend an operation in this way, saveTriggeringOpState() is called from execLQHKEYCONF(). This method simply saves the received LQHKEYCONF in a buffer. It is assumed that all triggered operations will complete with an LQHKEYCONF or LQHKEYREF signal being processed. These handlers will check whether the triggering operation has completed all of its triggers, and if so, will restart it (which is done by (re)sending it's LQHKEYCONF with the numberOfFiredTriggers set to 0). Immediate triggers are currently used for unique index maintenance and table reorganisation. This behaviour should be unaffected by this WL. For testing, note that primary key operations on unique-indexed tables can result in triggers being fired, and unique key operations on unique-indexed tables can result in triggers being fired. It is not (currently) expected that the first operation on the index table (read only) can fire triggers. Long signals and backwards compatibility WL4258 added long TCINDXREQ signals to mysql-5.1-telco-6.4 while maintaining support for short TCINDXREQ signals from API nodes at previous releases. This WL should maintain support for short and long variants of TCINDXREQ. (Is TCINDXREQ sent internally at all?) Note that TUP does not currently send long TRANSID_AI to kernel blocks, and so does not send long TRANSID_AI to TC as a result of the index table read. There are no plans to change this as part of this WL. Simplifying interactions with Savepoints etc. Unique index operations require that the initial index table access and the second base table occur within the same savepoint. The current savepoint identifier is incremented in TC when a TCKEYREQ with the exec flag set is encountered. One way to simplify having the first and second operations have the same savepoint is to create them both at the start. The first operation takes the KeyInfo from the TCINDXREQ, and gets generated ATTRINFO. The second takes the ATTRINFO from the TCINDXREQ, and waits for TRANSID_AI to get it's KEYINFO. This means that by the time the savepoint is incremented, all operations in the previous savepoint have been defined, but not all will have received all of their KEYINFO/ATTRINFO yet. This aligns with 'normal' primary key access. Proposed Implementation Modifications ------------------------------------- 1) execTCKEYREQ() modified to check received GSN, and for GSN==TCINDXREQ: - Perform unique index access specific checks - Create index table and base table access operations from received request, with each marked as such. The index table access operation also contains a reference to the base table access operation. The index table access operation should : - Read with a read lock - Have the client 'return address' set to the TC block's id. - Place any received KeyInfo (either inline or attached in a long section) into the index table access operation(1). - Place any received AttrInfo (either inline or attached in a long section) into the base table access operation(2). - Generate the AttrInfo to read the index table's $PK column. - Continue processing the index table access operation as normal (if all KeyInfo received, send LQHKEYREQ etc..) 2) TCINDXREQ, INDXKEYINFO and INDXATTRINFO modified to call execTCKEYREQ(), execKEYINFO() and execATTRINFO() handlers. 3) execKEYINFO() handler modified to detect case where operation is (1) and add KeyInfo to it (no work here?). 4) execATTRINFO() handler modified to add received ATTRINFO for operation(1) to operation (2). 5) execTRANSID_AI() handler modified to add received TRANSID_AI as KeyInfo to operation(2) - Specifically, AttributeHeader(s) must be removed if present, and the fragmentid must be set. 6) execATTRINFO() execLQHKEYCONF() and execTRANSID_AI() handlers modified so that when all required ATTRINFO, TRANSID_AI and LQHKEYCONF signals have been received for operation (2), it is executed as normal with : - At least read lock for read operations - client set to original NDBAPI client reference 7) execLQHKEYREF() modified to perform cleanup of both operations if required, and return TCKEYREF() to client as expected. 8) Remove old definitions, code paths etc. New implementation notes ------------------------ Having both operations defined upfront helps with Savepoint issues and also provides buffering for state information (received AttrInfo, fragId of base table key etc.) that is currently placed in index operation specific buffers. The second operation can be considered as a slow-to-start PK op which is fed keyinfo from the first, but otherwise looks like a standard operation. The normal post-execution operation cleanup mechanisms which release the CacheRecord and buffered key and attrinfo can continue to work in the normal way. Implementation approach ----------------------- The intended approach has two steps : 1) modify the current TCKEYREQ handler to also handle TCINDXREQ requests in the 'new' way. 2) remove existing code, types, structures involved in handling TCINDXREQ in the 'old' way. It is hoped that this makes it easier to compare old and new implementations as development progresses, and avoids a big-bang approach which may result in extra debugging. Testing approach ----------------