WL#1498: Multithreaded DB-nodes

Status: Assigned

Implement multithreaded VM to make better use of SMP machines

Goals:
* Improved scalability on SMP machines
* Unaltered scalability/performance on UP-machines

---

Goals until dec-2007

- 2 ndb threads (maybe 1 transporter thread)
- fully functional, 
  - including crash dump
  - "safe" job scheduling
  - not many global locks
- "optimized" / performance characteristics evaluated
ToDo:

 - Check cmvmi, does it need a transporter lock? (related to detecting failed
   node, must guarantee that after getting node failure report will never see
   a signal from that node).

 - Decide on and implement how to handle job buffer overrun.

 - Fix cross-thread EXECUTE_DIRECT (in DBDIH).
   Hm, also more EXECUTE_DIRECT in DBDIH to fix.
   And also one in ndbfs/AsyncFile.cpp, not sure if it needs fixing?
   Lots of them that at least need to be checked...

src/kernel/blocks/dbdict/Dbdict.cpp:  EXECUTE_DIRECT(SUMA, GSN_ALTER_TAB_REQ,
signal,
src/kernel/blocks/dbdict/Dbdict.cpp:    EXECUTE_DIRECT(SUMA, GSN_DROP_TAB_CONF,
signal,
src/kernel/blocks/dbdih/DbdihMain.cpp:    EXECUTE_DIRECT(LGMAN,
GSN_SUB_GCP_COMPLETE_REP, signal, 
src/kernel/blocks/dbdih/DbdihMain.cpp:      EXECUTE_DIRECT(SUMA,
GSN_SUB_GCP_COMPLETE_REP, signal, 
src/kernel/blocks/dbdih/DbdihMain.cpp:    EXECUTE_DIRECT(DBACC,
GSN_DUMP_STATE_ORD, signal, 2);
src/kernel/blocks/dbdih/DbdihMain.cpp:    EXECUTE_DIRECT(DBTUP,
GSN_DUMP_STATE_ORD, signal, 2);
src/kernel/blocks/dblqh/DblqhMain.cpp:     
EXECUTE_DIRECT(refToBlock(atcBlockref), GSN_LQHKEYCONF,
src/kernel/blocks/dblqh/DblqhMain.cpp:    EXECUTE_DIRECT(DBDIH, GSN_GCP_SAVEREQ,
signal, signal->getLength());
src/kernel/blocks/backup/Backup.cpp:      EXECUTE_DIRECT(DBDICT,
GSN_BACKUP_FRAGMENT_REQ, signal, 2);
src/kernel/blocks/backup/Backup.cpp:    EXECUTE_DIRECT(DBDICT,
GSN_BACKUP_FRAGMENT_REQ, signal, 2);
src/kernel/blocks/backup/Backup.cpp:      EXECUTE_DIRECT(DBDICT,
GSN_BACKUP_FRAGMENT_REQ, signal, 2);
src/kernel/blocks/suma/Suma.cpp:  EXECUTE_DIRECT(DBDIH, GSN_CHECKNODEGROUPSREQ,
signal, 
src/kernel/blocks/suma/Suma.cpp:      EXECUTE_DIRECT(QMGR, GSN_API_FAILREQ,
signal, 1);
src/kernel/blocks/suma/Suma.cpp:  EXECUTE_DIRECT(DBDIH, GSN_GETGCIREQ, signal, 3);

src/kernel/blocks/ndbfs/AsyncFile.cpp:      m_fs.EXECUTE_DIRECT(block,
GSN_FSWRITEREQ, signal,
src/kernel/blocks/dbtux/DbtuxScan.cpp:      EXECUTE_DIRECT(blockNo,
GSN_NEXT_SCANCONF, signal, signalLength);
src/kernel/blocks/dbtup/DbtupBuffer.cpp:    EXECUTE_DIRECT(block,
GSN_TRANSID_AI, signal, 3 + ToutBufIndex);
src/kernel/blocks/dbtup/DbtupScan.cpp:      EXECUTE_DIRECT(blockNo,
GSN_NEXT_SCANCONF, signal, signalLength);
src/kernel/blocks/dbtup/DbtupScan.cpp:	EXECUTE_DIRECT(blockNo,
GSN_NEXT_SCANCONF, signal, 7);
src/kernel/blocks/dbtup/DbtupScan.cpp:  EXECUTE_DIRECT(blockNo,
GSN_NEXT_SCANCONF, signal, 7);
src/kernel/blocks/dbtup/DbtupIndex.cpp:     
EXECUTE_DIRECT(buildPtr.p->m_buildRef, GSN_TUX_MAINT_REQ,
src/kernel/blocks/dbtup/DbtupIndex.cpp:	  EXECUTE_DIRECT(buildPtr.p->m_buildRef,
GSN_TUX_MAINT_REQ,
src/kernel/blocks/dbtup/DbtupTrigger.cpp:      EXECUTE_DIRECT(receiverReference, 
src/kernel/blocks/dbtup/DbtupTrigger.cpp:   
EXECUTE_DIRECT(trigPtr->m_receiverBlock,
src/kernel/blocks/dbtup/DbtupTrigger.cpp:   
EXECUTE_DIRECT(trigPtr->m_receiverBlock,
src/kernel/blocks/dblqh/DblqhMain.cpp:    EXECUTE_DIRECT(refToBlock(ref),
GSN_KEYINFO20, signal, 
src/kernel/blocks/dbacc/DbaccMain.cpp:      EXECUTE_DIRECT(blockNo,
GSN_NEXT_SCANCONF, signal, 1);
src/kernel/blocks/dbacc/DbaccMain.cpp:  EXECUTE_DIRECT(blockNo,
GSN_NEXT_SCANCONF, signal, 6);

 - Write a document listing everywhere we take locks.

 - Check that crash handling is safe; we must _never_ (as far as possible)
   hang during crash handling, should make sure we exit the process (so a
   restart can bring back the node into the cluster).

 - Change all use of pthread* stuff to wrappers in portlib.

 - Change spinlocks to real mutexes. This will greatly help running ndbmtd on
   less cores than available threads, and should not hurt much as there should
   be little contention (once the send lock is fixed).

 - More bechmarks:
    - 2 and 4 nodes, api on other nodes
    - low load
    - medium load
    - maybe scans

 - Configurable minimum for resource group for job buffers.

 - Integrate thread creation with the stuff from the real-time patches.
   Configuration::addThreadId() and similar.

 - NUMA-awareness.
    - allocate job buffer thrA->thrB from B-local memory.
      Then thrA can do async writes, and B can read from local.

 - one really cool thing would be add a unit testprg of mt.cpp
   create X threads each running 1 block, send signals like crazy inbetween them
   this would help us finds bugs when porting to non-x86
     (and we could to test that it finds something *introduce* race conditions
      and see that the unit test crashes...only during testing of unit-test-prg)

 - Split transporter buffers, avoid having to lock each sendSignal().


Done:
 * job handling, copy buffer pointers to local stack before executing.

 * Change lock contention printouts, after a few contentions only print counts
   occasionally (ignore races on counter, doesn't matter).

 * prio a signal handling, make reader lock-free (still need locking among
   writers).

 * Remove 'reserve space' hack for STOP_FOR_CRASH, allow to reserve space
   instead (just use static buffer, will never be used anyway).

 * Unify prio a and b buffer handling? So only difference is single receive
   queue (checked more often) for prio A, per-sender receive queue for prio B.

 * Fix signal dumps (no more prio A ...)

 * Fix crashes / instabilities.

 * DbCreate + DbAsyncGenerator benchmarks, get to 100% cpu (single-threaded).

 * Measure number of prio A vs prio B vs. prio A timed signals sent (hopefully
   most are prio B). ToDo: enable for some ALL DUMP command.
     Result: 1/24 prio a, 99% of then from receive thread (TC_COMMIT_ACK from
     API).

 * locking around global memory manager.

 * use global mem manager instead of malloc.

 * Do signal stats like for non-threaded scheduler (send info events).

 * i once added a thr_transporter_main, for a thread running the transporter.
   i think that still would be nice, instead of having it inside thr_main (if
that still the name)

 * split most obvious contented locks (send lock)
   - send lock (only one I can think of now...)
     * introduce lock per transporter instead
     * implement TransporterRegistry::doSend(nodeId)

 * Make sure Watchdog is running before doing allocations ... and set Watchdog
   counter during allocations.