WL#5490: Separate send thread in Multithreaded NDB Data node

Status: Complete

Description
Low Level Design

The current design of Multithreaded NDB Data nodes is in
many benchmarks and workloads limiting throughput to what
the LQH threads can handle and in some cases what the TC
thread can handle. One method to ensure each Data node can
deliver more throughput is to offload those threads from
any send activity towards the OS.

The send activity is instead offloaded into one or more
separate send thread.

Since separate send thread can also increase latency this
will be a configurable parameter which will be off by
default. The parameter will be a number which specifies
the number of send threads to create.

When locking to CPUs, if there are not enough CPUs provided
to house the send threads, the separation of send threads
will not be performed since it can only be beneficial if there
is a chance we can use more CPUs to run the workload.

The algorithm will first assign CPUs to LQH threads, next the
TC/Backup/Suma thread will receive a CPU. If there is one
more the receiver thread will also get its own. This means that
if 4 LQH threads are used, at least 7 CPUs are needed to use
1 send thread. If 8 CPUs are provided we can support 2 send
threads.

If no locking to CPUs is configured, then send threads will
always be created as configured.

All the code to handle sending in multithreaded NDB kernel
is located in mt.cpp in storage/ndb/kernel/vm. The send
thread will only take care of the actual send call for one
or more connections. If more than one send thread is allocated
each thread will take care of a subset of the connections to
other nodes.

Currently any send is protected by a mutex on the node send
buffer. This mutex is no longer needed to protect the actual
send call which is the main time consumer in the send
handling. It is however required to protect both the collection
of send buffers to iovec's and to remove send buffers after
completion of the send call.

The release of send buffers in send thread can go to a local list
maintained by the send thread and protected by a new mutex. When
the block thread needs more buffers into its free list, it gets
those from the list maintained by the send thread.

The variable m_bytes can be checked to see if there is data available
to send on a node send buffer. This is protected by the mutex for
updates, but most likely it can be read without protection just to
check if there is any outstanding sends needed before going to sleep
in the send thread.

The callback forceSend can be adapted to wait for the send thread
and possibly even expediting the sending in the send thread. It could
also be handled directly in the method getWritePtr.

Major changes required:
mt_send_handle::getWritePtr
1) Remove extra flush send buffer here when using send threads
since we're focusing on throughput and not on latency
2) If no send buffer is available in pool, get send buffers from
the pool maintained by the send thread
3) For this case it is unnecessary to call send_is_possible since
send is done by other thread, we will however keep it, since it
introduces another potential delay.
4) The callback forceSend will be a call that tries to expedite
sending of this node send buffer in the send thread since its
buffers are overloaded.

trp_callback::reportSendLen
This method needs to be adapted since it's no longer executed in
a block thread but in the send thread instead.