The current design of Multithreaded NDB Data nodes is in many benchmarks and workloads limiting throughput to what the LQH threads can handle and in some cases what the TC thread can handle. One method to ensure each Data node can deliver more throughput is to offload those threads from any send activity towards the OS. The send activity is instead offloaded into one or more separate send thread. Since separate send thread can also increase latency this will be a configurable parameter which will be off by default. The parameter will be a number which specifies the number of send threads to create. When locking to CPUs, if there are not enough CPUs provided to house the send threads, the separation of send threads will not be performed since it can only be beneficial if there is a chance we can use more CPUs to run the workload. The algorithm will first assign CPUs to LQH threads, next the TC/Backup/Suma thread will receive a CPU. If there is one more the receiver thread will also get its own. This means that if 4 LQH threads are used, at least 7 CPUs are needed to use 1 send thread. If 8 CPUs are provided we can support 2 send threads. If no locking to CPUs is configured, then send threads will always be created as configured.
All the code to handle sending in multithreaded NDB kernel is located in mt.cpp in storage/ndb/kernel/vm. The send thread will only take care of the actual send call for one or more connections. If more than one send thread is allocated each thread will take care of a subset of the connections to other nodes. Currently any send is protected by a mutex on the node send buffer. This mutex is no longer needed to protect the actual send call which is the main time consumer in the send handling. It is however required to protect both the collection of send buffers to iovec's and to remove send buffers after completion of the send call. The release of send buffers in send thread can go to a local list maintained by the send thread and protected by a new mutex. When the block thread needs more buffers into its free list, it gets those from the list maintained by the send thread. The variable m_bytes can be checked to see if there is data available to send on a node send buffer. This is protected by the mutex for updates, but most likely it can be read without protection just to check if there is any outstanding sends needed before going to sleep in the send thread. The callback forceSend can be adapted to wait for the send thread and possibly even expediting the sending in the send thread. It could also be handled directly in the method getWritePtr. Major changes required: mt_send_handle::getWritePtr 1) Remove extra flush send buffer here when using send threads since we're focusing on throughput and not on latency 2) If no send buffer is available in pool, get send buffers from the pool maintained by the send thread 3) For this case it is unnecessary to call send_is_possible since send is done by other thread, we will however keep it, since it introduces another potential delay. 4) The callback forceSend will be a call that tries to expedite sending of this node send buffer in the send thread since its buffers are overloaded. trp_callback::reportSendLen This method needs to be adapted since it's no longer executed in a block thread but in the send thread instead.