NDB Disk Data: When a log file group had more than 18 undo logs, it was not possible to restart the cluster. (Bug #251155785)
References: See also: Bug #28922609.
When a local checkpoint (LCP) was complete on all data nodes except one, and this node failed,
NDBdid not continue with the steps required to finish the LCP. This led to the following issues:
No new LCPs could be started.
Redo and Undo logs were not trimmed and so grew excessively large, causing an increase in times for recovery from disk. This led to write service failure, which eventually led to cluster shutdown when the head of the redo log met the tail. This placed a limit on cluster uptime.
Node restarts were no longer possible, due to the fact that a data node restart requires that the node's state be made durable on disk before it can provide redundancy when joining the cluster. For a cluster with two data nodes and two fragment replicas, this meant that a restart of the entire cluster (system restart) was required to fix the issue (this was not necessary for a cluster with two fragment replicas and four or more data nodes). (Bug #28728485, Bug #28698831)
References: See also: Bug #11757421.
It was possible in certain cases for nodes to hang during an initial restart. (Bug #28698831)
References: See also: Bug #27622643.
When tables with
BLOBcolumns were dropped and then re-created with a different number of
BLOBcolumns the event definitions for monitoring table changes could become inconsistent in certain error situations involving communication errors when the expected cleanup of the corresponding events was not performed. In particular, when the new versions of the tables had more
BLOBcolumns than the original tables, some events could be missing. (Bug #27072756)
When running a cluster with 4 or more data nodes under very high loads, data nodes could sometimes fail with Error 899 Rowid already allocated. (Bug #25960230)
When starting, a data node copies metadata, while a local checkpoint updates metadata. To avoid any conflict, any ongoing LCP activity is paused while metadata is being copied. An issue arose when a local checkpoint was paused on a given node, and another node that was also restarting checked for a complete LCP on this node; the check actually caused the LCP to be completed before copying of metadata was complete and so ended the pause prematurely. Now in such cases, the LCP completion check waits to complete a paused LCP until copying of metadata is finished and the pause ends as expected, within the LCP in which it began. (Bug #24827685)
Asynchronous disconnection of mysqld from the cluster caused any subsequent attempt to start an NDB API transaction to fail. If this occurred during a bulk delete operation, the SQL layer called
HA::end_bulk_delete(), whose implementation by
ha_ndbclusterassumed that a transaction had been started, and could fail if this was not the case. This problem is fixed by checking that the transaction pointer used by this method is set before referencing it. (Bug #20116393)