Important Change: It is now possible to divide a backup into slices and to restore these in parallel using two new options implemented for the ndb_restore utility, making it possible to employ multiple instances of ndb_restore to restore subsets of roughly the same size of the backup in parallel, which should help to reduce the length of time required to restore an NDB Cluster from backup.
--num-slicesoptions determines the number of slices into which the backup should be divided;
--slice-idprovides the ID of the slice (0 to 1 less than the number of slices) to be restored by ndb_restore.
Up to 1024 slices are supported.
Incompatible Change: The minimum value for the
RedoOverCommitCounterdata node configuration parameter has been increased from 0 to 1. The minimum value for the
RedoOverCommitLimitdata node configuration parameter has also been increased from 0 to 1.
You should check the cluster global configuration file and make any necessary adjustments to values set for these parameters before upgrading. (Bug #29752703)
Microsoft Windows; NDB Disk Data: On Windows, restarting a data node other than the master when using Disk Data tables led to a failure in
TSMAN. (Bug #97436, Bug #30484272)
ndbrequire()introduced when implementing partial local checkpoints assumed that
m_participatingLQHmust be clear when receiving
START_LCP_REQ, which is not necessarily true when a failure happens for the master after sending
START_LCP_REQand before handling any
START_LCP_CONFsignals. (Bug #30523457)
A local checkpoint sometimes hung when the master node failed while sending an
LCP_COMPLETE_REPsignal and it was sent to some nodes, but not all of them. (Bug #30520818)
If a transaction was aborted while getting a page from the disk page buffer and the disk system was overloaded, the transaction hung indefinitely. This could also cause restarts to hang and node failure handling to fail. (Bug #30397083, Bug #30360681)
References: See also: Bug #30152258.
When synchronizing extent pages it was possible for the current local checkpoint (LCP) to stall indefinitely if a
CONTINUEBsignal for handling the LCP was still outstanding when receiving the
FSWRITECONFsignal for the last page written in the extent synchronization page. The LCP could also be restarted if another page was written from the data pages. It was also possible that this issue caused
PREP_LCPpages to be written at times when they should not have been. (Bug #30397083)
Data node failures with the error Another node failed during system restart... occurred during a partial restart. (Bug #30368622)
SYNC_EXTENT_PAGES_REQsignal was received by
PGMANwhile dropping a log file group as part of a partial local checkpoint, and thus dropping the page locked by this block for processing next, the LCP terminated due to trying to access the page after it had already been dropped. (Bug #30305315)
The wrong number of bytes was reported in the cluster log for a completed local checkpoint. (Bug #30274618)
References: See also: Bug #29942998.
The number of data bytes for the summary event written in the cluster log when a backup completed was truncated to 32 bits, so that there was a significant mismatch between the number of log records and the number of data records printed in the log for this event. (Bug #29942998)
Using 2 LDM threads on a 2-node cluster with 10 threads per node could result in a partition imbalance, such that one of the LDM threads on each node was the primary for zero fragments. Trying to restore a multi-threaded backup from this cluster failed because the datafile for one LDM contained only the 12-byte data file header, which ndb_restore was unable to read. The same problem could occur in other cases, such as when taking a backup immediately after adding an empty node online.
It was found that this occurred when
ODirectwas enabled for an EOF backup data file write whose size was less than 512 bytes and the backup was in the
STOPPINGstate. This normally occurs only for an aborted backup, but could also happen for a successful backup for which an LDM had no fragments. We fix the issue by introducing an additional check to ensure that writes are skipped only if the backup actually contains an error which should cause it to abort. (Bug #29892660)
References: See also: Bug #30371389.
In some cases the
SignalSenderclass, used as part of the implementation of ndb_mgmd and
ndbinfo, buffered excessive numbers of unneeded
API_REGCONFsignals, leading to unnecessary consumption of memory. (Bug #29520353)
References: See also: Bug #20075747, Bug #29474136.
The setting for the
BackupLogBufferSizeconfiguration parameter was not honored. (Bug #29415012)
The maximum global checkpoint (GCP) commit lag and GCP save timeout are recalculated whenever a node shuts down, to take into account the change in number of data nodes. This could lead to the unintentional shutdown of a viable node when the threshold decreased below the previous value. (Bug #27664092)
References: See also: Bug #26364729.
A transaction which inserts a child row may run concurrently with a transaction which deletes the parent row for that child. One of the transactions should be aborted in this case, lest an orphaned child row result.
Before committing an insert on a child row, a read of the parent row is triggered to confirm that the parent exists. Similarly, before committing a delete on a parent row, a read or scan is performed to confirm that no child rows exist. When insert and delete transactions were run concurrently, their prepare and commit operations could interact in such a way that both transactions committed. This occurred because the triggered reads were performed using
NdbOperation::LockMode), which are not strong enough to prevent such error scenarios.
This problem is fixed by using the stronger
LM_SimpleReadlock mode for both triggered reads. The use of
LM_CommittedReadlocks ensures that at least one transaction aborts in every possible scenario involving transactions which concurrently insert into child rows and delete from parent rows. (Bug #22180583)