-
Important Change: It is now possible to divide a backup into slices and to restore these in parallel using two new options implemented for the ndb_restore utility, making it possible to employ multiple instances of ndb_restore to restore subsets of roughly the same size of the backup in parallel, which should help to reduce the length of time required to restore an NDB Cluster from backup.
The
--num-slices
options determines the number of slices into which the backup should be divided;--slice-id
provides the ID of the slice (0 to 1 less than the number of slices) to be restored by ndb_restore.Up to 1024 slices are supported.
For more information, see the descriptions of the
--num-slices
and--slice-id
options. (Bug #30383937, WL #10691)
-
Incompatible Change: The minimum value for the
RedoOverCommitCounter
data node configuration parameter has been increased from 0 to 1. The minimum value for theRedoOverCommitLimit
data node configuration parameter has also been increased from 0 to 1.You should check the cluster global configuration file and make any necessary adjustments to values set for these parameters before upgrading. (Bug #29752703)
Microsoft Windows; NDB Disk Data: On Windows, restarting a data node other than the master when using Disk Data tables led to a failure in
TSMAN
. (Bug #97436, Bug #30484272)NDB Disk Data: Compatibility code for the Version 1 disk format used prior to the introduction of the Version 2 format in NDB 7.6 turned out not to be necessary, and is no longer used.
A faulty
ndbrequire()
introduced when implementing partial local checkpoints assumed thatm_participatingLQH
must be clear when receivingSTART_LCP_REQ
, which is not necessarily true when a failure happens for the master after sendingSTART_LCP_REQ
and before handling anySTART_LCP_CONF
signals. (Bug #30523457)A local checkpoint sometimes hung when the master node failed while sending an
LCP_COMPLETE_REP
signal and it was sent to some nodes, but not all of them. (Bug #30520818)Execution of ndb_restore
--rebuild-indexes
together with the--rewrite-database
and--exclude-missing-tables
options did not create indexes for any tables in the target database. (Bug #30411122)When synchronizing extent pages it was possible for the current local checkpoint (LCP) to stall indefinitely if a
CONTINUEB
signal for handling the LCP was still outstanding when receiving theFSWRITECONF
signal for the last page written in the extent synchronization page. The LCP could also be restarted if another page was written from the data pages. It was also possible that this issue causedPREP_LCP
pages to be written at times when they should not have been. (Bug #30397083)-
If a transaction was aborted while getting a page from the disk page buffer and the disk system was overloaded, the transaction hung indefinitely. This could also cause restarts to hang and node failure handling to fail. (Bug #30397083, Bug #30360681)
References: See also: Bug #30152258.
Data node failures with the error Another node failed during system restart... occurred during a partial restart. (Bug #30368622)
If a
SYNC_EXTENT_PAGES_REQ
signal was received byPGMAN
while dropping a log file group as part of a partial local checkpoint, and thus dropping the page locked by this block for processing next, the LCP terminated due to trying to access the page after it had already been dropped. (Bug #30305315)-
The wrong number of bytes was reported in the cluster log for a completed local checkpoint. (Bug #30274618)
References: See also: Bug #29942998.
The number of data bytes for the summary event written in the cluster log when a backup completed was truncated to 32 bits, so that there was a significant mismatch between the number of log records and the number of data records printed in the log for this event. (Bug #29942998)
-
Using 2 LDM threads on a 2-node cluster with 10 threads per node could result in a partition imbalance, such that one of the LDM threads on each node was the primary for zero fragments. Trying to restore a multi-threaded backup from this cluster failed because the datafile for one LDM contained only the 12-byte data file header, which ndb_restore was unable to read. The same problem could occur in other cases, such as when taking a backup immediately after adding an empty node online.
It was found that this occurred when
ODirect
was enabled for an EOF backup data file write whose size was less than 512 bytes and the backup was in theSTOPPING
state. This normally occurs only for an aborted backup, but could also happen for a successful backup for which an LDM had no fragments. We fix the issue by introducing an additional check to ensure that writes are skipped only if the backup actually contains an error which should cause it to abort. (Bug #29892660)References: See also: Bug #30371389.
-
In some cases the
SignalSender
class, used as part of the implementation of ndb_mgmd andndbinfo
, buffered excessive numbers of unneededSUB_GCP_COMPLETE_REP
andAPI_REGCONF
signals, leading to unnecessary consumption of memory. (Bug #29520353)References: See also: Bug #20075747, Bug #29474136.
The setting for the
BackupLogBufferSize
configuration parameter was not honored. (Bug #29415012)-
The maximum global checkpoint (GCP) commit lag and GCP save timeout are recalculated whenever a node shuts down, to take into account the change in number of data nodes. This could lead to the unintentional shutdown of a viable node when the threshold decreased below the previous value. (Bug #27664092)
References: See also: Bug #26364729.
-
A transaction which inserts a child row may run concurrently with a transaction which deletes the parent row for that child. One of the transactions should be aborted in this case, lest an orphaned child row result.
Before committing an insert on a child row, a read of the parent row is triggered to confirm that the parent exists. Similarly, before committing a delete on a parent row, a read or scan is performed to confirm that no child rows exist. When insert and delete transactions were run concurrently, their prepare and commit operations could interact in such a way that both transactions committed. This occurred because the triggered reads were performed using
LM_CommittedRead
locks (seeNdbOperation::LockMode
), which are not strong enough to prevent such error scenarios.This problem is fixed by using the stronger
LM_SimpleRead
lock mode for both triggered reads. The use ofLM_SimpleRead
rather thanLM_CommittedRead
locks ensures that at least one transaction aborts in every possible scenario involving transactions which concurrently insert into child rows and delete from parent rows. (Bug #22180583) Concurrent
SELECT
andALTER TABLE
statements on the same SQL node could sometimes block one another while waiting for locks to be released. (Bug #17812505, Bug #30383887)