Incompatible Change; NDB Disk Data: Due to changes in disk file formats, it is necessary to perform an
--initialrestart of each data node when upgrading to or downgrading from this release.
Important Change; NDB Disk Data: NDB Cluster has improved node restart times and overall performance with larger data sets by implementing partial local checkpoints (LCPs). Prior to this release, an LCP always made a copy of the entire database.
NDBnow supports LCPs that write individual records, so it is no longer strictly necessary for an LCP to write the entire database. Since, at recovery, it remains necessary to restore the database fully, the strategy is to save one fourth of all records at each LCP, as well as to write the records that have changed since the last LCP.
Two data node configuration parameters relating to this change are introduced in this release:
true, or enabled) enables partial LCPs. When partial LCPs are enabled,
RecoveryWorkcontrols the percentage of space given over to LCPs; it increases with the amount of work which must be performed on LCPs during restarts as opposed to that performed during normal operations. Raising this value causes LCPs during normal operations to require writing fewer records and so decreases the usual workload. Raising this value also means that restarts can take longer.Important
Upgrading to NDB 7.6.4 or downgrading from this release requires purging then re-creating the
NDBdata node file system, which means that an initial restart of each data node is needed. An initial node restart still requires a complete LCP; a partial LCP is not used for this purpose.
A rolling restart or system restart is a normal part of an
NDBsoftware upgrade. When such a restart is performed as part of an upgrade to NDB 7.6.4 or later, any existing LCP files are checked for the presence of the LCP
sysfile, indicating that the existing data node file system was written using NDB 7.6.4 or later. If such a node file system exists, but does not contain the
sysfile, and if any data nodes are restarted without the
NDBcauses the restart to fail with an appropriate error message. This detection can be performed only as part of an upgrade; it is not possible to do so as part of a downgrade to NDB 7.6.3 or earlier from a later release.
Exception: If there are no data node files—that is, in the event of a “clean” start or restart—using
--initialis not required for a software upgrade, since this is already equivalent to an initial restart. (This aspect of restarts is unchanged from previous releases of NDB Cluster.)
This release also deprecates the data node configuration parameters
BackupMaxWriteSize; these are now subject to removal in a future NDB Cluster version. (Bug #27308632)
Important Change: Added the ndb_perror utility for obtaining information about NDB Cluster error codes. This tool replaces perror
--ndboption for perror is now deprecated and raises a warning when used; the option is subject to removal in a future NDB version.
See ndb_perror — Obtain NDB Error Message Information, for more information. (Bug #81703, Bug #81704, Bug #23523869, Bug #23523926)
References: See also: Bug #26966826, Bug #88086.
It is now possible to specify a set of cores to be used for I/O threads performing offline multithreaded builds of ordered indexes, as opposed to normal I/O duties such as file I/O， compression， or decompression. “Offline” in this context refers to building of ordered indexes performed when the parent table is not being written to; such building takes place when an
NDBcluster performs a node or system restart, or as part of restoring a cluster from backup using ndb_restore
In addition, the default behaviour for offline index build work is modified to use all cores available to ndbmtd, rather limiting itself to the core reserved for the I/O thread. Doing so can improve restart and restore times and performance, availability, and the user experience.
This enhancement is implemented as follows:
The default value for
BuildIndexThreadsis changed from 0 to 128. This means that offline ordered index builds are now multithreaded by default.
The default value for
TwoPassInitialNodeRestartCopyis changed from
true. This means that an initial node restart first copies all data from a “live” node to one that is starting—without creating any indexes—builds ordered indexes offline, and then again synchronizes its data with the live node, that is, synchronizing twice and building indexes offline between the two synchonizations. This causes an initial node restart to behave more like the normal restart of a node, and reduces the time required for building indexes.
A new thread type (
idxbld) is defined for the
ThreadConfigconfiguration parameter, to allow locking of offline index build threads to specific CPUs.
NDBnow distinguishes the thread types that are accessible to “ThreadConfig” by the following two criteria:
Whether the thread is an execution thread. Threads of types
sendare execution threads; thread types
Whether the allocation of the thread to a given task is permanent or temporary. Currently all thread types except
For additonal information, see the descriptions of the parameters in the Manual. (Bug #25835748, Bug #26928111)
ODirectSyncFlagconfiguration parameter for data nodes. When enabled, the data node treats all completed filesystem writes to the redo log as though they had been performed using
This parameter has no effect if at least one of the following conditions is true:
ODirectis not enabled.
InitFragmentLogFilesis set to
ndbinfo.error_messagestable, which provides information about NDB Cluster errors, including error codes, status types, brief descriptions, and classifications. This makes it possible to obtain error information using SQL in the mysql client (or other MySQL client program), like this:
mysql> SELECT * FROM ndbinfo.error_messages WHERE error_code='321'; +------------+----------------------+-----------------+----------------------+ | error_code | error_description | error_status | error_classification | +------------+----------------------+-----------------+----------------------+ | 321 | Invalid nodegroup id | Permanent error | Application error | +------------+----------------------+-----------------+----------------------+ 1 row in set (0.00 sec)
ThreadConfignow has an additional
nosendparameter that can be used to prevent a
tcthread from assisting the send threads, by setting this parameter to 1 for the given thread. By default,
nosendis 0. It cannot be used with threads other than those of the types just listed.
When executing a scan as a pushed join, all instances of
DBSPJwere involved in the execution of a single query; some of these received multiple requests from the same query. This situation is improved by enabling a single SPJ request to handle a set of root fragments to be scanned, such that only a single SPJ request is sent to each
DBSPJinstance on each node and batch sizes are allocated per fragment, the multi-fragment scan can obtain a larger total batch size, allowing for some scheduling optimizations to be done within
DBSPJ, which can scan a single fragment at a time (giving it the total batch size allocation), scan all fragments in parallel using smaller sub-batches, or some combination of the two.
Since the effect of this change is generally to require fewer SPJ requests and instances, performance of pushed-down joins should be improved in many cases.
As part of work ongoing to optimize bulk DDL performance by ndbmtd, it is now possible to obtain performance improvements by increasing the batch size for the bulk data parts of DDL operations which process all of the data in a fragment or set of fragments using a scan. Batch sizes are now made configurable for unique index builds, foreign key builds, and online reorganization, by setting the respective data node configuration parameters listed here:
For each of the parameters just listed, the default value is 64, the minimum is 16, and the maximum is 512.
Increasing the appropriate batch size or sizes can help amortize inter-thread and inter-node latencies and make use of more parallel resources (local and remote) to help scale DDL performance.
Formerly, the data node
LGMANkernel block processed undo log records serially; now this is done in parallel. The
repthread, which hands off undo records to local data handler (LDM) threads, waited for an LDM to finish applying a record before fetching the next one; now the
repthread no longer waits, but proceeds immediately to the next record and LDM.
There are no user-visible changes in functionality directly associated with this work; this performance enhancement is part of the work being done in NDB 7.6 to improve undo long handling for partial local checkpoints.
When applying an undo log the table ID and fragment ID are obtained from the page ID. This was done by reading the page from
PGMANusing an extra
PGMANworker thread, but when applying the undo log it was necessary to read the page again.
This became very inefficient when using
ODirect) since the page was not cached in the OS kernel.
Mapping from page ID to table ID and fragment ID is now done using information the extent header contains about the table IDs and fragment IDs of the pages used in a given extent. Since the extent pages are always present in the page cache, no extra disk reads are required to perform the mapping, and the information can be read using existing
NODELOG DEBUGcommand in the ndb_mgm client to provide runtime control over data node debug logging.
NODE DEBUG ONcauses a data node to write extra debugging information to its node log, the same as if the node had been started with
NODELOG DEBUG OFFdisables the extra logging.
LocationDomainIdconfiguration parameter for management, data, and API nodes. When using NDB Cluster in a cloud environment, you can set this parameter to assign a node to a given availability domain or availability zone. This can improve performance in the following ways:
If requested data is not found on the same node, reads can be directed to another node in the same availability domain.
Communication between nodes in different availability domains are guaranteed to use
NDBtransporters' WAN support without any further manual intervention.
The transporter's group number can be based on which availability domain is used, such that also SQL and other API nodes communicate with local data nodes in the same availability domain whenever possible.
The arbitrator can be selected from an availability domain in which no data nodes are present, or, if no such availability domain can be found, from a third availability domain.
This parameter takes an integer value between 0 and 16, with 0 being the default; using 0 is the same as leaving
References: See also: Bug #86615, Bug #26236320, Bug #26907833.
NDB Disk Data: An
ALTER TABLEthat switched the table storage format between
DISKwas always performed in place for all columns. This is not correct in the case of a column whose storage format is inherited from the table; the column's storage type is not changed.
For example, this statement creates a table
c2uses in-memory storage since the table does so implicitly:
CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 INT) ENGINE NDB;
ALTER TABLEstatement shown here is expected to cause
c2to be stored on disk, but failed to do so:
ALTER TABLE t1 STORAGE DISK TABLESPACE ts1;
Similarly, an on-disk column that inherited its storage format from the table to which it belonged did not have the format changed by
ALTER TABLE ... STORAGE MEMORY.
These two cases are now performed as a copying alter, and the storage format of the affected column is now changed. (Bug #26764270)
Errors in parsing
NDB_TABLEmodifiers could cause memory leaks. (Bug #26724559)
DUMPcode 7027 to facilitate testing of issues relating to local checkpoints. For more information, see DUMP 7027. (Bug #26661468)
A previous fix intended to improve logging of node failure handling in the transaction coordinator included logging of transactions that could occur in normal operation, which made the resulting logs needlessly verbose. Such normal transactions are no longer written to the log in such cases. (Bug #26568782)
References: This issue is a regression of: Bug #26364729.
Due to a configuration file error, CPU locking capability was not available on builds for Linux platforms. (Bug #26378589)
DUMPcodes used for the
LGMANkernel block were incorrectly assigned numbers in the range used for codes belonging to
DBTUX. These have now been assigned symbolic constants and numbers in the proper range (10001, 10002, and 10003). (Bug #26365433)
Node failure handling in the
DBTCkernel block consists of a number of tasks which execute concurrently, and all of which must complete before TC node failure handling is complete. This fix extends logging coverage to record when each task completes, and which tasks remain, includes the following improvements:
Handling interactions between GCP and node failure handling interactions, in which TC takeover causes GCP participant stall at the master TC to allow it to extend the current GCI with any transactions that were taken over; the stall can begin and end in different GCP protocol states. Logging coverage is extended to cover all scenarios. Debug logging is now more consistent and understandable to users.
Logging done by the
QMGRblock as it monitors duration of node failure handling duration is done more frequently. A warning log is now generated every 30 seconds (instead of 1 minute), and this now includes
DBDIHblock debug information (formerly this was written separately, and less often).
To reduce space used,
DBTC instanceis shortened to
A new error code is added to assist testing.
During a restart,
DBLQHloads redo log part metadata for each redo log part it manages, from one or more redo log files. Since each file has a limited capacity for metadata, the number of files which must be consulted depends on the size of the redo log part. These files are opened, read, and closed sequentially, but the closing of one file occurs concurrently with the opening of the next.
In cases where closing of the file was slow, it was possible for more than 4 files per redo log part to be open concurrently; since these files were opened using the
OM_WRITE_BUFFERoption, more than 4 chunks of write buffer were allocated per part in such cases. The write buffer pool is not unlimited; if all redo log parts were in a similar state, the pool was exhausted, causing the data node to shut down.
This issue is resolved by avoiding the use of
OM_WRITE_BUFFERduring metadata reload, so that any transient opening of more than 4 redo log files per log file part no longer leads to failure of the data node. (Bug #25965370)
TRUNCATE TABLEon an
AUTO_INCREMENTID was not reset on an SQL node not performing binary logging. (Bug #14845851)
A join entirely within the materialized part of a semijoin was not pushed even if it could have been. In addition,
EXPLAINprovided no information about why the join was not pushed. (Bug #88224, Bug #27022925)
References: See also: Bug #27067538.
When the duplicate weedout algorithm was used for evaluating a semijoin, the result had missing rows. (Bug #88117, Bug #26984919)
References: See also: Bug #87992, Bug #26926666.
A table used in a loose scan could be used as a child in a pushed join query, leading to possibly incorrect results. (Bug #87992, Bug #26926666)
When representing a materialized semijoin in the query plan, the MySQL Optimizer inserted extra
JOIN_TABobjects to represent access to the materialized subquery result. The join pushdown analyzer did not properly set up its internal data structures for these, leaving them uninitialized instead. This meant that later usage of any item objects referencing the materialized semijoin accessed an initialized
tablenocolumn when accessing a 64-bit
tablenobitmask, possibly referring to a point beyond its end, leading to an unplanned shutdown of the SQL node. (Bug #87971, Bug #26919289)
In some cases, a
SCAN_FRAGCONFsignal was received after a
SCAN_FRAGREQwith a close flag had already been sent, clearing the timer. When this occurred, the next
SCAN_FRAGREFto arrive caused time tracking to fail. Now in such cases, a check for a cleared timer is performed prior to processing the
SCAN_FRAGREFmessage. (Bug #87942, Bug #26908347)
While deleting an element in
Dbacc, or moving it during hash table expansion or reduction, the method used (
getLastAndRemove()) could return a reference to a removed element on a released page, which could later be referenced from the functions calling it. This was due to a change brought about by the implementation of dynamic index memory in NDB 7.6.2; previously, the page had always belonged to a single
Dbaccinstance, so accessing it was safe. This was no longer the case following the change; a page released in
Dbacccould be placed directly into the global page pool where any other thread could then allocate it.
Now we make sure that newly released pages in
Dbaccare kept within the current
Dbaccinstance and not given over directly to the global page pool. In addition, the reference to a released page has been removed; the affected internal method now returns the last element by value, rather than by reference. (Bug #87932, Bug #26906640)
References: See also: Bug #87987, Bug #26925595.
DBTCkernel block could receive a
TCRELEASEREQsignal in a state for which it was unprepared. Now it such cases it responds with a
TCRELEASECONFmessage, and subsequently behaves just as if the API connection had failed. (Bug #87838, Bug #26847666)
References: See also: Bug #20981491.
When a data node was configured for locking threads to CPUs, it failed during startup with Failed to lock tid.
This was is a side effect of a fix for a previous issue, which disabled CPU locking based on the version of the available
glibc. The specific
glibcissue being guarded against is encountered only in response to an internal NDB API call (
Ndb_UnlockCPU()) not used by data nodes (and which can be accessed only through internal API calls). The current fix enables CPU locking for data nodes and disables it only for the relevant API calls when an affected
glibcversion is used. (Bug #87683, Bug #26758939)
References: This issue is a regression of: Bug #86892, Bug #26378589.
ndb_top failed to build on platforms where the
ncurseslibrary did not define
stdscr. Now these platforms require the
tinfolibrary to be included. (Bug #87185, Bug #26524441)
On completion of a local checkpoint, every node sends a
LCP_COMPLETE_REPsignal to every other node in the cluster; a node does not consider the LCP complete until it has been notified that all other nodes have sent this signal. Due to a minor flaw in the LCP protocol, if this message was delayed from another node other than the master, it was possible to start the next LCP before one or more nodes had completed the one ongoing; this caused problems with
LCP_COMPLETE_REPsignals from previous LCPs becoming mixed up with such signals from the current LCP, which in turn led to node failures.
To fix this problem, we now ensure that the previous LCP is complete before responding to any
TCGETOPSIZEREQsignal initiating a new LCP. (Bug #87184, Bug #26524096)
NDB Cluster did not compile successfully when the build used
WITH_UNIT_TESTS=OFF. (Bug #86881, Bug #26375985)
Recent improvements in local checkpoint handling that use
OM_CREATEto open files did not work correctly on Windows platforms, where the system tried to create a new file and failed if it already existed. (Bug #86776, Bug #26321303)
A potential hundredfold signal fan-out when sending a
START_FRAG_REQsignal could lead to a node failure due to a job buffer full error in start phase 5 while trying to perform a local checkpoint during a restart. (Bug #86675, Bug #26263397)
References: See also: Bug #26288247, Bug #26279522.
Compilation of NDB Cluster failed when using
-DWITHOUT_SERVER=1to build only the client libraries. (Bug #85524, Bug #25741111)
OM_SYNCflag is intended to make sure that all FSWRITEREQ signals used for a given file are synchronized, but was ignored by platforms that do not support
O_SYNC, meaning that this feature did not behave properly on those platforms. Now the synchronization flag is used on those platforms that do not support
O_SYNC. (Bug #76975, Bug #21049554)