NDB Cluster APIs: Reuse of transaction IDs could occur when
Ndbobjects were created and deleted concurrently. As part of this fix, the NDB API methods
unlock_ndb_objectsare now declared as
const. (Bug #23709232)
NDB Cluster APIs: When the management server was restarted while running an MGM API application that continuously monitored events, subsequent events were not reported to the application, with timeouts being returned indefinitely instead of an error.
This occurred because sockets for event listeners were not closed when restarting mgmd. This is fixed by ensuring that event listener sockets are closed when the management server shuts down, causing applications using functions such as
ndb_logevent_get_next()to receive a read error following the restart. (Bug #19474782)
Passing a nonexistent node ID to
CREATE NODEGROUPled to random data node failures. (Bug #23748958)
DROP TABLEfollowed by a node shutdown and subesequent master takeover—and with the containing local checkpoint not yet complete prior to the takeover—caused the LCP to be ignored, and in some cases, the data node to fail. (Bug #23735996)
References: See also: Bug #23288252.
Removed an invalid assertion to the effect that all cascading child scans are closed at the time API connection records are released following an abort of the main transaction. The assertion was invalid because closing of scans in such cases is by design asynchronous with respect to the main transaction, which means that subscans may well take some time to close after the main transaction is closed. (Bug #23709284)
A number of potential buffer overflow issues were found and fixed in the
NDBcodebase. (Bug #23152979)
SIGNAL_DROPPED_REPhandler invoked in response to long message buffer exhaustion was defined in the
SPJkernel block, but not actually used. This meant that the default handler from
SimulatedBlockwas used instead in such cases, which shut down the data node. (Bug #23048816)
References: See also: Bug #23251145, Bug #23251423.
When a data node has insufficient redo buffer during a system restart, it does not participate in the restart until after the other nodes have started. After this, it performs a takeover of its fragments from the nodes in its node group that have already started; during this time, the cluster is already running and user activity is possible, including DML and DDL operations.
During a system restart, table creation is handled differently in the
DIHkernel block than normally, as this creation actually consists of reloading table definition data from disk on the master node. Thus,
DIHassumed that any table creation that occurred before all nodes had restarted must be related to the restart and thus always on the master node. However, during the takeover, table creation can occur on non-master nodes due to user activity; when this happened, the cluster underwent a forced shutdown.
Now an extra check is made during system restarts to detect in such cases whether the executing node is the master node, and use that information to determine whether the table creation is part of the restart proper, or is taking place during a subsequent takeover. (Bug #23028418)
ndb_restore set the
MAX_ROWSattribute for a table for which it had not been set prior to taking the backup. (Bug #22904640)
Whenever data nodes are added to or dropped from the cluster, the
NDBkernel's Event API is notified of this using a
SUB_GCP_COMPLETE_REPsignal with either the
ADD(add) flag or
SUB(drop) flag set, as well as the number of nodes to add or drop; this allows
NDBto maintain a correct count of
SUB_GCP_COMPLETE_REPsignals pending for every incomplete bucket. In addition to handling the bucket for the epoch associated with the addition or removal, it must also compensate for any later incomplete buckets associated with later epochs. Although it was possible to complete such buckets out of order, there was no handling of these, leading a stall in to event reception.
This fix adds detection and handling of such out of order bucket completion. (Bug #20402364)
References: See also: Bug #82424, Bug #24399450.
When restoring a backup taken from a database containing tables that had foreign keys, ndb_restore disabled the foreign keys for data, but not for the logs. (Bug #83155, Bug #24736950)
The count displayed by the
c_execcolumn in the
ndbinfo.threadstattable was incomplete. (Bug #82635, Bug #24482218)
The internal function
ndbcluster_binlog_wait(), which provides a way to make sure that all events originating from a given thread arrive in the binary log, is used by
SHOW BINLOG EVENTSas well as when resetting the binary log. This function waits on an injector condition while the latest global epoch handled by
NDBis more recent than the epoch last committed in this session, which implies that this condition must be signalled whenever the binary log thread completes and updates a new latest global epoch. Inspection of the code revealed that this condition signalling was missing, and that, instead of being awakened whenever a new latest global epoch completes (~100ms), client threads waited for the maximum timeout (1 second).
This fix adds the missing injector condition signalling, while also changing it to a condition broadcast to make sure that all client threads are alerted. (Bug #82630, Bug #24481551)
During a node restart, a fragment can be restored using information obtained from local checkpoints (LCPs); up to 2 restorable LCPs are retained at any given time. When an LCP is reported to the
DIHkernel block as completed, but the node fails before the last global checkpoint index written into this LCP has actually completed, the latest LCP is not restorable. Although it should be possible to use the older LCP, it was instead assumed that no LCP existed for the fragment, which slowed the restart process. Now in such cases, the older, restorable LCP is used, which should help decrease long node restart times. (Bug #81894, Bug #23602217)
While a mysqld was waiting to connect to the management server during initialization of the
NDBhandler, it was not possible to shut down the mysqld. If the mysqld was not able to make the connection, it could become stuck at this point. This was due to an internal wait condition in the utility and index statistics threads that could go unmet indefinitely. This condition has been augmented with a maximum timeout of 1 second, which makes it more likely that these threads terminate themselves properly in such cases.
In addition, the connection thread waiting for the management server connection performed 2 sleeps in the case just described, instead of 1 sleep, as intended. (Bug #81585, Bug #23343673)
The list of deferred tree node lookup requests created when preparing to abort a
DBSPJrequest were not cleared when this was complete, which could lead to deferred operations being started even after the
DBSPJrequest aborted. (Bug #81355, Bug #23251423)
References: See also: Bug #23048816.
Error and abort handling in
Dbspj::execTRANSID_AI()was implemented such that its
abort()method was called before processing of the incoming signal was complete. Since this method sends signals to the LDM, this partly overwrote the contents of the signal which was later required by
execTRANSID_AI(). This could result in aborted
DBSPJrequests cleaning up their allocated resources too early, or not at all. (Bug #81353, Bug #23251145)
References: See also: Bug #23048816.
Several object constructors and similar functions in the
NDBcodebase did not always perform sanity checks when creating new instances. These checks are now performed under such circumstances. (Bug #77408, Bug #21286722)
An internal call to
malloc()was not checked for
NULL. The function call was replaced with a direct write. (Bug #77375, Bug #21271194)