If your cluster experiences a complete outage you can
reconfigure it using
dba.rebootClusterFromCompleteOutage()
. This
operation enables you to connect to one of the cluster's MySQL
instances and use its metadata to recover the cluster.
A complete outage means that group replication has stopped on all member instances.
Ensure all cluster members are started before running
dba.rebootClusterFromCompleteOutage()
. The
command will fail if any of the cluster members are
unreachable.
This check is ignored if the cluster is INVALIDATED and is a member of a ClusterSet.
Connect to the most up-to-date instance and run the following command:
JS> var cluster = dba.rebootClusterFromCompleteOutage()
If all members have the same GTID set, the member to which you are currently connected becomes the primary. See Selecting a Primary with rebootClusterFromCompleteOutage.
The dba.rebootClusterFromCompleteOutage()
operation follows these steps to ensure the cluster is correctly
reconfigured:
Cluster metadata and the cluster topology is retrieved from the current instance.
If a cluster member is in RECOVERING or ERROR, and all other members are OFFLINE or ERROR,
dba.rebootClusterFromCompleteOutage()
attempts to stop Group Replication on that member. If Group Replication fails to stop, the command stops and displays an error.-
The InnoDB Cluster metadata found on the instance which MySQL Shell is currently connected to is checked to see if it contains the GTID superset. If the currently connected instance does not contain the GTID superset, the operation aborts with that information.
See GTID Superset.
If the instance contains the GTID superset, the cluster is recovered based on the metadata stored in that instance.
-
MySQL Shell checks which instances of the cluster are currently reachable and fails if any member is currently unreachable.
NoteIt is possible to bypass this check with the
force
option. This reboots the cluster using the remaining contactable members.See Force Option.
Similarly, MySQL Shell detects instances which are currently not reachable. It is not possible to add or remove former members to the cluster as part of the
dba.rebootClusterFromCompleteOutage()
command, if they are currently unreachable.If enabled on the primary instance of the cluster, while in single-primary mode,
super_read_only
is disabled.
To reboot the cluster, you must connect to the member with the GTID superset, which means the instance which had applied the most transactions before the outage.
To determine which member has the GTID superset, do one of the following:
-
Connect to an instance and run
dba.rebootClusterFromCompleteOutage()
withdryRun: true
. The generated report returns information similar to the following:.Switching over to instance '127.0.0.1:4001' to be used as seed.
This indicates the member with the GTID superset.
Running
dba.rebootClusterFromCompleteOutage()
against a member with a lower GTID set results in an error. -
Connect to each instance in turn and run the following in
SQL
mode:SHOW VARIABLES LIKE 'gtid_executed';
The instance which has applied the largest GTID Sets of transactions contains the GTID superset.
It is possible to override this behavior, and use an
instance with a lower GTID set, by running
dba.rebootClusterFromCompleteOutage()
with the force
option.
This makes the selected member the primary and discards any transactions not included in the selected member's GTID set.
If this process fails, and the cluster metadata has become
badly corrupted, you might need to drop the metadata and
create the cluster again from scratch. You can drop the
cluster metadata using
dba.dropMetadataSchema()
.
The dba.dropMetadataSchema()
method
should only be used as a last resort, when it is not
possible to restore the cluster. It cannot be undone.
If you are using MySQL Router with the cluster, when you drop the metadata, all current connections are dropped and new connections are forbidden. This causes a full outage.
dba.rebootClusterFromCompleteOutage()
has
the following options:
force: true | false (default)
: If true, the operation must be executed even if some members of the Cluster cannot be reached, or the primary instance selected has a diverging or lower GTID_SET. See Force OptiondryRun: true | false (default)
: performs all validations and steps of the command, but no changes are made. A report is displayed when finished. See Testing rebootClusterFromCompleteOutage.primary
: Instance definition representing the instance that must be selected as the primary. See Selecting a Primary with rebootClusterFromCompleteOutage.switchCommunicationStack: mysql | xcom
: The Group Replication protocol stack to be used by the Cluster after the reboot. See Section 7.5.9, “Configuring the Group Replication Communication Stack”.ipAllowList
: The list of hosts allowed to connect to the instance for Group Replication traffic when using theXCOM
protocol stack.localAddress
: string value with the Group Replication local address to use instead of the automatically generated one when using theXCOM
protocol stack.
The force
option enables you to ignore the
availability of Cluster members or GTID-set divergence in the
selected member and reboot the Cluster.
For example, rebooting the Cluster
myCluster
:
JS> var cluster = dba.rebootClusterFromCompleteOutage("myCluster",{force: true})
The force
option is not permitted in the
following situations:
If the Cluster belongs to a ClusterSet and is INVALIDATED or the primary Cluster is not in global status OK,
The Cluster belongs to a ClusterSet, is the primary Cluster, and is INVALIDATED.
It is not possible to add or rejoin instances with
rebootClusterFromCompleteOutage
. If you
used force
to ignore unreachable members
and reboot your Cluster, you must use
to add the unreachable members to the Cluster.
cluster
.rejoinInstance()
You can define the Cluster primary in one of the following ways:
-
Define the
primary
option in thedba.rebootClusterFromCompleteOutage()
command.For example, rebooting the Cluster
myCluster
and setting the member running on the local machine, on port 4001, as the primary:var cluster = dba.rebootClusterFromCompleteOutage("myCluster",{primary: "127.0.0.1:4001"})
By using the
primary
option with theforce
option on a Cluster member with a lower GTID set than another member.
You can test the changes by using the
dryRun
option. This option validates the
command and its options and generates a log of results. An
exception is thrown if there is a problem with the proposed
changes.
The following example shows a dry run of rebooting the
Cluster, myCluster
, setting the primary to
the local member running on port 4001, and the log message it
returns:
JS > var cluster = dba.rebootClusterFromCompleteOutage("myCluster",{primary: "127.0.0.1:4001", dryRun: true})
NOTE: dryRun option was specified. Validations will be executed, but no changes will be applied.
Cluster instances: '127.0.0.1:4000' (OFFLINE), '127.0.0.1:4001' (OFFLINE), '127.0.0.1:4002' (OFFLINE)
Switching over to instance '127.0.0.1:4001' to be used as seed.
dryRun finished.
rebootClusterFromCompleteOutage
performs
the following checks and generates a warning if the Cluster
does not meet the requirements:
Confirms the Replica Cluster was not forcibly removed from the ClusterSet.
Confirms the ClusterSet's primary Cluster is reachable.
Checks the Cluster for errant transactions which are not View Change Log Events (VCLE). See How Distributed Recovery Works.
Confirms the Cluster's executed transaction set (
GTID_EXECUTED
) is not empty.
The command automatically rejoins a Replica Cluster to the ClusterSet, ensuring the ClusterSet replication channel is configured for all Cluster members.
You can switch communication stack during a
dba.rebootClusterFromCompleteOutage()
operation.
For example:
js> dba.rebootClusterFromCompleteOutage("testcluster", {switchCommunicationStack: "mysql"})
Switching from the MYSQL
protocol to
XCOM
requires an additional network address
for the localAddress
and may also require
you to define ipAllowList
values.