WL#3464: Add replication event to denote gap in replication
Affects: Server-5.1
—
Status: Complete
RATIONALE --------- Needed to solve BUG#21494. DESCRIPTION ----------- Introduce a way to denote that there has been an outage and there is a gap in the replication stream. Typically by either adding a flag to an event or by introducing a new event. The indication can be used to force resynchronization, but an initial implementation will just stop the slave indicating that the master and the slave is out of sync and needs resynchronization. We want the log after the gap to feel "clean". After the gap we should hava a log file that can be easily used with an initial snapshot without needing to execute SKIP EVENT or similar stuff. REQUIREMENT ----------- An incident is something that occurs and that affects the contents of the database but which cannot easily be represented as a set of changes. Examples of incidents are: server crashes, database resynchronization, (some) software updates, and (some) hardware changes. R1. It shall be possible to record an incident in the binary log R2. It shall be possible to record an incident through the injector interface R3.1. When seeing a known incident, the slave shall stop with an indication of the incident that occured. In the future, other ways of handling known incidents might be implemented, but for the current implementation, the slave will stop R3.2. For *unknown* incident codes, the slave shall *always* stop R4. It shall be possible for the DBA to start the replication from just after the incident when the apropriate actions have been taken to handle the incident DECISIONS REGARDING THIS WL --------------------------- - Elliot Murphy, Brian Aker, Calvin Sun and Trudy Pelzer agreed today (28 Aug 2006) that this task is needed for 5.1. (Reported by Trudy) - Lars and Mats dicussed this on email and we will most likely simply insert a Format_description_log_event instead of the gap event, possibly after rotating the log. There is no need to change the interface that presented to NDB, so Cluster team will not have to make any changes to their part. - Lars and Mats discussed again 2007-01-25 and realized that a separate event might be best. BUG SCENARIOS ------------- We need to have a solution that fixes both of these problems: 1. NDB node crashes and SUMA produce a gap in the cluster replication log sent to the event interface in the MySQL server. 2. MySQL Server crashes and the event queue is lost. When MySQL server is restarted it receives new events from SUMA, but it has lost the events that were stored in the event queue in the MySQL server. 3. Network outage. The replication log is lost in the connection between the NDB nodes and the MySQL server. POSSIBLE SOLUTIONS ------------------ 1. Slave stop with error message when it recieves GAP information from the master. The user needs to do some special action to restart the replication, e.g. skip event, START SLAVE WITH INITIAL START (a new command). 2. Slave stop with error message but automatically skip the gap. A new START SLAVE command will resume the replication without further user action. Both Lars and Mats think this is a bad solution because it is too easy for the DBA to make mistakes. SUGGESTED SOLUTION ------------------ If a gap error occur, the user needs to execute SKIP EVENT. IMPLEMENTATION -------------- Possibilities: I1. GAP info in separate event I2. GAP info in rotate_event I3. GAP info in format_description_log_event Mats and Lars thinks that a "INCIDENT" event + a rotation log event seems like the best way to indicate a gap. Then we get a fresh log after the gap (R1) and we don't need to skip rotate events (I2). In the incident event there should be: 1. A number saying what incident it conveys. For this case we think it should be "LOST_EVENTS". 2. A optional string with information about the nature of the incident. For this case it can be "MySQL Server crashed" or "Cluster did not provide continuous replication log", "Network outage between NDB node and MySQL server". (It is up to the cluster team to provide proper messages).
# Behaviour When the slave encounters a gap in the replication log, it shall stop with an error message. The error message will be available in the output of SHOW SLAVE STATUS. The format of the message is:: The incident *symbol* occured on the master. Message: *message* Where *symbol* is a symbol denoting the incident (the constant name without the INCIDENT_) and *message* is the message supplied. # Administration If the slave stops at an incident, the output of SHOW SLAVE STATUS will indicate that the SQL thread has stopped due to an incident registered in the replication stream. In order to restart the slave, the DBA has to issue the following commands:: slave> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; slave> START SLAVE;
New enumeration ``Incident`` ---------------------------------------- Synopsis:: enum Incident { /* No incident */ INCIDENT_NONE, /* There were events lost in the replication stream */ INCIDENT_LOST_EVENTS, /* Shall be last event of the enumeration */ INCIDENT_COUNT }; New class ``Incident_log_event`` ----------------------------------------- Introduce a new event to denote that something unexpected happened on the master. Synopsis:: class Incident_log_event : public Log_event { public: Incident_log_event(Incident incident); Incident_log_event(Incident incident, LEX_STRING message); private: Incident m_incident; LEX_STRING m_message; }; New member functions in class ``injector`` -------------------------------- The following functions are added to the ``injector`` class. Synposis:: int record_incident(Incident incident); int record_incident(Incident incident, LEX_STRING message);
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.