WL#9834: MySQL GCS - Minor Refactoring: Information on nodes are duplicated in several modules and classes

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

In this WL, we will :

  . Eliminate some duplicated code related to node's properties.

  . Improve the code to track node's incarnation.

- User visible behavior

A different and unique identifier is automatically associated to a node
whenever it joins the cluster so the system can easily identify if a node's
different incarnation, i.e. an instance with the same address but a different
identifier, is trying to rejoin the cluster. If so, the attempt to rejoin is
rejected until any reference to the old incarnation is removed from the
cluster, meaning that the node will be firstly expelled from the cluster by the
system and only after that its new incarnation will be able to rejoin it.

- Addressed problems

Allowing a new incarnation to rejoin the cluster while there are references to 
an old incarnation in the system may cause strange behaviors and in some
circumstances inconsistencies. In the first case, a node could join the cluster 
and be suddenly expelled from it when there isn't a network issue nor a problem 
with the member server. 

The current patch is a stepping stone towards fixing possible inconsistency
issues that may happen when there is a majority loss and a node that died or 
had its group replication plugin stopped is temporarily available allowing  
the cluster to briefly make progress. The node suffers from amnesia and may
cast votes that contradicts a previous promises in unfinished consensus rounds.

So after a majority loss users must be very careful on how they handle the
situation and cannot allow nodes that have restarted the group replication
service try to rejoin the cluster before the cluster is reconfigured.
We have noticed that information on nodes are duplicated in
several modules and classes and this makes our lives hard because it becomes
difficult to reason about the code and improve MySQL GCS. For example,

class Gcs_xcom_control : public Gcs_control_interface {

  unsigned int          m_local_member_id_hash;

  Gcs_xcom_group_member_information *m_local_node_info;

  unsigned int m_hash;

  node_list m_node_list_me;

  unsigned int m_uuid;


"m_uuid" was never used, m_local_member_id_hash was used as an internal UUID 
which stored the hash of the node's address. Unfortunately, this information 
could not be used to distinguish among different incarnations of the same node 
because different incarnations would have the same hash. Besides, node's 
properties such as address, identifier, etc were defined in different modules 
and stored in different places: Gcs_xcom_group_member_information and 
Gcs_xcom_nodes and Gcs_member_identifier stored copies of an address.

Apparently this confusion stems from the use of Gcs_member_identifier to 
represent a "node" where the only stored information is an address. However, 
there should have been a distinction between an identifier and an address since 
the begining. In this WL, we will start fixing this by making it clear that a 
node has several properties which includes its address, identifier, etc.
We will make the following changes:

1 - Remove some duplicated code that deals with adding and removing nodes
    from the system. This requires some serialization code to create an XCOM
    object and this code is duplicated.

2 - Separate the Gcs_uuid from the Gcs_member_identifier since they represent
    different ways of identifying a node. This will simplify the code because
    the Gcs_member_identifier is pervasive and we don't want to worry about
    when unique identifiers should be created or not and don't want to create
    them when they are not really necessary.

3 - Rename the Gcs_xcom_group_member_information to Gcs_xcom_node_address
    because the only thing that it stores is an address.

4 - Create the notion of a node object, i.e. Gcs_xcom_node_information, which
    will encapsulate the following properties and methods:

  class Gcs_xcom_node_information
      Gcs_xcom_node_information constructor.

      @param[in] member_id the member identifier
    explicit Gcs_xcom_node_information(const std::string &member_id);

      Gcs_xcom_node_information constructor.

      @param[in] member_id the member's identifier
      @param[in] uuid the member uuid
      @param[in] address the member's address
      @param[in] node the member's node number
      @param[in] alive whether the node is alive or not.
    explicit Gcs_xcom_node_information(const std::string &member_id,
                                       const Gcs_xcom_uuid &uuid,
                                       const std::string &address,
                                       const unsigned int node_no,
                                       const bool alive);

    virtual ~Gcs_xcom_node_information() {}

      Sets the timestamp to indicate the creation of the suspicion.
    void set_timestamp(uint64_t ts);

      Gets the timestamp that indicates the creation of the suspicion.
    uint64_t get_timestamp() const;

      Compares the object's timestamp with the received one, in order
      to check if the suspicion has timed out and the suspect node
      must be removed.

      @param[in] ts Provided timestamp
      @param[in] timeout Time interval for the suspicion to timeout

    bool has_timed_out(uint64_t ts, uint64_t timeout);

      @return the member uuid
    const Gcs_member_identifier &get_member_identifier() const;

      @return the member uuid
    const Gcs_xcom_address &get_member_address() const;

      @return the member uuid
    const Gcs_xcom_uuid& get_member_uuid() const;

      Regenerate the member uuid
    void regenerate_member_uuid();

      Set the member node_no.
    void set_node_no(unsigned int);

      Return member node_no.
    unsigned int get_node_no() const;

      Get whether the member is alive or not.
    bool is_alive() const;

  We can keep using the Gcs_member_identifier internally as we are already
  doing so to validate whether a node has joined or left, etc. However, we
  should create a node abstraction that encompasses any node's property.

5 - Make the Gcs_xcom_nodes handle a set of Gcs_xcom_node_information.

6 - Move all information on a node, i.e. Gcs_uuid, Gcs_xcom_nodes and
    Gcs_xcom_group_member_information to a single module, specifically the
    src/bindings/xcom/gcs_xcom_group_member_information.*. Note that we don't
    need to create a single class per module (i.e. file).

7 - Clean up the Gcs_xcom_control as follows:

  class Gcs_xcom_control : public Gcs_control_interface {
    Gcs_xcom_node_information *m_local_node_info;

    Gcs_xcom_node_address *m_local_node_address;

  7.1 - Replace the Gcs_member_identifier *m_local_member_id by
        Gcs_xcom_node_information *m_local_node_info.

  7.2 - Rename the Gcs_xcom_group_member_information to Gcs_xcom_node_address.

  7.4 - Remove the m_node_list_me because the information to build it is
        already in Gcs_xcom_node_information.

  7.5 - Note that although the address is in Gcs_xcom_node_information
        too, we cannot remove it because it is required during the bootstrap.
        Maybe we will be able to remove it after refactoring the configuration
        classes as well.

8 - Incorporate the timeout from the Gcs_node_suspicion into the Gcs_xcom_node
    and remove the former.

    If I don't do this we will have to pass the set of Gcs_xcom_node(s) to the
    Gcs_suspicions_manager because we need to extract the UUID of each node.
    Otherwise, we will not be able to expel them.

Note that the current proposal is just a stepping stone and there is more
work to be done to improve the code. However, we don't need to do everything
in one single patch and we cannot really change interfaces used by external
components such as Group Replication and our Java Wrapper right now.

Here are some ideas on what we should do in the future:

. Create a generic interface, maybe Gcs_uuid, and move the Gcs_xcom_uuid
  code to there.

. Create a generic interface to represent a node, maybe Gcs_node_information,
  and move Gcs_xcom_node_information's properties to it and then make the
  Gcs_xcom_node_information inherit from it.

. Check whether it makes sense to create a generic interface to the

. Make the view report sets of Gcs_node_information instead of reporting

. Rename the get_local_member_identifier to get_local_node_information
  and make it report a Gcs_node_information.

. Expose these new interfaces to Group Replication and make it use them
  and distinguish between identifier, unique identifier and address.

. Find better names to some of these objects and methods.