WL#10570: Provide logical diffs for partial update of JSON values

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

WL#8963 introduced physical diffs for partial update of JSON values. In a
replicated environment, it cannot be guaranteed that the layout of a JSON
document is exactly the same on the slave and the master, so the physical diffs
cannot be used to reduce the network I/O for row-based replication. This worklog
will provide logical diffs that row-based replication can send over the wire and
reapply on the slave.

This worklog will also add support for partial update with the JSON_REMOVE function, in addition to the JSON_SET and JSON_REPLACE functions which got support for partial update in WL#8963.
### Functional requirements

F-1: When an UPDATE statement updates a JSON column with a sequence of JSON_SET, JSON_REPLACE and/or JSON_REMOVE calls, and the input document of the JSON_SET/JSON_REPLACE/JSON_REMOVE calls is the target column of the update operation, a list of the changed paths and their new values should be collected and made available to row-based replication.

F-2: Provide a function (apply_json_diffs()) that takes a list of logical JSON diffs to a value in a JSON colum, and updates the value in the column.

F-3: The apply_json_diffs() function should signal to the caller if a logical JSON diff cannot be applied because the value on the replication slave is not the same as the original value on the replication master.

### Non-requirements

NR-1: The apply_json_diffs() function does not have to detect and report all inconsistencies between the master and the slave. (It should however report the inconsistencies that it does detect.)
## When to collect logical diffs

Logical diffs will be collected for the same kinds of statements that we collect binary diffs for in WL#8963. That is, updates of a JSON column using JSON_SET or JSON_REPLACE expressions whose input is the same as the target column of the update.

Additionally, in the case where binary diffs will be disabled because the replacement value is too big to fit in the space used by the replaced value, logical diffs will be collected anyway, since they can still be used to reduce the amount of binary log to send over the network.

For now, the collection of logical diffs will be enabled regardless of whether row-based replication is in use. WL#2955 will limit it to only enable the collection when row-based replication is used and configured in a way that will benefit from logical diffs.

**TBD**: Should the collection of logical diffs be enabled regardless of storage engine, or only when the storage engine supports partial update using binary diffs? Proposal: Only enable logical diffs when the storage engine supports partial update for now. This gives us fewer combinations to test in this round.

## Representation of logical diffs

A change of a JSON document is represented by an object of type Json_diff, which has the following interface:

    class Json_diff
    {
    public:
      /// Get the path of the changed sub-document.
      const Json_path &path() const;

      /// Which operation to perform on the path.
      enum_json_diff_operation operation() const;

      /// The new value in the path.
      Json_wrapper value() const;
    };

    /// Enum that describes what kind of operation a Json_diff object represents.
    enum class enum_json_diff_operation
    {
      /**
        The JSON value in the given path is replaced with a new value.
        It has the same effect as `JSON_REPLACE(col, path, value)`.
      */
      REPLACE,

      /**
        Add a new element at the given path.

        If the path specifies an array element, it has the same effect as
        `JSON_ARRAY_INSERT(col, path, value)`.

        If the path specifies an object member, it has the same effect as
        `JSON_INSERT(col, path, value)`.
      */
      INSERT,

      /**
        The JSON value at the given path is removed from an array or object.
        It has the same effect as `JSON_REMOVE(col, path)`.
      */
      REMOVE,
    };

Note that the diff operations don't have a 1-to-1 mapping to the JSON functions which support partial update. For example, JSON_SET supports partial update, but there is no SET diff operation (JSON_SET will create REPLACE diffs when updating a value on an existing path, or INSERT diffs when adding new object members or new array elements). Conversely, there is an INSERT diff operation, but JSON_INSERT does not get support for partial update in this worklog. The idea is that other JSON modification functions can be supported with these diff operations in the future, with no changes in the diff representation, so that we can extend the set of supported functions without worrying about old replication slaves not understanding the generated diffs. JSON_INSERT, JSON_ARRAY_INSERT and JSON_ARRAY_APPEND can all be implemented with the INSERT diff operation.

The Json_diff objects are collected into vectors (aliased as Json_diff_vector), one vector per column that is updated. Inside of handler::ha_update_row(), they can be retrieved from the TABLE object, using these new member functions:

    /**
      Is this a column that can be updated with partial update using JSON diffs?

      @param field  the column to check
      @return whether the column can be updated with JSON diffs
    */
    bool is_json_diff_column(const Field *field) const;

    /**
      Get the list of JSON diffs that have been collected for a given column in
      the current row, or `nullptr` if partial update cannot be used for that
      column.

      @param  field   the column to get JSON diffs for
      @return the list of JSON diffs for the column, or `nullptr` if the column
      cannot be updated using partial update
    */
    const Json_diff_vector *get_json_diffs(const Field_json *field) const;

## Application of diffs

A Json_diff_vector can be applied on the slave by using the new function apply_json_diffs():

    /**
      Apply a sequence of JSON diffs to the value stored in a JSON column.

      @param field  the column to update
      @param diffs  the diffs to apply
      @return an enum_json_diff_status value that tells if the diffs were
              applied successfully
    */
    enum_json_diff_status apply_json_diffs(Field_json *field,
                                           const Json_diff_vector *diffs);

The enum that is returned from apply_json_diffs() is defined as follows:

    /**
      The result of applying JSON diffs on a JSON value using apply_json_diffs().
    */
    enum class enum_json_diff_status
    {
      /**
         The JSON diffs were applied and the JSON value in the column was updated
         successfully.
      */
      SUCCESS,

      /**
        An error was raised while applying one of the diffs. The value in the
        column was not updated.
      */
      ERROR,

      /**
        One of the diffs was rejected. This could happen if the path specified in
        the diff does not exist in the JSON value, or if the diff is supposed to
        add a value at a given path, but there already is a value at the path.

        This return code would usually indicate that the replication slave where
        the diff is applied, is out of sync with the replication master where the
        diff was created.

        The value in the column was not updated, but no error was raised.
      */
      REJECTED,
    };

This function will update the representation of the JSON document in the Field_json argument as described by the supplied Json_diff_vector. It won't write the updated value to the table, the slave applier will have to call handler::ha_update_row() to make that happen.

It can collect binary diffs that the storage engine can use for doing a partial update in the table. To make this happen, the slave applier will have to properly set up partial update in the TABLE object. This includes the following steps:

* At start of update, mark the column as a partial update column using TABLE::mark_column_for_partial_update()
* At start of update, Set up partial update execution-time data structures using TABLE::setup_partial_update()
* Between each row, clear the state using TABLE::clear_partial_update_diffs()
* At end of update, clean up execution-time data structures using TABLE::cleanup_partial_update()

## Collection of logical diffs

The logical diffs are collected in vectors that live in the TABLE object of the updated table. Just like the vectors that hold the binary diffs, these vectors store the diffs in memory allocated on the execution MEM_ROOT. TABLE gets a new member function to add the diffs:

    /**
      Add a JSON diff describing a logical change to a JSON column in
      partial update.

      @param field      the column that is updated
      @param path       the JSON path that is changed
      @param operation  the operation to perform
      @param new_value  the new value in the path

      @throws std::bad_alloc if memory cannot be allocated
    */
    void add_json_diff(const Field_json *field,
                       const Json_seekable_path &path,
                       enum_json_diff_operation operation,
                       const Json_wrapper *new_value);

Item_func_json_set_replace, which is the class that implements the JSON_SET and JSON_REPLACE functions, calls TABLE::add_json_diff() while evaluating the JSON_SET/JSON_REPLACE expression to record the changes it makes to the document. The Item_func_json_remove class uses the function in a similar way to record which paths a JSON_REMOVE expression removes from a JSON value.

## Extending partial update support to JSON_REMOVE

Collecting physical diffs for JSON_REMOVE is done by using the mechanisms added in WL#8963. The following is needed:

- Item_func_json_remove needs to implement mark_for_partial_update() to indicate that it can be used for partial update.
- The Json_wrapper class needs a new remove_path() function that removes a specified path from a JSON document and collects binary and logical diffs. (This function is to JSON_REMOVE what Json_wrapper::attempt_partial_update() is to JSON_SET and JSON_REPLACE.)
- The json_binary::Value class needs a function to remove a path from a binary JSON document, called remove_in_shadow(). (This function is to JSON_REMOVE what json_binary::Value::update_in_shadow() is to JSON_SET and JSON_REPLACE.)

When an element is removed from a JSON array or a JSON object, and partial update using binary diffs is performed, the entry for the removed element is removed from the lookup table at the beginning of the binary representation of the JSON array/object. The space previously occupied by the removed element is made available for use by subsequent updates.

Example:

Take the JSON document { "a": "x", "b": "y", "c": "z" }, whose serialized
representation looks like the following:

            0x00 - type: JSONB_TYPE_SMALL_OBJECT
            0x03 - number of elements (low byte)
            0x00 - number of elements (high byte)
            0x22 - number of bytes (low byte)
            0x00 - number of bytes (high byte)
            0x19 - offset of key "a" (high byte)
            0x00 - offset of key "a" (low byte)
            0x01 - length of key "a" (high byte)
            0x00 - length of key "a" (low byte)
            0x1a - offset of key "b" (high byte)
            0x00 - offset of key "b" (low byte)
            0x01 - length of key "b" (high byte)
            0x00 - length of key "b" (low byte)
            0x1b - offset of key "c" (high byte)
            0x00 - offset of key "c" (low byte)
            0x01 - length of key "c" (high byte)
            0x00 - length of key "c" (low byte)
            0x0c - type of value "a": JSONB_TYPE_STRING
            0x1c - offset of value "a" (high byte)
            0x00 - offset of value "a" (low byte)
            0x0c - type of value "b": JSONB_TYPE_STRING
            0x1e - offset of value "b" (high byte)
            0x00 - offset of value "b" (low byte)
            0x0c - type of value "c": JSONB_TYPE_STRING
            0x20 - offset of value "c" (high byte)
            0x00 - offset of value "c" (low byte)
            0x61 - first key  ('a')
            0x62 - second key ('b')
            0x63 - third key  ('c')
            0x01 - length of value "a"
            0x78 - contents of value "a" ('x')
            0x01 - length of value "b"
            0x79 - contents of value "b" ('y')
            0x01 - length of value "c"
            0x7a - contents of value "c" ('z')

We remove the member with name 'b' from the document, using a statement such
as:

    UPDATE t SET j = JSON_REMOVE(j, '$.b')

This function will then remove the element by moving the key entries and
value entries that follow the removed member so that they overwrite the
existing entries, and the element count is decremented.

The resulting binary document will look like this:

            0x00 - type: JSONB_TYPE_SMALL_OBJECT
    CHANGED 0x02 - number of elements (low byte)
            0x00 - number of elements (high byte)
            0x22 - number of bytes (low byte)
            0x00 - number of bytes (high byte)
            0x19 - offset of key "a" (high byte)
            0x00 - offset of key "a" (low byte)
            0x01 - length of key "a" (high byte)
            0x00 - length of key "a" (low byte)
    CHANGED 0x1b - offset of key "c" (high byte)
    CHANGED 0x00 - offset of key "c" (low byte)
    CHANGED 0x01 - length of key "c" (high byte)
    CHANGED 0x00 - length of key "c" (low byte)
    CHANGED 0x0c - type of value "a": JSONB_TYPE_STRING
    CHANGED 0x1c - offset of value "a" (high byte)
    CHANGED 0x00 - offset of value "a" (low byte)
    CHANGED 0x0c - type of value "c": JSONB_TYPE_STRING
    CHANGED 0x20 - offset of value "c" (high byte)
    CHANGED 0x00 - offset of value "c" (low byte)
    (free)  0x00
    (free)  0x0c
    (free)  0x1e
    (free)  0x00
    (free)  0x0c
    (free)  0x20
    (free)  0x00
            0x61 - first key  ('a')
    (free)  0x62
            0x63 - third key  ('c')
            0x01 - length of value "a"
            0x78 - contents of value "a" ('x')
    (free)  0x01
    (free)  0x79
            0x01 - length of value "c"
            0x7a - contents of value "c" ('z')

Two binary diffs will be created. One diff changes the element count, and one
diff changes the key and value entries.
## sql/json_diff.h, sql/json_diff.cc

Implementation of the Json_diff class described in the HLS.

Implementation of the apply_json_diffs() function described in the HLS.

## sql/item_json_func.cc

Add code to collect JSON diffs in Item_func_json_set_replace::val_json().

Add code to collect JSON diffs and binary diffs in Item_func_json_remove::val_json().

Move some code from Item_func_json_set_replace to Item_json_func, so that it can be reused by Item_func_json_remove.

## sql/item_json_func.h

Add forwarding constructors for some of Item_json_func's subclasses to make it easier to use them in unit tests.

## sql/json_dom.cc, sql/json_dom.h

Make Json_wrapper::attempt_partial_update() collect JSON diffs that describe the changes.

Add Json_wrapper::remove_path() to remove a path from a JSON document while collecting binary diffs and JSON diffs.

## sql/json_binary.cc, sql/json_binary.h

Add json_binary::Value::remove_in_shadow(), which removes a path from the binary representation of a document in a copy of the original column value.

## sql/table.cc, sql/table.h

Add vectors that hold Json_diff objects to the Partial_update_info class.

Rename some functions to make it clearer whether they are used for binary diffs, JSON diffs, or for partial update in general.

Add the functions for collecting or retrieving JSON diffs, as described in the HLS.

## unittest/gunit/json_dom-t.cc, unittest/gunit/item_json_func-t.cc

Add test cases to verify that collection and application of JSON diffs works correctly.