WL#2955: RBR replication of partial JSON updates

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

==== Executive Summary ====

MySQL shall replicate small updates of big JSON documents more space
efficiently.  More precisely, when using RBR, we will write only
modified parts of JSON documents, instead of the whole JSON document.

==== User stories ====

U1. As a DBA, I do partial updates to JSON values, and I don't use
    replication. Then I enable (row-based) replication. That causes
    performance to drop and disk usage to grow. The reason is that
    the full JSON document is written to the binary log and replicated
    to slaves.  (Writes of full JSON documents were not done by the
    single server because InnoDB supports partial JSON updates.)

    So enabling replication may give a performance degradation.
    The effect will be reduced when we replicate partial JSON documents
    and user sets binlog_row_image=MINIMAL.

U2. As a DBA, I do partial updates to JSON documents, and I use
    statement-based replication. Then I switch to row-based
    replication. Then performance drops and the binary logs grow. The
    reason is that the full JSON document is written to the binary log
    and replicated to slaves. (Such writes of full JSON documents were
    not done by SBR.)

    So switching from SBR to RBR may give a performance degradation.
    The effect will be reduced if we replicate partial JSON documents
    and user sets binlog_row_image=MINIMAL.

U3. As a DBA, I use row-based replication, and I don't use JSON. Then
    I start to use JSON and do partial updates. Then my slave starts to
    fall behind / lag. The reason is that the slave needs to hand the
    full JSON document to the engine, as well as handle the full JSON
    document a couple of extra times in the replication pipeline. (The
    master makes the partial update more efficiently.)

    So doing partial JSON updates can cause slave lag.
    The effect will be reduced if we replicate partial JSON documents
    and user sets binlog_row_image=MINIMAL.

U4. As a DBA, I would like to use both replication and partial updates
    to JSON documents, but I cannot do that in 5.7 because it is too
    slow. I upgrade to 8.0 because it is supposed to optimize the
    partial JSON documents. I get disappointed because this did not help
    much. (The slave is equally slow, and the master is not so much
    faster because it writes full JSON documents to the binary log.)

    So the InnoDB optimization in 8.0 did not help much.  The positive
    effect will be more visible if we replicate partial JSON documents.

U5. As a power user, I have various scripts that mine the binary log,
    extract the after-image, and do interesting things. I do not want
    to update my scripts, and/or my scripts can only work if the full
    after-image is there.

    So partial JSON replication should be optional.

U6. Both optimizer and NDB are interested in using binary diff
    algorithms to implement partial blob replication for blob updates.
    This will make blob updates more efficient in InnoDB and NDB,
    respectiely (for roughly the same reasons as explained in U1-U4).

    So the user interface and the binary log format should be extensible
    so that we can add support for other data types in the future.
* Functional requirements

** Core feature.

 FR1. Add a new server option.

      If the user disables the option, the server must only write full
      JSON documents to the binary log, and never partial JSON
      updates.

      If the user enables the option, the server may write partial
      JSON updates to the after-image of RBR updates in the binary log
      when possible.  It may also write full JSON documents, e.g. in
      case the server cannot generate a partial JSON update, or if the
      partial JSON update would be bigger than the full document.

      Exceptions: when the binary log is disabled, or when using
      statement-based replication, or when using
      log_bin_use_v1_row_events, the option has no effect. The option
      only affects the after-image, and therefore when
      binlog_row_image=FULL, or when the JSON column is part of the
      PK, or when there is no PK, the before-image will contain the
      JSON column, and it is full.)

 FR2. The slave must decode the partial JSON updates and apply them
      correctly.

 FR3. When a binary log contains row events with partial JSON updates,
      mysqlbinlog must output BINLOG statements such that when the 8.0
      server applies them, the JSON documents shall be correctly
      updated.

 FR4. The partial JSON update must apply correctly on slave as long as
      the JSON documents are equal as seen from SQL, regardless of the
      low-level storage format. That is, even if the number of
      internal unused padding bytes in the InnoDB disk format differs
      between master and slave, the following must hold:
      - If the JSON documents are equal according to the SQL '='
        operator applied directly on the JSON documents, then this 
        equality must hold also after doing the
        update on the master and applying the resulting partial JSON
        update on the slave.
      - If the JSON documents are equal according to the SQL '='
        operator applied on the textual representation of the JSON
        documents, then this equality must hold also after doing the
        update on the master and applying the resulting partial JSON
        update on the slave.
      (But in both cases, the internal representation may be different
      both before and after applying the also after applying the
      change. Moreover, even if the internal representations are
      identical before the update, it is not a requirement that they
      are equal after the update.)

      In particular, if a JSON document is created on a single server,
      and subsequently updated using JSON functions that replace part
      of the document by a smaller document, and then the database is
      cloned to another server using mysqldump, then subsequent
      replication of partial JSON from the original server to the
      newly created server shall succeed.

 FR5. If master and slave have diverged so that they have different
      JSON documents, a partial update must not crash the server and
      must not produce invalid JSON. The partial update must either
      cause the applier thread to stop with an error, or update the
      JSON document on slave in an unpredictable way.

 FR6. When the slave has the option enabled, and applies a JSON diff,
      then:
       FR6.1. The slave shall generate JSON diffs in its binary log
       FR6.2. The slave shall use binary diffs to update the engine

** Usability and Compatibility

 FR7. mysqlbinlog -v shall output human-readable and complete
      information about partial JSON updates.

 FR8. When partial JSON is disabled, the binary log format must not
      change (therefore this does not block cross-version replication,
      neither OLD->NEW nor NEW->OLD).

 FR9. The new server option shall have a type that makes it easy to
      extend to other datatypes in the future.

FR10. When replicating from a NEW master having the option enabled, to
      an OLD slave having the option disabled, the OLD slave must stop
      with an error message and fail to apply the partial updates.

** Storage and performance

Def1. The *size of a partial JSON update* is equal to the total size
      of all JSON diff objects, and each JSON diff object shall be the
      size of the JSON path, plus the size of the binary
      representation of the inserted JSON (sub-)document, if any, plus
      a small number of bytes (bounded by a constant).

Def2. The *size of a full JSON update* is equal to the size of the
      binary representation of the full document, plus a small number
      of bytes (bounded by a constant).

FR11. When the feature is enabled, and the server is able to generate
      a partial JSON update for a row change, and the size of the
      partial JSON update is smaller than the size of the full JSON
      update ("Def1 < Def2"), then the binary log size shall decrease
      by size of the full JSON update and increase by the size of the
      partial JSON update.

FR12. When the feature is enabled, and the server is unable to
      generate a partial JSON update for a row change, or the size of
      the partial JSON update is greater than the size of the full
      JSON update ("Def1 > Def2"), then the binary log size shall
      change only by a small constant number of bytes (the difference
      between D2 and a raw binary format).

FR13. When the feature is disabled, the size of the binlog shall be
      unchanged.

FR14. When the feature is disabled, the memory usage shall not change.

FR15. When the feature is enabled, the throughput (on each of master
      and slave) shall decrease by at most 2% (as long as the
      increased memory usage does not cause the server to swap).

FR16. When the feature is disabled, the throughput shall not change.

* Non-functional requirements

NFR1. The disk format must be possible to extend to accomodate partial
      updates of other data types in the future.

* NON-requirements

 NR1. "If the slave has a different value for a JSON document,
      compared to the master, and the master does a partial JSON
      update, the update shall not take effect on the slave."

      This cannot be implemented in a practical and efficient way. So
      a partial JSON update applied on the wrong JSON document may
      corrupt the JSON document.

 NR2. "The binary log size shall never increase."

      When partial JSON updates cannot be computed, the binary log
      size will increase by a few bytes per JSON column; see FR11.

 NR3. "When partial JSON is enabled, all JSON updates *must* use the
      partial format.  When the server cannot generate partial JSON,
      it should use the JSON path '$' and the full value."

      The uniformity of this format might seem appealing.  However, in
      case the old value was NULL, applying any JSON function on it
      will produce the result NULL.  This could be handled by a
      special case in the decoder, which would replace updates at path
      '$' by a full write (without applying the JSON function).  But
      then the uniformity would be lost anyways.

 NR4. "When the feature is enabled, the memory usage for JSON updates
      shall increase by at most X% (on each of master and slave)."

      We do not provide such guarantees.  The size of the
      Json_diff_vector is probably unbounded (except the
      MAX_ALLOWED_PACKET limit), in case user executes statements
      like:

      CREATE TABLE t (j JSON);
      INSERT INTO t VALUES ('[1, 2, 3, ..., 1000]');
      UPDATE t SET j = JSON_SET(j, '$', '[1, 2, ..., 1000]',
                                   '$', '[1, 2, ..., 1000]',
                                   ...
                                   '$', '[1, 2, ..., 1000]'));

      And even disregarding this, the replication code uses 2-3 extra
      copies of the JSON value.  This is because of limitations in
      existing frameworks and has to be addressed separately.

 NR5. "When partial JSON is enabled, it shall be possible for user to
      predict if a given change will be written to the binary log in
      full format or partial format based on knowing the old JSON
      value and the operations applied to it."

      This property is not be guaranteed, since the server will decide
      to use the format (full or partial) that has the smallest size.
      And the size of a full object is not determined by the JSON
      value; in fact two JSON objects having the same textural
      representation may have different internal binary
      representations.
==== 1. SQL interface ====

We introduce a new SQL variable:

  NAME:
  binlog_row_value_options (variable)
  binlog-row-value-options (command line argument)

  SCOPE: command-line, global and session

  TYPE: set

  VALUES: 'PARTIAL_JSON'

  DEFAULT: ''

  DYNAMIC: yes

  RESTRICTIONS: only SUPER can set this, outside transactions, stored
  procedures, triggers, and functions.

  DESCRIPTION:
  Enables a space-efficient binary log format for updates that modify
  only a small portion of a data value. Currently, this is only
  implemented for updates of JSON documents. It only works when
  log-bin=1 and binlog_format=ROW or MIXED. It is also recommended to
  use binlog_row_image=NOBLOB or MINIMAL; with binlog_row_image=FULL,
  the benefit is smaller because the full JSON document will be stored
  in the before-image and the partial update in the after-image. This
  option overrides log_bin_use_v1_row_events, so even when
  log_bin_use_v1_row_events=1 it uses the new event format with
  partial JSON.

  WARNINGS:

   1. Suppose any of the following conditions holds:

      - user sets @@session.binlog_row_value_options=PARTIAL_JSON,
        when either @@session.binlog_format=STATEMENT, or the binlog
        is not open, or @@session.sql_log_bin=0;

      - user sets @@session.binlog_format=STATEMENT or
        @@session.sql_log_bin=0 when
        @@session.binlog_row_value_options=PARTIAL_JSON;

      - user sets @@global.binlog_row_value_options=PARTIAL_JSON or
        @@persist.binlog_row_value_options when either
        @@global.binlog_format=STATEMENT, or the binlog is not open;

      - user sets @@global.binlog_format=STATEMENT or
        @@persist.binlog_format=STATEMENT when
        @@global.binlog_row_value_options=PARTIAL_JSON;

      - when the server starts, the command line or configuration file
        or persisted configuration contains
        binlog-row-value-options=PARTIAL_JSON when either
        binlog-format=STATEMENT, or log-bin is not set;

      Then the following warning shall be generated:

        "When %.192s, the setting binlog_row_value_options=%.192s will
        be ignored and no partial updates will be written to the binary
        log."

      where the first %.192s is replaced by "binlog_format=STATEMENT"
      or "the binary log is disabled", and the second %.192s is
      replaced by "PARTIAL_JSON".

   2. Suppose any of the following conditions holds:

      - user sets @@session.binlog_row_value_options=PARTIAL_JSON or
        @@global.binlog_row_value_options=PARTIAL_JSON or
        @@persist.binlog_row_value_options=PARTIAL_JSON when
        @@global.log_bin_use_v1_row_events=1;

      - user sets @@global.log_bin_use_v1_row_events=1 or
        @@persist.log_bin_use_v1_row_events=1 when
        @@global.binlog_row_value_options=PARTIAL_JSON;

      - when the server starts, the command line or configuration file
        or persisted configuration contains
        binlog-row-value-options=PARTIAL_JSON and
        log-bin-use-v1-row-events=1;

      Then the following warning shall be generated:

        "When %.192s, the option log_bin_use_v1_row_events=1 will be
        ignored and row events will be written in new format to binary
        log."

      where %.192 is replaced by "binlog_row_value_options=PARTIAL_JSON".

   3. Suppose any of the following conditions holds:

      - user sets @@session.binlog_row_value_options=PARTIAL_JSON when
        @@session.binlog_row_image=FULL;

      - user sets @@session.binlog_row_image=FULL when
        @@session.binlog_row_value_options=PARTIAL_JSON;

      - user sets @@global.binlog_row_value_options=PARTIAL_JSON or
        @@persist.binlog_row_value_options=PARTIAL_JSON when
        @@global.binlog_row_image=FULL;

      - user sets @@global.binlog_row_image=FULL or
        @@persist.binlog_row_image=FULL when
        @@global.binlog_row_value_options=PARTIAL_JSON;

      - when the server starts, the command line or configuration file
        or persisted configuration contains binlog-row-image=FULL and
        binlog-row-value-options=PARTIAL_JSON;

      Then the following warning shall be generated:

        "When %.192s, the option binlog_row_value_options=%.192s will
        be used only for the after-image. Full values will be written
        in the before-image, so the saving in disk space due to
        binlog_row_value_options is limited to less than 50%%."

      where the first %.192s is replaced by "binlog_row_image=FULL"
      and the second %.192s is replaced by "PARTIAL_JSON".

  NOTES:
  Even if this is only two-state, we make it a set instead of a
  boolean so that it can easily be extended with e.g. PARTIAL_BLOB and
  PARTIAL_TEXT in the future (FR7).

==== 2. Logic to decide to use partial format ====

Update_rows_log_event will use a format that allows JSON documents to
be stored in partial format, whenever
binlog_row_value_options=PARTIAL_JSON.

Partial format is used when all the following holds:

 1. The binary log is open and enabled for the transaction
 2. binlog_image_value_options=PARTIAL_JSON (this is not the default)
 3. the statement is written in row format
 4. log_bin_use_v1_row_events=0
 5. The UPDATE statement has the form:
      SET column = f1(f2(...fn(column, ...), ... )...)
    where each of f1, f2, ..., fn is either JSON_SET or JSON_REPLACE.
 6. The resulting sequence of diffs has a smaller byte size than the
    full value, when serialized in the binlog.
 7. Only for the after-image (the before-image, if present, uses full
    format).

If any one of the above conditions does not hold, the full format is
used.

Notice that #6 above ensures that FR11 holds.  Also notice that #6 is
non-deterministic in the following sense.  Two JSON documents that
have equal string representations, may use different binary
representations internally, either due to having different data types
(e.g. timestamp vs string) or due to having different amounts of
internal padding (e.g. because the documents were created in different
ways).  Then if the same JSON operation is applied to both documents,
one of them may be written as a JSON diff and the other as a full
document.  So the user cannot in general predict whether full format
or partial format will be used (cf. NR5).

==== 3. Applier logic ====

The RBR applier for update events has the following control flow
(leaving out steps that are irrelevant to the present worklog):

 1. Decode the before-image.
 2. Set the table's read_set to those columns that need to be retrieved
    from the engine.
 3. (Tell the storage engine to) look up the before-image in the table.
 4. Decode the after-image.
 5. Set the engine's write_set to those columns that need to be stored
    in the engine.
 6. (Tell the storage engine to) write the after-image to the table.
 7. Write the row change to the binary log.  This includes those
    columns that are in the read_set and the write_set in the event
    data.

And now:
- Step 6 needs to write full values to the engine, because there is no
  interface to send a JSON diff to the engine.
- Therefore, step 4 needs to apply the JSON diff to the old version of
  the JSON document.
- Therefore, step 3 needs to retrieve JSON columns which are in
  partial format in the after-image from the engine.
- Therefore, step 2 needs to set include JSON columns which are in
  partial format in the after-image in the table's read_set.
- Therefore, step 1 needs to read the information about which columns
  are partial from the event data.

Therefore, we make the following changes in steps 1 and 3 of the
decoder:

 1. Decode the before-image, plus the information about which JSON
    columns are stored in partial format in the after-image.
 2. Set the bit for each partial column in the table's read_set.
 4. Decode the after-image.  For columns stored in partial format,
    decode the partial format correctly.  Apply the JSON diffs to the
    column data to produce the full after-image.

Notice that both the before-image decoder and the after-image decoder
need to know which JSON columns are stored in partial format.  As we
will see in the next section, this has implications for the storage
format.

Notice that in step 7, it uses the read_set to determine which columns
will be included in the before-image.  Therefore, the JSON columns
which were added to the read_set only because they are partial in the
after-image, need to be removed from the read_set.  Otherwise the
slave would store all partial JSON columns in full format in the
before-image in the slave's binary log.  Apparently this logic is
already implemented, so there is nothing to change in this
worklog. (???)

==== 4. Binary log format ====

A new event type is introduced. This event type differs from
Update_rows_log_event in the following ways:

F1. It has a new type code.
F2. It has a new layout for the header of the before-image.
F3. It has a new format for JSON columns.

(There is no change in Table_map_log_event, Write_rows_log_event, or
Delete_rows_log_event.)

We now desribe the format changes more precisely:

F1. The *old* type code is UPDATE_ROWS_EVENT=31 (or
    UPDATE_ROWS_EVENT_V1=24, in case log_bin_use_v1_row_events=1).

    The *new* type code is PARTIAL_UPDATE_ROWS_EVENT=39.

    (Internally, it uses the same class, it is only the type code that
    differs.)

    The new type code is used whenever
    binlog_row_value_options=PARTIAL_JSON and
    log_bin_use_v1_row_events=0.

F2. For UPDATE_ROWS_EVENT and UPDATE_ROWS_EVENT_V1, the format for a
    row is (before and after this worklog):

    +--------------+-------------+
    | before-image | after-image |
    +--------------+-------------+

    (and each image has the following format:
    +-----------+-------+-------+     +-------+
    | null_bits | col_1 | col_2 | ... | col_N |
    +-----------+-------+-------+     +-------+
    )

    For PARTIAL_UPDATE_ROWS_EVENT, we introduce a *shared-image*.  The
    shared-image appears between before-image and after-image:

    +--------------+--------------+-------------+
    | before-image | shared-image | after-image |
    +--------------+--------------+-------------+

    shared-image has one of the following structures:

    +-----------------+
    | value_options=0 |
    +-----------------+

    +-----------------+-----------------+
    | value_options=1 | partial_columns |
    +-----------------+-----------------+

    The new fields are:

    - value_options: A bitmap, stored as an integer in
      net_field_length() format. This bitmap currently has one bit,
      PARTIAL_JSON_UPDATES (the lowest bit), hence the integer value
      of the bitmap is 0 or 1.  When the bit is set to 0, all columns
      are stored in full format. When the bit is 1, each JSON column
      may be stored in either partial format or full format.  The
      exact format is specified per column in the partial_columns
      field.  (In case we implement partial formats for other data
      types in the future, or even other partial formats for JSON,
      then those modes can be enabled and disabled by adding more bits
      to value_options.)

      The encoder sets value_options=1 when
      @@session.binlog_row_value_options=PARTIAL_JSON and at least one
      column is stored in partial JSON format.  In case
      @@session.binlog_row_value_options='', or there are no JSON
      updates, or each JSON update either did not generate a JSON diff
      or generated a JSON diff that was bigger than the full value,
      the encoder sets value_options=0 and skips the partial_columns
      field.

    - partial_columns: A bitmap having one bit for each column in the
      table that has a type that might be partial according to
      value_options (i.e., in the current worklog, there is a bit for
      JSON columns; in the future, enabling other options in
      binlog_row_value_options may imply that we include columns
      having other types in partial_columns).  Each bit is 1 if the
      column is stored in partial format in the after-image, and 0 if
      the column is either stored in full format in the after-image,
      or NULL in the after-image, or not included in the after-image.

    Note that partial_columns appears *between* before-image and
    after-image.  The reason is that (as was noted at the end of the
    previous section), this information is needed by both the
    before-image decoder and the after-image decoder.  In the encoder,
    the information is only known in the after-image encoder.
    Therefore, we organize the encoder and decoder as follows:

    - The before-image encoder writes only the before-image.
    - The after-image encoder writes the shared-image and the after-image.
    - The before-image decoder reads both the before-image and the
      shared-image, but leaves the read position at the beginning of
      the shared-image.
    - The after-image decoder reads both the shared-image and the
      after-image.

F3. The format for JSON columns in full format is:

    +--------+-------+
    | length | value |
    +--------+-------+

    This is true before and after this worklog.

    The *new* format for *partial* JSON columns in the after-image is:

    +--------+--------+--------+     +--------+
    | length | diff_1 | diff_2 | ... | diff_N |
    +--------+--------+--------+     +--------+

    Each diff_i represents a single JSON diff. It has the following
    format:

    +-----------+-------------+------+    +-------------+------+
    | operation | path_length | path |  ( | data_length | data | )?
    +-----------+-------------+------+    +-------------+------+

    The fields are:

    - operation: a single byte containing the JSON diff operation.
      The possible values are defined by enum_json_diff_operation:

      REPLACE=0
      INSERT=1
      REMOVE=2

    - path_length: an unsigned integer in net_field_length() format.

    - path: a string of 'path_length' bytes containing the JSON path
      of the update.

    - data_length: an unsigned integer in net_field_length() format.

    - data: a string of 'data_length' bytes containing the JSON
      document that will be inserted at the position specified by
      'path'.

    data_length and data are omitted if and only if operation=REMOVE.

==== 5. Capture logic ====

JSON diffs are generated unconditionally.

Since we add the option that enables / disables JSON diffs in this
worklog, this worklog shall also prohibit JSON diffs from being
generated when they are not needed. I.e., JSON diffs shall only be
generated by the optimizer when all the following holds (according to
section 2 above):
- the binary log is open and enabled for the transaction
- binlog_image_value_options=PARTIAL_JSON (this is not the default)
- the statement is written in row format
- log_bin_use_v1_row_events=0

==== 6. mysqlbinlog output ====

mysqlbinlog shall output BINLOG 'base64' statements containing the
partial JSON update. (No code change is needed to make this work).

mysqlbinlog -v shall output readable JSON in the pseudo-SQL:

 1. Example:

     original document:
      {
        "a": "replace this string value by 7",
        "b": [ 0, "replace this string by bb" ],
        "c": "remove this key-value pair, including the key c",
        "d": [ "remove this string" ],
        "0": "insert the key-value pair e: ee in the top-level object",
        "f": [ "insert ff after this string", "and before this string" ],
        "1": "insert the key-value pair g: gg in the top-level object"
      }
    diff vector:
      (REPLACE, '$.a', '7')
      (REPLACE, '$.b[1]', '"bb"')
      (REMOVE, '$.c')
      (REMOVE, '$.d[0]')
      (INSERT, '$.e', '"ee"')
      (INSERT, '$.f[1]', '"ff"')
      (INSERT, '$.g', '"gg"')
    output (may also have been the original SQL that created the diff vector):
      ### UPDATE `test`.`t`
      ### WHERE
      ###   @1=1
      ### SET
      ###   @2=JSON_INSERT(
      ###      JSON_ARRAY_INSERT(
      ###      JSON_INSERT(
      ###      JSON_REMOVE(
      ###      JSON_REPLACE(@2, '$.a', 7,
      ###      '$.b[1]', 'bb'),
      ###      '$.c'
      ###      '$.d[0]'),
      ###      '$.e', 'ee'),
      ###      '$.f[3]', 'ff'),
      ###      '$.g', 'gg'),

    Notice that:
    - REPLACE maps to JSON_REPLACE, REMOVE maps to JSON_REMOVE, INSERT
      maps to JSON_ARRAY_INSERT if the last component of the path is
      an array index (i.e. '[number]') and to JSON_INSERT otherwise.
    - Subsequent JSON diffs with different operations are printed like
      nested function calls (hence the function names appear in
      reverse order, compared to the order of the diffs).
    - Subsequent JSON diffs with the same operation are printed as one
      function call.
    - Multiple JSON functions are separated by a newline and multiple
      JSON diffs are separated by a newline.  In order to keep a
      one-diff update on a single line, there is no newline separating
      the last JSON function from the first JSON diff.

 2. Grammar for the part after '###   @COL=':

    FUNC        -> FUNC_NAME '(' FUNC_OR_COL path VALUE_OPT ')'
    FUNC_NAME   -> 'JSON_REPLACE' | 'JSON_REMOVE | 'JSON_INSERT' |
                   'JSON_ARRAY_INSERT
    FUNC_OR_COL -> '@' column | NEWLINE FUNC NEWLINE
    VALUE_OPT   -> ( ', ' value )?
    NEWLINE     -> ',\n###      '

    (where VALUE_OPT must be present if and only if FUNC_NAME!='JSON_REMOVE')

==== 7. Safety checks ====

If the slave has diverged and has a different JSON document than the
master, applying a partial JSON update may either work and produce a
new JSON document that is syntactically valid but may not be what the
user expects / break application logic; or fail, in case any of the
paths doesn't exist on the slave.

The following safety checks will be used:

 1. The replication decoder will check that the file format is
    correct. If not, generate ER_SLAVE_CORRUPT_EVENT.

 2. When applying the JSON diff, the optimizer should generate an
    error if the JSON diff cannot be applied (e.g. if any path does
    not exist, or is not in correct format for a path, or if any data
    is not in correct JSON format). In this case, the replication code
    shall generate an error and rollback.

 3. It may be possible to apply a JSON diff to a JSON document that is
    different from that on the master. This may work by chance and
    there is no guarantee as to what the result is.

---- 7.1. Alternatives that were REJECTED ----

The following alternatives were considered, but rejected:

 4. REJECTED: On master, generate a checksum before the partial
    update. Include this checksum in the binary log event. On slave,
    verify the checksum before applying the delta.

    Notes:

    - We could compute the checksum of only the updated parts (and
      possibly a number of bytes of context before and after each
      updated part, to avoid degenerate cases when only a few bytes
      are changed).  That would require the engine to fetch the
      updated parts.

    - The checksum needs to be agnostic to the internal, binary format
      (due to FR4).  Therefore it needs to be aware of the nested JSON
      structure, so it can be a bit complicated to compute.

    - If we augment the engine to store the checksum persistently, we
      could use a checksum on the entire JSON documents that allows
      fast recomputation of partial updates.  Checksums with such
      characteristics are Adler and Fletcher (which have a weaker
      level of protection than checksums that do not allow fast
      recomputation of partial updates, but that may be ok for the
      present use case).  This adds complexity to the feature.

    - Checksums are probabilistic by nature, they do not provide
      guarantees.  So even if we implement checksums we need to handle
      the case that a partial JSON update is applied to the wrong
      JSON documents.

 5. REJECTED: On slave, after applying a delta on a JSON column,
    verify that the column still contains valid JSON data.  (And if
    this feature is extended to other data types that have any kind of
    structure, verify that the data is still in correct
    form. E.g. valid UTF8 for TEXT columns.)

    (This was more relevant in earlier version of this worklog where
    we considered using phisical, binary diffs. Since we use logical
    JSON diffs, this is not necessary.)

==== 8. Rolling upgrades/downgrades, and cross-version replication ====

Replication OLD->NEW will work always, because we do not change the
existing event format and do not remove the code to read it.

Replication NEW->OLD will work whenever binlog_row_value_options='',
because then we use the old format.

==== 9. Future extensions ====

The following are possible future extensions (not implemented in this
worklog):

 1. Partial updates of other big data types, such as BLOB, BINARY,
    VARBINARY, TEXT, etc. This would mainly be optimizer work. The
    present worklog will do most of the needed replication
    changes. However, replication may need to add logic to correctly
    handle lossy slave_type_conversions where the slave uses a shorter
    column width than the master.

 2. Support for JSON diffs corresponding to JSON_ARRAY_INSERT and
    JSON_ARRAY_APPEND.  The code in the current worklog would work for
    these types, if only the optimizer generated them, except that the
    current worklog explicitly forbids those types in the
    deserialization code (so that we don't have untested code).  So if
    optimizer implements support for these update types, we can just
    remove the checks.  We would also need a new flag in
    binlog_row_value_options to enable the two new types.

 3. Support for enabling partial JSON only for a certain type of
    tables, by using a syntax like
    binlog_row_value_options='attribute:PARTIAL_JSON'

 4. Reduce memory usage.  Currently, disregarding JSON, RBR stores
    each row in memory two times more than necessary.  The partial
    JSON we introduce here adds to this by writing partial JSON to 2-3
    temporary locations.  The reason is that our internal interfaces
    are inconsistent, so for instance JSON must be serialied to a
    String whereas RBR events must be stored in a char* buffer before
    being serialized to an IO_CACHE.
Step 1: Implement new option binlog_row_value_options

This adds the option and all the checks for compatibility with other
options.  In this patch, the option does not yet have any effect.  We
also add test cases.

Step 2: Implement logic to generate JSON diffs only when needed

In WL#10570, JSON diffs are generated unconditionally by the
optimizer.  This patch changes so that JSON diffs are only generated
when the option is enabled, and not inhibited by other options.

Step 3: Implement new event type and use it when the option is enabled

This only creates a new event type code and uses this type code when
the option is enabled.  It does not change the format of the event
payload.  It adds a test to verify that the correct type code is
generated on both master and slave.

Step 4: Refactor: make max_row_length a private member of Row_data_memory

This function is only used internally in class Row_data_memory, but
was defined as a global function in table.cc.  Moved it to a private
member of Row_data_memory.

This refactoring is needed to prove that max_row_length is not used
elsewhere (so when we modify it in a later patch we don't break any
protocol).  It is also improves encapsulation.

Step 5: Refactor: remove unused variables in pack_row

This is a small refactoring / cleanup, just removing a couple of
unused variables to make it easier to follow the logic and understand
the real changes that come in later patches.

Step 6: Refactor: localize the scope of loop variable in pack_row

A loop variable was declared at top level in pack_row.  We move it to
be declared only in the loop.  This makes it easier to edit the file
later.

Step 7: Refactor: move null-bit handling to new utility classes

When reading and writing a row in a row event, the logic for iterating
over fields was interleaved with low-level bit operations to maintain
a bitmap of null fields.  This made the code error-prone and hard to
understand and edit.  This refactoring encapsulates the bitmap
handling in utility classes, and simplifies pack_row / unpack_row
accordingly.

Step 8: Refactor: add const to integer decoder functions in pack.cc

Functions in mysys/pack.cc that read from a buffer did not declare the
buffer as const.  This patch makes net_field_length_size use a const
parameter and makes other functions use const internally.  Since these
functions are part of the ABI, we also have to update
include/mysql.h.pp.  (We do not const-ify pointers-to-pointers in
function declarations, since that breaks compilation on other places
that call the functions using non-const arguments.)

Step 9: Refactor: change Json_diff_vector from a type alias to a class

This is needed because we are going to extend Json_diff_vector with
more member functions.  It also simplifies some forward declarations.

Step 10: Refactor: do not overload global identifier TABLE in rpl_tblmap.h

Class table_mapping in rpl_tblmap.h is used both in mysqlbinlog and in
the server.  In the server, it maps numbers to TABLE objects.  In
mysqlbinlog, it maps numbers to Table_map_log_event objects.  This was
implemented by using the type name TABLE, and in mysqlbinlog use a
typedef that makes TABLE an alias for Table_map_log_event.

This worked by chance, because no header included by mysqlbinlog
happened to have a declaration of struct TABLE.  However, in this
worklog we will need log_event.cc to include rpl_record.h from
log_event.cc, and rpl_record.h has a forward declaration of struct
TABLE.  This would clash with a typedef that maps TABLE to
Table_map_log_event and generate a compilation error.

Therefore, we change rpl_tblmap.h so that it does not use the
identifier TABLE.  Instead, it uses the new typedef Tablemap_table
that maps to TABLE in the server and to Table_map_log_event in
mysqlbinlog.

While changing this file, also removed some other weirdness:

- The function free_table_map_log_event only called delete on its
  argument.  It was declared in a different header than the one for
  the source file that defines the function.  This was pointless;
  removed the function and called delete directly instead.

- Removed comment that looked like someone used the source files
  instead of writing a mail :-)

- Simplified class comment.

Step 11: Refactor: print more to the debug trace

Improve indentation and debug trace printouts.

Step 12: Refactor: remove unused variable Rows_log_event::m_master_reclength

There was a member variable Rows_log_event::m_master_reclength that
was set to a (strange) value which was never read.  Removed this.

Step 13: Refactor: simplify Rows_log_event::read_write_bitmaps_cmp

This member function was implemented only in the base class, but had a
switch that made it execute differently depending on the instance's
subclass.  Changed to use a pure virtual function in the base class
and implement the different logic in each subclass.

Step 14: Refactor: simplify pack_row

This moves code so that it appears in a more logical order, and
changes some debug printouts.  More precisely:
- Start with DBUG_ENTER, then initialize variables that will be used
  throughout the function, then print to debug trace, then do the
  actual encoding.
- Try to use more descriptive variable names.
- Print more meaningful things to the debug trace (less pointers, more
  table and column names, try to make more clear what the different
  fields mean).
- Simplify a comment.

Step 15: Implement encoder of new event format

Outline of the pipeline:

 1. In binlog.cc:Row_data_memory, take a new argument in the
    constructor having two 'data' pointers (this constructor is used
    for Update_rows_log_event and is invoked in
    binlog.cc:THD::binlog_update_row).  This the value of the new
    server option binlog_row_value_options.  Based on this variable,
    determine if Json diffs may be used, estimate how much memory will
    be used (using the new function
    json_diff.cc:Json_diff_vector::binary_length), decide if full
    format or partial format will be used, and adjust the allocated
    memory accordingly.

 2. In binlog.cc:THD::binlog_update_row, pass two new arguments to
    pack_row:
    - row_image_type, which specifies if this is a
      Write/Update/Delete, and if it is a before-image or
      after-image.
    - value_options, which contains the value of
      binlog_row_value_options for update after-images.

 3. In rpl_record.cc:pack_row, accept the two new arguments.  If this
    is an update after-image and the bit in value_options is set, then
    determine if any column will use partial format.  If any column
    will use partial format, write the value_options field, followed
    by the partial_bits, to the output.  Otherwise, just write
    value_options=0 to the output and skip the value_options.

 3. From rpl_record.cc:pack_row, invoke the new function
    rpl_record.cc:pack_field to write a single field.  If the column
    is JSON and this is the after-image of an Update and the bit in
    value_options is set, invoke the new function
    field.cc:Field_json::pack_diff.  Otherwise, or if
    field.cc:Field_json::pack_diff returned NULL, fall back to the
    usual non-diff writer.

 4. In Field_json::pack_diff, determine again if this field will be
    smaller in full format or in partial format.  If full format is
    smaller, just return NULL so that rpl_record.cc:pack_field will
    write the full format.  Otherwise, invoke the new function
    json_diff.cc:Json_diff_vector::write_binary.

 5. In json_diff.cc:Json_diff_vector::write_binary, write the length
    using 4 bytes, followed by all the diffs.  Write each diff using
    the new function json_diff.c:Json_diff::write_binary.

 6. In json_diff.c:Json_diff::write_binary, write a single diff to the
    output.

Step 16: Refactor: simplify unpack_row

Make some small simplficationsin rpl_record.cc:unpack_row:

- Order the logic more clearly: First DEBUG_ENTER, then initialize
  some variables used throughout the function, then print to the debug
  trace, then start to do the actual decoding.

- Use less pointer arithmetic.

- Improve some comments.

- Remove meaningless variable 'error'.

Step 17: Implement decoder of new format

The pipeline is now:

 1. Add a parameter to
    log_event.cc:Rows_log_event::unpack_current_row, which says if
    this is an after-image or not.  Set the parameter from all the
    callers in log_event.cc.

 2. Move Rows_log_event::unpack_current_row from log_event.h to
    log_event.cc and make it pass two new arguments to
    rpl_record.cc:unpack_row: row_image_type, which indicates if this
    is Write/Update/Delete and before-image or after-image, and
    has_value_options, which is true for Update events when
    binlog_row_value_options=PARTIAL_JSON.

 3. Make rpl_record.cc:unpack_row accept the two new parameters.

    First make a few small refactorings in rpl_record.cc:unpack_row:
    - Clarify some variable names and improve the comment for the
      function.
    - Remove comments about unpack_row being used by backup, having
      rli==NULL.  This may have been an intention at some point in
      time, perhaps in 5.1, but probably never was true.  And rli is
      unconditionally dereferenced in the main loop, so it cannot be
      NULL.  Instead assert that it is not NULL.  Also assert that
      other parameters are not NULL, as well as other preconditions.
    - Improve some debug trace printouts.
    - Return bool instead of int since the caller does not need to
      distinguish more than two different return statuses.

    Then implement the new logic:

    - When partial format is enabled, read partial_bits before the
      after-image (from within the main loop, as well as from the loop
      that consumes unused fields), and also read partial_bits after
      the before-image (after the main loop).  For the before-image,
      leave the read-position before the partial_bits.  Use the new
      auxiliary function start_partial_bits_reader to read the
      value_options and initialize the Bit_reader accordingly, in the
      two places (after before-image and before after-image).
    - In order to read the correct number of bits before the
      after-image, start_partial_bits_reader needs to know the number
      of JSON columns on the master.  This is known from the
      table_map_log_event via the table_def class.  For convenience
      (and reuse in the mysqlbinlog patch), we add a member function
      rpl_utility.cc:table_def::json_column_count.  This function also
      caches the computed column count, to speed up successive calls
      (e.g. for many-row updates).
    - For the before-image, set the corresponding bit in the table's
      read_set, for any column having a 1 in the partial_bits.  This
      tells the engine to fetch the blob from storage (later, when the
      engine is invoked).  The blob will be needed since we have to
      apply the diff on it.
    - Call an auxiliary function rpl_record.cc:unpack_field to read
      each field move some special case handling for blobs into this
      function too.

 4. In rpl_record.cc:unpack_field, call field.cc:Field_json::unpack_field
    for partial Json fields.

 5. Add new function field.cc:Field_json::unpack_field, which invokes
    the new function json_diff.cc:Json_diff_vector::read_binary to
    read the Json_diff_vector, and the pre-existing (since WL#10570)
    function apply_json_diffs to apply the diff.

    The Json_diff_vector uses a new MEM_ROOT rather than the one of
    the current_thd, because that allows memory to be freed for each
    value, which saves resources e.g. in case of many-row updates.

    Before apply_json_diff can be invoked, we need to call
    table->mark_column_for_partial_update and
    table->setup_partial_update, in order to enable the *slave* server
    to generate JSON diffs in the *slave's* binary log.

 6. Add the new function json_diff.cc:Json_diff_vector:read_binary.
    This function reads the length of the field, then iterates over
    the diffs, reads each diff in turn, constructs Json_path and
    Json_wrapper/Json_dom objects, and appends them to the
    Json_diff_vector.

    We implement the auxiliary function net_field_length_checked,
    which reads an integer in packed format (see mysys/pack.cc),
    checking for out-of-bounds conditions.

Step 18: Implement decoding and pretty-formatting of JSON diffs in mysqlbinlog

mysqlbinlog outputs row events in two forms:

- BINLOG statements that a server can apply.  There is nothing to
  change to make this work for the new event type.
- "Pseudo-SQL" that humans can read, in case mysqlbinlog is invoked
  with the -v flag.  This is what the present patch implements.

The pipeline in mysqlbinlog is:

 1. log_event.cc:Rows_log_event::print_verbose invokes
    log_event.cc:Rows_log_event::print_verbose_one_row with the new
    argument row_image_type, which indicates if this is a
    Write/Update/Delete and whether it is a before-image or
    after-image.

 2. In log_event.cc:log_event.cc:Rows_log_event::print_verbose_one_row
    we do two things:
    - Refactorings:
      - Use a Bit_reader to read the null bits, instead of using bit
        arithmetic.
      - Use safer boundary checks.  The code has a pointer to row data
        and a pointer to the end of the row data.  In C/C++, a pointer
        may point to the next byte after an allocated block of memory,
        but incrementing it further has an undefined result.  After
        reading the length of a field, the correct way to check that
        this length is not corrupt is to compare it with the end
        pointer minus the pointer to the read position.  (Before, it
        added the length to the read position and compared with the
        end pointer, but the read position plus the length is
        undefined.)
    - Implement the feature:
      - Read the value_options, if this is the after-image of a
        PARTIAL_UPDATE_ROWS_EVENT.
      - If value_options has the PARTIAL_JSON bit set, read the
        partial_bits.
      - Pass the partialness of the column as a parameter to
        log_event.cc:log_event_print_value.

 3. In the new function log_event_print_value, accept the new
    parameter, and in case the value is partial, call the new function
    log_event.cc:print_json_diff to parse and print the Json diffs.

 4. In the new function log_event.cc:print_json_diff, read, parse, and
    print all the diffs.

    The output has the form:
      JSON_<func>(
      JSON_<func>(
      ...
      JSON_<func>(@column, path[, value][,
                  path [,value][,
                  ...]]),
      ...
                  path[, value][,
                  path [,value][,
                  ...]]),
                  path[, value][,
                  path [,value][,
                  ...]])

    In this output format, the JSON_<func> functions appear in
    *reversed* order, whereas all the (path, value) pairs appear in
    order of appearance.  Therefore, we make two passes over the
    sequence of diffs:

     1. Read just the operations and store them in a vector.  Then
        print the operations in reverse order. Operations are printed
        using the new function log_event.cc:json_wrapper_to_string.

     2. Read the full diffs and output in the order of appearance.

 5. Add a new function log_event.cc:json_wrapper_to_string to print a
    Json_wrapper.  This ensures that the Json values are printed in
    the correct type.  JSON_<func> functions will convert SQL types to
    their JSON equivalents: for instance, the JSON function
    JSON_SET('[1, 2]', '$[0]', '[]') will set the 0th element of the
    JSON array to a string containing an open and closing square
    bracket, and not to an empty JSON array.  To account for this,
    different data types need different quoting, and to insert a JSON
    object or JSON array we need to cast the string to JSON first.

 6. To output JSON values with correct quoting for SQL strings, we use
    the existing my_b_write_quoted, but change it so that:
    - it uses a lookup table (computed only once) for simplicity and
      performance;
    - it prints common escapes such as \n, \\ in a more human-readable
      way.

Then, we add tests for printing JSON diffs.