WL#2955: RBR replication of partial JSON updates
Affects: Server-8.0
—
Status: Complete
==== Executive Summary ====
MySQL shall replicate small updates of big JSON documents more space
efficiently. More precisely, when using RBR, we will write only
modified parts of JSON documents, instead of the whole JSON document.
==== User stories ====
U1. As a DBA, I do partial updates to JSON values, and I don't use
replication. Then I enable (row-based) replication. That causes
performance to drop and disk usage to grow. The reason is that
the full JSON document is written to the binary log and replicated
to slaves. (Writes of full JSON documents were not done by the
single server because InnoDB supports partial JSON updates.)
So enabling replication may give a performance degradation.
The effect will be reduced when we replicate partial JSON documents
and user sets binlog_row_image=MINIMAL.
U2. As a DBA, I do partial updates to JSON documents, and I use
statement-based replication. Then I switch to row-based
replication. Then performance drops and the binary logs grow. The
reason is that the full JSON document is written to the binary log
and replicated to slaves. (Such writes of full JSON documents were
not done by SBR.)
So switching from SBR to RBR may give a performance degradation.
The effect will be reduced if we replicate partial JSON documents
and user sets binlog_row_image=MINIMAL.
U3. As a DBA, I use row-based replication, and I don't use JSON. Then
I start to use JSON and do partial updates. Then my slave starts to
fall behind / lag. The reason is that the slave needs to hand the
full JSON document to the engine, as well as handle the full JSON
document a couple of extra times in the replication pipeline. (The
master makes the partial update more efficiently.)
So doing partial JSON updates can cause slave lag.
The effect will be reduced if we replicate partial JSON documents
and user sets binlog_row_image=MINIMAL.
U4. As a DBA, I would like to use both replication and partial updates
to JSON documents, but I cannot do that in 5.7 because it is too
slow. I upgrade to 8.0 because it is supposed to optimize the
partial JSON documents. I get disappointed because this did not help
much. (The slave is equally slow, and the master is not so much
faster because it writes full JSON documents to the binary log.)
So the InnoDB optimization in 8.0 did not help much. The positive
effect will be more visible if we replicate partial JSON documents.
U5. As a power user, I have various scripts that mine the binary log,
extract the after-image, and do interesting things. I do not want
to update my scripts, and/or my scripts can only work if the full
after-image is there.
So partial JSON replication should be optional.
U6. Both optimizer and NDB are interested in using binary diff
algorithms to implement partial blob replication for blob updates.
This will make blob updates more efficient in InnoDB and NDB,
respectiely (for roughly the same reasons as explained in U1-U4).
So the user interface and the binary log format should be extensible
so that we can add support for other data types in the future.
* Functional requirements
** Core feature.
FR1. Add a new server option.
If the user disables the option, the server must only write full
JSON documents to the binary log, and never partial JSON
updates.
If the user enables the option, the server may write partial
JSON updates to the after-image of RBR updates in the binary log
when possible. It may also write full JSON documents, e.g. in
case the server cannot generate a partial JSON update, or if the
partial JSON update would be bigger than the full document.
Exceptions: when the binary log is disabled, or when using
statement-based replication, or when using
log_bin_use_v1_row_events, the option has no effect. The option
only affects the after-image, and therefore when
binlog_row_image=FULL, or when the JSON column is part of the
PK, or when there is no PK, the before-image will contain the
JSON column, and it is full.)
FR2. The slave must decode the partial JSON updates and apply them
correctly.
FR3. When a binary log contains row events with partial JSON updates,
mysqlbinlog must output BINLOG statements such that when the 8.0
server applies them, the JSON documents shall be correctly
updated.
FR4. The partial JSON update must apply correctly on slave as long as
the JSON documents are equal as seen from SQL, regardless of the
low-level storage format. That is, even if the number of
internal unused padding bytes in the InnoDB disk format differs
between master and slave, the following must hold:
- If the JSON documents are equal according to the SQL '='
operator applied directly on the JSON documents, then this
equality must hold also after doing the
update on the master and applying the resulting partial JSON
update on the slave.
- If the JSON documents are equal according to the SQL '='
operator applied on the textual representation of the JSON
documents, then this equality must hold also after doing the
update on the master and applying the resulting partial JSON
update on the slave.
(But in both cases, the internal representation may be different
both before and after applying the also after applying the
change. Moreover, even if the internal representations are
identical before the update, it is not a requirement that they
are equal after the update.)
In particular, if a JSON document is created on a single server,
and subsequently updated using JSON functions that replace part
of the document by a smaller document, and then the database is
cloned to another server using mysqldump, then subsequent
replication of partial JSON from the original server to the
newly created server shall succeed.
FR5. If master and slave have diverged so that they have different
JSON documents, a partial update must not crash the server and
must not produce invalid JSON. The partial update must either
cause the applier thread to stop with an error, or update the
JSON document on slave in an unpredictable way.
FR6. When the slave has the option enabled, and applies a JSON diff,
then:
FR6.1. The slave shall generate JSON diffs in its binary log
FR6.2. The slave shall use binary diffs to update the engine
** Usability and Compatibility
FR7. mysqlbinlog -v shall output human-readable and complete
information about partial JSON updates.
FR8. When partial JSON is disabled, the binary log format must not
change (therefore this does not block cross-version replication,
neither OLD->NEW nor NEW->OLD).
FR9. The new server option shall have a type that makes it easy to
extend to other datatypes in the future.
FR10. When replicating from a NEW master having the option enabled, to
an OLD slave having the option disabled, the OLD slave must stop
with an error message and fail to apply the partial updates.
** Storage and performance
Def1. The *size of a partial JSON update* is equal to the total size
of all JSON diff objects, and each JSON diff object shall be the
size of the JSON path, plus the size of the binary
representation of the inserted JSON (sub-)document, if any, plus
a small number of bytes (bounded by a constant).
Def2. The *size of a full JSON update* is equal to the size of the
binary representation of the full document, plus a small number
of bytes (bounded by a constant).
FR11. When the feature is enabled, and the server is able to generate
a partial JSON update for a row change, and the size of the
partial JSON update is smaller than the size of the full JSON
update ("Def1 < Def2"), then the binary log size shall decrease
by size of the full JSON update and increase by the size of the
partial JSON update.
FR12. When the feature is enabled, and the server is unable to
generate a partial JSON update for a row change, or the size of
the partial JSON update is greater than the size of the full
JSON update ("Def1 > Def2"), then the binary log size shall
change only by a small constant number of bytes (the difference
between D2 and a raw binary format).
FR13. When the feature is disabled, the size of the binlog shall be
unchanged.
FR14. When the feature is disabled, the memory usage shall not change.
FR15. When the feature is enabled, the throughput (on each of master
and slave) shall decrease by at most 2% (as long as the
increased memory usage does not cause the server to swap).
FR16. When the feature is disabled, the throughput shall not change.
* Non-functional requirements
NFR1. The disk format must be possible to extend to accomodate partial
updates of other data types in the future.
* NON-requirements
NR1. "If the slave has a different value for a JSON document,
compared to the master, and the master does a partial JSON
update, the update shall not take effect on the slave."
This cannot be implemented in a practical and efficient way. So
a partial JSON update applied on the wrong JSON document may
corrupt the JSON document.
NR2. "The binary log size shall never increase."
When partial JSON updates cannot be computed, the binary log
size will increase by a few bytes per JSON column; see FR11.
NR3. "When partial JSON is enabled, all JSON updates *must* use the
partial format. When the server cannot generate partial JSON,
it should use the JSON path '$' and the full value."
The uniformity of this format might seem appealing. However, in
case the old value was NULL, applying any JSON function on it
will produce the result NULL. This could be handled by a
special case in the decoder, which would replace updates at path
'$' by a full write (without applying the JSON function). But
then the uniformity would be lost anyways.
NR4. "When the feature is enabled, the memory usage for JSON updates
shall increase by at most X% (on each of master and slave)."
We do not provide such guarantees. The size of the
Json_diff_vector is probably unbounded (except the
MAX_ALLOWED_PACKET limit), in case user executes statements
like:
CREATE TABLE t (j JSON);
INSERT INTO t VALUES ('[1, 2, 3, ..., 1000]');
UPDATE t SET j = JSON_SET(j, '$', '[1, 2, ..., 1000]',
'$', '[1, 2, ..., 1000]',
...
'$', '[1, 2, ..., 1000]'));
And even disregarding this, the replication code uses 2-3 extra
copies of the JSON value. This is because of limitations in
existing frameworks and has to be addressed separately.
NR5. "When partial JSON is enabled, it shall be possible for user to
predict if a given change will be written to the binary log in
full format or partial format based on knowing the old JSON
value and the operations applied to it."
This property is not be guaranteed, since the server will decide
to use the format (full or partial) that has the smallest size.
And the size of a full object is not determined by the JSON
value; in fact two JSON objects having the same textural
representation may have different internal binary
representations.
==== 1. SQL interface ====
We introduce a new SQL variable:
NAME:
binlog_row_value_options (variable)
binlog-row-value-options (command line argument)
SCOPE: command-line, global and session
TYPE: set
VALUES: 'PARTIAL_JSON'
DEFAULT: ''
DYNAMIC: yes
RESTRICTIONS: only SUPER can set this, outside transactions, stored
procedures, triggers, and functions.
DESCRIPTION:
Enables a space-efficient binary log format for updates that modify
only a small portion of a data value. Currently, this is only
implemented for updates of JSON documents. It only works when
log-bin=1 and binlog_format=ROW or MIXED. It is also recommended to
use binlog_row_image=NOBLOB or MINIMAL; with binlog_row_image=FULL,
the benefit is smaller because the full JSON document will be stored
in the before-image and the partial update in the after-image. This
option overrides log_bin_use_v1_row_events, so even when
log_bin_use_v1_row_events=1 it uses the new event format with
partial JSON.
WARNINGS:
1. Suppose any of the following conditions holds:
- user sets @@session.binlog_row_value_options=PARTIAL_JSON,
when either @@session.binlog_format=STATEMENT, or the binlog
is not open, or @@session.sql_log_bin=0;
- user sets @@session.binlog_format=STATEMENT or
@@session.sql_log_bin=0 when
@@session.binlog_row_value_options=PARTIAL_JSON;
- user sets @@global.binlog_row_value_options=PARTIAL_JSON or
@@persist.binlog_row_value_options when either
@@global.binlog_format=STATEMENT, or the binlog is not open;
- user sets @@global.binlog_format=STATEMENT or
@@persist.binlog_format=STATEMENT when
@@global.binlog_row_value_options=PARTIAL_JSON;
- when the server starts, the command line or configuration file
or persisted configuration contains
binlog-row-value-options=PARTIAL_JSON when either
binlog-format=STATEMENT, or log-bin is not set;
Then the following warning shall be generated:
"When %.192s, the setting binlog_row_value_options=%.192s will
be ignored and no partial updates will be written to the binary
log."
where the first %.192s is replaced by "binlog_format=STATEMENT"
or "the binary log is disabled", and the second %.192s is
replaced by "PARTIAL_JSON".
2. Suppose any of the following conditions holds:
- user sets @@session.binlog_row_value_options=PARTIAL_JSON or
@@global.binlog_row_value_options=PARTIAL_JSON or
@@persist.binlog_row_value_options=PARTIAL_JSON when
@@global.log_bin_use_v1_row_events=1;
- user sets @@global.log_bin_use_v1_row_events=1 or
@@persist.log_bin_use_v1_row_events=1 when
@@global.binlog_row_value_options=PARTIAL_JSON;
- when the server starts, the command line or configuration file
or persisted configuration contains
binlog-row-value-options=PARTIAL_JSON and
log-bin-use-v1-row-events=1;
Then the following warning shall be generated:
"When %.192s, the option log_bin_use_v1_row_events=1 will be
ignored and row events will be written in new format to binary
log."
where %.192 is replaced by "binlog_row_value_options=PARTIAL_JSON".
3. Suppose any of the following conditions holds:
- user sets @@session.binlog_row_value_options=PARTIAL_JSON when
@@session.binlog_row_image=FULL;
- user sets @@session.binlog_row_image=FULL when
@@session.binlog_row_value_options=PARTIAL_JSON;
- user sets @@global.binlog_row_value_options=PARTIAL_JSON or
@@persist.binlog_row_value_options=PARTIAL_JSON when
@@global.binlog_row_image=FULL;
- user sets @@global.binlog_row_image=FULL or
@@persist.binlog_row_image=FULL when
@@global.binlog_row_value_options=PARTIAL_JSON;
- when the server starts, the command line or configuration file
or persisted configuration contains binlog-row-image=FULL and
binlog-row-value-options=PARTIAL_JSON;
Then the following warning shall be generated:
"When %.192s, the option binlog_row_value_options=%.192s will
be used only for the after-image. Full values will be written
in the before-image, so the saving in disk space due to
binlog_row_value_options is limited to less than 50%%."
where the first %.192s is replaced by "binlog_row_image=FULL"
and the second %.192s is replaced by "PARTIAL_JSON".
NOTES:
Even if this is only two-state, we make it a set instead of a
boolean so that it can easily be extended with e.g. PARTIAL_BLOB and
PARTIAL_TEXT in the future (FR7).
==== 2. Logic to decide to use partial format ====
Update_rows_log_event will use a format that allows JSON documents to
be stored in partial format, whenever
binlog_row_value_options=PARTIAL_JSON.
Partial format is used when all the following holds:
1. The binary log is open and enabled for the transaction
2. binlog_image_value_options=PARTIAL_JSON (this is not the default)
3. the statement is written in row format
4. log_bin_use_v1_row_events=0
5. The UPDATE statement has the form:
SET column = f1(f2(...fn(column, ...), ... )...)
where each of f1, f2, ..., fn is either JSON_SET or JSON_REPLACE.
6. The resulting sequence of diffs has a smaller byte size than the
full value, when serialized in the binlog.
7. Only for the after-image (the before-image, if present, uses full
format).
If any one of the above conditions does not hold, the full format is
used.
Notice that #6 above ensures that FR11 holds. Also notice that #6 is
non-deterministic in the following sense. Two JSON documents that
have equal string representations, may use different binary
representations internally, either due to having different data types
(e.g. timestamp vs string) or due to having different amounts of
internal padding (e.g. because the documents were created in different
ways). Then if the same JSON operation is applied to both documents,
one of them may be written as a JSON diff and the other as a full
document. So the user cannot in general predict whether full format
or partial format will be used (cf. NR5).
==== 3. Applier logic ====
The RBR applier for update events has the following control flow
(leaving out steps that are irrelevant to the present worklog):
1. Decode the before-image.
2. Set the table's read_set to those columns that need to be retrieved
from the engine.
3. (Tell the storage engine to) look up the before-image in the table.
4. Decode the after-image.
5. Set the engine's write_set to those columns that need to be stored
in the engine.
6. (Tell the storage engine to) write the after-image to the table.
7. Write the row change to the binary log. This includes those
columns that are in the read_set and the write_set in the event
data.
And now:
- Step 6 needs to write full values to the engine, because there is no
interface to send a JSON diff to the engine.
- Therefore, step 4 needs to apply the JSON diff to the old version of
the JSON document.
- Therefore, step 3 needs to retrieve JSON columns which are in
partial format in the after-image from the engine.
- Therefore, step 2 needs to set include JSON columns which are in
partial format in the after-image in the table's read_set.
- Therefore, step 1 needs to read the information about which columns
are partial from the event data.
Therefore, we make the following changes in steps 1 and 3 of the
decoder:
1. Decode the before-image, plus the information about which JSON
columns are stored in partial format in the after-image.
2. Set the bit for each partial column in the table's read_set.
4. Decode the after-image. For columns stored in partial format,
decode the partial format correctly. Apply the JSON diffs to the
column data to produce the full after-image.
Notice that both the before-image decoder and the after-image decoder
need to know which JSON columns are stored in partial format. As we
will see in the next section, this has implications for the storage
format.
Notice that in step 7, it uses the read_set to determine which columns
will be included in the before-image. Therefore, the JSON columns
which were added to the read_set only because they are partial in the
after-image, need to be removed from the read_set. Otherwise the
slave would store all partial JSON columns in full format in the
before-image in the slave's binary log. Apparently this logic is
already implemented, so there is nothing to change in this
worklog. (???)
==== 4. Binary log format ====
A new event type is introduced. This event type differs from
Update_rows_log_event in the following ways:
F1. It has a new type code.
F2. It has a new layout for the header of the before-image.
F3. It has a new format for JSON columns.
(There is no change in Table_map_log_event, Write_rows_log_event, or
Delete_rows_log_event.)
We now desribe the format changes more precisely:
F1. The *old* type code is UPDATE_ROWS_EVENT=31 (or
UPDATE_ROWS_EVENT_V1=24, in case log_bin_use_v1_row_events=1).
The *new* type code is PARTIAL_UPDATE_ROWS_EVENT=39.
(Internally, it uses the same class, it is only the type code that
differs.)
The new type code is used whenever
binlog_row_value_options=PARTIAL_JSON and
log_bin_use_v1_row_events=0.
F2. For UPDATE_ROWS_EVENT and UPDATE_ROWS_EVENT_V1, the format for a
row is (before and after this worklog):
+--------------+-------------+
| before-image | after-image |
+--------------+-------------+
(and each image has the following format:
+-----------+-------+-------+ +-------+
| null_bits | col_1 | col_2 | ... | col_N |
+-----------+-------+-------+ +-------+
)
For PARTIAL_UPDATE_ROWS_EVENT, we introduce a *shared-image*. The
shared-image appears between before-image and after-image:
+--------------+--------------+-------------+
| before-image | shared-image | after-image |
+--------------+--------------+-------------+
shared-image has one of the following structures:
+-----------------+
| value_options=0 |
+-----------------+
+-----------------+-----------------+
| value_options=1 | partial_columns |
+-----------------+-----------------+
The new fields are:
- value_options: A bitmap, stored as an integer in
net_field_length() format. This bitmap currently has one bit,
PARTIAL_JSON_UPDATES (the lowest bit), hence the integer value
of the bitmap is 0 or 1. When the bit is set to 0, all columns
are stored in full format. When the bit is 1, each JSON column
may be stored in either partial format or full format. The
exact format is specified per column in the partial_columns
field. (In case we implement partial formats for other data
types in the future, or even other partial formats for JSON,
then those modes can be enabled and disabled by adding more bits
to value_options.)
The encoder sets value_options=1 when
@@session.binlog_row_value_options=PARTIAL_JSON and at least one
column is stored in partial JSON format. In case
@@session.binlog_row_value_options='', or there are no JSON
updates, or each JSON update either did not generate a JSON diff
or generated a JSON diff that was bigger than the full value,
the encoder sets value_options=0 and skips the partial_columns
field.
- partial_columns: A bitmap having one bit for each column in the
table that has a type that might be partial according to
value_options (i.e., in the current worklog, there is a bit for
JSON columns; in the future, enabling other options in
binlog_row_value_options may imply that we include columns
having other types in partial_columns). Each bit is 1 if the
column is stored in partial format in the after-image, and 0 if
the column is either stored in full format in the after-image,
or NULL in the after-image, or not included in the after-image.
Note that partial_columns appears *between* before-image and
after-image. The reason is that (as was noted at the end of the
previous section), this information is needed by both the
before-image decoder and the after-image decoder. In the encoder,
the information is only known in the after-image encoder.
Therefore, we organize the encoder and decoder as follows:
- The before-image encoder writes only the before-image.
- The after-image encoder writes the shared-image and the after-image.
- The before-image decoder reads both the before-image and the
shared-image, but leaves the read position at the beginning of
the shared-image.
- The after-image decoder reads both the shared-image and the
after-image.
F3. The format for JSON columns in full format is:
+--------+-------+
| length | value |
+--------+-------+
This is true before and after this worklog.
The *new* format for *partial* JSON columns in the after-image is:
+--------+--------+--------+ +--------+
| length | diff_1 | diff_2 | ... | diff_N |
+--------+--------+--------+ +--------+
Each diff_i represents a single JSON diff. It has the following
format:
+-----------+-------------+------+ +-------------+------+
| operation | path_length | path | ( | data_length | data | )?
+-----------+-------------+------+ +-------------+------+
The fields are:
- operation: a single byte containing the JSON diff operation.
The possible values are defined by enum_json_diff_operation:
REPLACE=0
INSERT=1
REMOVE=2
- path_length: an unsigned integer in net_field_length() format.
- path: a string of 'path_length' bytes containing the JSON path
of the update.
- data_length: an unsigned integer in net_field_length() format.
- data: a string of 'data_length' bytes containing the JSON
document that will be inserted at the position specified by
'path'.
data_length and data are omitted if and only if operation=REMOVE.
==== 5. Capture logic ====
JSON diffs are generated unconditionally.
Since we add the option that enables / disables JSON diffs in this
worklog, this worklog shall also prohibit JSON diffs from being
generated when they are not needed. I.e., JSON diffs shall only be
generated by the optimizer when all the following holds (according to
section 2 above):
- the binary log is open and enabled for the transaction
- binlog_image_value_options=PARTIAL_JSON (this is not the default)
- the statement is written in row format
- log_bin_use_v1_row_events=0
==== 6. mysqlbinlog output ====
mysqlbinlog shall output BINLOG 'base64' statements containing the
partial JSON update. (No code change is needed to make this work).
mysqlbinlog -v shall output readable JSON in the pseudo-SQL:
1. Example:
original document:
{
"a": "replace this string value by 7",
"b": [ 0, "replace this string by bb" ],
"c": "remove this key-value pair, including the key c",
"d": [ "remove this string" ],
"0": "insert the key-value pair e: ee in the top-level object",
"f": [ "insert ff after this string", "and before this string" ],
"1": "insert the key-value pair g: gg in the top-level object"
}
diff vector:
(REPLACE, '$.a', '7')
(REPLACE, '$.b[1]', '"bb"')
(REMOVE, '$.c')
(REMOVE, '$.d[0]')
(INSERT, '$.e', '"ee"')
(INSERT, '$.f[1]', '"ff"')
(INSERT, '$.g', '"gg"')
output (may also have been the original SQL that created the diff vector):
### UPDATE `test`.`t`
### WHERE
### @1=1
### SET
### @2=JSON_INSERT(
### JSON_ARRAY_INSERT(
### JSON_INSERT(
### JSON_REMOVE(
### JSON_REPLACE(@2, '$.a', 7,
### '$.b[1]', 'bb'),
### '$.c'
### '$.d[0]'),
### '$.e', 'ee'),
### '$.f[3]', 'ff'),
### '$.g', 'gg'),
Notice that:
- REPLACE maps to JSON_REPLACE, REMOVE maps to JSON_REMOVE, INSERT
maps to JSON_ARRAY_INSERT if the last component of the path is
an array index (i.e. '[number]') and to JSON_INSERT otherwise.
- Subsequent JSON diffs with different operations are printed like
nested function calls (hence the function names appear in
reverse order, compared to the order of the diffs).
- Subsequent JSON diffs with the same operation are printed as one
function call.
- Multiple JSON functions are separated by a newline and multiple
JSON diffs are separated by a newline. In order to keep a
one-diff update on a single line, there is no newline separating
the last JSON function from the first JSON diff.
2. Grammar for the part after '### @COL=':
FUNC -> FUNC_NAME '(' FUNC_OR_COL path VALUE_OPT ')'
FUNC_NAME -> 'JSON_REPLACE' | 'JSON_REMOVE | 'JSON_INSERT' |
'JSON_ARRAY_INSERT
FUNC_OR_COL -> '@' column | NEWLINE FUNC NEWLINE
VALUE_OPT -> ( ', ' value )?
NEWLINE -> ',\n### '
(where VALUE_OPT must be present if and only if FUNC_NAME!='JSON_REMOVE')
==== 7. Safety checks ====
If the slave has diverged and has a different JSON document than the
master, applying a partial JSON update may either work and produce a
new JSON document that is syntactically valid but may not be what the
user expects / break application logic; or fail, in case any of the
paths doesn't exist on the slave.
The following safety checks will be used:
1. The replication decoder will check that the file format is
correct. If not, generate ER_SLAVE_CORRUPT_EVENT.
2. When applying the JSON diff, the optimizer should generate an
error if the JSON diff cannot be applied (e.g. if any path does
not exist, or is not in correct format for a path, or if any data
is not in correct JSON format). In this case, the replication code
shall generate an error and rollback.
3. It may be possible to apply a JSON diff to a JSON document that is
different from that on the master. This may work by chance and
there is no guarantee as to what the result is.
---- 7.1. Alternatives that were REJECTED ----
The following alternatives were considered, but rejected:
4. REJECTED: On master, generate a checksum before the partial
update. Include this checksum in the binary log event. On slave,
verify the checksum before applying the delta.
Notes:
- We could compute the checksum of only the updated parts (and
possibly a number of bytes of context before and after each
updated part, to avoid degenerate cases when only a few bytes
are changed). That would require the engine to fetch the
updated parts.
- The checksum needs to be agnostic to the internal, binary format
(due to FR4). Therefore it needs to be aware of the nested JSON
structure, so it can be a bit complicated to compute.
- If we augment the engine to store the checksum persistently, we
could use a checksum on the entire JSON documents that allows
fast recomputation of partial updates. Checksums with such
characteristics are Adler and Fletcher (which have a weaker
level of protection than checksums that do not allow fast
recomputation of partial updates, but that may be ok for the
present use case). This adds complexity to the feature.
- Checksums are probabilistic by nature, they do not provide
guarantees. So even if we implement checksums we need to handle
the case that a partial JSON update is applied to the wrong
JSON documents.
5. REJECTED: On slave, after applying a delta on a JSON column,
verify that the column still contains valid JSON data. (And if
this feature is extended to other data types that have any kind of
structure, verify that the data is still in correct
form. E.g. valid UTF8 for TEXT columns.)
(This was more relevant in earlier version of this worklog where
we considered using phisical, binary diffs. Since we use logical
JSON diffs, this is not necessary.)
==== 8. Rolling upgrades/downgrades, and cross-version replication ====
Replication OLD->NEW will work always, because we do not change the
existing event format and do not remove the code to read it.
Replication NEW->OLD will work whenever binlog_row_value_options='',
because then we use the old format.
==== 9. Future extensions ====
The following are possible future extensions (not implemented in this
worklog):
1. Partial updates of other big data types, such as BLOB, BINARY,
VARBINARY, TEXT, etc. This would mainly be optimizer work. The
present worklog will do most of the needed replication
changes. However, replication may need to add logic to correctly
handle lossy slave_type_conversions where the slave uses a shorter
column width than the master.
2. Support for JSON diffs corresponding to JSON_ARRAY_INSERT and
JSON_ARRAY_APPEND. The code in the current worklog would work for
these types, if only the optimizer generated them, except that the
current worklog explicitly forbids those types in the
deserialization code (so that we don't have untested code). So if
optimizer implements support for these update types, we can just
remove the checks. We would also need a new flag in
binlog_row_value_options to enable the two new types.
3. Support for enabling partial JSON only for a certain type of
tables, by using a syntax like
binlog_row_value_options='attribute:PARTIAL_JSON'
4. Reduce memory usage. Currently, disregarding JSON, RBR stores
each row in memory two times more than necessary. The partial
JSON we introduce here adds to this by writing partial JSON to 2-3
temporary locations. The reason is that our internal interfaces
are inconsistent, so for instance JSON must be serialied to a
String whereas RBR events must be stored in a char* buffer before
being serialized to an IO_CACHE.
Step 1: Implement new option binlog_row_value_options This adds the option and all the checks for compatibility with other options. In this patch, the option does not yet have any effect. We also add test cases. Step 2: Implement logic to generate JSON diffs only when needed In WL#10570, JSON diffs are generated unconditionally by the optimizer. This patch changes so that JSON diffs are only generated when the option is enabled, and not inhibited by other options. Step 3: Implement new event type and use it when the option is enabled This only creates a new event type code and uses this type code when the option is enabled. It does not change the format of the event payload. It adds a test to verify that the correct type code is generated on both master and slave. Step 4: Refactor: make max_row_length a private member of Row_data_memory This function is only used internally in class Row_data_memory, but was defined as a global function in table.cc. Moved it to a private member of Row_data_memory. This refactoring is needed to prove that max_row_length is not used elsewhere (so when we modify it in a later patch we don't break any protocol). It is also improves encapsulation. Step 5: Refactor: remove unused variables in pack_row This is a small refactoring / cleanup, just removing a couple of unused variables to make it easier to follow the logic and understand the real changes that come in later patches. Step 6: Refactor: localize the scope of loop variable in pack_row A loop variable was declared at top level in pack_row. We move it to be declared only in the loop. This makes it easier to edit the file later. Step 7: Refactor: move null-bit handling to new utility classes When reading and writing a row in a row event, the logic for iterating over fields was interleaved with low-level bit operations to maintain a bitmap of null fields. This made the code error-prone and hard to understand and edit. This refactoring encapsulates the bitmap handling in utility classes, and simplifies pack_row / unpack_row accordingly. Step 8: Refactor: add const to integer decoder functions in pack.cc Functions in mysys/pack.cc that read from a buffer did not declare the buffer as const. This patch makes net_field_length_size use a const parameter and makes other functions use const internally. Since these functions are part of the ABI, we also have to update include/mysql.h.pp. (We do not const-ify pointers-to-pointers in function declarations, since that breaks compilation on other places that call the functions using non-const arguments.) Step 9: Refactor: change Json_diff_vector from a type alias to a class This is needed because we are going to extend Json_diff_vector with more member functions. It also simplifies some forward declarations. Step 10: Refactor: do not overload global identifier TABLE in rpl_tblmap.h Class table_mapping in rpl_tblmap.h is used both in mysqlbinlog and in the server. In the server, it maps numbers to TABLE objects. In mysqlbinlog, it maps numbers to Table_map_log_event objects. This was implemented by using the type name TABLE, and in mysqlbinlog use a typedef that makes TABLE an alias for Table_map_log_event. This worked by chance, because no header included by mysqlbinlog happened to have a declaration of struct TABLE. However, in this worklog we will need log_event.cc to include rpl_record.h from log_event.cc, and rpl_record.h has a forward declaration of struct TABLE. This would clash with a typedef that maps TABLE to Table_map_log_event and generate a compilation error. Therefore, we change rpl_tblmap.h so that it does not use the identifier TABLE. Instead, it uses the new typedef Tablemap_table that maps to TABLE in the server and to Table_map_log_event in mysqlbinlog. While changing this file, also removed some other weirdness: - The function free_table_map_log_event only called delete on its argument. It was declared in a different header than the one for the source file that defines the function. This was pointless; removed the function and called delete directly instead. - Removed comment that looked like someone used the source files instead of writing a mail :-) - Simplified class comment. Step 11: Refactor: print more to the debug trace Improve indentation and debug trace printouts. Step 12: Refactor: remove unused variable Rows_log_event::m_master_reclength There was a member variable Rows_log_event::m_master_reclength that was set to a (strange) value which was never read. Removed this. Step 13: Refactor: simplify Rows_log_event::read_write_bitmaps_cmp This member function was implemented only in the base class, but had a switch that made it execute differently depending on the instance's subclass. Changed to use a pure virtual function in the base class and implement the different logic in each subclass. Step 14: Refactor: simplify pack_row This moves code so that it appears in a more logical order, and changes some debug printouts. More precisely: - Start with DBUG_ENTER, then initialize variables that will be used throughout the function, then print to debug trace, then do the actual encoding. - Try to use more descriptive variable names. - Print more meaningful things to the debug trace (less pointers, more table and column names, try to make more clear what the different fields mean). - Simplify a comment. Step 15: Implement encoder of new event format Outline of the pipeline: 1. In binlog.cc:Row_data_memory, take a new argument in the constructor having two 'data' pointers (this constructor is used for Update_rows_log_event and is invoked in binlog.cc:THD::binlog_update_row). This the value of the new server option binlog_row_value_options. Based on this variable, determine if Json diffs may be used, estimate how much memory will be used (using the new function json_diff.cc:Json_diff_vector::binary_length), decide if full format or partial format will be used, and adjust the allocated memory accordingly. 2. In binlog.cc:THD::binlog_update_row, pass two new arguments to pack_row: - row_image_type, which specifies if this is a Write/Update/Delete, and if it is a before-image or after-image. - value_options, which contains the value of binlog_row_value_options for update after-images. 3. In rpl_record.cc:pack_row, accept the two new arguments. If this is an update after-image and the bit in value_options is set, then determine if any column will use partial format. If any column will use partial format, write the value_options field, followed by the partial_bits, to the output. Otherwise, just write value_options=0 to the output and skip the value_options. 3. From rpl_record.cc:pack_row, invoke the new function rpl_record.cc:pack_field to write a single field. If the column is JSON and this is the after-image of an Update and the bit in value_options is set, invoke the new function field.cc:Field_json::pack_diff. Otherwise, or if field.cc:Field_json::pack_diff returned NULL, fall back to the usual non-diff writer. 4. In Field_json::pack_diff, determine again if this field will be smaller in full format or in partial format. If full format is smaller, just return NULL so that rpl_record.cc:pack_field will write the full format. Otherwise, invoke the new function json_diff.cc:Json_diff_vector::write_binary. 5. In json_diff.cc:Json_diff_vector::write_binary, write the length using 4 bytes, followed by all the diffs. Write each diff using the new function json_diff.c:Json_diff::write_binary. 6. In json_diff.c:Json_diff::write_binary, write a single diff to the output. Step 16: Refactor: simplify unpack_row Make some small simplficationsin rpl_record.cc:unpack_row: - Order the logic more clearly: First DEBUG_ENTER, then initialize some variables used throughout the function, then print to the debug trace, then start to do the actual decoding. - Use less pointer arithmetic. - Improve some comments. - Remove meaningless variable 'error'. Step 17: Implement decoder of new format The pipeline is now: 1. Add a parameter to log_event.cc:Rows_log_event::unpack_current_row, which says if this is an after-image or not. Set the parameter from all the callers in log_event.cc. 2. Move Rows_log_event::unpack_current_row from log_event.h to log_event.cc and make it pass two new arguments to rpl_record.cc:unpack_row: row_image_type, which indicates if this is Write/Update/Delete and before-image or after-image, and has_value_options, which is true for Update events when binlog_row_value_options=PARTIAL_JSON. 3. Make rpl_record.cc:unpack_row accept the two new parameters. First make a few small refactorings in rpl_record.cc:unpack_row: - Clarify some variable names and improve the comment for the function. - Remove comments about unpack_row being used by backup, having rli==NULL. This may have been an intention at some point in time, perhaps in 5.1, but probably never was true. And rli is unconditionally dereferenced in the main loop, so it cannot be NULL. Instead assert that it is not NULL. Also assert that other parameters are not NULL, as well as other preconditions. - Improve some debug trace printouts. - Return bool instead of int since the caller does not need to distinguish more than two different return statuses. Then implement the new logic: - When partial format is enabled, read partial_bits before the after-image (from within the main loop, as well as from the loop that consumes unused fields), and also read partial_bits after the before-image (after the main loop). For the before-image, leave the read-position before the partial_bits. Use the new auxiliary function start_partial_bits_reader to read the value_options and initialize the Bit_reader accordingly, in the two places (after before-image and before after-image). - In order to read the correct number of bits before the after-image, start_partial_bits_reader needs to know the number of JSON columns on the master. This is known from the table_map_log_event via the table_def class. For convenience (and reuse in the mysqlbinlog patch), we add a member function rpl_utility.cc:table_def::json_column_count. This function also caches the computed column count, to speed up successive calls (e.g. for many-row updates). - For the before-image, set the corresponding bit in the table's read_set, for any column having a 1 in the partial_bits. This tells the engine to fetch the blob from storage (later, when the engine is invoked). The blob will be needed since we have to apply the diff on it. - Call an auxiliary function rpl_record.cc:unpack_field to read each field move some special case handling for blobs into this function too. 4. In rpl_record.cc:unpack_field, call field.cc:Field_json::unpack_field for partial Json fields. 5. Add new function field.cc:Field_json::unpack_field, which invokes the new function json_diff.cc:Json_diff_vector::read_binary to read the Json_diff_vector, and the pre-existing (since WL#10570) function apply_json_diffs to apply the diff. The Json_diff_vector uses a new MEM_ROOT rather than the one of the current_thd, because that allows memory to be freed for each value, which saves resources e.g. in case of many-row updates. Before apply_json_diff can be invoked, we need to call table->mark_column_for_partial_update and table->setup_partial_update, in order to enable the *slave* server to generate JSON diffs in the *slave's* binary log. 6. Add the new function json_diff.cc:Json_diff_vector:read_binary. This function reads the length of the field, then iterates over the diffs, reads each diff in turn, constructs Json_path and Json_wrapper/Json_dom objects, and appends them to the Json_diff_vector. We implement the auxiliary function net_field_length_checked, which reads an integer in packed format (see mysys/pack.cc), checking for out-of-bounds conditions. Step 18: Implement decoding and pretty-formatting of JSON diffs in mysqlbinlog mysqlbinlog outputs row events in two forms: - BINLOG statements that a server can apply. There is nothing to change to make this work for the new event type. - "Pseudo-SQL" that humans can read, in case mysqlbinlog is invoked with the -v flag. This is what the present patch implements. The pipeline in mysqlbinlog is: 1. log_event.cc:Rows_log_event::print_verbose invokes log_event.cc:Rows_log_event::print_verbose_one_row with the new argument row_image_type, which indicates if this is a Write/Update/Delete and whether it is a before-image or after-image. 2. In log_event.cc:log_event.cc:Rows_log_event::print_verbose_one_row we do two things: - Refactorings: - Use a Bit_reader to read the null bits, instead of using bit arithmetic. - Use safer boundary checks. The code has a pointer to row data and a pointer to the end of the row data. In C/C++, a pointer may point to the next byte after an allocated block of memory, but incrementing it further has an undefined result. After reading the length of a field, the correct way to check that this length is not corrupt is to compare it with the end pointer minus the pointer to the read position. (Before, it added the length to the read position and compared with the end pointer, but the read position plus the length is undefined.) - Implement the feature: - Read the value_options, if this is the after-image of a PARTIAL_UPDATE_ROWS_EVENT. - If value_options has the PARTIAL_JSON bit set, read the partial_bits. - Pass the partialness of the column as a parameter to log_event.cc:log_event_print_value. 3. In the new function log_event_print_value, accept the new parameter, and in case the value is partial, call the new function log_event.cc:print_json_diff to parse and print the Json diffs. 4. In the new function log_event.cc:print_json_diff, read, parse, and print all the diffs. The output has the form: JSON_( JSON_ ( ... JSON_ (@column, path[, value][, path [,value][, ...]]), ... path[, value][, path [,value][, ...]]), path[, value][, path [,value][, ...]]) In this output format, the JSON_ functions appear in *reversed* order, whereas all the (path, value) pairs appear in order of appearance. Therefore, we make two passes over the sequence of diffs: 1. Read just the operations and store them in a vector. Then print the operations in reverse order. Operations are printed using the new function log_event.cc:json_wrapper_to_string. 2. Read the full diffs and output in the order of appearance. 5. Add a new function log_event.cc:json_wrapper_to_string to print a Json_wrapper. This ensures that the Json values are printed in the correct type. JSON_ functions will convert SQL types to their JSON equivalents: for instance, the JSON function JSON_SET('[1, 2]', '$[0]', '[]') will set the 0th element of the JSON array to a string containing an open and closing square bracket, and not to an empty JSON array. To account for this, different data types need different quoting, and to insert a JSON object or JSON array we need to cast the string to JSON first. 6. To output JSON values with correct quoting for SQL strings, we use the existing my_b_write_quoted, but change it so that: - it uses a lookup table (computed only once) for simplicity and performance; - it prints common escapes such as \n, \\ in a more human-readable way. Then, we add tests for printing JSON diffs.
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.