WL#2955: RBR replication of partial JSON updates
Affects: Server-8.0
—
Status: Complete
==== Executive Summary ==== MySQL shall replicate small updates of big JSON documents more space efficiently. More precisely, when using RBR, we will write only modified parts of JSON documents, instead of the whole JSON document. ==== User stories ==== U1. As a DBA, I do partial updates to JSON values, and I don't use replication. Then I enable (row-based) replication. That causes performance to drop and disk usage to grow. The reason is that the full JSON document is written to the binary log and replicated to slaves. (Writes of full JSON documents were not done by the single server because InnoDB supports partial JSON updates.) So enabling replication may give a performance degradation. The effect will be reduced when we replicate partial JSON documents and user sets binlog_row_image=MINIMAL. U2. As a DBA, I do partial updates to JSON documents, and I use statement-based replication. Then I switch to row-based replication. Then performance drops and the binary logs grow. The reason is that the full JSON document is written to the binary log and replicated to slaves. (Such writes of full JSON documents were not done by SBR.) So switching from SBR to RBR may give a performance degradation. The effect will be reduced if we replicate partial JSON documents and user sets binlog_row_image=MINIMAL. U3. As a DBA, I use row-based replication, and I don't use JSON. Then I start to use JSON and do partial updates. Then my slave starts to fall behind / lag. The reason is that the slave needs to hand the full JSON document to the engine, as well as handle the full JSON document a couple of extra times in the replication pipeline. (The master makes the partial update more efficiently.) So doing partial JSON updates can cause slave lag. The effect will be reduced if we replicate partial JSON documents and user sets binlog_row_image=MINIMAL. U4. As a DBA, I would like to use both replication and partial updates to JSON documents, but I cannot do that in 5.7 because it is too slow. I upgrade to 8.0 because it is supposed to optimize the partial JSON documents. I get disappointed because this did not help much. (The slave is equally slow, and the master is not so much faster because it writes full JSON documents to the binary log.) So the InnoDB optimization in 8.0 did not help much. The positive effect will be more visible if we replicate partial JSON documents. U5. As a power user, I have various scripts that mine the binary log, extract the after-image, and do interesting things. I do not want to update my scripts, and/or my scripts can only work if the full after-image is there. So partial JSON replication should be optional. U6. Both optimizer and NDB are interested in using binary diff algorithms to implement partial blob replication for blob updates. This will make blob updates more efficient in InnoDB and NDB, respectiely (for roughly the same reasons as explained in U1-U4). So the user interface and the binary log format should be extensible so that we can add support for other data types in the future.
* Functional requirements ** Core feature. FR1. Add a new server option. If the user disables the option, the server must only write full JSON documents to the binary log, and never partial JSON updates. If the user enables the option, the server may write partial JSON updates to the after-image of RBR updates in the binary log when possible. It may also write full JSON documents, e.g. in case the server cannot generate a partial JSON update, or if the partial JSON update would be bigger than the full document. Exceptions: when the binary log is disabled, or when using statement-based replication, or when using log_bin_use_v1_row_events, the option has no effect. The option only affects the after-image, and therefore when binlog_row_image=FULL, or when the JSON column is part of the PK, or when there is no PK, the before-image will contain the JSON column, and it is full.) FR2. The slave must decode the partial JSON updates and apply them correctly. FR3. When a binary log contains row events with partial JSON updates, mysqlbinlog must output BINLOG statements such that when the 8.0 server applies them, the JSON documents shall be correctly updated. FR4. The partial JSON update must apply correctly on slave as long as the JSON documents are equal as seen from SQL, regardless of the low-level storage format. That is, even if the number of internal unused padding bytes in the InnoDB disk format differs between master and slave, the following must hold: - If the JSON documents are equal according to the SQL '=' operator applied directly on the JSON documents, then this equality must hold also after doing the update on the master and applying the resulting partial JSON update on the slave. - If the JSON documents are equal according to the SQL '=' operator applied on the textual representation of the JSON documents, then this equality must hold also after doing the update on the master and applying the resulting partial JSON update on the slave. (But in both cases, the internal representation may be different both before and after applying the also after applying the change. Moreover, even if the internal representations are identical before the update, it is not a requirement that they are equal after the update.) In particular, if a JSON document is created on a single server, and subsequently updated using JSON functions that replace part of the document by a smaller document, and then the database is cloned to another server using mysqldump, then subsequent replication of partial JSON from the original server to the newly created server shall succeed. FR5. If master and slave have diverged so that they have different JSON documents, a partial update must not crash the server and must not produce invalid JSON. The partial update must either cause the applier thread to stop with an error, or update the JSON document on slave in an unpredictable way. FR6. When the slave has the option enabled, and applies a JSON diff, then: FR6.1. The slave shall generate JSON diffs in its binary log FR6.2. The slave shall use binary diffs to update the engine ** Usability and Compatibility FR7. mysqlbinlog -v shall output human-readable and complete information about partial JSON updates. FR8. When partial JSON is disabled, the binary log format must not change (therefore this does not block cross-version replication, neither OLD->NEW nor NEW->OLD). FR9. The new server option shall have a type that makes it easy to extend to other datatypes in the future. FR10. When replicating from a NEW master having the option enabled, to an OLD slave having the option disabled, the OLD slave must stop with an error message and fail to apply the partial updates. ** Storage and performance Def1. The *size of a partial JSON update* is equal to the total size of all JSON diff objects, and each JSON diff object shall be the size of the JSON path, plus the size of the binary representation of the inserted JSON (sub-)document, if any, plus a small number of bytes (bounded by a constant). Def2. The *size of a full JSON update* is equal to the size of the binary representation of the full document, plus a small number of bytes (bounded by a constant). FR11. When the feature is enabled, and the server is able to generate a partial JSON update for a row change, and the size of the partial JSON update is smaller than the size of the full JSON update ("Def1 < Def2"), then the binary log size shall decrease by size of the full JSON update and increase by the size of the partial JSON update. FR12. When the feature is enabled, and the server is unable to generate a partial JSON update for a row change, or the size of the partial JSON update is greater than the size of the full JSON update ("Def1 > Def2"), then the binary log size shall change only by a small constant number of bytes (the difference between D2 and a raw binary format). FR13. When the feature is disabled, the size of the binlog shall be unchanged. FR14. When the feature is disabled, the memory usage shall not change. FR15. When the feature is enabled, the throughput (on each of master and slave) shall decrease by at most 2% (as long as the increased memory usage does not cause the server to swap). FR16. When the feature is disabled, the throughput shall not change. * Non-functional requirements NFR1. The disk format must be possible to extend to accomodate partial updates of other data types in the future. * NON-requirements NR1. "If the slave has a different value for a JSON document, compared to the master, and the master does a partial JSON update, the update shall not take effect on the slave." This cannot be implemented in a practical and efficient way. So a partial JSON update applied on the wrong JSON document may corrupt the JSON document. NR2. "The binary log size shall never increase." When partial JSON updates cannot be computed, the binary log size will increase by a few bytes per JSON column; see FR11. NR3. "When partial JSON is enabled, all JSON updates *must* use the partial format. When the server cannot generate partial JSON, it should use the JSON path '$' and the full value." The uniformity of this format might seem appealing. However, in case the old value was NULL, applying any JSON function on it will produce the result NULL. This could be handled by a special case in the decoder, which would replace updates at path '$' by a full write (without applying the JSON function). But then the uniformity would be lost anyways. NR4. "When the feature is enabled, the memory usage for JSON updates shall increase by at most X% (on each of master and slave)." We do not provide such guarantees. The size of the Json_diff_vector is probably unbounded (except the MAX_ALLOWED_PACKET limit), in case user executes statements like: CREATE TABLE t (j JSON); INSERT INTO t VALUES ('[1, 2, 3, ..., 1000]'); UPDATE t SET j = JSON_SET(j, '$', '[1, 2, ..., 1000]', '$', '[1, 2, ..., 1000]', ... '$', '[1, 2, ..., 1000]')); And even disregarding this, the replication code uses 2-3 extra copies of the JSON value. This is because of limitations in existing frameworks and has to be addressed separately. NR5. "When partial JSON is enabled, it shall be possible for user to predict if a given change will be written to the binary log in full format or partial format based on knowing the old JSON value and the operations applied to it." This property is not be guaranteed, since the server will decide to use the format (full or partial) that has the smallest size. And the size of a full object is not determined by the JSON value; in fact two JSON objects having the same textural representation may have different internal binary representations.
==== 1. SQL interface ==== We introduce a new SQL variable: NAME: binlog_row_value_options (variable) binlog-row-value-options (command line argument) SCOPE: command-line, global and session TYPE: set VALUES: 'PARTIAL_JSON' DEFAULT: '' DYNAMIC: yes RESTRICTIONS: only SUPER can set this, outside transactions, stored procedures, triggers, and functions. DESCRIPTION: Enables a space-efficient binary log format for updates that modify only a small portion of a data value. Currently, this is only implemented for updates of JSON documents. It only works when log-bin=1 and binlog_format=ROW or MIXED. It is also recommended to use binlog_row_image=NOBLOB or MINIMAL; with binlog_row_image=FULL, the benefit is smaller because the full JSON document will be stored in the before-image and the partial update in the after-image. This option overrides log_bin_use_v1_row_events, so even when log_bin_use_v1_row_events=1 it uses the new event format with partial JSON. WARNINGS: 1. Suppose any of the following conditions holds: - user sets @@session.binlog_row_value_options=PARTIAL_JSON, when either @@session.binlog_format=STATEMENT, or the binlog is not open, or @@session.sql_log_bin=0; - user sets @@session.binlog_format=STATEMENT or @@session.sql_log_bin=0 when @@session.binlog_row_value_options=PARTIAL_JSON; - user sets @@global.binlog_row_value_options=PARTIAL_JSON or @@persist.binlog_row_value_options when either @@global.binlog_format=STATEMENT, or the binlog is not open; - user sets @@global.binlog_format=STATEMENT or @@persist.binlog_format=STATEMENT when @@global.binlog_row_value_options=PARTIAL_JSON; - when the server starts, the command line or configuration file or persisted configuration contains binlog-row-value-options=PARTIAL_JSON when either binlog-format=STATEMENT, or log-bin is not set; Then the following warning shall be generated: "When %.192s, the setting binlog_row_value_options=%.192s will be ignored and no partial updates will be written to the binary log." where the first %.192s is replaced by "binlog_format=STATEMENT" or "the binary log is disabled", and the second %.192s is replaced by "PARTIAL_JSON". 2. Suppose any of the following conditions holds: - user sets @@session.binlog_row_value_options=PARTIAL_JSON or @@global.binlog_row_value_options=PARTIAL_JSON or @@persist.binlog_row_value_options=PARTIAL_JSON when @@global.log_bin_use_v1_row_events=1; - user sets @@global.log_bin_use_v1_row_events=1 or @@persist.log_bin_use_v1_row_events=1 when @@global.binlog_row_value_options=PARTIAL_JSON; - when the server starts, the command line or configuration file or persisted configuration contains binlog-row-value-options=PARTIAL_JSON and log-bin-use-v1-row-events=1; Then the following warning shall be generated: "When %.192s, the option log_bin_use_v1_row_events=1 will be ignored and row events will be written in new format to binary log." where %.192 is replaced by "binlog_row_value_options=PARTIAL_JSON". 3. Suppose any of the following conditions holds: - user sets @@session.binlog_row_value_options=PARTIAL_JSON when @@session.binlog_row_image=FULL; - user sets @@session.binlog_row_image=FULL when @@session.binlog_row_value_options=PARTIAL_JSON; - user sets @@global.binlog_row_value_options=PARTIAL_JSON or @@persist.binlog_row_value_options=PARTIAL_JSON when @@global.binlog_row_image=FULL; - user sets @@global.binlog_row_image=FULL or @@persist.binlog_row_image=FULL when @@global.binlog_row_value_options=PARTIAL_JSON; - when the server starts, the command line or configuration file or persisted configuration contains binlog-row-image=FULL and binlog-row-value-options=PARTIAL_JSON; Then the following warning shall be generated: "When %.192s, the option binlog_row_value_options=%.192s will be used only for the after-image. Full values will be written in the before-image, so the saving in disk space due to binlog_row_value_options is limited to less than 50%%." where the first %.192s is replaced by "binlog_row_image=FULL" and the second %.192s is replaced by "PARTIAL_JSON". NOTES: Even if this is only two-state, we make it a set instead of a boolean so that it can easily be extended with e.g. PARTIAL_BLOB and PARTIAL_TEXT in the future (FR7). ==== 2. Logic to decide to use partial format ==== Update_rows_log_event will use a format that allows JSON documents to be stored in partial format, whenever binlog_row_value_options=PARTIAL_JSON. Partial format is used when all the following holds: 1. The binary log is open and enabled for the transaction 2. binlog_image_value_options=PARTIAL_JSON (this is not the default) 3. the statement is written in row format 4. log_bin_use_v1_row_events=0 5. The UPDATE statement has the form: SET column = f1(f2(...fn(column, ...), ... )...) where each of f1, f2, ..., fn is either JSON_SET or JSON_REPLACE. 6. The resulting sequence of diffs has a smaller byte size than the full value, when serialized in the binlog. 7. Only for the after-image (the before-image, if present, uses full format). If any one of the above conditions does not hold, the full format is used. Notice that #6 above ensures that FR11 holds. Also notice that #6 is non-deterministic in the following sense. Two JSON documents that have equal string representations, may use different binary representations internally, either due to having different data types (e.g. timestamp vs string) or due to having different amounts of internal padding (e.g. because the documents were created in different ways). Then if the same JSON operation is applied to both documents, one of them may be written as a JSON diff and the other as a full document. So the user cannot in general predict whether full format or partial format will be used (cf. NR5). ==== 3. Applier logic ==== The RBR applier for update events has the following control flow (leaving out steps that are irrelevant to the present worklog): 1. Decode the before-image. 2. Set the table's read_set to those columns that need to be retrieved from the engine. 3. (Tell the storage engine to) look up the before-image in the table. 4. Decode the after-image. 5. Set the engine's write_set to those columns that need to be stored in the engine. 6. (Tell the storage engine to) write the after-image to the table. 7. Write the row change to the binary log. This includes those columns that are in the read_set and the write_set in the event data. And now: - Step 6 needs to write full values to the engine, because there is no interface to send a JSON diff to the engine. - Therefore, step 4 needs to apply the JSON diff to the old version of the JSON document. - Therefore, step 3 needs to retrieve JSON columns which are in partial format in the after-image from the engine. - Therefore, step 2 needs to set include JSON columns which are in partial format in the after-image in the table's read_set. - Therefore, step 1 needs to read the information about which columns are partial from the event data. Therefore, we make the following changes in steps 1 and 3 of the decoder: 1. Decode the before-image, plus the information about which JSON columns are stored in partial format in the after-image. 2. Set the bit for each partial column in the table's read_set. 4. Decode the after-image. For columns stored in partial format, decode the partial format correctly. Apply the JSON diffs to the column data to produce the full after-image. Notice that both the before-image decoder and the after-image decoder need to know which JSON columns are stored in partial format. As we will see in the next section, this has implications for the storage format. Notice that in step 7, it uses the read_set to determine which columns will be included in the before-image. Therefore, the JSON columns which were added to the read_set only because they are partial in the after-image, need to be removed from the read_set. Otherwise the slave would store all partial JSON columns in full format in the before-image in the slave's binary log. Apparently this logic is already implemented, so there is nothing to change in this worklog. (???) ==== 4. Binary log format ==== A new event type is introduced. This event type differs from Update_rows_log_event in the following ways: F1. It has a new type code. F2. It has a new layout for the header of the before-image. F3. It has a new format for JSON columns. (There is no change in Table_map_log_event, Write_rows_log_event, or Delete_rows_log_event.) We now desribe the format changes more precisely: F1. The *old* type code is UPDATE_ROWS_EVENT=31 (or UPDATE_ROWS_EVENT_V1=24, in case log_bin_use_v1_row_events=1). The *new* type code is PARTIAL_UPDATE_ROWS_EVENT=39. (Internally, it uses the same class, it is only the type code that differs.) The new type code is used whenever binlog_row_value_options=PARTIAL_JSON and log_bin_use_v1_row_events=0. F2. For UPDATE_ROWS_EVENT and UPDATE_ROWS_EVENT_V1, the format for a row is (before and after this worklog): +--------------+-------------+ | before-image | after-image | +--------------+-------------+ (and each image has the following format: +-----------+-------+-------+ +-------+ | null_bits | col_1 | col_2 | ... | col_N | +-----------+-------+-------+ +-------+ ) For PARTIAL_UPDATE_ROWS_EVENT, we introduce a *shared-image*. The shared-image appears between before-image and after-image: +--------------+--------------+-------------+ | before-image | shared-image | after-image | +--------------+--------------+-------------+ shared-image has one of the following structures: +-----------------+ | value_options=0 | +-----------------+ +-----------------+-----------------+ | value_options=1 | partial_columns | +-----------------+-----------------+ The new fields are: - value_options: A bitmap, stored as an integer in net_field_length() format. This bitmap currently has one bit, PARTIAL_JSON_UPDATES (the lowest bit), hence the integer value of the bitmap is 0 or 1. When the bit is set to 0, all columns are stored in full format. When the bit is 1, each JSON column may be stored in either partial format or full format. The exact format is specified per column in the partial_columns field. (In case we implement partial formats for other data types in the future, or even other partial formats for JSON, then those modes can be enabled and disabled by adding more bits to value_options.) The encoder sets value_options=1 when @@session.binlog_row_value_options=PARTIAL_JSON and at least one column is stored in partial JSON format. In case @@session.binlog_row_value_options='', or there are no JSON updates, or each JSON update either did not generate a JSON diff or generated a JSON diff that was bigger than the full value, the encoder sets value_options=0 and skips the partial_columns field. - partial_columns: A bitmap having one bit for each column in the table that has a type that might be partial according to value_options (i.e., in the current worklog, there is a bit for JSON columns; in the future, enabling other options in binlog_row_value_options may imply that we include columns having other types in partial_columns). Each bit is 1 if the column is stored in partial format in the after-image, and 0 if the column is either stored in full format in the after-image, or NULL in the after-image, or not included in the after-image. Note that partial_columns appears *between* before-image and after-image. The reason is that (as was noted at the end of the previous section), this information is needed by both the before-image decoder and the after-image decoder. In the encoder, the information is only known in the after-image encoder. Therefore, we organize the encoder and decoder as follows: - The before-image encoder writes only the before-image. - The after-image encoder writes the shared-image and the after-image. - The before-image decoder reads both the before-image and the shared-image, but leaves the read position at the beginning of the shared-image. - The after-image decoder reads both the shared-image and the after-image. F3. The format for JSON columns in full format is: +--------+-------+ | length | value | +--------+-------+ This is true before and after this worklog. The *new* format for *partial* JSON columns in the after-image is: +--------+--------+--------+ +--------+ | length | diff_1 | diff_2 | ... | diff_N | +--------+--------+--------+ +--------+ Each diff_i represents a single JSON diff. It has the following format: +-----------+-------------+------+ +-------------+------+ | operation | path_length | path | ( | data_length | data | )? +-----------+-------------+------+ +-------------+------+ The fields are: - operation: a single byte containing the JSON diff operation. The possible values are defined by enum_json_diff_operation: REPLACE=0 INSERT=1 REMOVE=2 - path_length: an unsigned integer in net_field_length() format. - path: a string of 'path_length' bytes containing the JSON path of the update. - data_length: an unsigned integer in net_field_length() format. - data: a string of 'data_length' bytes containing the JSON document that will be inserted at the position specified by 'path'. data_length and data are omitted if and only if operation=REMOVE. ==== 5. Capture logic ==== JSON diffs are generated unconditionally. Since we add the option that enables / disables JSON diffs in this worklog, this worklog shall also prohibit JSON diffs from being generated when they are not needed. I.e., JSON diffs shall only be generated by the optimizer when all the following holds (according to section 2 above): - the binary log is open and enabled for the transaction - binlog_image_value_options=PARTIAL_JSON (this is not the default) - the statement is written in row format - log_bin_use_v1_row_events=0 ==== 6. mysqlbinlog output ==== mysqlbinlog shall output BINLOG 'base64' statements containing the partial JSON update. (No code change is needed to make this work). mysqlbinlog -v shall output readable JSON in the pseudo-SQL: 1. Example: original document: { "a": "replace this string value by 7", "b": [ 0, "replace this string by bb" ], "c": "remove this key-value pair, including the key c", "d": [ "remove this string" ], "0": "insert the key-value pair e: ee in the top-level object", "f": [ "insert ff after this string", "and before this string" ], "1": "insert the key-value pair g: gg in the top-level object" } diff vector: (REPLACE, '$.a', '7') (REPLACE, '$.b[1]', '"bb"') (REMOVE, '$.c') (REMOVE, '$.d[0]') (INSERT, '$.e', '"ee"') (INSERT, '$.f[1]', '"ff"') (INSERT, '$.g', '"gg"') output (may also have been the original SQL that created the diff vector): ### UPDATE `test`.`t` ### WHERE ### @1=1 ### SET ### @2=JSON_INSERT( ### JSON_ARRAY_INSERT( ### JSON_INSERT( ### JSON_REMOVE( ### JSON_REPLACE(@2, '$.a', 7, ### '$.b[1]', 'bb'), ### '$.c' ### '$.d[0]'), ### '$.e', 'ee'), ### '$.f[3]', 'ff'), ### '$.g', 'gg'), Notice that: - REPLACE maps to JSON_REPLACE, REMOVE maps to JSON_REMOVE, INSERT maps to JSON_ARRAY_INSERT if the last component of the path is an array index (i.e. '[number]') and to JSON_INSERT otherwise. - Subsequent JSON diffs with different operations are printed like nested function calls (hence the function names appear in reverse order, compared to the order of the diffs). - Subsequent JSON diffs with the same operation are printed as one function call. - Multiple JSON functions are separated by a newline and multiple JSON diffs are separated by a newline. In order to keep a one-diff update on a single line, there is no newline separating the last JSON function from the first JSON diff. 2. Grammar for the part after '### @COL=': FUNC -> FUNC_NAME '(' FUNC_OR_COL path VALUE_OPT ')' FUNC_NAME -> 'JSON_REPLACE' | 'JSON_REMOVE | 'JSON_INSERT' | 'JSON_ARRAY_INSERT FUNC_OR_COL -> '@' column | NEWLINE FUNC NEWLINE VALUE_OPT -> ( ', ' value )? NEWLINE -> ',\n### ' (where VALUE_OPT must be present if and only if FUNC_NAME!='JSON_REMOVE') ==== 7. Safety checks ==== If the slave has diverged and has a different JSON document than the master, applying a partial JSON update may either work and produce a new JSON document that is syntactically valid but may not be what the user expects / break application logic; or fail, in case any of the paths doesn't exist on the slave. The following safety checks will be used: 1. The replication decoder will check that the file format is correct. If not, generate ER_SLAVE_CORRUPT_EVENT. 2. When applying the JSON diff, the optimizer should generate an error if the JSON diff cannot be applied (e.g. if any path does not exist, or is not in correct format for a path, or if any data is not in correct JSON format). In this case, the replication code shall generate an error and rollback. 3. It may be possible to apply a JSON diff to a JSON document that is different from that on the master. This may work by chance and there is no guarantee as to what the result is. ---- 7.1. Alternatives that were REJECTED ---- The following alternatives were considered, but rejected: 4. REJECTED: On master, generate a checksum before the partial update. Include this checksum in the binary log event. On slave, verify the checksum before applying the delta. Notes: - We could compute the checksum of only the updated parts (and possibly a number of bytes of context before and after each updated part, to avoid degenerate cases when only a few bytes are changed). That would require the engine to fetch the updated parts. - The checksum needs to be agnostic to the internal, binary format (due to FR4). Therefore it needs to be aware of the nested JSON structure, so it can be a bit complicated to compute. - If we augment the engine to store the checksum persistently, we could use a checksum on the entire JSON documents that allows fast recomputation of partial updates. Checksums with such characteristics are Adler and Fletcher (which have a weaker level of protection than checksums that do not allow fast recomputation of partial updates, but that may be ok for the present use case). This adds complexity to the feature. - Checksums are probabilistic by nature, they do not provide guarantees. So even if we implement checksums we need to handle the case that a partial JSON update is applied to the wrong JSON documents. 5. REJECTED: On slave, after applying a delta on a JSON column, verify that the column still contains valid JSON data. (And if this feature is extended to other data types that have any kind of structure, verify that the data is still in correct form. E.g. valid UTF8 for TEXT columns.) (This was more relevant in earlier version of this worklog where we considered using phisical, binary diffs. Since we use logical JSON diffs, this is not necessary.) ==== 8. Rolling upgrades/downgrades, and cross-version replication ==== Replication OLD->NEW will work always, because we do not change the existing event format and do not remove the code to read it. Replication NEW->OLD will work whenever binlog_row_value_options='', because then we use the old format. ==== 9. Future extensions ==== The following are possible future extensions (not implemented in this worklog): 1. Partial updates of other big data types, such as BLOB, BINARY, VARBINARY, TEXT, etc. This would mainly be optimizer work. The present worklog will do most of the needed replication changes. However, replication may need to add logic to correctly handle lossy slave_type_conversions where the slave uses a shorter column width than the master. 2. Support for JSON diffs corresponding to JSON_ARRAY_INSERT and JSON_ARRAY_APPEND. The code in the current worklog would work for these types, if only the optimizer generated them, except that the current worklog explicitly forbids those types in the deserialization code (so that we don't have untested code). So if optimizer implements support for these update types, we can just remove the checks. We would also need a new flag in binlog_row_value_options to enable the two new types. 3. Support for enabling partial JSON only for a certain type of tables, by using a syntax like binlog_row_value_options='attribute:PARTIAL_JSON' 4. Reduce memory usage. Currently, disregarding JSON, RBR stores each row in memory two times more than necessary. The partial JSON we introduce here adds to this by writing partial JSON to 2-3 temporary locations. The reason is that our internal interfaces are inconsistent, so for instance JSON must be serialied to a String whereas RBR events must be stored in a char* buffer before being serialized to an IO_CACHE.
Step 1: Implement new option binlog_row_value_options This adds the option and all the checks for compatibility with other options. In this patch, the option does not yet have any effect. We also add test cases. Step 2: Implement logic to generate JSON diffs only when needed In WL#10570, JSON diffs are generated unconditionally by the optimizer. This patch changes so that JSON diffs are only generated when the option is enabled, and not inhibited by other options. Step 3: Implement new event type and use it when the option is enabled This only creates a new event type code and uses this type code when the option is enabled. It does not change the format of the event payload. It adds a test to verify that the correct type code is generated on both master and slave. Step 4: Refactor: make max_row_length a private member of Row_data_memory This function is only used internally in class Row_data_memory, but was defined as a global function in table.cc. Moved it to a private member of Row_data_memory. This refactoring is needed to prove that max_row_length is not used elsewhere (so when we modify it in a later patch we don't break any protocol). It is also improves encapsulation. Step 5: Refactor: remove unused variables in pack_row This is a small refactoring / cleanup, just removing a couple of unused variables to make it easier to follow the logic and understand the real changes that come in later patches. Step 6: Refactor: localize the scope of loop variable in pack_row A loop variable was declared at top level in pack_row. We move it to be declared only in the loop. This makes it easier to edit the file later. Step 7: Refactor: move null-bit handling to new utility classes When reading and writing a row in a row event, the logic for iterating over fields was interleaved with low-level bit operations to maintain a bitmap of null fields. This made the code error-prone and hard to understand and edit. This refactoring encapsulates the bitmap handling in utility classes, and simplifies pack_row / unpack_row accordingly. Step 8: Refactor: add const to integer decoder functions in pack.cc Functions in mysys/pack.cc that read from a buffer did not declare the buffer as const. This patch makes net_field_length_size use a const parameter and makes other functions use const internally. Since these functions are part of the ABI, we also have to update include/mysql.h.pp. (We do not const-ify pointers-to-pointers in function declarations, since that breaks compilation on other places that call the functions using non-const arguments.) Step 9: Refactor: change Json_diff_vector from a type alias to a class This is needed because we are going to extend Json_diff_vector with more member functions. It also simplifies some forward declarations. Step 10: Refactor: do not overload global identifier TABLE in rpl_tblmap.h Class table_mapping in rpl_tblmap.h is used both in mysqlbinlog and in the server. In the server, it maps numbers to TABLE objects. In mysqlbinlog, it maps numbers to Table_map_log_event objects. This was implemented by using the type name TABLE, and in mysqlbinlog use a typedef that makes TABLE an alias for Table_map_log_event. This worked by chance, because no header included by mysqlbinlog happened to have a declaration of struct TABLE. However, in this worklog we will need log_event.cc to include rpl_record.h from log_event.cc, and rpl_record.h has a forward declaration of struct TABLE. This would clash with a typedef that maps TABLE to Table_map_log_event and generate a compilation error. Therefore, we change rpl_tblmap.h so that it does not use the identifier TABLE. Instead, it uses the new typedef Tablemap_table that maps to TABLE in the server and to Table_map_log_event in mysqlbinlog. While changing this file, also removed some other weirdness: - The function free_table_map_log_event only called delete on its argument. It was declared in a different header than the one for the source file that defines the function. This was pointless; removed the function and called delete directly instead. - Removed comment that looked like someone used the source files instead of writing a mail :-) - Simplified class comment. Step 11: Refactor: print more to the debug trace Improve indentation and debug trace printouts. Step 12: Refactor: remove unused variable Rows_log_event::m_master_reclength There was a member variable Rows_log_event::m_master_reclength that was set to a (strange) value which was never read. Removed this. Step 13: Refactor: simplify Rows_log_event::read_write_bitmaps_cmp This member function was implemented only in the base class, but had a switch that made it execute differently depending on the instance's subclass. Changed to use a pure virtual function in the base class and implement the different logic in each subclass. Step 14: Refactor: simplify pack_row This moves code so that it appears in a more logical order, and changes some debug printouts. More precisely: - Start with DBUG_ENTER, then initialize variables that will be used throughout the function, then print to debug trace, then do the actual encoding. - Try to use more descriptive variable names. - Print more meaningful things to the debug trace (less pointers, more table and column names, try to make more clear what the different fields mean). - Simplify a comment. Step 15: Implement encoder of new event format Outline of the pipeline: 1. In binlog.cc:Row_data_memory, take a new argument in the constructor having two 'data' pointers (this constructor is used for Update_rows_log_event and is invoked in binlog.cc:THD::binlog_update_row). This the value of the new server option binlog_row_value_options. Based on this variable, determine if Json diffs may be used, estimate how much memory will be used (using the new function json_diff.cc:Json_diff_vector::binary_length), decide if full format or partial format will be used, and adjust the allocated memory accordingly. 2. In binlog.cc:THD::binlog_update_row, pass two new arguments to pack_row: - row_image_type, which specifies if this is a Write/Update/Delete, and if it is a before-image or after-image. - value_options, which contains the value of binlog_row_value_options for update after-images. 3. In rpl_record.cc:pack_row, accept the two new arguments. If this is an update after-image and the bit in value_options is set, then determine if any column will use partial format. If any column will use partial format, write the value_options field, followed by the partial_bits, to the output. Otherwise, just write value_options=0 to the output and skip the value_options. 3. From rpl_record.cc:pack_row, invoke the new function rpl_record.cc:pack_field to write a single field. If the column is JSON and this is the after-image of an Update and the bit in value_options is set, invoke the new function field.cc:Field_json::pack_diff. Otherwise, or if field.cc:Field_json::pack_diff returned NULL, fall back to the usual non-diff writer. 4. In Field_json::pack_diff, determine again if this field will be smaller in full format or in partial format. If full format is smaller, just return NULL so that rpl_record.cc:pack_field will write the full format. Otherwise, invoke the new function json_diff.cc:Json_diff_vector::write_binary. 5. In json_diff.cc:Json_diff_vector::write_binary, write the length using 4 bytes, followed by all the diffs. Write each diff using the new function json_diff.c:Json_diff::write_binary. 6. In json_diff.c:Json_diff::write_binary, write a single diff to the output. Step 16: Refactor: simplify unpack_row Make some small simplficationsin rpl_record.cc:unpack_row: - Order the logic more clearly: First DEBUG_ENTER, then initialize some variables used throughout the function, then print to the debug trace, then start to do the actual decoding. - Use less pointer arithmetic. - Improve some comments. - Remove meaningless variable 'error'. Step 17: Implement decoder of new format The pipeline is now: 1. Add a parameter to log_event.cc:Rows_log_event::unpack_current_row, which says if this is an after-image or not. Set the parameter from all the callers in log_event.cc. 2. Move Rows_log_event::unpack_current_row from log_event.h to log_event.cc and make it pass two new arguments to rpl_record.cc:unpack_row: row_image_type, which indicates if this is Write/Update/Delete and before-image or after-image, and has_value_options, which is true for Update events when binlog_row_value_options=PARTIAL_JSON. 3. Make rpl_record.cc:unpack_row accept the two new parameters. First make a few small refactorings in rpl_record.cc:unpack_row: - Clarify some variable names and improve the comment for the function. - Remove comments about unpack_row being used by backup, having rli==NULL. This may have been an intention at some point in time, perhaps in 5.1, but probably never was true. And rli is unconditionally dereferenced in the main loop, so it cannot be NULL. Instead assert that it is not NULL. Also assert that other parameters are not NULL, as well as other preconditions. - Improve some debug trace printouts. - Return bool instead of int since the caller does not need to distinguish more than two different return statuses. Then implement the new logic: - When partial format is enabled, read partial_bits before the after-image (from within the main loop, as well as from the loop that consumes unused fields), and also read partial_bits after the before-image (after the main loop). For the before-image, leave the read-position before the partial_bits. Use the new auxiliary function start_partial_bits_reader to read the value_options and initialize the Bit_reader accordingly, in the two places (after before-image and before after-image). - In order to read the correct number of bits before the after-image, start_partial_bits_reader needs to know the number of JSON columns on the master. This is known from the table_map_log_event via the table_def class. For convenience (and reuse in the mysqlbinlog patch), we add a member function rpl_utility.cc:table_def::json_column_count. This function also caches the computed column count, to speed up successive calls (e.g. for many-row updates). - For the before-image, set the corresponding bit in the table's read_set, for any column having a 1 in the partial_bits. This tells the engine to fetch the blob from storage (later, when the engine is invoked). The blob will be needed since we have to apply the diff on it. - Call an auxiliary function rpl_record.cc:unpack_field to read each field move some special case handling for blobs into this function too. 4. In rpl_record.cc:unpack_field, call field.cc:Field_json::unpack_field for partial Json fields. 5. Add new function field.cc:Field_json::unpack_field, which invokes the new function json_diff.cc:Json_diff_vector::read_binary to read the Json_diff_vector, and the pre-existing (since WL#10570) function apply_json_diffs to apply the diff. The Json_diff_vector uses a new MEM_ROOT rather than the one of the current_thd, because that allows memory to be freed for each value, which saves resources e.g. in case of many-row updates. Before apply_json_diff can be invoked, we need to call table->mark_column_for_partial_update and table->setup_partial_update, in order to enable the *slave* server to generate JSON diffs in the *slave's* binary log. 6. Add the new function json_diff.cc:Json_diff_vector:read_binary. This function reads the length of the field, then iterates over the diffs, reads each diff in turn, constructs Json_path and Json_wrapper/Json_dom objects, and appends them to the Json_diff_vector. We implement the auxiliary function net_field_length_checked, which reads an integer in packed format (see mysys/pack.cc), checking for out-of-bounds conditions. Step 18: Implement decoding and pretty-formatting of JSON diffs in mysqlbinlog mysqlbinlog outputs row events in two forms: - BINLOG statements that a server can apply. There is nothing to change to make this work for the new event type. - "Pseudo-SQL" that humans can read, in case mysqlbinlog is invoked with the -v flag. This is what the present patch implements. The pipeline in mysqlbinlog is: 1. log_event.cc:Rows_log_event::print_verbose invokes log_event.cc:Rows_log_event::print_verbose_one_row with the new argument row_image_type, which indicates if this is a Write/Update/Delete and whether it is a before-image or after-image. 2. In log_event.cc:log_event.cc:Rows_log_event::print_verbose_one_row we do two things: - Refactorings: - Use a Bit_reader to read the null bits, instead of using bit arithmetic. - Use safer boundary checks. The code has a pointer to row data and a pointer to the end of the row data. In C/C++, a pointer may point to the next byte after an allocated block of memory, but incrementing it further has an undefined result. After reading the length of a field, the correct way to check that this length is not corrupt is to compare it with the end pointer minus the pointer to the read position. (Before, it added the length to the read position and compared with the end pointer, but the read position plus the length is undefined.) - Implement the feature: - Read the value_options, if this is the after-image of a PARTIAL_UPDATE_ROWS_EVENT. - If value_options has the PARTIAL_JSON bit set, read the partial_bits. - Pass the partialness of the column as a parameter to log_event.cc:log_event_print_value. 3. In the new function log_event_print_value, accept the new parameter, and in case the value is partial, call the new function log_event.cc:print_json_diff to parse and print the Json diffs. 4. In the new function log_event.cc:print_json_diff, read, parse, and print all the diffs. The output has the form: JSON_( JSON_ ( ... JSON_ (@column, path[, value][, path [,value][, ...]]), ... path[, value][, path [,value][, ...]]), path[, value][, path [,value][, ...]]) In this output format, the JSON_ functions appear in *reversed* order, whereas all the (path, value) pairs appear in order of appearance. Therefore, we make two passes over the sequence of diffs: 1. Read just the operations and store them in a vector. Then print the operations in reverse order. Operations are printed using the new function log_event.cc:json_wrapper_to_string. 2. Read the full diffs and output in the order of appearance. 5. Add a new function log_event.cc:json_wrapper_to_string to print a Json_wrapper. This ensures that the Json values are printed in the correct type. JSON_ functions will convert SQL types to their JSON equivalents: for instance, the JSON function JSON_SET('[1, 2]', '$[0]', '[]') will set the 0th element of the JSON array to a string containing an open and closing square bracket, and not to an empty JSON array. To account for this, different data types need different quoting, and to insert a JSON object or JSON array we need to cast the string to JSON first. 6. To output JSON values with correct quoting for SQL strings, we use the existing my_b_write_quoted, but change it so that: - it uses a lookup table (computed only once) for simplicity and performance; - it prints common escapes such as \n, \\ in a more human-readable way. Then, we add tests for printing JSON diffs.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.