WL#5092: RBR: Options for writing partial or full row images in RBR events
Affects: Server-5.6
—
Status: Complete
Summary ======= Implement mysqld option so that the user is able to select whether to log minimal or full images on RBR events depending on the keys available in the master. When sending minimal rows, one can save binlog disk space, network resources and mysqld memory footprint. BACKGROUND ========== Before and After Images ----------------------- In row based replication row events may contain two copies for the row that they are changing. These are generally known as images. The first one, called *before image* (BI), contains data that existed on the row before it was actually changed. The second one, called *after image* (AI), contain the actual changes. Each, BI and AI, usage is confined to two different moments in the execution flow. The BI is used while the slave is searching for the row to be updated, while AI is used when replaying the changes in the row. Both have some restrictions: - BI: needs to hold a set of values that can be used by the slave to fetch the correct row. In other words, it should provide a set a values that *uniquely* identify the row to be changed; - AI: needs to hold values that are needed to replay all the changes, that were actually done during the original execution, in an identical set (meaning, same index structures, same engine, ...). Given that BI and AI have different usages, their usefulness can be mapped into the data modification row events: - Write_rows_log_event: *requires only AI*. There is no need for a BI because, we are adding a record, and not changing an existing one. The current implementation logs only the AI. - Delete_rows_log_event: *requires only BI*. There is no need for an AI because, the row ceases to exist, as it is removed. However, before removing it, one needs to find it, thence BI is required. - Update_rows_log_event: *requires both: AI and BI*. Both BI and AI, are required. The row needs to be found (BI comes to play) before being changed (AI comes to play). Summing up, BI must contain values that uniquely identifies rows, acting like a primary key equivalent (PKE), while AI must contain values that make possible changing the row according to the original execution. Primary Key Equivalent ---------------------- Tables contain an index structure - mysql calls indexes keys. This structure holds information on which keys are declared, and these range from plain keys (K), to unique keys (UK), or even a primary key (PK). PK - When it comes to logging row based events, PK plays an important role, as it covers the BI requirement of uniquely identifying a given row just by searching using the PK value. Thence, if replicating the master contents, one should be fine by just logging the PK column. The slave would then use this value to search the correct row. Technically, "A PRIMARY KEY is a unique index where all key columns must be defined as NOT NULL. If they are not explicitly declared as NOT NULL, MySQL declares them so implicitly (and silently). A table can have only one PRIMARY KEY." [1] UK - Unique keys share the same usefulness of the PK, except that they need to be declared without nullable parts. From the manual [1]: "[...] For all engines, a UNIQUE index allows multiple NULL values for columns that can contain NULL." In fact, if there is no PK declared in a table and an application requests one, MySQL "[...] returns the first UNIQUE index that has no NULL columns as the PRIMARY KEY." [1] K - Regular keys or nullable UK are of no particular interest when logging row events. Given that these are stripped from uniqueness property provided by UK and PK, these cannot be used to uniquely identify a row. There can be the case that a table does not declare any index at all, or all indexes are regular keys. In this case the master must ensure that the data provided in BI shall be enough to uniquely identify the row. As such, the alternative is to log the full row. This ensures that when searching (using the BI) and applying (using the AI) the next record to be fetched will be uniquely identified by all the fields in the BI. Should there be indistinguishable rows, searching and updating either one in any given order leads to a correct state. Consequently, a *Primary Key Equivalent* (PKE) is defined as: 1. If a PK exists, the PKE is equal to the PK. 2. Otherwise, if there exists a UK where all columns have the NOT NULL attribute, then that is the PKE (if there are more than one such UKs, then one is chosen arbitrarily). 3. Otherwise, the PKE is equal to the set of all columns. Hereafter, we will be considering to PK to be a subset of PKE and it shall map into items 1. and 2. of PKE definition. Furthermore, if no explicit primary key nor UK NOT NULL exists in the table, it is said that the table has no PK. PROBLEM STATEMENT ================= Given the definition of PKE, one can have a smaller set of columns to be logged (instead of the full row), and still be able to find (BI) and update a record (AI), while replaying the row event. This can be useful to reduce bandwidth (less network traffic) and storage (samller binlogs) usage as well as mysqld memory footprint. It becomes even more important if tables contain large blobs that do not need to be logged as part of the BI. Currently, in MySQL 5.1 GA, PKE is always assumed to be the full row, ie, the index structure is ignored. Consequently, AI and BI are always logged with all their columns. It should be possible for the user to configure this behavior such that he could request that BI and AI always log full rows, or a PK when available - for the BI - and changed columns only - for AI. SOLUTION ======== MySQL shall provide an option and the different configurations should be: - minimal: Means PKE in the before image and changed columns in after image - full: Means all columns in both before and after image - noblob: Works as full but avoids sending blobs when these are not needed. Blobs are still replicated if: 1. In AI, if they have been changed. 2. In BI, if they are part of PK. It shall be named: --binlog-row-image={minimal,noblob,full} DEFAULT VALUE SHALL BE: 'FULL'. NOTE ---- As a side remark, there was another option considered: - reversible means PKE and changed columns in before and after image Nevertheless, it was decided that this will not be implemented for now, as there is no practical use for it (think Undo operations). So, for not risking to get the user confused, and because the 'noblob' option supersedes this one, we will put this on hold. Decision was made over Skype meeting: - attendees: Alfranio, Mats and Luís - date: 09/09/09 (11:00 am CEST) REQUIREMENTS ============ For the requirements statement, we assume that there is a mysqld instance that originally executes some statement and logs it as row events. Lets call this instance M (master). We also assume that there is a second instance that replays the events. Lets call this instance S (slave). S, may not necessarily be a slave, eg, if one is replaying some binlog file in an independent instance or even on a new master, etc... We also assume that there is a table on M Tm and a table on S Ts. Events are logged for changes in Tm and replayed on Ts. Unless stated otherwise, Tm and Ts are assumed to have same number of columns. Logging ------- R1. Logging behavior must be done according to the one depicted in the following tables: - INSERTS |-----------+--------------+-----------------------------------| | OPTION | BEFORE IMAGE | AFTER IMAGE | |-----------+--------------+-----------------------------------| | 'minimal' | --- | All columns where a value | | | | was specified. Autoinc columns | | | | are also set if not specified. | |-----------+--------------+-----------------------------------| | 'noblob' | --- | All columns where a value | | | | was specified, and all | | | | non-blob columns. Autoinc columns | | | | are also set if not specified. | |-----------+--------------+-----------------------------------| | 'full | --- | All columns | |-----------+--------------+-----------------------------------| - UPDATES |-----------+----------------------------+---------------------------| | OPTION | BEFORE IMAGE | AFTER IMAGE | |-----------+----------------------------+---------------------------| | 'minimal' | PKE | All columns where a value | | | | was specified | |-----------+----------------------------+---------------------------| | 'noblob' | PKE + all non-blob columns | All columns where a value | | | | was specified, and all | | | | non-blob columns | |-----------+----------------------------+---------------------------| | 'full | All columns | All columns | |-----------+----------------------------+---------------------------| - DELETES |-----------+----------------------------+-------------| | OPTION | BEFORE IMAGE | AFTER IMAGE | |-----------+----------------------------+-------------| | 'minimal' | PKE | --- | |-----------+----------------------------+-------------| | 'noblob' | PKE + all non-blob columns | --- | | | | | | | | | |-----------+----------------------------+-------------| | 'full' | All columns | --- | |-----------+----------------------------+-------------| Index structures ---------------- - Searching for row: R1. If S and M differ in Ts and Tm index structure and there is no usable field in the BI for Ts' indexes, then S shall fall back to table scan when searching for the record. In this case, row matching will be done in a best-effort approach, the first row fetched from the SE that matches the existing data in the BI will be selected. R2. Let Tm be defined with a set of columns that do not exist in Ts - Cm. The remaining columns - Cc - exist on both Tm and Ts. If M is logging in minimal mode and Tm's PK is declared over Cm, then S will: - not search for the row and abort, if and only if, Tm's PK is declared *entirely* on Cm. - use available data on BI, regarding Cc, to conduct either index search/scan or table scan, depending on the data available for Cc in BI, and Ts' indexes. R3. For all binlog-row-image modes and when Ts has more columns than Tm, then S will always succeed when finding the row, regardless of any index structure differences between Tm and Ts. R4. If Ts index structure differs from Tm, then it shall use its own indexes while searching the row, provided that there are values for the index fields in the BI. - Applying the row: R5. A row event should always be applied successfully if Tm and Ts share the same index structure. R6. If Tm and Ts differ in index structures and Ts declares incompatible/stricter PK with Tm index, then replication may stop - eg, PK on slave does not allow duplicate entries for a given field that hadn't such restriction on master. This is the current MySQL 5.1 behavior. R7. If S is applying the events in IDEMPOTENT mode, then it is subject to the same restrictions as of MySQL 5.1. BLOBS ----- - Related to BI logging: R1. If Tm declares a blob column (c1) and c1 is part of the PKE, then c1 shall always be replicated in the BI regardless of M's binlog-row-image mode. R2. If Tm declares a blob column (c1), which is not part of PKE, it shall not be logged when M is using minimal or noblob binlog-row-image modes. - Related to AI logging: R3. If Tm declares a blob column (c1) and c1 is updated/written then it shall be logged in the AI regardless of M's binlog-row-image mode. R4. If Tm declares a blob column (c1) and this column is not updated nor written, then c1 shall not be logged as part of AI when M is using noblob or minimal binlog-row-image modes. DEFAULT VALUES -------------- - After Image R1. Columns missing a value in the AI are given default values according to the table definition from the database in which the row event is applied. This means that: 1. Extra fields on slave get default values, as stated in the table definition: MASTER> CREATE TABLE t1 (a int, b int); SLAVE> CREATE TABLE t1 (a int, b int, c int DEFAULT 100); MASTER> INSERT INTO t1 VALUES (1,1); sync_slave_with_master SLAVE> SELECT * FROM t1; a b c 1 1 100 2. Fields missing a value in the AI (for instance when using 'MINIMAL' option) get the default value as stated in the table definition: MASTER> CREATE TABLE t1 (a int DEFAULT 100, b int); SLAVE> CREATE TABLE t1 (a int DEFAULT 900, b int); MASTER> INSERT INTO t1(b) VALUES (1); MASTER> SELECT * FROM t1 a b 100 1 sync_slave_with_master SLAVE> SELECT * FROM t1; a b 900 1 R2. UPDATEs are not affected because they unpack changes (AI) on top of a row fetched from the storage engine (SE). This means: 1. DEFAULT values must have been handled at INSERT time for the given row; 2. Fields in the row fetched from the SE, that have no correspondent value in the AI, will keep their previous value. LIST OF BUGS/WLs ================ REFERENCES ---------- - Implementation and tests: WL#5092 : RBR: Options for writing partial or full row images ... BUG#14068: RBR: Write only primary key instead of entire before image WL#5096 : Write test cases for WL#5092 - Backports: BUG#33055: Replication fails for UPDATE when using falcon Storage engine - Regressions found and fixed: BUG#47200: RBR: Absence of PK on slave leads to slave stop. BUG#47303: RBR: Replicating from master with PK into slave with KEY fails. BUG#49100: RBR: Unexpected behavior when AI contains no usable data ... BUG#53643: assert in Field_new_decimal::store_value on slave server BUG#53889: slaves stops with 1032; handler error HA_ERR_KEY_NOT_FOUND BUG#46554: RBR: pending rows event can be flushed too soon sometimes. NOTES ----- N1. WL#5092/BUG#14068 builds on BUG#33055 work, meaning that patch for BUG#33055 had to be backported. N2. The following regressions were found during WL#5092 and it's likely that they have been introduced by BUG#33055, either directly or indirectly, (they are now fixed as part of WL#5092 or in WL#5092 tree): - BUG#47200 - BUG#47303 - BUG#49100 - BUG#46554 N3. WL#5092 introduced the following regressions (found during QA analysis and are now fixed): - BUG#53643 - BUG#53889 N4. The following RBR bugs were found while working on WL#5092 and are already fixed in 5.1+: - BUG#53893 - BUG#47312 REFERENCES ========== [1] http://dev.mysql.com/doc/refman/5.1/en/create-table.html [2] BUG#33055 [3] BUG#14068 [4] BUG#47200 [5] BUG#47303 [6] WL#3281 PATCHES ======= The full patch history is in mysql-5.1-rpl-wl5092 and mysql-next-mr-wl5092 trees. PB2 === http://pb2.norway.sun.com/web.py?template=show_pushes&branch=mysql-next-mr-wl5092
REFACTORINGS ============ Rows_log_event::find_row method must be refactored so that it takes into account several indexes and checks BI before actually using them. In what follows the search_key_for_bi function tries to find a useful index on the slave table. The order of the index searches PK, UK and K. Search considers the BI available columns, the flags specifying which kind of index, whether the index is active or not and if it is defined in slave's table extra columns or not. The proposed algorithm is the following: def Rows_log_event::find_row: IF invalid_bi (table, &m_cols) THEN GOTO err; ENDIF LET key= CALL search_key_for_bi(table, m_cols, PK_FLAG); IF key == NULL THEN GOTO index_scan; ELSE LET error= CALL rnd_pos_by_record(...) RETURN error ENDIF index_scan: LET key= CALL search_key_for_bi(table, m_cols, PK_FLAG | UNIQUE_KEY_FLAG | MULTIPLE_KEY_FLAG); IF key == NULL THEN GOTO table_scan; ELSE CALL index_read_map(...) IF is_UK(key) OR is_PK(key) THEN GOTO ok; ELSE WHILE (CALL record_compare(...)) LET error= CALL index_next(...) IF error THEN GOTO err; ENDIF ENDWHILE ENDIF GOTO ok; ENDIF table_scan: CALL rnd_next(...) WHILE error != eof AND (CALL record_compare(...)) LET error= CALL rnd_next(...) ENDWHILE ASSERT (error == eof || error == 0) GOTO err; ok: RETURN 0; err: RETURN error; ENDMETHOD
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.