WL#5092: RBR: Options for writing partial or full row images in RBR events
Affects: Server-5.6
—
Status: Complete
Summary
=======
Implement mysqld option so that the user is able to select
whether to log minimal or full images on RBR events depending
on the keys available in the master. When sending minimal rows,
one can save binlog disk space, network resources and mysqld
memory footprint.
BACKGROUND
==========
Before and After Images
-----------------------
In row based replication row events may contain two copies for the
row that they are changing. These are generally known as images. The
first one, called *before image* (BI), contains data that existed on
the row before it was actually changed. The second one, called
*after image* (AI), contain the actual changes. Each, BI and AI,
usage is confined to two different moments in the execution
flow. The BI is used while the slave is searching for the row to be
updated, while AI is used when replaying the changes in the row.
Both have some restrictions:
- BI: needs to hold a set of values that can be used by the
slave to fetch the correct row. In other words, it should
provide a set a values that *uniquely* identify the row to
be changed;
- AI: needs to hold values that are needed to replay all the
changes, that were actually done during the original
execution, in an identical set (meaning, same index
structures, same engine, ...).
Given that BI and AI have different usages, their usefulness can be
mapped into the data modification row events:
- Write_rows_log_event: *requires only AI*.
There is no need for a BI because, we are adding a record,
and not changing an existing one. The current
implementation logs only the AI.
- Delete_rows_log_event: *requires only BI*.
There is no need for an AI because, the row ceases to
exist, as it is removed. However, before removing it, one
needs to find it, thence BI is required.
- Update_rows_log_event: *requires both: AI and BI*.
Both BI and AI, are required. The row needs to be found (BI
comes to play) before being changed (AI comes to play).
Summing up, BI must contain values that uniquely identifies rows,
acting like a primary key equivalent (PKE), while AI must contain
values that make possible changing the row according to the original
execution.
Primary Key Equivalent
----------------------
Tables contain an index structure - mysql calls indexes keys. This
structure holds information on which keys are declared, and these
range from plain keys (K), to unique keys (UK), or even a primary
key (PK).
PK - When it comes to logging row based events, PK plays an
important role, as it covers the BI requirement of uniquely
identifying a given row just by searching using the PK
value. Thence, if replicating the master contents, one should
be fine by just logging the PK column. The slave would then use
this value to search the correct row. Technically, "A PRIMARY
KEY is a unique index where all key columns must be defined as
NOT NULL. If they are not explicitly declared as NOT NULL,
MySQL declares them so implicitly (and silently). A table can
have only one PRIMARY KEY." [1]
UK - Unique keys share the same usefulness of the PK, except that
they need to be declared without nullable parts. From the
manual [1]: "[...] For all engines, a UNIQUE index allows
multiple NULL values for columns that can contain NULL." In
fact, if there is no PK declared in a table and an application
requests one, MySQL "[...] returns the first UNIQUE index that
has no NULL columns as the PRIMARY KEY." [1]
K - Regular keys or nullable UK are of no particular interest when
logging row events. Given that these are stripped from
uniqueness property provided by UK and PK, these cannot be used
to uniquely identify a row.
There can be the case that a table does not declare any index at
all, or all indexes are regular keys. In this case the master must
ensure that the data provided in BI shall be enough to uniquely
identify the row. As such, the alternative is to log the full
row. This ensures that when searching (using the BI) and applying
(using the AI) the next record to be fetched will be uniquely
identified by all the fields in the BI. Should there be
indistinguishable rows, searching and updating either one in any
given order leads to a correct state.
Consequently, a *Primary Key Equivalent* (PKE) is defined as:
1. If a PK exists, the PKE is equal to the PK.
2. Otherwise, if there exists a UK where all columns have the NOT
NULL attribute, then that is the PKE (if there are more than
one such UKs, then one is chosen arbitrarily).
3. Otherwise, the PKE is equal to the set of all columns.
Hereafter, we will be considering to PK to be a subset of PKE and it
shall map into items 1. and 2. of PKE definition. Furthermore, if no
explicit primary key nor UK NOT NULL exists in the table, it is said
that the table has no PK.
PROBLEM STATEMENT
=================
Given the definition of PKE, one can have a smaller set of columns
to be logged (instead of the full row), and still be able to find
(BI) and update a record (AI), while replaying the row event.
This can be useful to reduce bandwidth (less network traffic) and
storage (samller binlogs) usage as well as mysqld memory
footprint. It becomes even more important if tables contain large
blobs that do not need to be logged as part of the BI. Currently, in
MySQL 5.1 GA, PKE is always assumed to be the full row, ie, the
index structure is ignored. Consequently, AI and BI are always
logged with all their columns.
It should be possible for the user to configure this behavior
such that he could request that BI and AI always log full rows,
or a PK when available - for the BI - and changed columns only
- for AI.
SOLUTION
========
MySQL shall provide an option and the different configurations
should be:
- minimal:
Means PKE in the before image and changed columns
in after image
- full:
Means all columns in both before and after image
- noblob:
Works as full but avoids sending blobs when these are not
needed. Blobs are still replicated if:
1. In AI, if they have been changed.
2. In BI, if they are part of PK.
It shall be named:
--binlog-row-image={minimal,noblob,full}
DEFAULT VALUE SHALL BE: 'FULL'.
NOTE
----
As a side remark, there was another option considered:
- reversible
means PKE and changed columns in before and after image
Nevertheless, it was decided that this will not be implemented
for now, as there is no practical use for it (think Undo
operations). So, for not risking to get the user confused, and
because the 'noblob' option supersedes this one, we will put
this on hold.
Decision was made over Skype meeting:
- attendees: Alfranio, Mats and Luís
- date: 09/09/09 (11:00 am CEST)
REQUIREMENTS
============
For the requirements statement, we assume that there is a mysqld
instance that originally executes some statement and logs it as row
events. Lets call this instance M (master). We also assume that
there is a second instance that replays the events. Lets call this
instance S (slave). S, may not necessarily be a slave, eg, if one is
replaying some binlog file in an independent instance or even on a
new master, etc...
We also assume that there is a table on M Tm and a table on S
Ts. Events are logged for changes in Tm and replayed on Ts. Unless
stated otherwise, Tm and Ts are assumed to have same number of
columns.
Logging
-------
R1. Logging behavior must be done according to the one depicted in
the following tables:
- INSERTS
|-----------+--------------+-----------------------------------|
| OPTION | BEFORE IMAGE | AFTER IMAGE |
|-----------+--------------+-----------------------------------|
| 'minimal' | --- | All columns where a value |
| | | was specified. Autoinc columns |
| | | are also set if not specified. |
|-----------+--------------+-----------------------------------|
| 'noblob' | --- | All columns where a value |
| | | was specified, and all |
| | | non-blob columns. Autoinc columns |
| | | are also set if not specified. |
|-----------+--------------+-----------------------------------|
| 'full | --- | All columns |
|-----------+--------------+-----------------------------------|
- UPDATES
|-----------+----------------------------+---------------------------|
| OPTION | BEFORE IMAGE | AFTER IMAGE |
|-----------+----------------------------+---------------------------|
| 'minimal' | PKE | All columns where a value |
| | | was specified |
|-----------+----------------------------+---------------------------|
| 'noblob' | PKE + all non-blob columns | All columns where a value |
| | | was specified, and all |
| | | non-blob columns |
|-----------+----------------------------+---------------------------|
| 'full | All columns | All columns |
|-----------+----------------------------+---------------------------|
- DELETES
|-----------+----------------------------+-------------|
| OPTION | BEFORE IMAGE | AFTER IMAGE |
|-----------+----------------------------+-------------|
| 'minimal' | PKE | --- |
|-----------+----------------------------+-------------|
| 'noblob' | PKE + all non-blob columns | --- |
| | | |
| | | |
|-----------+----------------------------+-------------|
| 'full' | All columns | --- |
|-----------+----------------------------+-------------|
Index structures
----------------
- Searching for row:
R1. If S and M differ in Ts and Tm index structure and there is no
usable field in the BI for Ts' indexes, then S shall fall back
to table scan when searching for the record.
In this case, row matching will be done in a best-effort
approach, the first row fetched from the SE that matches the
existing data in the BI will be selected.
R2. Let Tm be defined with a set of columns that do not exist in
Ts - Cm. The remaining columns - Cc - exist on both Tm and
Ts. If M is logging in minimal mode and Tm's PK is declared over
Cm, then S will:
- not search for the row and abort, if and only if, Tm's PK is
declared *entirely* on Cm.
- use available data on BI, regarding Cc, to conduct either
index search/scan or table scan, depending on the data
available for Cc in BI, and Ts' indexes.
R3. For all binlog-row-image modes and when Ts has more columns than
Tm, then S will always succeed when finding the row, regardless
of any index structure differences between Tm and Ts.
R4. If Ts index structure differs from Tm, then it shall use its own
indexes while searching the row, provided that there are values
for the index fields in the BI.
- Applying the row:
R5. A row event should always be applied successfully if Tm and Ts
share the same index structure.
R6. If Tm and Ts differ in index structures and Ts declares
incompatible/stricter PK with Tm index, then replication
may stop - eg, PK on slave does not allow duplicate entries
for a given field that hadn't such restriction on
master. This is the current MySQL 5.1 behavior.
R7. If S is applying the events in IDEMPOTENT mode, then it is
subject to the same restrictions as of MySQL 5.1.
BLOBS
-----
- Related to BI logging:
R1. If Tm declares a blob column (c1) and c1 is part of the PKE,
then c1 shall always be replicated in the BI regardless of M's
binlog-row-image mode.
R2. If Tm declares a blob column (c1), which is not part of PKE, it
shall not be logged when M is using minimal or noblob
binlog-row-image modes.
- Related to AI logging:
R3. If Tm declares a blob column (c1) and c1 is updated/written then
it shall be logged in the AI regardless of M's binlog-row-image
mode.
R4. If Tm declares a blob column (c1) and this column is not updated
nor written, then c1 shall not be logged as part of AI when M is
using noblob or minimal binlog-row-image modes.
DEFAULT VALUES
--------------
- After Image
R1. Columns missing a value in the AI are given default values
according to the table definition from the database in
which the row event is applied. This means that:
1. Extra fields on slave get default values, as stated in
the table definition:
MASTER> CREATE TABLE t1 (a int, b int);
SLAVE> CREATE TABLE t1 (a int, b int, c int DEFAULT 100);
MASTER> INSERT INTO t1 VALUES (1,1);
sync_slave_with_master
SLAVE> SELECT * FROM t1;
a b c
1 1 100
2. Fields missing a value in the AI (for instance when
using 'MINIMAL' option) get the default value as stated
in the table definition:
MASTER> CREATE TABLE t1 (a int DEFAULT 100, b int);
SLAVE> CREATE TABLE t1 (a int DEFAULT 900, b int);
MASTER> INSERT INTO t1(b) VALUES (1);
MASTER> SELECT * FROM t1
a b
100 1
sync_slave_with_master
SLAVE> SELECT * FROM t1;
a b
900 1
R2. UPDATEs are not affected because they unpack changes (AI)
on top of a row fetched from the storage engine (SE).
This means:
1. DEFAULT values must have been handled at INSERT time
for the given row;
2. Fields in the row fetched from the SE, that have no
correspondent value in the AI, will keep their previous
value.
LIST OF BUGS/WLs
================
REFERENCES
----------
- Implementation and tests:
WL#5092 : RBR: Options for writing partial or full row images ...
BUG#14068: RBR: Write only primary key instead of entire before image
WL#5096 : Write test cases for WL#5092
- Backports:
BUG#33055: Replication fails for UPDATE when using falcon Storage engine
- Regressions found and fixed:
BUG#47200: RBR: Absence of PK on slave leads to slave stop.
BUG#47303: RBR: Replicating from master with PK into slave with KEY fails.
BUG#49100: RBR: Unexpected behavior when AI contains no usable data ...
BUG#53643: assert in Field_new_decimal::store_value on slave server
BUG#53889: slaves stops with 1032; handler error HA_ERR_KEY_NOT_FOUND
BUG#46554: RBR: pending rows event can be flushed too soon sometimes.
NOTES
-----
N1. WL#5092/BUG#14068 builds on BUG#33055 work, meaning that
patch for BUG#33055 had to be backported.
N2. The following regressions were found during WL#5092 and it's
likely that they have been introduced by BUG#33055, either
directly or indirectly, (they are now fixed as part of WL#5092
or in WL#5092 tree):
- BUG#47200
- BUG#47303
- BUG#49100
- BUG#46554
N3. WL#5092 introduced the following regressions (found
during QA analysis and are now fixed):
- BUG#53643
- BUG#53889
N4. The following RBR bugs were found while working on WL#5092
and are already fixed in 5.1+:
- BUG#53893
- BUG#47312
REFERENCES
==========
[1] http://dev.mysql.com/doc/refman/5.1/en/create-table.html
[2] BUG#33055
[3] BUG#14068
[4] BUG#47200
[5] BUG#47303
[6] WL#3281
PATCHES
=======
The full patch history is in mysql-5.1-rpl-wl5092 and
mysql-next-mr-wl5092 trees.
PB2
===
http://pb2.norway.sun.com/web.py?template=show_pushes&branch=mysql-next-mr-wl5092
REFACTORINGS
============
Rows_log_event::find_row method must be refactored so that it takes
into account several indexes and checks BI before actually using them.
In what follows the search_key_for_bi function tries to find a useful
index on the slave table. The order of the index searches PK, UK and
K. Search considers the BI available columns, the flags specifying
which kind of index, whether the index is active or not and if it is
defined in slave's table extra columns or not.
The proposed algorithm is the following:
def Rows_log_event::find_row:
IF invalid_bi (table, &m_cols) THEN
GOTO err;
ENDIF
LET key= CALL search_key_for_bi(table, m_cols, PK_FLAG);
IF key == NULL THEN
GOTO index_scan;
ELSE
LET error= CALL rnd_pos_by_record(...)
RETURN error
ENDIF
index_scan:
LET key= CALL search_key_for_bi(table, m_cols,
PK_FLAG | UNIQUE_KEY_FLAG | MULTIPLE_KEY_FLAG);
IF key == NULL THEN
GOTO table_scan;
ELSE
CALL index_read_map(...)
IF is_UK(key) OR is_PK(key) THEN
GOTO ok;
ELSE
WHILE (CALL record_compare(...))
LET error= CALL index_next(...)
IF error THEN
GOTO err;
ENDIF
ENDWHILE
ENDIF
GOTO ok;
ENDIF
table_scan:
CALL rnd_next(...)
WHILE error != eof AND (CALL record_compare(...))
LET error= CALL rnd_next(...)
ENDWHILE
ASSERT (error == eof || error == 0)
GOTO err;
ok:
RETURN 0;
err:
RETURN error;
ENDMETHOD
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.