WL#2387: Replication Master Filtering

Affects: WorkLog-3.4 — Status: Un-Assigned

Description
High Level Architecture

SUMMARY
-------
Be able to have the replication filters work on master instead of
on the slave.  (Currently data is being replicated to the 
slave even if the filters on the slave discard that data.)


MOTIVATION
----------
Much less network bandwidth used when replicating.  
Tables, databases that should be filtered away are 
being so already at the master.


REQUIREMENTS
----------------
1. Filtering on originating server (or originating cluster if we implement that)
   could also be done on the master.

USER INTERFACE
--------------
The following options start to take effect on master instead
of slave:
`--replicate-do-db=DB_NAME'
`--replicate-do-table=DB_NAME.TBL_NAME'
`--replicate-ignore-db=DB_NAME'
`--replicate-ignore-table=DB_NAME.TBL_NAME'
`--replicate-wild-do-table=DB_NAME.TBL_NAME'
`--replicate-wild-ignore-table=DB_NAME.TBL_NAME'

The following options still take effect at the slave:
`--replicate-rewrite-db=FROM_NAME->TO_NAME'


OPEN ISSUE
----------
Either the filtering can be controlled by the master (so that 
slaves would only get what the master has defined).  Alternatively
each slave can connect to the master with a different defintion of
filter.  The latter version needs changes to the way the slave
asks the master for the binlog.


OPTIONAL EXTENSION
------------------
All of this options could be added to CHANGE MASTER in the 
following way:
CHANGE MASTER 'foo' TO MASTER_HOST=127.0.0.1, REPLICATE-DO-DB='mydb';


IMPLEMENTATION
--------------
All filtering code is refactored into a separate file 
rpl_filter.cc

Part 1: When the slave registers on the master it forwards 
        information about all filters that should be applied.
        This requires an exension to the function
        slave.cc:register_slave_on_master().

Part 2: The master adds functionality in the dump thread 
        to filter things.  Much of the code in rpl_filter.cc
        can be used for this (functions like slave.cc:db_ok())


BINLOG EXTENSIONS
-----------------
There is a possibility to divide the filtered binlog into 
separate binlogs, i.e. on binlog for one database and another 
for another database (Brian seems fond of this idea.)

If we choose this path, we need to rename binlog files 
accordingly, for instance like this:
- -bin.index
- -bin.NNNNNN

Note, however that this is not really needed for filtering 
on master.  One could just use one binlog and then apply 
the filtering in the dump thread instead.  There are, however,
benefits in dividing it into multiple binlogs (e.g. backups 
could be done of different binlogs at different times.  Purging 
could be done differently on different binlogs).

It is not yet decided if this extension should be implemented.

Lars suggests that the naming of the binlogs is separate from 
the naming of the schemas, i.e. no automatic naming.  When 
you specify that you want this schema in that binlog, you 
can provide the binlog name then.  This removes problems with 
renamed schemas etc.  Also it makes it more flexible (e.g.
perhaps we want binlogs on other filters than schemas)

See also Guilhems notes in WL#1401.

NOTES
-----
There are corresponding ideas for filtering the query log, 
see WL#3017.

RELATED BUGS/WLs
----------------
  BUG#2917
  BUG#21146
  BUG#41267
  BUG#55733
  WL#1049

Use rpl_filter for the actual logic behind the filtering
mechanisms (Master binlog filtering, master replication filtering and
slave replication filtering), but that a cached variable on the table
object makes sense.  Add "uint32 table->s->flags" and the
following enum in table->s:
  enum enum_flag
  {
    FILTER_BINLOG_SEND_F = (1U << 0),
    FILTER_BINLOG_WRITE_F = (1U << 1),
    FILTER_SLAVE_EXECUTE_F = (1U << 2)
  };
Whenever the table object is created, the corresponding rpl_filter
object should be asked for how to set each flag.