WL#10956: Binary log storage access API

Affects: Server-8.0   —   Status: Complete   —   Priority: Low

EXECUTIVE SUMMARY
=================
This worklog implements a storage layer abstraction interface, making
it possible to decouple the act of capturing changes and persisting
them. Therefore, this is also a stepping stone for creating an
encrypted persistence layer, effectively providing encryption at rest
for the binary log.

DETAILS
=======
Currently, IO_CACHE is used for accessing binlog/relaylog files everywhere in
the code. It just supposes binlog/relaylog files are uncompressed and
unencrypted. So that is difficult to add encryption or compression features
into binlog/relaylog.

This worklog will encapsulate binlog/relaylog storage into a separate layer. A
group of interfaces will be defined for accessing binlog/relaylog files. The
interfaces can have different implementations. IO_CACHE will be encapsulated and
provided as an implementation of binlog storage accessing API for accessing
plain binlog/relaylog files. In the future, an encryption storage layer
implementation can be provided.

DEV STORIES
===========
- Developers want to define a common API for writing binlog/relay
  events to somewhere else than IO_CACHE. Binlog events shall be able to write
  into any object which support the common API.
  * It will make it easy to implement new access features(e.g encryption).
  * New implementations will not compromise existing implementations.
- Developers want to define a common API for reading binlog/relay
  events from somewhere else than IO_CACHE. Binlog events shall be able to read
  from any object which support the common API.
  * It will make it easy to implement new access features(e.g encryption).
  * New implementations will not compromise existing implementations.

- Developers want to define a layered/chained/pipeline model for
  binlog/relay log files.
  * It is easy to add/remove a feature into/from the pipeline.
Functional Requirements
=======================
NONE

Non-Functional Requirements
===========================
NF1. It shall define some interfaces for reading and writing binlog files.
     The code just needs to call the interfaces for reading and writing binlog
     events without knowing any detail of the implementation of the interfaces.
NF2. It shall has an implementation for accessing events in binlog files.
NF3. It shall has an implementation for accessing events in binlog caches.
NF4. When adding new interfaces implementation (e.g. Encryption), it shall not
     change the existing interfaces. It should be enough just to implement the
     exact existing interfaces.
NF5. The patch shall not impact performance.
UPGRADE/DOWNGRADE
=================
No impact.

SECURITY
========
No impact.

OBSERVABILITY
=============
No impact.

PROTOCOLS
=========
No impact on protocols.
No impact on the on-the-wire format of the binlog.

DEPLOYMENT
==========
No new persistent or temporary files after this worklog.
No files removed.

Interface Specification
=======================
No interface change for users.

High-Level Specification
========================
* Binlog's accesses can be summarized as
- Searialize binlog events to a storage
  The the storage could be binlog cache, binlog file, buffer.
  This is done by Log_event::write[_xxx]() functions.
- Desearialize an binlog event from a storage
  could be a file or memory buffer.
- Reads byte events from a binlog file, it is the pattern in binlog_sender.
- Controls binlog process. do prepare, flush, sync, rotate related things.


This design abstract all storages like file, binlog_cache even memory buffers as
streams. A few access interfaces is defined for the streams. All streams
following the interfaces can be accessed as storage mentioned above. In future
we can have more streams, events can be easily written into or read from the
streams.

Most of the time, reading from and writting into binary logs are separate tasks.
When doing transaction, it just writes. When reading events, it never writes.
So this worklog separates read and write into input stream and output stream.

With the stream concept in, Binlog's accesses can be summarized as
- Searialize binlog events to an output stream.
- Read a byte event from an input stream
  the byte event will be stored into an memory buffer(uchar *).
- Deserialize binlog events from an memory buffer.
  We cannot desearialize an event directly from an input stream, because
  All events' deserialization process are based on memory buffer.

* Stream Design
- Basic Stream for Serialization and Deserialization
  Serialization and deserialization just need pretty simple interfaces. They are
  - Basic_ostream only support write()
  - Basic_istream only support read()

- Stream Chain
  Searialization and deserialization process could read from or write to a
  stream chain. Stream chain is not part of this design. But it do support
  stream chain. For example:
  Serialize -> Buffer Output Stream -> File Output Stream.
  Read byte event <- Buffer input stream <- File Output Stream

- Binlog File Streams
  Since there are many binlog file accesses, it will design a few binlog
  file streams to hide file access detail from binlog code.
  - Binlog_ifile for read a binlog file
  - Binlog_ofile for control and write a binlog file

  Implementation detail is hide from binlog code. When binlog code need to
  write, it just call Binlog_ofstream::write. When binlog code need to
  flush/truncate/open/close, it just call the functions of Binlog_ostream.

  Internally, Binlog_ofstream could have a ostream chain like:
  Buffer Output Stream -> Encryption Output Stream -> File Output Stream
  Binlog code don't need to know the detail.

  Binlog_ifstream is similar.
Output Streams
==============
* Class Basic_ostream
  It is the pure abstract class which declares the write interface for writing
  bytes into a place.

  /**
    Write some data into the output stream.

    @param[in] buffer  data will be written into the stream
    @param[in] length  the length of the data

    @return returns false if succeeds, otherwise returns true
   */
  - virtual bool write(const unsigned char *buffer, my_off_t length) = 0;

  Binary events have below functions to writing themself to an IO_CACHE.
  - Log_event::write_header(IO_CACHE* file, ...)
  - Log_event::write_footer(IO_CACHE* file, ...)
  - Xxx_log_event::write_body(IO_CACHE* file, ...)
  - Xxx_log_event::write(IO_CACHE* file, ...)

  In this design, argument IO_CACHE is replaced by Basic_ostream. The functions
  will be:
  - Log_event::write_header(Basic_ostream* ostream, ...)
  - Log_event::write_footer(Basic_ostream* ostream, ...)
  - Xxx_log_event::write_body(Basic_ostream* ostream, ...)
  - Xxx_log_event::write(Basic_ostream* ostream, ...)

  With this design, events can be written into different place easily.
  It can write into a binlog file, it can also be write into binlog cache or any
  other classes which derives from Basic_ostream.

  This worklog includes four different basic output streams:
  - class StringBuffer_ostream
  - Transaction_message
  - class Binlog_ofile
  - class Binlog_cache

* Class StringBuffer_ostream
  A wrapper of StringBuffer to implement a simple output stream for writing
  data into stack memory. If the data is too big then allocate enough heap
  memory for the data.

  It is used to simplify group replication code. Group replication used
  IO_CACHE for serialize an event to some memory buffer. With this
  class, events can be serialized into a memory buffer directly.

* Class Transaction_message
  It is a class in group replication which serializes binary events into
  a message. This worklog makes it derives from Basic_ostream, so the
  binary events can be serialized into its buffer directly.

* Class Binlog_ofile
  It defines a logical binlog file which wrappers and hides the real storage
  layer operations. It provides the operations for controlling binlog files,
  like open, close, write, flush etc. When opening Binlog_file, it initializes
  the real storage. When writing an event to Binlog_file, it writes the
  event into the real storage. MYSQL_BIN_LOG code operates on a plain binlogs.
  It doesn't need to know/care the detail of low level storage
  operates(e.g. if it is encrypted or not).

  It derives from Basic_ostream, so events can be written into it directly.

  /**
     Open the binlog file. It opens the file output stream.

     @param[in] log_file_key  The PSI_file_key for this stream
     @param[in] binlog_name  The file will be opened
     @param[in] flags  The flags used by IO_CACHE.
     @return returns false if succeeds, otherwise true is returned.
  */
  bool open(PSI_file_key log_file_key, const char* binlog_name, myf flags);

  /**
    close the binlog file and the file output stream.
   */
  void close();

  /**
    Binlog has position conception. It is used many places. Position is the logical
    offset of the binlog, but not the real position where it is stored in a
    output stream. So This class maintains binlog position.
   */
  - my_off_t m_position

  - bool write(const unsigned char *buffer, my_off_t length)
    same to Basic_ostream::write. it overrides Basic_ostream::write.

  /**
    Updates some data in the binlog file.

    @param[in] buffer  data will be written into the binlog file.
    @param[in] length  the length of the data
    @param[in] offset  the start position where the data will be updated

    @return returns false if succeeds, otherwise returns true
   */
  - bool update(const unsigned char *buffer, my_off_t length, my_off_t offset)


  /**
    Truncates some data from the end of the binlog file.

    @param[in] offset  where the data will be truncated to
    @return returns false if succeeds, otherwise returns true
   */
  - bool truncate(my_off_t offset)

  /**
    Flush buffered data into file system.

    @return returns false if succeeds, otherwise returns true
   */
  - bool flush()

  /**
    Flush the data from file system into disk.

    @return returns false if succeeds, otherwise returns true
   */
  - bool sync()

  /**
    returns current position where it is writing
   */
  - my_off_t position()

  /**
    returns true if the binlog file is empty
   */
  - bool is_empty()
    if it is empty or not

  /**
    return true if the real storage is opened.
   */
  - bool is_open()

* Class Binlog_cache
  It defines a binlog cache container for store binlog events.
  It provides a few elegant interfaces for writing or reading binlog events
  into or from the container. It hides the detail of level storage details
  which binlog code doesn't need to know.

  It is derived from Basic_ostream, So we can pass it as argument to
  Xxx_log_evnet::write() for writing events into binlog cache.

  Binlog_cache has both read and write interfaces.

  ----------------
  Write Interfaces
  ----------------
  - bool write(const unsigned char *buffer, my_off_t length)
    same to Basic_ostream::write. it overrides Basic_ostream::write.

  - virtual bool truncate(my_off_t offset) = 0;
    same to Binlog_file::truncate

  /**
     Drop the data in the cache.

    @return returns false if succeeds, otherwise returns true
   */
  - virtual bool reset() = 0;

  /**
    Returns the numbers of disk writes in the transaction
   */
  - virtual size_t disk_writes() = 0;
    Returns the number of disk writes.

  /**
    Returns the temporary file name if it has one.
   */
  - virtual const char* tmp_file_name() = 0;

  ---------------
  Read Interfaces
  ---------------
  /**
    Copy the entire data to somewhere.

    @ostream  Where the data will be written into
  */
  bool copy_to(Basic_ostream *ostream);

  /**
    Returns data length of the cache.
   */
  - virtual my_off_t length() = 0;
    Returns the length of the data in the cache.

  /**
    Returns if the cache is empty.
   */
  - bool is_empty() { return length() == 0; }

  It is used in binlog_cache_data class to replace IO_CACHE.
  binlog_cache *m_cache;

* IO_CACHE_ostream
  It is a wrapper of IO_CACHE to implement necessary file operations.
  It is a low level output stream used in Binlog_ofile.

* IO_CACHE_binlog_cache
  It is a wrapper of IO_CACHE to implement the required features for a
  binlog cache. It is a low level stream used in Binlog_cache.

Input streams
==============
* Class Basic_istream
  It is the pure abstract class which declares the basic interface for reading
  some data from a place.

 /**
   Read some data from the stream.

   @param[in] buffer  where the data will be written into
   @param[in,out] length  buffer's size as input parameter.
                  It will be set to read bytes as output parameter.
   @return returns false if succeeds, otherwise returns true
  */
  - virtual bool read(unsigned char *buffer, my_off_t *length) = 0;
    Read some data from the source. caller pass the bytes wanting through
    'length'. And read will returns bytes it exact read through 'length'.

  There are two class derives from it:
  - Basic_seekable_istream
    It is introduced below.
  - Stdin_istream
    It encapsulate stdin into a basic_stream interface.

* Class Basic_seekable_istream
  It is the pure abstract class which declares the basic seek interface for
  seeking read position in a stream.
  - virtual bool seek(my_off_t offset) = 0;

  There are three classes derives from it:
  - class IO_CACHE_istream
    IO_CACHE_istream is a file input stream based on IO_CACHE
  - class Stdin_seakable_istream
    A wrapper of Stdin_istream which inherits to Binlog_istream and
    implements the seek() interface of Basic_seekable_istream. In fact, it
    only support seeking forward, seeking backward is invalid.
    It is used by mysqlbinlog.
  - Basic_binlog_ifile
    It is introduced below.

* Class Basic_binlog_ifile
  It defines a logical binlog file which wrappers and hides the real storage
  layer operations. It provides the operations for controlling binlog files,
  like open, close, read, seek etc. When reading an event from
  Binlog_ifile, it reads the event from the real storage. Binlog access
  code operates on a plain binlogs. It doesn't need to know/care the detail of
  low level storage operates(e.g. if it is encrypted or not).

  It derives from Basic_seekable_istream, so it can be called by readers as
  as byte stream.
  /**
    It points to the underlying input stream created by reader. When read() or
    seek() function is called, it calls m_istream to operate the underlying
    storage.
   */
  - Basic_seakable_istream *m_istream

  - my_off_t m_position
    same to MYSQL_BIN_LOG::Binlog_file

  - bool read(unsigned char* buffer, my_off_t *length)
    same to Basic_istream::read, it overrides Basic_istream::read.

  /**
    Sets the read start position for next read().

    @param[in] offset where it should seek to.
   */
  - bool seek(my_off_t position)
    same to Binlog_istream::seek

  - my_off_t position()
    same to MYSQL_BIN_LOG::Binlog_file

  /**
    returns true if underlying input stream is opened.
   */
  - bool opened()

  /**
     Open the system layer file. It is the entry of the stream pipeline.
     Implementation is delegated to sub-classes. Sub-classes opens system layer
     files in different way.

     @param[in] file_name  name of the binlog file which will be opened.
  */
  virtual Basic_seekable_istream *open_file(const char *file_name) = 0;
  /**
     close the system layer file.
  */
  virtual void close_file() = 0;

  Three classes dreive from it.
  - Binlog_ifile
    It is for the binlog files generated on master side.
  - Relaylog_ifile
    It is for the binlog files generated on slave side.
  - Mysqlbinlog_ifile
    It is for mysqlbinlog. Mysqlbinlog could read files from stdin.

Binary Event Readers
====================
* Class Binlog_read_error
  It defines the error types which could happen when reading binlog files or
  deserializing binlog events. String error message of the error types are also
  defined. It has an member variable to store an error type and provides a few
  functions to check the error type stored in the member variable.

* Class Binlog_event_data_istream
   Event_data is serialized event object. It is a chunk of data in buffer.
   Binlog_event_data_istream fetches byte data from Basic_istream and
   divides them into event_data chunk according to the format.

  /**
     The stream where event read from
   */
  - Basic_istream *m_istream

  /**
     Stores the error encounted when reading header or body. It is a pointer.
    It muse be initiaized when in its constructor by caller.
  */
  - Binlog_read_error *m_error

  /**
     Read an event data from the stream and verify its checksum if
     verify_checksum is true.

     @param[out] data The pointer of the event data
     @param[out] length The length of the event data
     @param[in] allocator It is used to allocate memory for the event data.
     @param[in] verify_checksum Verify the event data's checksum if it is true.
     @param[in] checksum_alg Checksum algorithm for verifying the event data. It
                             is used only when verify_checksum is true.

     @retval false Succeed
     @retval true Error
  */
  template <class ALLOCATOR>
  bool read_event_data(unsigned char **data, unsigned int *length,
                       ALLOCATOR *allocator, bool verify_checksum,
                       enum_binlog_checksum_alg checksum_alg);
  /**
     Read the event header from the Basic_istream.

     @retval false Succeed
     @retval true Error
  */
  virtual bool read_header();
  It is virtual so the subclass can override it. There is one class derives from
  it.
  - Mysqlbinlog_event_data_istream
    It override read_header to skip multiple binlog magic for the case that
    mysqlbinlog reading binlog files through stdin.
    It also reimplement read_event_data() to for rewriting database names in
    the event data.

* Class Binlog_event_object_istream
  It reads event_datas from an event_data stream and deserialize them to event
  objects.

  /**
     Stores the error encounted when reading header or body. It is a pointer.
    It muse be initiaized when in its constructor by caller.
  */
  - Binlog_read_error *m_error

  /**
     Read an event ojbect from the stream

     @param[in] fde The Format_description_event for deserialization.
     @param[in] verify_checksum Verify the checksum of the event_data before
     @param[in] allocator It is used to allocate memory for the event data.
     @return An valid event object if succeed.
     @retval nullptr Error
  */
  template <class ALLOCATOR>
  Log_event *read_event_object(const Format_description_event &fde,
                               bool verify_checksum, ALLOCATOR *allocator);

* Class Basic_binlog_file_reader
  It owns a byte stream, an event_data stream and an event object stream.
  The stream pipeline is setup in constructor. All the objects required for
  reading a binlog file is initialized in reader class. It also includes
  a few convenient functions to encapsulate the access of BINLOG_IFILE,
  BINLOG_EVENT_DATA_ISTREAM, BINLOG_EVENT_OBJECT_ISTREAM. It makes the code
  simpler for reading a binlog file.

  /**
     Open a binlog file and set read position to offset. It will read and store
     Format_description_event automatically if offset is bigger than current
     position and fde is nullptr. Otherwise fde is use instead of finding fde
     from the file if fde is not null.

     @param[in] file_name  name of the binlog file which will be opened.
     @param[in] offset  The position where it starts to read.
     @param[in] fde  The format_description_event for reading events.
  */
  bool open(const char *file_name, my_off_t offset = 0,
            const Format_description_event *fde = nullptr);

  /**
    close the binlog file.
   */
  - void close()

  /**
     Wrapper of BINLOG_EVENT_DATA_ISTREAM::read_event_data.
  */
  bool read_event_data(unsigned char **data, unsigned int *length);

  /**
     wrapper of BINLOG_EVENT_OBJECT_ISTREAM::read_event_object.
  */
  Log_event *read_event_object();

* Allocators
  There are two allocator classes in this worklog.
  - Class Default_binlog_event_allocator
    It is an allocator which using my_malloc to allocate memory
  - Class Binlog_sender::Event_allocator
    It uses a String as shared memory for reading event data.

* Class Binlog_file_reader
  The class for reading binlog files generated on master side.

  typedef Basic_binlog_file_reader<Binlog_ifile, Binlog_event_data_istream,
                                 Binlog_event_object_istream,
                                 Default_binlog_event_allocator>
  Binlog_file_reader;

* Class Relaylog_file_reader
  The class for reading binlog files generated on master side.

  typedef Basic_binlog_file_reader<Relaylog_ifile, Binlog_event_data_istream,
                                 Binlog_event_object_istream,
                                 Default_binlog_event_allocator>
  Relaylog_file_reader;

* Class Mysqlbinlog_file_reader

  typedef Basic_binlog_file_reader<Mysqlbinlog_ifile,
                                   Mysqlbinlog_event_data_istream,
                                   Binlog_event_object_istream,
                                   Default_binlog_event_allocator>
  Mysqlbinlog_file_reader;

* Class Binlog_sender::File_reader
  typedef Basic_binlog_file_reader<Binlog_ifile, Binlog_event_data_istream,
                                   Binlog_event_object_istream, Event_allocator>
  File_reader;