WL#7696: InnoDB: Transparent page compression

Affects: Server-Prototype Only   —   Status: Complete   —   Priority: Medium

Allow transparent page level compression in the IO layer. The InnoDB row level 
compression a.k.a InnoDB compressed tables is not fast enough and secondly the 
implementation is more complicated than it should be. The transparent page level 
compression complements the old scheme it doesn't aim to replace it.

This new scheme relies on sparse file and "hole punching" support. There are two
parts to this though, the OS kernel support and the File System support.

For the Linux kernel, it has been available since 2.6.38. XFS has had this
support since 2.6.38, EXT4 support was added in 3.0 (backported to RHEL 6's
2.6.32 based kernel in RHEL 6.4), Btrfs has support for this as of Linux 3.7. 
{Add other FS info here}.  

Solaris 10 and 11 kernels both have sparse file and "hole punching"
support with ZFS. FreeBSD 9+ also seems to have support for it with ZFS (?). 

OSX (HFS+) seems to lack support for sparse files generally.

On all UNIX/POSIX based systems, you can use fpathconf(_PC_MIN_HOLE_SIZE) or
pathconf(_PC_MIN_HOLE_SIZE) to determine if a filesystem supports SEEK_HOLE,
which seems to be a good way to check for "hole punching" support specifically.
Since "hole punching" support *also* requires sparse file support, this is the
best single call to see if we have support for everything needed to implement
this scheme.

Windows (all currently supported versions) supports sparse files and "hole
punching" with NTFS. We would only need to check that NTFS is being used, but
support can also be checked using the FILE_SUPPORTS_SPARSE_FILES bit flag in the
lpFileSystemFlags parameter returned from the GetVolumeInformation function.


The scheme works like this:

 Write page -> compress -> write compressed data to disk -> release empty
block(s) from end of the page

If the compression fails (for whatever reason) then write the page out as is.

 Read page -> decompress


If the page read in is not a compressed page then return page to the upper layer
as is.

The advantage of this scheme is that it is very simple and elegant and one can
use the best compression algorithm  available for the required data set. It can
be used with the current row compression of InnoDB as is i.e., they can coexist
in the same server. It can be used with any table type, UNDO and the system 
tablespace if required. The compression of a table can be changed on the fly. No
rebuild required, old pages can be read in using the old compression format (or
none) and written using the new algorithm. A table's pages can be in a
compressed format in any arbitrary mix of compression algorithms (in theory).
The algorithm selection can be controlled at a page level in other words.
Observability
=============
Use Information Schema to display the displayed total size and actual allocated 
size. The compression is currently per tablespace therefore the information will 
be displayed in the INNODB_SYS_TABLESPACES view.

mysql> select * from information_schema.INNODB_SYS_TABLESPACES;
+-------+----------------------------+------+-------------+----------------------
+-----------+---------------+------------+----------------+
| SPACE | NAME                       | FLAG | FILE_FORMAT | ROW_FORMAT           | 
PAGE_SIZE | ZIP_PAGE_SIZE | FILE_SIZE  | ALLOCATED_SIZE |
+-------+----------------------------+------+-------------+----------------------
+-----------+---------------+------------+----------------+
|   128 | mysql/innodb_table_stats   |    0 | Antelope    | Compact or Redundant |     
16384 |             0 |      98304 |          16384 |
|   129 | mysql/innodb_index_stats   |    0 | Antelope    | Compact or Redundant |     
16384 |             0 |      98304 |          16384 |
|   130 | mysql/slave_relay_log_info |    0 | Antelope    | Compact or Redundant |     
16384 |             0 |      98304 |          16384 |
|   131 | mysql/slave_master_info    |    0 | Antelope    | Compact or Redundant |     
16384 |             0 |      98304 |          16384 |
|   132 | mysql/slave_worker_info    |    0 | Antelope    | Compact or Redundant |     
16384 |             0 |      98304 |          16384 |
|   135 | sbtest/sbtest1             |    0 | Antelope    | Compact or Redundant |     
16384 |             0 | 2734686208 |     1423495168 |
+-------+----------------------------+------+-------------+----------------------
+-----------+---------------+------------+----------------+
6 rows in set (0.01 sec)