This chapter describes the layout for the data file of compressed
MyISAM tables.
MyISAM compression is based on Huffman
compression. In his article from 1952 Huffman proved that his
algorithm uses the least possible number of bits to encode a
sequence of messages. The number of bits assigned to each
message depends on its probability to appear in the sequence.
Huffman did not specify exactly, what those "messages" are. One could take all possible values - say of a table column - as "messages". But if there are too many of them, the code tables could become bigger than the uncompressed table. One would need to specify every possible value once and the code tree with its indexes and offsets. Not to forget the effort to step through big binary trees for every value and - on the encoding side - the comparison of each value against the already collected distinct values.
The usual way to define "Huffman messages" is to take the possible 256 values, which a byte can express, as the "messages". That way the code trees are of limited size. On the other hand, the theoretical maximum compression is 1:8 (12.5%) in this case.
myisampack tries both ways to compress the column values. When starting to analyze the existing uncompressed data, it collects distinct column values up to a limit of 8KB. If there are more, it falls back to byte value compression for this column.
This means also that myisampack may use different algorithms for different columns. Besides a couple of other tricks, myisampack determines for every column if distinct column value compression or byte value compression is better. After that it tries to combine byte value compression trees of different columns into one or more code trees. This means that finally we may have less code trees than columns. Therefore the column information in the file header contains the number of the code tree used for each column. Some columns might not need a code tree at all. This happens for columns which have the same value in all records.
Since the compressed data file should be usable for read-only purposes by the MySQL database management system, every record starts on a byte boundary. For easier handling by the system, every record begins with a length information for the compressed record and a length information for the total size of all uncompressed blobs of this record. Both lengths are encoded in 1 to 5 bytes, depending on its value.
A length from 1 to 253 bytes is represented in one byte. A length of 254 to 65535 bytes (64KB-1) is represented by three bytes. The first contains the value 254 and the next two bytes contain the plain length. The low order byte goes first. A length of 65536 to 4294967295 bytes (4GB-1) is represented by five bytes. The first contains the value 255 and the next four bytes contain the plain length. The low order byte goes first.
The encoded compressed record length does not include these length bytes. It tells the number of bytes which follow behind the length bytes for this record.
The code trees are binary trees. Every node has exactly two children. The children can be leaves or branches. A leaf contains one original, uncompressed value. A branch contains a pointer to another node. The Huffman codes represent the navigation through the tree. Every left branch gets a 0 bit, every right branch gets a 1 bit.
The in-memory representation of the trees are two unsigned integers per node. Each describes either a leaf value or an offset (in unsigned integers relative from this node) to another node. To distinguish values from offsets, the 15th bit (decimal value 32768) is set together with offsets. This is safe as the size of the trees is limited by either having a maximum of 256 elements for byte value compression or 4096 elements for distinct column value compression.
The representation of the trees in the compressed data file is almost the same. But instead of writing all bits of the unsigned integers, only as many bits are written as are required to represent the highest value or offset respectively. One more bit per value/offset is written in advance, to distinguish both. The number of bits required per value and per offset is computed in advance and part of the code tree description.
While the header of the compressed data file contains a lot of information, there are still some things which need to be taken from the index file. These are the number of columns of the table and the length of each column. The latter is required for columns with suppressed leading spaces or suppressed trailing spaces or zeros.
As already mentioned, myisampack uses some
tricks to decrease the amount of data to be encoded. These cope
with leading and trailing spaces or zeros or with all blank or
NULL fields.
I do not describe these in detail here. They do not materialize in the compressed data files other than the later mentioned field and pack types. They are however important to know for decoding the records.
Below follows the detailed specification of the encoding:
Datafile fixed header (32 bytes):
4 byte magic number 4 byte total header length (fixed + column info + code trees) 4 byte minimum packed record length 4 byte maximum packed record length 4 byte total number of elements in all code trees 4 byte total number of bytes collected for distinct column values 2 byte number of code trees 1 byte maximum number of bytes required to represent record+blob lengths 1 byte record pointer length, number of bytes for compressed data file length 4 byte zeros
Column Information. For every column in the table:
5 bits field type
FIELD_NORMAL 0
FIELD_SKIP_ENDSPACE 1
FIELD_SKIP_PRESPACE 2
FIELD_SKIP_ZERO 3
FIELD_BLOB 4
FIELD_CONSTANT 5
FIELD_INTERVALL 6
FIELD_ZERO 7
FIELD_VARCHAR 8
FIELD_CHECK 9
6 bits pack type as a set of flags
PACK_TYPE_SELECTED 1
PACK_TYPE_SPACE_FIELDS 2
PACK_TYPE_ZERO_FILL 4
5 bits if pack type contains PACK_TYPE_ZERO_FILL
minimum number of trailing zero bytes in this column
else
number of bits to encode the number of
packed bytes in this column (length_bits)
x bits number of the code tree used to encode this column
x is the minimum number of bits required to represent the highest
tree number.
Alignment:
x bits alignment to the next byte border
Code Trees. For every tree:
1 bit compression type
0 = byte value compression
8 bits minimum byte value coded by this tree
9 bits number of byte values encoded by this tree
5 bits number of bits used to encode the byte values
5 bits number of bits used to encode offsets to next tree elements
1 = distinct column value compression
15 bits number of distinct column values encoded by this tree
16 bits length of the buffer with all column values
5 bits number of bits used to encode the index of the column value
5 bits number of bits used to encode offsets to next tree elements
For each code tree element:
1 bit IS_OFFSET
x bits the announced number of bits for either a value or an offset
x bits alignment to the next byte border
If compression by distinct column values:
The number of 8-bit values that make up the column value buffer
Compressed Records. For every record:
1-5 bytes length of the compressed record in bytes
1. byte 0..253 length
254 length encoded in the next two bytes little endian
255 length encoded in the next x bytes little endian
x = 3 for pack file version 1
x = 4 for pack file version > 1
1-5 bytes total length of all expanded blobs of this record
1. byte 0..253 length
254 length encoded in the next two bytes little endian
255 length encoded in the next x bytes little endian
x = 3 for pack file version 1
x = 4 for pack file version > 1
For every column:
If pack type includes PACK_TYPE_SPACE_FIELDS,
1 bit 1 = spaces only, 0 = not only spaces
In case the field type is of:
FIELD_SKIP_ZERO
1 bit 1 = zeros only, 0 = not only zeros
In the latter case
x bits the Huffman code for every byte
FIELD_NORMAL
x bits the Huffman code for every byte
FIELD_SKIP_ENDSPACE
If pack type includes PACK_TYPE_SELECTED,
1 bit 1 = more than min endspace, 0 = not more
In the former case
x bits nr of extra spaces, x = length_bits
else
x bits nr of extra spaces, x = length_bits
x bits the Huffman code for every byte
FIELD_SKIP_PRESPACE
If pack type includes PACK_TYPE_SELECTED,
1 bit 1 = more than min prespace, 0 = not more
In the former case
x bits nr of extra spaces, x = length_bits
else
x bits nr of extra spaces, x = length_bits
x bits the Huffman code for every byte
FIELD_CONSTANT or FIELD_ZERO or FIELD_CHECK
nothing for these
FIELD_INTERVALL
x bits the Huffman code for the buffer index of the column value
FIELD_BLOB
1 bit 1 = blob is empty, 0 = not empty
In the latter case
x bits blob length, x = length_bits
x bits the Huffman code for every byte
FIELD_VARCHAR
1 bit 1 = varchar is empty, 0 = not empty
In the latter case
x bits blob length, x = length_bits
x bits the Huffman code for every byte
x bits alignment to the next byte border
