Most often, the internal optimizations described in InnoDB Data Storage and Compression ensure that the system runs well with compressed data. However, because the efficiency of compression depends on the nature of your data, you can make decisions that affect the performance of compressed tables:
Which tables to compress.
What compressed page size to use.
Whether to adjust the size of the buffer pool based on run-time performance characteristics, such as the amount of time the system spends compressing and uncompressing data. Whether the workload is more like a data warehouse (primarily queries) or an OLTP system (mix of queries and DML).
If the system performs DML operations on compressed tables, and the way the data is distributed leads to expensive compression failures at runtime, you might adjust additional advanced configuration options.
Use the guidelines in this section to help make those architectural and configuration choices. When you are ready to conduct long-term testing and put compressed tables into production, see Section 17.9.1.4, “Monitoring InnoDB Table Compression at Runtime” for ways to verify the effectiveness of those choices under real-world conditions.
When to Use Compression
In general, compression works best on tables that include a reasonable number of character string columns and where the data is read far more often than it is written. Because there are no guaranteed ways to predict whether or not compression benefits a particular situation, always test with a specific workload and data set running on a representative configuration. Consider the following factors when deciding which tables to compress.
Data Characteristics and Compression
A key determinant of the efficiency of compression in reducing
the size of data files is the nature of the data itself. Recall
that compression works by identifying repeated strings of bytes
in a block of data. Completely randomized data is the worst
case. Typical data often has repeated values, and so compresses
effectively. Character strings often compress well, whether
defined in CHAR
, VARCHAR
,
TEXT
or BLOB
columns. On
the other hand, tables containing mostly binary data (integers
or floating point numbers) or data that is previously compressed
(for example JPEG or PNG
images) may not generally compress well, significantly or at
all.
You choose whether to turn on compression for each InnoDB table. A table and all of its indexes use the same (compressed) page size. It might be that the primary key (clustered) index, which contains the data for all columns of a table, compresses more effectively than the secondary indexes. For those cases where there are long rows, the use of compression might result in long column values being stored “off-page”, as discussed in DYNAMIC Row Format. Those overflow pages may compress well. Given these considerations, for many applications, some tables compress more effectively than others, and you might find that your workload performs best only with a subset of tables compressed.
To determine whether or not to compress a particular table,
conduct experiments. You can get a rough estimate of how
efficiently your data can be compressed by using a utility that
implements LZ77 compression (such as gzip
or
WinZip) on a copy of the .ibd
file for an uncompressed table. You can expect less
compression from a MySQL compressed table than from file-based
compression tools, because MySQL compresses data in chunks based
on the page size, 16KB by
default. In addition to user data, the page format includes some
internal system data that is not compressed. File-based
compression utilities can examine much larger chunks of data,
and so might find more repeated strings in a huge file than
MySQL can find in an individual page.
Another way to test compression on a specific table is to copy
some data from your uncompressed table to a similar, compressed
table (having all the same indexes) in a
file-per-table
tablespace and look at the size of the resulting
.ibd
file. For example:
USE test;
SET GLOBAL innodb_file_per_table=1;
SET GLOBAL autocommit=0;
-- Create an uncompressed table with a million or two rows.
CREATE TABLE big_table AS SELECT * FROM information_schema.columns;
INSERT INTO big_table SELECT * FROM big_table;
INSERT INTO big_table SELECT * FROM big_table;
INSERT INTO big_table SELECT * FROM big_table;
INSERT INTO big_table SELECT * FROM big_table;
INSERT INTO big_table SELECT * FROM big_table;
INSERT INTO big_table SELECT * FROM big_table;
INSERT INTO big_table SELECT * FROM big_table;
INSERT INTO big_table SELECT * FROM big_table;
INSERT INTO big_table SELECT * FROM big_table;
INSERT INTO big_table SELECT * FROM big_table;
COMMIT;
ALTER TABLE big_table ADD id int unsigned NOT NULL PRIMARY KEY auto_increment;
SHOW CREATE TABLE big_table\G
select count(id) from big_table;
-- Check how much space is needed for the uncompressed table.
\! ls -l data/test/big_table.ibd
CREATE TABLE key_block_size_4 LIKE big_table;
ALTER TABLE key_block_size_4 key_block_size=4 row_format=compressed;
INSERT INTO key_block_size_4 SELECT * FROM big_table;
commit;
-- Check how much space is needed for a compressed table
-- with particular compression settings.
\! ls -l data/test/key_block_size_4.ibd
This experiment produced the following numbers, which of course could vary considerably depending on your table structure and data:
-rw-rw---- 1 cirrus staff 310378496 Jan 9 13:44 data/test/big_table.ibd
-rw-rw---- 1 cirrus staff 83886080 Jan 9 15:10 data/test/key_block_size_4.ibd
To see whether compression is efficient for your particular workload:
For simple tests, use a MySQL instance with no other compressed tables and run queries against the Information Schema
INNODB_CMP
table.For more elaborate tests involving workloads with multiple compressed tables, run queries against the Information Schema
INNODB_CMP_PER_INDEX
table. Because the statistics in theINNODB_CMP_PER_INDEX
table are expensive to collect, you must enable the configuration optioninnodb_cmp_per_index_enabled
before querying that table, and you might restrict such testing to a development server or a non-critical replica server.Run some typical SQL statements against the compressed table you are testing.
Examine the ratio of successful compression operations to overall compression operations by querying
INFORMATION_SCHEMA.INNODB_CMP
orINFORMATION_SCHEMA.INNODB_CMP_PER_INDEX
, and comparingCOMPRESS_OPS
toCOMPRESS_OPS_OK
.If a high percentage of compression operations complete successfully, the table might be a good candidate for compression.
If you get a high proportion of compression failures, you can adjust
innodb_compression_level
,innodb_compression_failure_threshold_pct
, andinnodb_compression_pad_pct_max
options as described in Section 17.9.1.6, “Compression for OLTP Workloads”, and try further tests.
Database Compression versus Application Compression
Decide whether to compress data in your application or in the table; do not use both types of compression for the same data. When you compress the data in the application and store the results in a compressed table, extra space savings are extremely unlikely, and the double compression just wastes CPU cycles.
Compressing in the Database
When enabled, MySQL table compression is automatic and applies
to all columns and index values. The columns can still be tested
with operators such as LIKE
, and sort
operations can still use indexes even when the index values are
compressed. Because indexes are often a significant fraction of
the total size of a database, compression could result in
significant savings in storage, I/O or processor time. The
compression and decompression operations happen on the database
server, which likely is a powerful system that is sized to
handle the expected load.
Compressing in the Application
If you compress data such as text in your application, before it is inserted into the database, You might save overhead for data that does not compress well by compressing some columns and not others. This approach uses CPU cycles for compression and uncompression on the client machine rather than the database server, which might be appropriate for a distributed application with many clients, or where the client machine has spare CPU cycles.
Hybrid Approach
Of course, it is possible to combine these approaches. For some applications, it may be appropriate to use some compressed tables and some uncompressed tables. It may be best to externally compress some data (and store it in uncompressed tables) and allow MySQL to compress (some of) the other tables in the application. As always, up-front design and real-life testing are valuable in reaching the right decision.
Workload Characteristics and Compression
In addition to choosing which tables to compress (and the page
size), the workload is another key determinant of performance.
If the application is dominated by reads, rather than updates,
fewer pages need to be reorganized and recompressed after the
index page runs out of room for the per-page “modification
log” that MySQL maintains for compressed data. If the
updates predominantly change non-indexed columns or those
containing BLOB
s or large strings that happen
to be stored “off-page”, the overhead of
compression may be acceptable. If the only changes to a table
are INSERT
s that use a monotonically
increasing primary key, and there are few secondary indexes,
there is little need to reorganize and recompress index pages.
Since MySQL can “delete-mark” and delete rows on
compressed pages “in place” by modifying
uncompressed data, DELETE
operations on a
table are relatively efficient.
For some environments, the time it takes to load data can be as important as run-time retrieval. Especially in data warehouse environments, many tables may be read-only or read-mostly. In those cases, it might or might not be acceptable to pay the price of compression in terms of increased load time, unless the resulting savings in fewer disk reads or in storage cost is significant.
Fundamentally, compression works best when the CPU time is available for compressing and uncompressing data. Thus, if your workload is I/O bound, rather than CPU-bound, you might find that compression can improve overall performance. When you test your application performance with different compression configurations, test on a platform similar to the planned configuration of the production system.
Configuration Characteristics and Compression
Reading and writing database pages from and to disk is the slowest aspect of system performance. Compression attempts to reduce I/O by using CPU time to compress and uncompress data, and is most effective when I/O is a relatively scarce resource compared to processor cycles.
This is often especially the case when running in a multi-user environment with fast, multi-core CPUs. When a page of a compressed table is in memory, MySQL often uses additional memory, typically 16KB, in the buffer pool for an uncompressed copy of the page. The adaptive LRU algorithm attempts to balance the use of memory between compressed and uncompressed pages to take into account whether the workload is running in an I/O-bound or CPU-bound manner. Still, a configuration with more memory dedicated to the buffer pool tends to run better when using compressed tables than a configuration where memory is highly constrained.
Choosing the Compressed Page Size
The optimal setting of the compressed page size depends on the type and distribution of data that the table and its indexes contain. The compressed page size should always be bigger than the maximum record size, or operations may fail as noted in Compression of B-Tree Pages.
Setting the compressed page size too large wastes some space, but the pages do not have to be compressed as often. If the compressed page size is set too small, inserts or updates may require time-consuming recompression, and the B-tree nodes may have to be split more frequently, leading to bigger data files and less efficient indexing.
Typically, you set the compressed page size to 8K or 4K bytes.
Given that the maximum row size for an InnoDB table is around
8K, KEY_BLOCK_SIZE=8
is usually a safe
choice.