WL#7794: PERFORMANCE SCHEMA, SCALABLE MEMORY ALLOCATION

Affects: Server-5.7 — Status: Complete

Currently the performance schema:
- allocates all the memory it needs up front
- requires the user to configure how much memory to use
- consumes all memory allocated but does not use all of it, when the server load
is low.

This task is to relax the memory constraints:
- to increase ease of use, with less configuration
- to decrease the memory footprint, scaling the memory consumption with the
server actual load.

F-1 Memory instruments

New instruments, named with the prefix "memory/performance_schema/",
are defined and visible in table performance_schema.setup_instruments.

These instruments expose how much memory is allocated for internal buffers in
the performance schema.

F-2 Global memory aggregates

Table performance_schema.memory_summary_global_by_event_name
displays global statistics for the instruments named
"memory/performance_schema/%"

Note that other memory aggregates, namely:
- table memory_summary_by_account_by_event_name
- table memory_summary_by_host_by_event_name
- table memory_summary_by_thread_by_event_name
- table memory_summary_by_user_by_event_name
are not affected.

NF-3 Memory life cycle

Before this change, design constraints on the performance schema were:
- allocate all the memory needed at startup
- never allocate memory during the server operation
- never free memory during the server operation
- free all the memory used at shutdown

These constraints are now relaxed, so that the performance schema:
- may allocate memory at server startup
- may allocate additional memory during server operation
- never free memory during the server operation
- free all the memory used at shutdown

F-4 "Autoscale" configuration parameters

The following configuration parameters:
- performance_schema_accounts_size
- performance_schema_hosts_size
- performance_schema_max_cond_instances
- performance_schema_max_file_instances
- performance_schema_max_index_stat
- performance_schema_max_metadata_locks
- performance_schema_max_mutex_instances
- performance_schema_max_prepared_statements_instances
- performance_schema_max_program_instances
- performance_schema_max_rwlock_instances
- performance_schema_max_socket_instances
- performance_schema_max_table_handles
- performance_schema_max_table_instances
- performance_schema_max_table_lock_stat
- performance_schema_max_thread_instances
- performance_schema_users_size
are now parameters that support automatic scaling.

Requirements for auto scaling parameters are as follows.

F-4-A Setting autoscale parameters to zero

When an autoscale parameter is set to 0,
the corresponding internal buffer is empty,
and no memory is allocated.

Note: No change from the previous behavior.

F-4-B Setting autoscale parameters to a positive value

When an autoscale parameter is set to a positive value N,
the corresponding internal buffer is initially empty,
and no memory is initially allocated.

As the performance schema collects data, memory is allocated in the
corresponding buffer, until the buffer size reaches N.

Once the buffer size is N, no more memory is allocated.
Data collected by the performance schema for this buffer is lost,
and the buffer corresponding lost counters are incremented.

Note: compared to the previous implementation,
- allocation is not N up-front, but up to N during the server operation
(improvement)
- the buffer total size is capped and lost counters indicate overflow (no change)

F-4-C Setting autoscale parameters to minus one

When an autoscale parameter is set to -1,
the corresponding internal buffer is initially empty,
and no memory is initially allocated.

As the performance schema collects data, memory is allocated in the
corresponding buffer. The buffer size is unbounded, and may grow with the load.

Note: compared to the previous implementation,
- allocation is not up-front, but increasing during the server operation
(improvement)
- there is no "autotuned" computed value for the max size of a buffer,
the buffer grows depending on the load actually seen, not the load estimated
(improvement)

Interface changes : additional memory instrumentation
=====================================================

The performance schema allocates memory internally, for various buffers.

Each buffer is instrumented with a dedicated instrument name,
so that memory consumption can be traced to individual buffers.

Given that the server is already instrumented for memory statistics,
statistics about the performance schema internal memory consumption
are reported in the existing tables, as for any other instrument.

Internal buffers are global to the server, so only a global statistic
is exposed: there are no per thread / account / user / host memory statistics
reported, as this is not applicable.

Behavior changes : scalable memory allocation
=============================================

The performance schema allocates memory incrementally,
based on the server load, instead of allocating all the memory needed during
server startup.

Memory is never freed during server operation, but can be recycled.

Allocation behavior
===================

This section is implementation dependent, and subject to change.
It is documented to help understand the behavior, and for the user manual.

The server allocates memory by increment, for each internal buffer.

The size of each increment is fixed, and depends on the buffer used.

The maximum number of increments is capped (this allows a simpler and more
efficient implementation), the maximum cap is a very high limit that should
never be reached in practice in production.

Currently, the increments are as follows.

Format of each line is:
Internal buffer name / increment size / max number of increments / max size

Mutex instances / 1024 / 1024 / 1048576
Rwlock instances / 1024 / 1024 / 1048576
Cond instances / 256 / 256 / 65536
File instances / 1024 / 1024 / 1048576
Socket instances / 256 / 256 / 65536
Metadata locks / 1024 / 1024 / 1048576
Setup actors / 1024 / 1024 / 1048576
table handles / 1024 / 1024 / 1048576
table instances / 1024 / 1024 / 1048576
table indexes / 1024 / 1024 / 1048576
table locks / 1024 / 1024 / 1048576
Stored programs / 1024 / 1024 / 1048576
Prepared statements / 1024 / 1024 / 1048576
Accounts / 256 / 256 / 65536
Hosts / 256 / 256 / 65536
Threads / 256 / 256 / 65536
Users / 256 / 256 / 65536

In other words, mutex instances are allocated by chuck of 1,024 at a time, up to
1 million (1,048,576 precisely) instances, or less if a specific limit is given
by configuration parameters.

Likewise, memory to instrument threads is allocated by chunk of 256, up to a
maximum limit of 65,536, or less if a specific limit is given by configuration
paramters.

Benefits
========

Configuration of most sizing parameters is no longer required.

A server used with a very low load will not consume arbitrary high amount
of memory as before.