WL#8117: Compact In-Memory Temporary Tables

Affects: Server-8.0   —   Status: Complete

The purpose of this task is to have a MEMORY-like engine that supports variable
length columns in an efficient way - that is, if a VARCHAR(100) column contains
'abcd' it must be stored in about 4 bytes, not about 100 bytes.
Functional Requirements

* Variable length storage of VARCHAR columns must be supported.
* It must be possible to read the rows in insertion order.
* It must be possible to create a unique index over a table.
* O(1) lookup index (hash table) must be implemented.
* An index that stores the data ordered must be implemented (tree), O(log(N))
lookup time.
* Both hash and tree indexes must support two modes: unique (no duplicates
allowed) and non-unique (duplicates allowed)
* Updates must be supported, including ones that change the size of the row.
* A table must support one inserter and multiple insertion-order-readers
(rnd_next()), where the readers are required to use a different handler object
each. All readers and the inserter are required to execute from the same thread
(interleaved). A reader must not be "disturbed" by other readers or the
inserter, and new insertions should not invalidate the cursor of any reader.
* A reader must be able to start a scan both from the first row, and from a
certain row indicated with rnd_pos(), then continue with rnd_next(). An
implication is that positions reported by position() must be stable: even if a
write happens later, the position of the row mustn't be changed.
* Comparisons must adhere to the defined collation of the column. UTF8
support must be implemented in an efficient way with overhead compared to Latin1
that is comparable to existing MEMORY engine.
* If any configuration variables are provided to the user, there must be some
way for the user to diagnose the need for tuning these parameters.
* There is no need for any rollback of operations. Upon error the table will be
* Should not impose any further restrictions on the number of columns an index
can cover, or the length of the data that can be stored in a table, row, cell or
an index.

Hints wrt performance

* The most frequent usage pattern is sequential insert and sequential read
(rnd_next()). This must be very efficient.
* There should be no index overhead unless an index is explicitly created.
* Storage space should be allocated in large chunks within which rows are stored
in a compact way.
* Performance must be at least as good as for current in-memory tables.
* Quick truncate
This is to avoid confusion because in the surrounding code base different terms
are used to designate the same thing and in some cases a given term can
designate different things.

A table consists of rows and columns:

 id | color_name | hex_code
 --- ------------ ---------
  1 | Red        | FF0000
  2 | Orange     | FF8800
  3 | Yellow     | FFFF00
  4 | Green      | 00FF00
  5 | Cyan       | 00FFFF
  6 | Blue       | 0000FF
  7 | Pink       | FF00FF


 id | color_name | hex_code
 --- ------------ ---------
  1 | Red        | FF0000
  2 | Orange     | FF8800
| 3 | Yellow     | FFFF00  |
  4 | Green      | 00FF00
  5 | Cyan       | 00FFFF
  6 | Blue       | 0000FF
  7 | Pink       | FF00FF

A row is a horizonal slice from the table.
Also called "record" elsewhere.


 id | color_name | hex_code |
 --- ------------|----------|
  1 | Red        | FF0000   |
  2 | Orange     | FF8800   |
  3 | Yellow     | FFFF00   |
  4 | Green      | 00FF00   |
  5 | Cyan       | 00FFFF   |
  6 | Blue       | 0000FF   |
  7 | Pink       | FF00FF   |

A column is a vertical slice from the table. It has a name - "hex_code" in the
above example.
Also called "field" elsewhere.


 id | color_name | hex_code |
  1 | Red        | FF0000   |
  2 | Orange     | FF8800   |
| 3 | Yellow     # FFFF00   #
  4 | Green      | 00FF00   |
  5 | Cyan       | 00FFFF   |
  6 | Blue       | 0000FF   |
  7 | Pink       | FF00FF   |

A cell is where a row intersects with a column.
Also called "field" elsewhere.

An index is a complex structure covering one or more columns.
Also called "key" elsewhere.

Indexed cell
An indexed cell is a cell that is covered by an index.
Also called "key", "field", "subkey", "key part", "key segment" elsewhere.

A new session config variable is introduced internal_tmp_mem_storage_engine that
can be set to 'memory' (the current/old heap engine) or 'temptable' (the new 
engine introduced in this WL). The default is 'temptable'.

A new global dynamic variable is inroduced temptable_max_ram to cap the maximum
amount of RAM that can be occupied by the TempTable storage engine before it 
starts  storing data on disk. Its min/max/default values are 2MiB/2^64-1 
Related to temptable_max_ram are two new performance schema memory entries that 
be  used to see if this threshold has been reached:

mysql> SELECT * FROM memory_summary_global_by_event_name WHERE event_name like
*************************** 1. row ***************************
                  EVENT_NAME: memory/temptable/physical_disk
                 COUNT_ALLOC: 1
                  COUNT_FREE: 1
              LOW_COUNT_USED: 0
             HIGH_COUNT_USED: 1
*************************** 2. row ***************************
                  EVENT_NAME: memory/temptable/physical_ram
                 COUNT_ALLOC: 2
                  COUNT_FREE: 1
              LOW_COUNT_USED: 0
             HIGH_COUNT_USED: 2
2 rows in set (0.00 sec)

memory/temptable/physical_ram indicates the amount of memory allocated in the RAM
(ie using malloc(3)).
memory/temptable/physical_disk indicates the amount of memory occupied on disk. If
that shows something different than zero, then the temptable_max_ram threshold has
been reached at some point.

Some further clarifications about how the overflow to disk works:
* The temporary files are created in the directory denoted by --tmpdir
The below are implementation details that the user or DBA does not necessary
need to be aware of (may be too detailed for the user documentation):
* One or more file can be created per table
* No data is ever moved between the RAM and temp disk files, or inside the RAM
or between temp disk files.
* New data is stored in RAM if there is space (dictated by temptable_max_ram) and
on disk otherwise.
* If some of the rows of a table are written to disk and later space in RAM
becomes available again and new data is being added to the table, then it is
possible that this new data is stored in the RAM.
* The temp files are deleted immediately after being created and opened. So they
are not visible in the --tmpdir (eg using ls(1)). The space occupied by them is
kept by the OS as long as there are open handles to them and it is reclaimed as
soon as the files are closed by the TempTable storage engine during normal
operation or if the mysqld process shuts down abnormally.

To be clarified in the documentation: the currently existing config variables in 
mysql-trunk --tmp-table-size and --max-heap-table-size have no effect on tables 
created in the new engine.
Custom memory allocator

All dynamic memory used by the TempTable engine is allocated through this 

The purposes of this allocator are:
- minimize the number of calls to the OS for allocating new memory (e.g. malloc())
- improve the spatial locality of reference.
- transparently start allocating memory from disk (using mmap()) instead of
malloc() then some max-RAM-to-use threshold is reached

The most common use case, for which it is optimized, is to have the following
performed by a single thread:
- allocate many times (creation of a temp table and inserting data into it).
- use the allocated memory (selects on the temp table).
- free all the pieces (drop of the temp table).

The allocator allocates memory from the OS in large blocks (e.g. a few MiB)
and uses these blocks to feed allocation requests. A block consists of a
header and a sequence of chunks:
- bytes [0, 3]: 4 bytes for the block size (set at block creation and never
  changed later).
- bytes [4, 7]: 4 bytes for the number of used/allocated chunks from this
  block (set to 0 at block creation).
- bytes [8, 11]: 4 bytes for the offset of the first byte from the block
  start that is free and can be used by the next allocation request (set
  to 12 at block creation (3 * 4 bytes)). We call this first pristine offset.
- bytes [12, block size) a sequence of chunks appended to each other.
A chunk structure is:
- bytes [0, 3]: 4 bytes that designate the offset of the chunk from
  the start of the block. This is used in order to be able to deduce
  the block start from a given chunk. The offset of the first chunk is
  12 (appended after the block size (4), number of allocated chunks (4)
  and the first pristine offset (4)).
- bytes [4, chunk size): user data, pointer to this is returned to the
  user after a successfull allocation request.

With the above structure we can deduce the block into which a user pointer
belongs in a constant time (returned by an allocation request). Each
allocation has an overhead of 4 bytes.

We do not store a list of all allocated blocks, only a pointer to the current
block which has not yet been entirely filled up and the overall number of
blocks. We do not store a list of the chunks inside a block either.

- if the current block does not have enough space:
    create a new block and make it the current (lose the pointer to the
    previous current block).
- increment the number of allocated chunks by 1.
- in the first pristine location - write its offset from the block
  start (4 bytes).
- increment the first pristine offset with 4 + requested bytes by the user.
- return a pointer to the previous first pristine + 4 to the user.

- read 4 bytes before the provided pointer and derive the block start.
- decrement the number of used chunks by 1.
- if this was the last chunk in the block and this is not the last block:
    destroy the block, returning the memory to the OS.
- keep the last block for reuse even if all chunks from it are removed, it
  will be destroyed when the thread terminates. When the last chunk from
  the last block is removed, instead of destroying the block reset its first
  pristine byte offset to 12.

Custom container

We use a custom container named temptable::Storage which is very similar to
std::deque with the main difference that it can store elements whose size is
determined at runtime. In the case of std::deque, sizeof(Foo) must be known
at compile time. We need to store elements whose size is determined at runtime
because in the case of fixed size rows (no VARCHAR columns) we store directly
the rows in mysql write_row() format in the container. Of course the size of
that is determined at runtime as it depends on the number and types of the
columns specified in CREATE TABLE.