WL#8117: Compact In-Memory Temporary Tables
Affects: Server-8.0 — Status: Complete
The purpose of this task is to have a MEMORY-like engine that supports variable length columns in an efficient way - that is, if a VARCHAR(100) column contains 'abcd' it must be stored in about 4 bytes, not about 100 bytes.
Functional Requirements ======================= * Variable length storage of VARCHAR columns must be supported. * It must be possible to read the rows in insertion order. * It must be possible to create a unique index over a table. * O(1) lookup index (hash table) must be implemented. * An index that stores the data ordered must be implemented (tree), O(log(N)) lookup time. * Both hash and tree indexes must support two modes: unique (no duplicates allowed) and non-unique (duplicates allowed) * Updates must be supported, including ones that change the size of the row. * A table must support one inserter and multiple insertion-order-readers (rnd_next()), where the readers are required to use a different handler object each. All readers and the inserter are required to execute from the same thread (interleaved). A reader must not be "disturbed" by other readers or the inserter, and new insertions should not invalidate the cursor of any reader. * A reader must be able to start a scan both from the first row, and from a certain row indicated with rnd_pos(), then continue with rnd_next(). An implication is that positions reported by position() must be stable: even if a write happens later, the position of the row mustn't be changed. * Comparisons must adhere to the defined collation of the column. UTF8 support must be implemented in an efficient way with overhead compared to Latin1 that is comparable to existing MEMORY engine. * If any configuration variables are provided to the user, there must be some way for the user to diagnose the need for tuning these parameters. * There is no need for any rollback of operations. Upon error the table will be discarded. * Should not impose any further restrictions on the number of columns an index can cover, or the length of the data that can be stored in a table, row, cell or an index. Hints wrt performance ===================== * The most frequent usage pattern is sequential insert and sequential read (rnd_next()). This must be very efficient. * There should be no index overhead unless an index is explicitly created. * Storage space should be allocated in large chunks within which rows are stored in a compact way. * Performance must be at least as good as for current in-memory tables. * Quick truncate
This is to avoid confusion because in the surrounding code base different terms are used to designate the same thing and in some cases a given term can designate different things. A table consists of rows and columns: id | color_name | hex_code --- ------------ --------- 1 | Red | FF0000 2 | Orange | FF8800 3 | Yellow | FFFF00 4 | Green | 00FF00 5 | Cyan | 00FFFF 6 | Blue | 0000FF 7 | Pink | FF00FF Row === id | color_name | hex_code --- ------------ --------- 1 | Red | FF0000 2 | Orange | FF8800 -------------------------- | 3 | Yellow | FFFF00 | -------------------------- 4 | Green | 00FF00 5 | Cyan | 00FFFF 6 | Blue | 0000FF 7 | Pink | FF00FF A row is a horizonal slice from the table. Also called "record" elsewhere. Column ====== ---------- id | color_name | hex_code | --- ------------|----------| 1 | Red | FF0000 | 2 | Orange | FF8800 | 3 | Yellow | FFFF00 | 4 | Green | 00FF00 | 5 | Cyan | 00FFFF | 6 | Blue | 0000FF | 7 | Pink | FF00FF | ---------- A column is a vertical slice from the table. It has a name - "hex_code" in the above example. Also called "field" elsewhere. Cell ==== ---------- id | color_name | hex_code | ---*------------|----------| 1 | Red | FF0000 | 2 | Orange | FF8800 | ----------------*==========* | 3 | Yellow # FFFF00 # ----------------*==========* 4 | Green | 00FF00 | 5 | Cyan | 00FFFF | 6 | Blue | 0000FF | 7 | Pink | FF00FF | ---------- A cell is where a row intersects with a column. Also called "field" elsewhere. Index ===== An index is a complex structure covering one or more columns. Also called "key" elsewhere. Indexed cell ============ An indexed cell is a cell that is covered by an index. Also called "key", "field", "subkey", "key part", "key segment" elsewhere. A new session config variable is introduced internal_tmp_mem_storage_engine that can be set to 'memory' (the current/old heap engine) or 'temptable' (the new engine introduced in this WL). The default is 'temptable'. A new global dynamic variable is inroduced temptable_max_ram to cap the maximum amount of RAM that can be occupied by the TempTable storage engine before it starts storing data on disk. Its min/max/default values are 2MiB/2^64-1 bytes/1GiB. Related to temptable_max_ram are two new performance schema memory entries that can be used to see if this threshold has been reached: mysql> SELECT * FROM memory_summary_global_by_event_name WHERE event_name like '%temptable%'\G *************************** 1. row *************************** EVENT_NAME: memory/temptable/physical_disk COUNT_ALLOC: 1 COUNT_FREE: 1 SUM_NUMBER_OF_BYTES_ALLOC: 2097152 SUM_NUMBER_OF_BYTES_FREE: 2097152 LOW_COUNT_USED: 0 CURRENT_COUNT_USED: 0 HIGH_COUNT_USED: 1 LOW_NUMBER_OF_BYTES_USED: 0 CURRENT_NUMBER_OF_BYTES_USED: 0 HIGH_NUMBER_OF_BYTES_USED: 2097152 *************************** 2. row *************************** EVENT_NAME: memory/temptable/physical_ram COUNT_ALLOC: 2 COUNT_FREE: 1 SUM_NUMBER_OF_BYTES_ALLOC: 2097152 SUM_NUMBER_OF_BYTES_FREE: 1048576 LOW_COUNT_USED: 0 CURRENT_COUNT_USED: 1 HIGH_COUNT_USED: 2 LOW_NUMBER_OF_BYTES_USED: 0 CURRENT_NUMBER_OF_BYTES_USED: 1048576 HIGH_NUMBER_OF_BYTES_USED: 2097152 2 rows in set (0.00 sec) memory/temptable/physical_ram indicates the amount of memory allocated in the RAM (ie using malloc(3)). memory/temptable/physical_disk indicates the amount of memory occupied on disk. If that shows something different than zero, then the temptable_max_ram threshold has been reached at some point. Some further clarifications about how the overflow to disk works: * The temporary files are created in the directory denoted by --tmpdir The below are implementation details that the user or DBA does not necessary need to be aware of (may be too detailed for the user documentation): * One or more file can be created per table * No data is ever moved between the RAM and temp disk files, or inside the RAM or between temp disk files. * New data is stored in RAM if there is space (dictated by temptable_max_ram) and on disk otherwise. * If some of the rows of a table are written to disk and later space in RAM becomes available again and new data is being added to the table, then it is possible that this new data is stored in the RAM. * The temp files are deleted immediately after being created and opened. So they are not visible in the --tmpdir (eg using ls(1)). The space occupied by them is kept by the OS as long as there are open handles to them and it is reclaimed as soon as the files are closed by the TempTable storage engine during normal operation or if the mysqld process shuts down abnormally. To be clarified in the documentation: the currently existing config variables in mysql-trunk --tmp-table-size and --max-heap-table-size have no effect on tables created in the new engine.
Custom memory allocator ======================= All dynamic memory used by the TempTable engine is allocated through this allocator. The purposes of this allocator are: - minimize the number of calls to the OS for allocating new memory (e.g. malloc()) - improve the spatial locality of reference. - transparently start allocating memory from disk (using mmap()) instead of malloc() then some max-RAM-to-use threshold is reached The most common use case, for which it is optimized, is to have the following performed by a single thread: - allocate many times (creation of a temp table and inserting data into it). - use the allocated memory (selects on the temp table). - free all the pieces (drop of the temp table). The allocator allocates memory from the OS in large blocks (e.g. a few MiB) and uses these blocks to feed allocation requests. A block consists of a header and a sequence of chunks: - bytes [0, 3]: 4 bytes for the block size (set at block creation and never changed later). - bytes [4, 7]: 4 bytes for the number of used/allocated chunks from this block (set to 0 at block creation). - bytes [8, 11]: 4 bytes for the offset of the first byte from the block start that is free and can be used by the next allocation request (set to 12 at block creation (3 * 4 bytes)). We call this first pristine offset. - bytes [12, block size) a sequence of chunks appended to each other. A chunk structure is: - bytes [0, 3]: 4 bytes that designate the offset of the chunk from the start of the block. This is used in order to be able to deduce the block start from a given chunk. The offset of the first chunk is 12 (appended after the block size (4), number of allocated chunks (4) and the first pristine offset (4)). - bytes [4, chunk size): user data, pointer to this is returned to the user after a successfull allocation request. With the above structure we can deduce the block into which a user pointer belongs in a constant time (returned by an allocation request). Each allocation has an overhead of 4 bytes. We do not store a list of all allocated blocks, only a pointer to the current block which has not yet been entirely filled up and the overall number of blocks. We do not store a list of the chunks inside a block either. Allocation: - if the current block does not have enough space: create a new block and make it the current (lose the pointer to the previous current block). - increment the number of allocated chunks by 1. - in the first pristine location - write its offset from the block start (4 bytes). - increment the first pristine offset with 4 + requested bytes by the user. - return a pointer to the previous first pristine + 4 to the user. Deallocation: - read 4 bytes before the provided pointer and derive the block start. - decrement the number of used chunks by 1. - if this was the last chunk in the block and this is not the last block: destroy the block, returning the memory to the OS. - keep the last block for reuse even if all chunks from it are removed, it will be destroyed when the thread terminates. When the last chunk from the last block is removed, instead of destroying the block reset its first pristine byte offset to 12. Custom container ================ We use a custom container named temptable::Storage which is very similar to std::deque with the main difference that it can store elements whose size is determined at runtime. In the case of std::deque
, sizeof(Foo) must be known at compile time. We need to store elements whose size is determined at runtime because in the case of fixed size rows (no VARCHAR columns) we store directly the rows in mysql write_row() format in the container. Of course the size of that is determined at runtime as it depends on the number and types of the columns specified in CREATE TABLE.
Copyright (c) 2000, 2021, Oracle Corporation and/or its affiliates. All rights reserved.