WL#13459: Optimize hash table in hash join
Affects: Server-8.0
—
Status: Complete
We intend to switch the hash table used for hash join from mem_root_unordered_multimap to a different implementation (robin_hood::unordered_flat_map), with our own multimap adapter on top. This is to get 1. A faster hash table. 2. Less memory usage: Smaller hash table overhead, less space wasted to alignment and key/value lengths, better memory usage with many equal keys. 3. Better memory control: Getting closer to the allowed join buffer size, and also ability to free the old memory when the hash table grows). Note that #2 means we need to spill to disk less often.
No new functional requirements. NF1. Hash join should be faster than before in most situations.
Scope ===== This worklog is limited in scope to improvements related to switching from std::unordered_multimap to a more efficient (in terms of time and memory) hash table; it does not intend to fix every possible inefficiency in our hash joins. In other words, we keep the basic architecture of having a hash table of string -> {string} (where {string} is a set of variable-length strings), but attempt to reduce the CPU usage and memory usage. Reducing memory usage in particular is of particular concern, as using too much memory causes us to have to spill to disk -- and even worse, re-reading the same chunks from disk multiple times if we cannot fit them into RAM. For future reference, here are a few issues that would be good to investigate, but that are out-of-scope for this worklog: - Increasing the default join buffer size from 256 kB (it was set so low at a point where RAM was much less plentiful, and we did not have incremental allocation). - Removing overhead related to translation between many different formats (we use a lot of time unpacking from InnoDB to MyISAM row format and into fields before storing the row in hash join format, Item_field overhead, Field::val_int() overhead, etc.). - Special-casing for fixed-length keys and values, which potentially can save memory for edge cases involving very few/short fields (saving pointer overhead). - Lifting the 128-file limitation, which hurts us severely in some cases. - Replacing the current pseudo-BNL (reading the same chunk file from disk multiple times when it grows too big to keep all in memory) with subsharding, compounding the previous point. - Issues when we misestimate the number of chunk files needed. - That we often store columns in the hash table that we don't need, since we don't track precise read sets throughout the iterator tree (BUG#99934, “Hash join adds columns to the hash table that is not needed.”). - That our key storage is inefficient, e.g. we store 8 bytes even for a 4-byte int. Choice of hash table ==================== Hash tables are an active area of research. In particular, there's the STL implementations (libstdc++, libc++, MSVC being the major ones), Google's Swiss tables (part of Abseil), Facebook's F14 (part of Folly), and Malte Skarupke's flat_hash_map and bytell_hash_map. (There are many more.) After some preliminary benchmarking, we decided on robin_hood::flat_hash_map; it has few dependencies, is reasonably time-efficient for insertions and lookups (we never erase), is appropriately licensed, and has fairly low memory overhead. (flat_hash_map benchmarked better in microbenchmarks, but no better in the actual hash join -- and uses more memory.) As most C++ templated data structures, it does not support variable-length keys and values well, and thus, we must generally use pointers to values instead of storing the values inline in the map; opening up for inline keys and values would be interesting in some situations, to save on the pointer overhead. (This is perhaps the biggest downside of not writing our own, although I would argue that type-safety and saving on implementation costs easily outweigh the extra performance we'd get with very short keys and values, which most often shows up in unrealistically narrow benchmark situations such as DBT-3.) For a pointer -> pointer map, flat_hash_map needs 17 bytes per bucket; 64 bits for the key pointer, 64 bits for the value, and one control byte for marking empty slots and a few other things. (For comparison, unordered_multimap, which we currently use, needs 48 bytes per node, and then some more in the actual map.) Of course, on top of this, we must find a way of storing strings efficiently. Storing multistrings ==================== Only the STL implementations support multimaps, ie., multiple values for the same key. This is essential for implementing SQL joins, except for semi- and antijoins, which by nature only needs one candidate value (except when there are join keys that cannot be represented in the hash and need to be checked separately). We also need to store the lengths of the keys. (The values have implicit or explicit lengths from the packing, which are opaque to us -- currently, we store the total length, but don't actually use is for anything.) We will approach these two problems separately but similarly. We assume that the key will be a pointer into a MEM_ROOT. There have been brief experiments with short-string optimizations and tagged pointers (e.g., if the pointer is invalid in a certain way, the first byte denotes the length and the remaining 7 the key), and while it saves some memory, it does not appear to be a big win in general, so it will be out-of-scope for this WL. In the current implementation, we store the length of the key together with the pointer, ie., in a string_view-like structure (64 bits for the pointer, 64 bits for the length). This is wasteful, given that most keys are typically well below 16 kB. Instead, in this implementation, we will store the length as a Varint128 in the beginning of the buffer, using the protobuf library that we already link to. This means that any length <= 127 bytes (2^7 - 1) will be stored as one byte, any <= 16383 bytes (2^14 - 1) will be stored as two, and so on. The key immediately follows. | len (usually 1-2 bytes) | key | 0x03 'f' 'o' 'o' We don't have any alignment requirements, so we are free to pack these keys as tightly as we want in the MEM_ROOT. (The MEM_ROOT will be extended with methods for unaligned allocation for this purpose, so that we don't waste 0–7 bytes for each key to alignment.) We'll encapsulate encoding and decoding of this kind of string into a class, which will be called ImmutableString. (It is, as the name implies, not intended for manipulation.) As previously mentioned, the values do not need to store lengths. However, we need a way to store multiple values into the same key slot. To this extent, we choose to link them together in a linked list; whenever we want to insert an extra element for an already-existing key, we replace the element currently in the bucket with a new one that contains both the value and a pointer to the next value. Normally linked lists are bad for performance due to the pointer chasing, but we do so much work (decode it into buffers, check extra conditions, run the upstream iterators, send data to the client, etc.) per row that having ~20 cycles of latency means absolutely nothing in the big picture. Linked lists are also typically space-inefficient in that we need to store an extra pointer, but we can invoke a trick here: Since we know that all the values are allocated on the same MEM_ROOT, their pointers are likely to be very close to each other. Thus, we can store only the difference between this value and the next value, again Varint128-encoded. (We use “zig-zag” encoding, so that negative values are also encoded efficiently.) In the vast majority of cases, this means we can get the “next” pointer encoded in only one or two bytes instead of 8, again a large win. (For the degenerate case of a cross product join, we will get only a single entry in the hash table, and then a long linked list with only one byte overhead for nearly all values -- extremely efficient.) The special pointer difference of 0 means that this is the last entry (first inserted); it would be nice to use less than one byte for this common case somehow, but it is not a huge issue. So the values, encapsulated in a separate class called LinkedImmutableString, may look something like this: 0x100000 0x100030 | ptr diff | value | ...others... | ptr diff | value | | 0x00 | 'b' 'a' 'r' | ............ | 0x5f | '1' '2' '3' | ↑_________________________________________↓ 0x5f here is -48, zigzag-encoded. Note that if we stored the difference at the end of the string instead of at the beginning we could get somewhat smaller pointer differences on average and occasionally save a byte, at the cost of extra complexity. For instance, we could allocate from the back of the MEM_ROOT instead of from the front, and use the fact that unpack gives us the length of the value: 0x100000 0x100030 | value | ptr diff | ...others... | value | ptr diff | | '1' '2' '3' | 0x5a | ............ | 'b' 'a' 'r' | 0x00 | ↓_________________________↑ This was deemed not worth the complexity. Note that while this structure is excellent for storage, it is not ideal for key lookup; encoding the length into a temporary buffer just to decode it immediatley afterwards is wasteful. Thus, we need support for heterogenous lookup in our hash map; store them as ImmutableString in the hash map, but look them up using simple pointer/length pairs. As long as both can be hashed, and they can be compared to each other, this should work fine. (flat_hash_map supports heterogenous find using transparent functors.) MEM_ROOT extensions =================== We need a few extensions to MEM_ROOT to use our allotted amount of RAM as efficiently as possible. (We could have implemented something outside the MEM_ROOT, but a lot of its infrastructure is still relevant for us.) In particular, we need a way of getting at unaligned memory; Alloc() always returs 8-aligned memory so that it can fit general types. Furthermore, it is useful for us to get a memory chunk and then use less of it than we planned. This is so that we can simply pack straight into the MEM_ROOT with no intermediate string, or that we can encode a key, try to insert it in the map and then immediately deallocate it if it already existed. To this extent, we extend the MEM_ROOT with three new low-level operations. Using them will be incompatible with regular Alloc() use, so this kind of MEM_ROOT is an all-or-nothing deal.[1] These operations are: - Peek: Look at the current chunk (start and end pointers), so that we can write into it as we wish. - ForceNewBlock: Allocate a new, larger block. - Commit: Claim the first N bytes of the current chunk as used. So whenever we want to insert a key or value, we peek to get the pointer and see that there is enough room (if not, we ask the MEM_ROOT to allocate a new and larger chunk) to write our key under conservative estimates; we then write it, see how long it became, commit that memory and try to insert. If the insert failed due to already existing, we uncommit so that the memory can be used for something else. A similar process is used for the value, except that we never need to uncommit those. We also need some adjustments so that the MEM_ROOT is able to allocate up to exactly the amount of RAM capacity. Combining this the fact that we now actually use the maximum capacity toggle instead of checking the number of used bytes after-the-fact effectively fixes BUG#99933, “In-memory hash join will only use two-thirds of join buffer” (although the bug title is somewhat too conservative in its estimates). Note that the current code fairly strongly assumes that even if insertion reports “buffer full”, the row still got inserted (and thus, one can always insert at least one row, no matter how small the join buffer). Instead of trying to rework this, we simply insert that last row into a tiny “overflow” MEM_ROOT, which is then immediately cleared after the chunk is written to disk. This seems to work well in practice, even if it occasionally means an extra malloc call. [1] The hash map itself will _not_ be allocated on the MEM_ROOT anymore, but on the regular heap, as it is wasteful to store old versions of it when it grows. When it grows, we will adjust memory allowances for the MEM_ROOT separately, so that the total cannot grow much above our limit. It can grow so that the total usage becomes too high, but we will then immediately spill the chunk to disk afterwards.) Performance =========== We've run some sparse benchmarks, including on the DBT-3-derived case Alibaba presented at Percona Live (which is pretty much the worst case for us compared to other approaches; small, fixed-length integer keys that are only counted up afterwards); these numbers are from an earlier prototype, but on a rough approximation, it appears that for the spill-to-disk case, we win 12%–70%, and for the fully in-memory case, we win 10%–25%. (This is full query time, so it includes a large amount of overhead that is not related to the hash table.) The bigger wins are of course when we can avoid spilling to disk, or even more importantly, avoid reading each disk chunk multiple times. See the QA Notes sections for more up-to-date benchmarks. After this, the hash table operations themselves take roughly 10–15% of CPU query time.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.