WL#13459: Optimize hash table in hash join

Affects: Server-8.0   —   Status: Complete

We intend to switch the hash table used for hash join from 
mem_root_unordered_multimap to a different implementation 
with our own multimap adapter on top. This is to get

 1. A faster hash table.
 2. Less memory usage: Smaller hash table overhead, less space wasted to 
    alignment and key/value lengths, better memory usage with many equal keys.
 3. Better memory control: Getting closer to the allowed join buffer size,
    and also ability to free the old memory when the hash table grows).

Note that #2 means we need to spill to disk less often.
No new functional requirements.

NF1. Hash join should be faster than before in most situations.

This worklog is limited in scope to improvements related to switching from
std::unordered_multimap to a more efficient (in terms of time and memory)
hash table; it does not intend to fix every possible inefficiency in our
hash joins.

In other words, we keep the basic architecture of having a hash table of
string -> {string} (where {string} is a set of variable-length strings),
but attempt to reduce the CPU usage and memory usage. Reducing memory usage
in particular is of particular concern, as using too much memory causes us
to have to spill to disk -- and even worse, re-reading the same chunks from
disk multiple times if we cannot fit them into RAM.

For future reference, here are a few issues that would be good
to investigate, but that are out-of-scope for this worklog:

 - Increasing the default join buffer size from 256 kB (it was set so low
   at a point where RAM was much less plentiful, and we did not have
   incremental allocation).
 - Removing overhead related to translation between many different formats
   (we use a lot of time unpacking from InnoDB to MyISAM row format and into
   fields before storing the row in hash join format, Item_field overhead,
   Field::val_int() overhead, etc.).
 - Special-casing for fixed-length keys and values, which potentially can
   save memory for edge cases involving very few/short fields (saving pointer
 - Lifting the 128-file limitation, which hurts us severely in some cases.
 - Replacing the current pseudo-BNL (reading the same chunk file from disk
   multiple times when it grows too big to keep all in memory) with
   subsharding, compounding the previous point.
 - Issues when we misestimate the number of chunk files needed.
 - That we often store columns in the hash table that we don't need, since
   we don't track precise read sets throughout the iterator tree (BUG#99934,
   “Hash join adds columns to the hash table that is not needed.”).
 - That our key storage is inefficient, e.g. we store 8 bytes even for a
   4-byte int.

Choice of hash table

Hash tables are an active area of research. In particular, there's the STL
implementations (libstdc++, libc++, MSVC being the major ones), Google's
Swiss tables (part of Abseil), Facebook's F14 (part of Folly), and Malte
Skarupke's flat_hash_map and bytell_hash_map. (There are many more.)
After some preliminary benchmarking, we decided on robin_hood::flat_hash_map;
it has few dependencies, is reasonably time-efficient for insertions and
lookups (we never erase), is appropriately licensed, and has fairly low
memory overhead. (flat_hash_map benchmarked better in microbenchmarks,
but no better in the actual hash join -- and uses more memory.)

As most C++ templated
data structures, it does not support variable-length keys and values well,
and thus, we must generally use pointers to values instead of storing the
values inline in the map; opening up for inline keys and values would be
interesting in some situations, to save on the pointer overhead. (This is
perhaps the biggest downside of not writing our own, although I would argue
that type-safety and saving on implementation costs easily outweigh the extra
performance we'd get with very short keys and values, which most often shows
up in unrealistically narrow benchmark situations such as DBT-3.)

For a pointer -> pointer map, flat_hash_map needs 17 bytes per bucket;
64 bits for the key pointer, 64 bits for the value, and one control byte
for marking empty slots and a few other things. (For comparison,
unordered_multimap, which we currently use,
needs 48 bytes per node, and then some more in the actual map.)
Of course, on top of this, we must find a way of storing strings efficiently.

Storing multistrings

Only the STL implementations support multimaps, ie., multiple values for
the same key. This is essential for implementing SQL joins, except for
semi- and antijoins, which by nature only needs one candidate value
(except when there are join keys that cannot be represented in the hash
and need to be checked separately).

We also need to store the lengths of the keys. (The values have implicit
or explicit lengths from the packing, which are opaque to us -- currently, we
store the total length, but don't actually use is for anything.) We will
approach these two problems separately but similarly.

We assume that the key will be a pointer into a MEM_ROOT. There have been
brief experiments with short-string optimizations and tagged pointers
(e.g., if the pointer is invalid in a certain way, the first byte denotes
the length and the remaining 7 the key), and while it saves some memory,
it does not appear to be a big win in general, so it will be out-of-scope
for this WL.

In the current implementation, we store the length of the key together
with the pointer, ie., in a string_view-like structure (64 bits for the
pointer, 64 bits for the length). This is wasteful, given that most keys
are typically well below 16 kB. Instead, in this implementation, we will
store the length as a Varint128 in the beginning of the buffer, using the
protobuf library that we already link to. This means that any length
<= 127 bytes (2^7 - 1) will be stored as one byte, any <= 16383 bytes
(2^14 - 1) will be stored as two, and so on. The key immediately follows.

  | len (usually 1-2 bytes) | key                  |
    0x03                      'f' 'o' 'o'

We don't have any alignment requirements, so we are free to pack these
keys as tightly as we want in the MEM_ROOT. (The MEM_ROOT will be extended
with methods for unaligned allocation for this purpose, so that we don't
waste 0–7 bytes for each key to alignment.) We'll encapsulate encoding and
decoding of this kind of string into a class, which will be called
ImmutableString. (It is, as the name implies, not intended for manipulation.)

As previously mentioned, the values do not need to store lengths. However,
we need a way to store multiple values into the same key slot. To this
extent, we choose to link them together in a linked list; whenever we want
to insert an extra element for an already-existing key, we replace the
element currently in the bucket with a new one that contains both the value
and a pointer to the next value. Normally linked lists are bad for
performance due to the pointer chasing, but we do so much work
(decode it into buffers, check extra conditions, run the upstream iterators,
send data to the client, etc.) per row that having ~20 cycles of latency
means absolutely nothing in the big picture.

Linked lists are also typically space-inefficient in that we need to store
an extra pointer, but we can invoke a trick here: Since we know that all
the values are allocated on the same MEM_ROOT, their pointers are likely
to be very close to each other. Thus, we can store only the difference
between this value and the next value, again Varint128-encoded. (We use
“zig-zag” encoding, so that negative values are also encoded efficiently.)
In the vast majority of cases, this means we can get the “next” pointer
encoded in only one or two bytes instead of 8, again a large win.
(For the degenerate case of a cross product join, we will get only a single
entry in the hash table, and then a long linked list with only one byte
overhead for nearly all values -- extremely efficient.) The special
pointer difference of 0 means that this is the last entry (first inserted);
it would be nice to use less than one byte for this common case somehow,
but it is not a huge issue. So the values, encapsulated in a separate
class called LinkedImmutableString, may look something like this:

     0x100000                                0x100030
   | ptr diff | value       | ...others... | ptr diff | value       |
   | 0x00     | 'b' 'a' 'r' | ............ |   0x5f   | '1' '2' '3' |

0x5f here is -48, zigzag-encoded. Note that if we stored the difference
at the end of the string instead of at the beginning we could get somewhat
smaller pointer differences on average and occasionally save a byte,
at the cost of extra complexity. For instance, we could allocate from
the back of the MEM_ROOT instead of from the front, and use the fact that
unpack gives us the length of the value:

     0x100000                                0x100030
   | value       | ptr diff | ...others... | value       | ptr diff |
   | '1' '2' '3' | 0x5a     | ............ | 'b' 'a' 'r' |   0x00   |

This was deemed not worth the complexity.

Note that while this structure is excellent for storage, it is not ideal
for key lookup; encoding the length into a temporary buffer just to decode
it immediatley afterwards is wasteful. Thus, we need support for heterogenous
lookup in our hash map; store them as ImmutableString in the hash map,
but look them up using simple pointer/length pairs. As long as both can be
hashed, and they can be compared to each other, this should work fine.
(flat_hash_map supports heterogenous find using transparent functors.)

MEM_ROOT extensions

We need a few extensions to MEM_ROOT to use our allotted amount of RAM
as efficiently as possible. (We could have implemented something outside
the MEM_ROOT, but a lot of its infrastructure is still relevant for us.)
In particular, we need a way of getting at unaligned memory; Alloc()
always returs 8-aligned memory so that it can fit general types.

Furthermore, it is useful for us to get a memory chunk and then use less
of it than we planned. This is so that we can simply pack straight into
the MEM_ROOT with no intermediate string, or that we can encode a key,
try to insert it in the map and then immediately deallocate it if it
already existed.

To this extent, we extend the MEM_ROOT with three new low-level operations.
Using them will be incompatible with regular Alloc() use, so this kind of
MEM_ROOT is an all-or-nothing deal.[1] These operations are:

 - Peek: Look at the current chunk (start and end pointers), so that we
   can write into it as we wish.
 - ForceNewBlock: Allocate a new, larger block.
 - Commit: Claim the first N bytes of the current chunk as used.

So whenever we want to insert a key or value, we peek to get the pointer
and see that there is enough room (if not, we ask the MEM_ROOT to allocate
a new and larger chunk) to write our key under conservative estimates;
we then write it, see how long it became, commit that memory and try to
insert. If the insert failed due to already existing, we uncommit so that
the memory can be used for something else. A similar process is used for
the value, except that we never need to uncommit those.

We also need some adjustments so that the MEM_ROOT is able to allocate
up to exactly the amount of RAM capacity. Combining this the fact that
we now actually use the maximum capacity toggle instead of checking the
number of used bytes after-the-fact effectively fixes BUG#99933,
“In-memory hash join will only use two-thirds of join buffer” (although
the bug title is somewhat too conservative in its estimates).

Note that the current code fairly strongly assumes that even if insertion
reports “buffer full”, the row still got inserted (and thus, one can always
insert at least one row, no matter how small the join buffer). Instead of
trying to rework this, we simply insert that last row into a tiny “overflow”
MEM_ROOT, which is then immediately cleared after the chunk is written to
disk. This seems to work well in practice, even if it occasionally means
an extra malloc call.

[1] The hash map itself will _not_ be allocated on the MEM_ROOT anymore,
    but on the regular heap, as it is wasteful to store old versions of it
    when it grows. When it grows, we will adjust memory allowances for the
    MEM_ROOT separately, so that the total cannot grow much above our limit.
    It can grow so that the total usage becomes too high, but we will then
    immediately spill the chunk to disk afterwards.)


We've run some sparse benchmarks, including on the DBT-3-derived case Alibaba
presented at Percona Live (which is pretty much the worst case for us
compared to other approaches; small, fixed-length integer keys that are only
counted up afterwards); these numbers are from an earlier prototype, but on a 
rough approximation, it appears that for the
spill-to-disk case, we win 12%–70%, and for the fully in-memory case, we win
10%–25%. (This is full query time, so it includes a large amount of overhead
that is not related to the hash table.) The bigger wins are of course when we
can avoid spilling to disk, or even more importantly, avoid reading each disk
chunk multiple times. See the QA Notes sections for more up-to-date benchmarks.

After this, the hash table operations themselves take roughly 10–15% of CPU
query time.