WL#8539 allowed to sort JSON values. However, in order to limit scope and reduce
various risks it was done in suboptimal way. This WL aims to provide better
performance for sorting/grouping JSON values by using variable length sort keys
for them. Preliminary benchmarks shows from 20% to 18 times improvement in
sorting, depending on use case.
.) Sorting by JSON values should be from noticeable to significantly faster
.) No loss in performance for sorting non-json data
.) Order of data should be the same as produced by WL#8539, with exception
to same numbers, but in different format, e.g 1 and 1.000. Such numbers could
have different order, compared to WL#8539. Reason is that internally all
number are unified and become indistnguishable. Due to slightly
different approach to sorting they might change places in the sort result.
Same order also applies to SQL NULLs, JSON NULLs, JSON arrays an objects.
After WL#8539 each JSON value in the sort key is represented by a fixed length
key part, each such keypart takes 1K of memory. In most cases this is
excessive, e.g an integer scalar takes just few bytes, the rest is occupied
with padding. To address that, this WL introduces variable length key parts.
Each key part takes only space that is needed to store the value. This have
two consequences that affects performance in good way:
.) Sort buffer space is used more effectively, so filesort doesn't have to
flush it to disk early. This leads that much more data could be sorted
completely in memory.
.) Shorter keys are faster to compare. This allows to noticeably improve
performance even for in-memory sorts.
This optimization could be also applied to regular strings or text blobs
sorting, but to minimize risk of regressions, it'll be applied only to JSON
Major changes introduces in this WL:
1) New format for sort key of a JSON field:
<4 bytes:key len>
The sort key is generated by the Field_json::make_sort_key().
2) Field_json::make_sort_key() doesn't pad the sort key till max sort key
length. Instead it creates the key and saves its length to the first 4 bytes
of the sort key.
3) New comparator based on memcmp() to compare varlen sort keys. One
implementation is a part of Merge_chunk_less's () operator and is used for
on-disk sorting. Another implementation is a new class Mem_compare_varlen and
it's used for in-memory sorting.
This is the result of comparing performance of the WL implementation vs
mysql-5.7 as of 8 Jul 2015.
Sort buffer is set to 1M, crossbench was used with following settings
Size of set to be sorted
Run only sorting queries
Use JSON data without indexes
The query being run looks like following:
SELECT JSON_EXTRACT(data, '$.c')
FROM %s where JSON_EXTRACT(data,'$.id') between ? and ?
ORDER BY JSON_EXTRACT(data, '$.c')
i.e it uses one-part sort key. Would the sort key consist of many JSON-typed
key parts performance improvement would be bigger, due to better sort buffer
Results (in tps)
table size: 1K rows
~2016 ~2065 WL is 2.5% faster
table 10K rows
~172 ~238 WL is 38% faster
table 50K rows
~27 ~45 WL is 66% faster
There are several record formats for sorting:
/ m_fixed_sort_length / ref_len /
or with "addon fields"
/ m_fixed_sort_length / addon_length /
The packed format for "addon fields"
/ m_fixed_sort_length / addon_length /
All the formats above have fixed-size keys, with appropriate padding.
Fixed-size keys can be compared/sorted using memcmp().
The packed (variable length) format for keys:
|@|@@...@|@ or @ |
/ 4 bytes / keylen bytes / ref_len or addon_length /
This format is currently only used if we are sorting JSON data.
Variable-size keys must be compared piece-by-piece, using type information
about each individual key part, @see cmp_packed_keys.
All the record formats consist of a (possibly composite) key,
followed by a (possibly composite) payload.
The key is used for sorting data. Once sorting is done, the payload is
stored in some buffer, and read by some rr_from or rr_unpack routine.
For fixed-size keys, with @ payload, the @ is also
considered to be part of the key.
@ Fields are fixed-size, specially encoded with
Field::make_sort_key() so we can do byte-by-byte compare.
@ Contains the *actual* packed length (after packing) of
everything after the sort keys.
The size of the length field is 2 bytes,
which should cover most use cases: addon data <= 65535 bytes.
This is the same as max record size in MySQL.
@ One bit for each nullable field, indicating whether the field
is null or not. May have size zero if no fields are nullable.
@ Are stored with field->pack(), and retrieved with
Addon fields within a record are stored consecutively, with no
"holes" or padding. They will have zero size for NULL values.
@ Contains the *actual* packed length of all the keys.
We may have an arbitrary mix of fixed and variable-sized keys.
@ Optional 8 byte hash, used for GROUPing of JSON values.
@ Used for JSON values, the format is:
/ 1 byte / 4 bytes / key length bytes /
Copyright (c) 2000, 2023, Oracle Corporation and/or its affiliates. All rights reserved.