WL#8741: Varlen keys for sorting JSON values
Affects: Server-8.0 — Status: Complete — Priority: Medium
WL#8539 allowed to sort JSON values. However, in order to limit scope and reduce various risks it was done in suboptimal way. This WL aims to provide better performance for sorting/grouping JSON values by using variable length sort keys for them. Preliminary benchmarks shows from 20% to 18 times improvement in sorting, depending on use case.
Functional requirements ----------------------- .) Sorting by JSON values should be from noticeable to significantly faster Non-functional requirements --------------------------- .) No loss in performance for sorting non-json data .) Order of data should be the same as produced by WL#8539, with exception to same numbers, but in different format, e.g 1 and 1.000. Such numbers could have different order, compared to WL#8539. Reason is that internally all number are unified and become indistnguishable. Due to slightly different approach to sorting they might change places in the sort result. Same order also applies to SQL NULLs, JSON NULLs, JSON arrays an objects.
After WL#8539 each JSON value in the sort key is represented by a fixed length key part, each such keypart takes 1K of memory. In most cases this is excessive, e.g an integer scalar takes just few bytes, the rest is occupied with padding. To address that, this WL introduces variable length key parts. Each key part takes only space that is needed to store the value. This have two consequences that affects performance in good way: .) Sort buffer space is used more effectively, so filesort doesn't have to flush it to disk early. This leads that much more data could be sorted completely in memory. .) Shorter keys are faster to compare. This allows to noticeably improve performance even for in-memory sorts. This optimization could be also applied to regular strings or text blobs sorting, but to minimize risk of regressions, it'll be applied only to JSON values. Major changes introduces in this WL: 1) New format for sort key of a JSON field: <4 bytes:key len><key len bytes: sort key> The sort key is generated by the Field_json::make_sort_key(). 2) Field_json::make_sort_key() doesn't pad the sort key till max sort key length. Instead it creates the key and saves its length to the first 4 bytes of the sort key. 3) New comparator based on memcmp() to compare varlen sort keys. One implementation is a part of Merge_chunk_less's () operator and is used for on-disk sorting. Another implementation is a new class Mem_compare_varlen and it's used for in-memory sorting. Benchmarking results ==================== This is the result of comparing performance of the WL implementation vs mysql-5.7 as of 8 Jul 2015. Settings -------- Sort buffer is set to 1M, crossbench was used with following settings Test selection --db-driver=mysql --test=oltp Size of set to be sorted --oltp-table-size=<table size> --oltp-range-size=<table size> Run only sorting queries --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-distinct-ranges=0 --oltp-order-ranges=1 --oltp-skip-trx=on Use JSON data without indexes --json=on --mysql-skip-key=on Misc --max-requests=1000 --max-time=0 --num-threads=16 --oltp-test-mode=complex --oltp-read-only=on --db-ps-mode=auto The query being run looks like following: SELECT JSON_EXTRACT(data, '$.c') FROM %s where JSON_EXTRACT(data,'$.id') between ? and ? ORDER BY JSON_EXTRACT(data, '$.c') i.e it uses one-part sort key. Would the sort key consist of many JSON-typed key parts performance improvement would be bigger, due to better sort buffer utilization. Results (in tps) ---------------- table size: 1K rows 5.7 wl8741 2036.50 2121.49 1975.73 2037.58 2048.49 2038.60 ---------------------------------------- ~2016 ~2065 WL is 2.5% faster table 10K rows 5.7 wl8741 172.16 237.91 171.22 238.42 174.63 238.50 ---------------------------------------- ~172 ~238 WL is 38% faster table 50K rows 5.7 wl8741 27.06 45.54 26.84 45.35 27.79 45.77 ---------------------------------------- ~27 ~45 WL is 66% faster
/** There are several record formats for sorting: |@<key a@>@<key b@>...|@<rowid@>| / m_fixed_sort_length / ref_len / or with "addon fields" |@<key a@>@<key b@>...|@<null bits@>|@<field a@>@<field b@>...| / m_fixed_sort_length / addon_length / The packed format for "addon fields" |@<key a@>@<key b@>...|@<length@>|@<null bits@>|@<field a@>@<field b@>...| / m_fixed_sort_length / addon_length / All the formats above have fixed-size keys, with appropriate padding. Fixed-size keys can be compared/sorted using memcmp(). The packed (variable length) format for keys: |@<keylen@>|@<varkey a@>@<key b@>...@<hash@>|@<rowid@> or @<addons@> | / 4 bytes / keylen bytes / ref_len or addon_length / This format is currently only used if we are sorting JSON data. Variable-size keys must be compared piece-by-piece, using type information about each individual key part, @see cmp_packed_keys. All the record formats consist of a (possibly composite) key, followed by a (possibly composite) payload. The key is used for sorting data. Once sorting is done, the payload is stored in some buffer, and read by some rr_from or rr_unpack routine. For fixed-size keys, with @<rowid@> payload, the @<rowid@> is also considered to be part of the key. @<key@> Fields are fixed-size, specially encoded with Field::make_sort_key() so we can do byte-by-byte compare. @<length@> Contains the *actual* packed length (after packing) of everything after the sort keys. The size of the length field is 2 bytes, which should cover most use cases: addon data <= 65535 bytes. This is the same as max record size in MySQL. @<null bits@> One bit for each nullable field, indicating whether the field is null or not. May have size zero if no fields are nullable. @<field xx@> Are stored with field->pack(), and retrieved with field->unpack(). Addon fields within a record are stored consecutively, with no "holes" or padding. They will have zero size for NULL values. @<keylen@> Contains the *actual* packed length of all the keys. We may have an arbitrary mix of fixed and variable-sized keys. @<hash@> Optional 8 byte hash, used for GROUPing of JSON values. @<varkey@> Used for JSON values, the format is: |@<null value@>|@<key length@>|@<JSON sort key@> | / 1 byte / 4 bytes / key length bytes / */
Copyright (c) 2000, 2017, Oracle Corporation and/or its affiliates. All rights reserved.