WL#8539: Ordering of scalar JSON values
When ordering scalar JSON values with ORDER BY, they should be returned in the order defined by the JSON comparator in WL#8249. This worklog will implement a function that produces the sort keys filesort needs in order to sort JSON values.
Functional requirements ----------------------- F1: If the ORDER BY list contains an expression with type JSON, the result should order the rows according to the rules in WL#8249 for all rows in which the expression evaluates to a scalar. F2: If the order is ascending (ORDER BY col ASC), SQL NULL (unknown) is ordered before all JSON values, including the JSON null literal. If the order is descending (ORDER BY col DESC), SQL NULL (unknown) is ordered after all JSON values. F3: On attempt to sort non-scalar JSON the ER_NOT_SUPPORTED_YET warning is issued Non-functional requirement -------------------------- NF1: For scalar values, the result of grouping when only filesort is used for grouping should be consistent with the result of grouping when tmp table is used for grouping.
This WL will add to server ability to sort scalars stored in JSON docs. In general, it's advised to cast JSON to some other native mysql type and use it for sorting, e.g. use ORDER BY CAST(JSN_EXTRACT(...) as SIGNED) instead of ORDER BY JSN_EXTRACT(...) as SIGNED. We will add a new function (make_sort_key()) to the Json_wrapper class. This function produces a sort key that can be used by filesort to order JSON values correctly. Filesort creates sort keys in Sort_param::make_sortkey(). We will update this function so that it uses Json_wrapper::make_sort_key() to produce sort keys for fields and expressions with type JSON. There will be some limitations: - Heavy memory consumption. Because of schemaless nature it's impossible to predict actual content and length of the key in general case. Due to that sort key for each JSON typed element in ORDER BY clause will take max_sort_length (default: 1024) bytes. Values shorter than this limit will be padded, longer than limit - trimmed. - Internal ordering of JSON objects and JSON arrays will not be handled by this worklog. When user will try to sort objects/arrays, a warning will be issued telling that it's not supported yet (ER_NOT_SUPPORTED_YET). This limitation will be lifter in 5.8 in scope of another WL. For now, JSON arrays/objects will be sorted based on their length (number of members of objects/number of elements in arrays). Shorter JSON arrays will sort before longer JSON arrays, and shorter JSON objects sort before longer JSON objects. JSON arrays with the same number of elements sort equal. JSON objects with the same number of members sort equal. Apart from the above limitations, ordering of JSON scalar values will follow the order defined in WL#8249. JSON strings use character set utf8mb4 and will be ordered by binary comparison. If one wants to order using a different collation, one will have to cast the JSON string values to SQL strings with the desired character set and collation. GROUP BY also uses filesort and will be affected by the changes in this worklog in the following way: - Since GROUP BY sorts the results before returning them, as if there had been an ORDER BY clause that mentioned the grouping columns, the results of a GROUP BY operation on a JSON column or JSON expression will be ordered as described above. - GROUP BY WITH ROLLUP uses filesort to do the actual grouping of the results. It will be able to group scalar JSON values correctly using the definition of equality from WL#8249. It will not produce any sensible grouping for arrays and objects because of the limited filesort support for those types. - Similarly, GROUP BY in combination with the SQL_BIG_RESULT keyword will use filesort for grouping, and will therefore not produce any sensible grouping for arrays and objects.
Filesort requires that the sort keys are created in a way that makes it possible to compare and sort them using memcmp(). These sort keys are produced by Sort_param::make_sortkey(). When making a sort key for a field, Sort_param::make_sortkey() delegates the creation of the sort key to the virtual function Field::make_sort_key(). Otherwise, if it is making a sort key for the result of an expression, it checks the result type of the expression and generates the sort key based on the result type. We will make the Field_json class override the make_sort_key() function. Sort_param::make_sortkey() will automatically pick this up when ordering on JSON fields. In order to make Sort_param::make_sortkey() also order JSON expressions (that are not fields) correctly, it needs to be updated to check if an expression returns JSON. Expressions that return JSON, have result type STRING_RESULT. The case for STRING_RESULT will therefore need to check field_type() and have special logic if the field type is MYSQL_TYPE_JSON. Both in the case of ordering on fields and ordering on expressions, the actual creation of the sort key will be delegated to a new function in the Json_wrapper class, with the following signature: /** Make a sort key that can be used by filesort to order JSON values. @param[out] to a buffer to which the sort key is written @param[in] length the length of the sort key */ void make_sort_key(uchar *to, size_t length) const; Sort keys are fixed length, and filesort will look at the entire sort key. If the JSON value doesn't have enough data to fill the sort key, make_sort_key() will pad the sort key with '\0' characters. Otherwise, the random garbage at the end of the sort key could have made equal values sort as unequal. The sort key produced by Json_wrapper::make_sort_key() starts with one byte that identifies the type of the JSON value. It is constructed so that types that should order first have a smaller type identifier than those that should order last. The following type bytes are used: #define JSON_KEY_NULL '\x00' #define JSON_KEY_NUMBER_NEG '\x01' #define JSON_KEY_NUMBER_ZERO '\x02' #define JSON_KEY_NUMBER_POS '\x03' #define JSON_KEY_STRING '\x04' #define JSON_KEY_OBJECT '\x05' #define JSON_KEY_ARRAY '\x06' #define JSON_KEY_FALSE '\x07' #define JSON_KEY_TRUE '\x08' #define JSON_KEY_DATE '\x09' #define JSON_KEY_TIME '\x0A' #define JSON_KEY_DATETIME '\x0B' #define JSON_KEY_OPAQUE '\x0C' After the type byte comes a representation of the value. The format depends on the type: - The literals true/false/null have only one possible value each, so all the information needed is contained in the type byte. - Similarly, arrays and objects have no more data than the type byte, since they have no internal order currently. - DATE, TIME, DATETIME/TIMESTAMP values are converted to the packed format and copied into the sort key with the copy_integer() function, which transform integers to a memcmp() compatible format. - Strings are copied into the sort key in their original utf8mb4 encoding. They are truncated to fit in the sort key. Also, they are padded with trailing '\0' characters if they are shorter than the sort key. The last four bytes of the sort key represent the length of the string, so that longer strings sort higher than shorter strings if they cannot be distinguished by the other parts of the sort key. - Opaque values are stored in the same way as strings, except that they have one byte that represents their field type (a enum_field_types value) before the binary string. This makes the opaque values sorted according to their field type before their contents. - Numbers are split into three distinct types: negative numbers, zero and positive numbers. Obviously, negative numbers come before zero, which comes before positive numbers. If the number is zero, only the type identifier is put into the sort key, since all zeros are equal, so no more data is needed to distinguish between them. Otherwise, the sort key will contain the following extra pieces of data: First, the decimal exponent of the number, represented as a two-byte integer copied into the sort buffer using copy_integer() to make it memcmp()-comparable. Since the decimal exponent comes first, the numbers will first be sorted according to their order of magnitude. Two bytes is sufficient to store the decimal exponents of all numbers supported by the JSON numbers (double precision numbers can have exponents from -323 to +308). Next, all the digits of the numbers are copied into the sort key, most significant digit first. This way, numbers with the same decimal exponent will be ordered on their digits with decreasing significance. Finally, the sort key is padded with character '0' so that the number of trailing zeros doesn't affect ordering (it makes sure that 1.1 and 1.10 are sorted equal). If the number is negative, the exponent, the digits and the padding are all inverted, so that larger negative numbers (those closer to negative infinity) are sorted before smaller negative numbers (those closer to zero). Some example sort keys: The JSON null literal will have a sort key that looks like this: '\x00' '\0' .... '\0' type padding And the JSON false literal will have this sort key: '\x07' '\0' .... '\0' type padding The positive numbers 123, 1.23e2 and 123.000 will all have the following sort key: '\x03' '\x80' '\x02' '1' '2' '3' '0' '0' .... '0' type decimal exponent mantissa zero-padding Note that the exponent is binary coded, whereas the mantissa is text. Note also that the decimal exponent is represented as 0x8002, not as 0x0002. This is because copy_integer(), which converts integers to memcmp() compatible format, flips the sign bit in order to make sure negative numbers sort before positive numbers. The negative numbers -123, -1.23e2 and -123.000 all have the following sort key: '\x01' '\x7F' '\xFE' '8' '7' '6' '9' '9' .... '9' type decimal exponent mantissa padding Note that the exponent, mantissa and padding have been inverted in order to make sure larger negative numbers sort before smaller negative numbers. The string "abc" gets this sort key: '\x04' 'a' 'b' 'c' '\0' '\0' ... '\0' '\0' '\0' '\0' '\x03' type string padding length The string "abc\u0000" will have a similar sort key, only differing in the length at the end of the key: '\x04' 'a' 'b' 'c' '\0' '\0' ... '\0' '\0' '\0' '\0' '\x04' type string padding length --- One note on GROUP BY: Filesort-based GROUP BY uses filesort to collect the rows into groups, and it uses Cached_item::cmp() to calculate where one group ends and the next one begins. To make it calculate the boundaries of groups of JSON values correctly, we need to add a new subclass of Cached_item which implements a cmp() function that compares values according to the rules of JSON values. We will therefore add a new class, Cached_item_json, which extends Cached_item, and whose cmp() function compares JSON values using Json_wrapper::compare(). This enables filesort-based GROUP BY to find the boundaries of groups of scalar JSON values.
Copyright (c) 2000, 2022, Oracle Corporation and/or its affiliates. All rights reserved.