WL#8539: Ordering of scalar JSON values

Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

When ordering scalar JSON values with ORDER BY, they should be returned in the
order defined by the JSON comparator in WL#8249.

This worklog will implement a function that produces the sort keys filesort
needs in order to sort JSON values.

Functional requirements
-----------------------
F1: If the ORDER BY list contains an expression with type JSON, the result
should order the rows according to the rules in WL#8249 for all rows in which
the expression evaluates to a scalar.

F2: If the order is ascending (ORDER BY col ASC), SQL NULL (unknown) is ordered
before all JSON values, including the JSON null literal. If the order is
descending (ORDER BY col DESC), SQL NULL (unknown) is ordered after all JSON values.

F3: On attempt to sort non-scalar JSON the ER_NOT_SUPPORTED_YET warning is
issued 

Non-functional requirement
--------------------------
NF1: For scalar values, the result of grouping when only filesort is used for
grouping should be consistent with the result of grouping when tmp table is used
for grouping.

This WL will add to server ability to sort scalars stored in JSON docs.
In general, it's advised to cast JSON to some other native mysql
type and use it for sorting, e.g. use ORDER BY CAST(JSN_EXTRACT(...) as SIGNED)
instead of ORDER BY JSN_EXTRACT(...) as SIGNED.

We will add a new function (make_sort_key()) to the Json_wrapper class. This
function produces a sort key that can be used by filesort to order JSON values
correctly.

Filesort creates sort keys in Sort_param::make_sortkey(). We will update this
function so that it uses Json_wrapper::make_sort_key() to produce sort keys for
fields and expressions with type JSON.

There will be some limitations:

- Heavy memory consumption. Because of schemaless nature it's impossible to
  predict actual content and length of the key in general case. Due to that
  sort key for each JSON typed element in ORDER BY clause will take 
  max_sort_length (default: 1024) bytes. Values shorter than this limit will 
  be padded, longer than limit - trimmed.

- Internal ordering of JSON objects and JSON arrays will not be handled by this
  worklog. When user will try to sort objects/arrays, a warning will be issued
  telling that it's not supported yet (ER_NOT_SUPPORTED_YET). This limitation
  will be lifter in 5.8 in scope of another WL.

  For now, JSON arrays/objects will be sorted based on their length (number of
  members of objects/number of elements in arrays). Shorter JSON arrays will sort
  before longer JSON arrays, and shorter JSON objects sort before longer JSON
  objects. JSON arrays with the same number of elements sort equal. JSON objects
  with the same number of members sort equal.

Apart from the above limitations, ordering of JSON scalar values will follow 
the order defined in WL#8249.

JSON strings use character set utf8mb4 and will be ordered by binary comparison. 
If one wants to order using a different collation, one will have to cast the JSON 
string values to SQL strings with the desired character set and collation.

GROUP BY also uses filesort and will be affected by the changes in this worklog
in the following way:

- Since GROUP BY sorts the results before returning them, as if there had been
an ORDER BY clause that mentioned the grouping columns, the results of a GROUP
BY operation on a JSON column or JSON expression will be ordered as described above.

- GROUP BY WITH ROLLUP uses filesort to do the actual grouping of the results.
It will be able to group scalar JSON values correctly using the definition of
equality from WL#8249. It will not produce any sensible grouping for arrays and
objects because of the limited filesort support for those types.

- Similarly, GROUP BY in combination with the SQL_BIG_RESULT keyword will use
filesort for grouping, and will therefore not produce any sensible grouping for
arrays and objects.

Filesort requires that the sort keys are created in a way that makes it possible
to compare and sort them using memcmp(). These sort keys are produced by
Sort_param::make_sortkey(). When making a sort key for a field,
Sort_param::make_sortkey() delegates the creation of the sort key to the virtual
function Field::make_sort_key(). Otherwise, if it is making a sort key for the
result of an expression, it checks the result type of the expression and
generates the sort key based on the result type.

We will make the Field_json class override the make_sort_key() function.
Sort_param::make_sortkey() will automatically pick this up when ordering on JSON
fields.

In order to make Sort_param::make_sortkey() also order JSON expressions (that 
are not fields) correctly, it needs to be updated to check if an expression 
returns JSON. Expressions that return JSON, have result type STRING_RESULT. The 
case for STRING_RESULT will therefore need to check field_type() and have 
special logic if the field type is MYSQL_TYPE_JSON.

Both in the case of ordering on fields and ordering on expressions, the actual
creation of the sort key will be delegated to a new function in the Json_wrapper
class, with the following signature:

  /**
    Make a sort key that can be used by filesort to order JSON values.

    @param[out] to      a buffer to which the sort key is written
    @param[in]  length  the length of the sort key
  */
  void make_sort_key(uchar *to, size_t length) const;

Sort keys are fixed length, and filesort will look at the entire sort key. If 
the JSON value doesn't have enough data to fill the sort key, make_sort_key() 
will pad the sort key with '\0' characters. Otherwise, the random garbage at the 
end of the sort key could have made equal values sort as unequal.

The sort key produced by Json_wrapper::make_sort_key() starts with one byte that 
identifies the type of the JSON value. It is constructed so that types that 
should order first have a smaller type identifier than those that should order 
last. The following type bytes are used:

#define JSON_KEY_NULL        '\x00'
#define JSON_KEY_NUMBER_NEG  '\x01'
#define JSON_KEY_NUMBER_ZERO '\x02'
#define JSON_KEY_NUMBER_POS  '\x03'
#define JSON_KEY_STRING      '\x04'
#define JSON_KEY_OBJECT      '\x05'
#define JSON_KEY_ARRAY       '\x06'
#define JSON_KEY_FALSE       '\x07'
#define JSON_KEY_TRUE        '\x08'
#define JSON_KEY_DATE        '\x09'
#define JSON_KEY_TIME        '\x0A'
#define JSON_KEY_DATETIME    '\x0B'
#define JSON_KEY_OPAQUE      '\x0C'

After the type byte comes a representation of the value. The format depends on 
the type:

- The literals true/false/null have only one possible value each, so all the 
information needed is contained in the type byte.

- Similarly, arrays and objects have no more data than the type byte, since they 
have no internal order currently.

- DATE, TIME, DATETIME/TIMESTAMP values are converted to the packed format and 
copied into the sort key with the copy_integer() function, which transform 
integers to a memcmp() compatible format.

- Strings are copied into the sort key in their original utf8mb4 encoding. They 
are truncated to fit in the sort key. Also, they are padded with trailing '\0' 
characters if they are shorter than the sort key. The last four bytes of the 
sort key represent the length of the string, so that longer strings sort higher 
than shorter strings if they cannot be distinguished by the other parts of the 
sort key.

- Opaque values are stored in the same way as strings, except that they have one 
byte that represents their field type (a enum_field_types value) before the 
binary string. This makes the opaque values sorted according to their field type 
before their contents.

- Numbers are split into three distinct types: negative numbers, zero and 
positive numbers. Obviously, negative numbers come before zero, which comes 
before positive numbers.

If the number is zero, only the type identifier is put into the sort key, since 
all zeros are equal, so no more data is needed to distinguish between them.

Otherwise, the sort key will contain the following extra pieces of data:

First, the decimal exponent of the number, represented as a two-byte integer 
copied into the sort buffer using copy_integer() to make it memcmp()-comparable. 
Since the decimal exponent comes first, the numbers will first be sorted 
according to their order of magnitude. Two bytes is sufficient to store the 
decimal exponents of all numbers supported by the JSON numbers (double precision 
numbers can have exponents from -323 to +308).

Next, all the digits of the numbers are copied into the sort key, most 
significant digit first. This way, numbers with the same decimal exponent will 
be ordered on their digits with decreasing significance.

Finally, the sort key is padded with character '0' so that the number of 
trailing zeros doesn't affect ordering (it makes sure that 1.1 and 1.10 are 
sorted equal).

If the number is negative, the exponent, the digits and the padding are all 
inverted, so that larger negative numbers (those closer to negative infinity) 
are sorted before smaller negative numbers (those closer to zero).


Some example sort keys:

The JSON null literal will have a sort key that looks like this:

'\x00' '\0' .... '\0'
type      padding

And the JSON false literal will have this sort key:

'\x07' '\0' .... '\0'
type      padding

The positive numbers 123, 1.23e2 and 123.000 will all have the following sort 
key:

'\x03'  '\x80' '\x02'     '1' '2' '3'    '0' '0' .... '0'
type   decimal exponent    mantissa        zero-padding

Note that the exponent is binary coded, whereas the mantissa is text. Note also 
that the decimal exponent is represented as 0x8002, not as 0x0002. This is 
because copy_integer(), which converts integers to memcmp() compatible format, 
flips the sign bit in order to make sure negative numbers sort before positive 
numbers.

The negative numbers -123, -1.23e2 and -123.000 all have the following sort key:

'\x01'  '\x7F' '\xFE'     '8' '7' '6'    '9' '9' .... '9'
type   decimal exponent    mantissa          padding

Note that the exponent, mantissa and padding have been inverted in order to make 
sure larger negative numbers sort before smaller negative numbers.

The string "abc" gets this sort key:

'\x04'  'a' 'b' 'c'   '\0' '\0' ... '\0'   '\0' '\0' '\0' '\x03'
type      string          padding               length

The string "abc\u0000" will have a similar sort key, only differing in the 
length at the end of the key:

'\x04'  'a' 'b' 'c'   '\0' '\0' ... '\0'   '\0' '\0' '\0' '\x04'
type      string          padding               length


---

One note on GROUP BY:

Filesort-based GROUP BY uses filesort to collect the rows into groups, and it
uses Cached_item::cmp() to calculate where one group ends and the next one
begins. To make it calculate the boundaries of groups of JSON values correctly,
we need to add a new subclass of Cached_item which implements a cmp() function
that compares values according to the rules of JSON values.

We will therefore add a new class, Cached_item_json, which extends Cached_item,
and whose cmp() function compares JSON values using Json_wrapper::compare().
This enables filesort-based GROUP BY to find the boundaries of groups of scalar
JSON values.