WL#9831: Ranges in JSON path expressions

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

The implementation of the JSON path language in MySQL currently allows only a
single subscript or a wildcard in array accessors. The SQL standard allows
specification of ranges, using the following syntax:

`$[5 to 7]` matches the array elements at 0-based index positions 5, 6 and 7.
That is, the 6th, 7th and 8th element of the array.

Oracle supports this syntax, see
[here](https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6255)

Additionally, the SQL standard allows array indexes to be specified using more
complex expressions than just an integer literal. We will implement a subset of
this syntax to allow specifying array indexes relative to the end of the array.
The subset we will implement includes the SQL/JSON path keyword `last`, which
represents the index of the last element of the array, and expressions that
subtract a number from `last`:

`$[last]` matches the last element of a JSON array.

`$[last-1]` matches the second to last element of a JSON array.

`$[last-2 to last]` matches the last three elements of a JSON array.

This worklog was created in response to BUG#79052.

Examples:

    # Returns [ 2, 3, 4 ]
    SELECT JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[1 to 3]');

    # Returns [ 4, 5 ]
    SELECT JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[3 to 10000]');

    # Returns [ 3 ]
    SELECT JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[2 to 2]');

    # Returns 5
    SELECT JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[last]');

    # Returns [ 3, 4 ]
    SELECT JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[last-2 to last-1]');

    # Returns [ 2, 3 ]
    SELECT JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[last-3 to 2]');

    # Returns [ 2, 3 ]
    SELECT JSON_EXTRACT('[{"x":1},{"y":2},{"y":3},{"z":4},]', '$[1 to 10].y');

    # Returns NULL
    SELECT JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[10 to 20]');

    # Raises an error, invalid path
    SELECT JSON_EXTRACT('[1, 2, 3, 4, 5]', '$[3 to 2]');

    # Raises an error, cannot use path with wildcard, ellipsis or range in this
context
    SELECT JSON_SET('[1, 2]', '$[0 to 1]', 3);

    # Raises an error, cannot use path with wildcard, ellipsis or range in this
context
    SELECT JSON_LENGTH('[1, 2]', '$[0 to 1]');





F-1: If a JSON path expression that contains a range is used in a context where wildcards or ellipses are not allowed, an error should be raised.

F-2: If a path expression contains an array index which specified as `last - N`, where N is a non-negative integer, it should be interpreted as the element at 0-based index position `arrayLength - N - 1`.

F-3: If a path expression contains an array index which is specified as `last`, it should be interpreted as if `last-0` had been specified.

F-4: If both the start of the range and the end of the range are relative to the beginning of the array, or if both are relative to the end of the array, the JSON path parser should raise an error if the start of the range is after the end of the range.

F-5: An array range that is evaluated against an array, matches all the array elements whose index is greater than or equal to `max(0, startIndex)` and less than or equal to `min(endIndex, arrayLength - 1)`.

F-6: If an array cell path leg or an array range path leg is evaluated against a non-array value, the result of the evaluation is the same as if the non-array value had been wrapped in a single-element array.
# Changes in the JSON path syntax

The new syntax introduced is a subset of the SQL standard syntax, described in SQL:2016, 9.39 SQL/JSON path language: syntax and semantics.

The currently supported JSON path syntax is described in https://dev.mysql.com/doc/refman/5.7/en/json-path-syntax.html. This worklog will change the `arrayLocation` production rule from

    arrayLocation:
        leftBracket ( nonNegativeInteger | asterisk ) rightBracket

to

    arrayLocation:
        leftBracket ( arrayIndex | arrayRange | asterisk ) rightBracket

    arrayIndex:
        nonNegativeInteger |
        last [ minus  nonNegativeInteger ]

    arrayRange:
        arrayIndex whitespace+ to whitespace+ arrayIndex

    minus:
        '-'

    to:
        "to"

    last:
        "last"

The first `arrayIndex` in an `arrayRange` is the start of the range, and the second `arrayIndex` is the end of the range. Both indexes are inclusive.

If both the start index and the end index of a range are relative to the beginning of the array (that is, neither contains the `last` keyword), or if both are relative to the end of the array (that is, both contain the `last` keyword), then it is a syntax error if the start index is greater than the end index. For example, `$[3 to 2]` and `$[last to last-1]` are rejected by the JSON path parser. Such paths can never match any value in any JSON document.

# Changes in array auto-wrapping

WL#7909 states that objects and scalars may be wrapped in single-element arrays while evaluating a path. Specifically, it says "If a pathExpression identifies a non-array value, then pathExpression[ 0 ] evaluates to the same value as pathExpression." This is in order to comply with the rules for "lax" evaluation of an array accessor on something that is not an array, as described in SQL:2016, 9.39 SQL/JSON path language: syntax and semantics, General Rule 11, g), case 3), A).

That rule says that (in "lax" mode) non-arrays should be wrapped in a single-element array before the array accessor is evaluated. In the original implementation of the JSON path language, the only array accessor that could match an element in a single-element array, was `[0]`. With the new syntax added in this worklog, the following array accessors can evaluate to the element contained in a single-element array: `[last]`, `[0 to last]`, `[last-N to M]`, `[0 to N]`, `[last-N to last]`, where N and M are non-negative integers.

Because of this, the new rule for array wrapping when evaluating a JSON path expression, is as follows: **If an array cell path leg or an array range path leg is evaluated against a non-array value, the result of the evaluation is the same as if the non-array value had been wrapped in a single-element array.**

Note that the current implementation is not consistently following the SQL standard on array auto-wrapping. It does so for `<JSON array accessor>` as specified in the General Rule mentioned above. It does however not auto-wrap for `<JSON wildcard array accessor>`, even though General Rule 11, g), case 4), A) says it should in "lax" mode. For example, `JSON_EXTRACT('{"a":123}', '$.a[*]')` does not return any matches, but to be standard-compliant it should evaluate to 123 in "lax" mode or raise the condition *data exception — SQL/JSON array not found* in "strict" mode.

Conversely, General Rule 11, g), cases 1) and 2) specify that `<JSON member accessor>` and `<JSON member wildcard accessor>` should **unwrap** arrays in "lax" mode, or raise an error when used on arrays in "strict" mode. The current implementation does neither. For example, `JSON_EXTRACT('[{"a":1},{"a":2}]', '$.a')` should match 1 and 2 if "lax" or raise an error if "strict", but it currently returns no matches.

This worklog does not add auto-wrapping for array wildcard accessors, or auto-unwrapping of arrays for member accessors or member wildcard accessors, so these inconsistencies remain.

# Changes in path handling in JSON functions

## Path lookups

The code that performs path lookup in a JSON value (`Json_wrapper::seek()`), used by most of the JSON functions, is extended as follows. (Any mentioning of `last-N` below also applies to `last`, since `last` is equivalent to `last-0`.)

If a JSON array cell path leg is encountered and the array index is an expression containing the `last` keyword:

- If the path leg `[last-N]` is evaluated on an array, and N is less than the length of the array, the array element at position arrayLength - N - 1 is added to the list of matches.
- If the path leg is evaluated on a non-array value, and the path leg is `[last]` or `[last-0]`, the non-array value is added to the list of matches.
- Otherwise, nothing is added to the list of matches.

If a JSON array range path leg is encountered:

- If the path leg is evaluated on an array:
    - First, all index expressions on the form last-N are converted to indexes relative to the beginning of the array: index = arrayLength - N - 1
    - If the start index is less than 0, a start index equal to 0 is used.
    - If the end index is greater than or equal to arrayLength, then an end index equal to arrayLength - 1 is used.
    - If the start index is less than or equal to the end index, all array elements from the start index up to and including the end index are added to the list of matches.
    - Otherwise, nothing is added to the list of matches.
- If the path leg is evaluated on a non-array value, and the range is [0 to N], [last-N to last], [0 to last] or [last-N to M], where N and M are non-negative integers, the non-array value is added to the list of matches.
- Otherwise, nothing is added to the list of matches.

## Rejection of paths that contain ranges

The following functions currently raise an error when a path contains a member wildcard, an array wildcard or an ellipsis:

- JSON_CONTAINS
- JSON_LENGTH
- JSON_KEYS
- JSON_ARRAY_APPEND
- JSON_INSERT
- JSON_ARRAY_INSERT
- JSON_SET
- JSON_REPLACE
- JSON_REMOVE

These functions will now also raise an error if they are called with a path that contains an array range. The error message is changed from "In this situation, path expressions may not contain the \* and \*\* tokens." to "In this situation, path expressions may not contain the \* and \*\* tokens or an array range."

Path arguments with ranges are accepted by the following functions:

- JSON_CONTAINS_PATH
- JSON_EXTRACT
- JSON_SEARCH

## Indexes before the first array element

JSON_INSERT, JSON_ARRAY_INSERT and JSON_SET currently accept paths that point to an element after the end of an array, in which case a new value is appended to the array. For example:

    mysql> SELECT JSON_INSERT('[1,2]', '$[5]', 5) AS c1,
        ->        JSON_ARRAY_INSERT('[1,2]', '$[5]', 5) AS c2,
        ->        JSON_SET('[1,2]', '$[5]', 5) AS c3;
    +-----------+-----------+-----------+
    | c1        | c2        | c3        |
    +-----------+-----------+-----------+
    | [1, 2, 5] | [1, 2, 5] | [1, 2, 5] |
    +-----------+-----------+-----------+

For symmetry, these functions will now also accept paths that point to an element before the first element in the array (an expression on the form `last-N` where `arrayLength - N - 1 < 0`). Then they will add a new value at the beginning of the array. For example:

    mysql> SELECT JSON_INSERT('[1,2]', '$[last-5]', -5) AS c1,
        ->        JSON_ARRAY_INSERT('[1,2]', '$[last-5]', -5) AS c2,
        ->        JSON_SET('[1,2]', '$[last-5]', -5) AS c3;
    +------------+------------+------------+
    | c1         | c2         | c3         |
    +------------+------------+------------+
    | [-5, 1, 2] | [-5, 1, 2] | [-5, 1, 2] |
    +------------+------------+------------+

This also happens in the case where JSON_INSERT and JSON_SET auto-wrap a non-array value (JSON_ARRAY_INSERT does not auto-wrap):

    mysql> SELECT JSON_INSERT('1', '$[last-5]', -5) AS c1,
        ->        JSON_ARRAY_INSERT('1', '$[last-5]', -5) AS c2,
        ->        JSON_SET('1', '$[last-5]', -5) AS c3;
    +---------+------+---------+
    | c1      | c2   | c3      |
    +---------+------+---------+
    | [-5, 1] | 1    | [-5, 1] |
    +---------+------+---------+

## Changes in JSON_EXTRACT

JSON_EXTRACT currently returns one of the following:

- NULL if one of its arguments is NULL, or if none of the specified paths is found in the JSON value.
- The JSON value that matches the specified path, not wrapped in an array, if there is only a single path argument and that path does not contain any wildcard or ellipsis token.
- An array of all the JSON values that match one or more of the given paths otherwise.

When this worklog is implemented, JSON_EXTRACT will return a single value, not wrapped in an array, only if there is a single path argument and that path does not contain any wildcard, ellipsis **or array range**. If the path contains an array range, JSON_EXTRACT will return an array of all matches. Except if there is no match, in which case NULL will be returned.

These changes also apply to the -> and ->> tokens that are translated to JSON_EXTRACT by the SQL parser.
# Changes in the Json_path_leg class

A new Json_path_leg type, `jpl_array_range`, is added to the enum_json_path_leg_type enum. This type is used for path legs that represent array ranges.

The `m_array_cell_index` member of the Json_path_leg class is replaced with two members: `m_first_array_index` and `m_first_array_index_from_end`, where the latter is a bool that tells if the index is relative to the end of the array. Also, two new members `m_last_array_index` and `m_last_array_index_from_end` are added to make it possible to represent ranges, and not only single-index array accessors.

The accessor functions that return the indexes of an array cell or array range path leg, return an instance of Json_path_leg instead of size_t. Json_array_index is a new class that helps translate the specified indexes (whether they are simple integers or expressions using the `last` keyword) into positions in a given JSON array.

`Json_path_leg::to_string()` is made aware of the new path leg syntax.

A new member function `is_autowrap()` is added, to help determining if an array accessor can match non-array values because of auto-wrapping. Before, this could easily be determined by checking if the path leg was a `jpl_array_cell` leg whose index was 0. With the introduction of ranges and negative array indexes, this now applies to more cases, so the logic is moved to this helper function to avoid duplicating the more complicated check.

New member functions to get the array indexes are added:

- `first_array_index()` and `last_array_index()` get the start and end of an array range path leg. `first_array_index()` can additionally be used to get the index of an array cell path leg. The return value is a Json_array_index object, which can tell whether the index is before the first element in the array, or after the end of the array, and which position in the array it points to.

- `get_array_range()` gets the range of indexes that are matched by an array range path leg.

# Changes in the Json_path class

The member function `contains_wildcard_or_ellipsis()` used to return true if the path contained a wildcard or an ellipsis. This was used in some contexts to reject paths that could match multiple values. In those contexts, paths containing array ranges should also be rejected, so this function was renamed to `can_match_many()` and was changed so that it also returned true for paths that contained an array range.

`Json_path::parse_array_leg()` is made aware of the new syntax (ranges and expressions using the `last` keyword).

One bug is fixed in `parse_array_leg()`: When parsing an array index, the original code parsed the array index as a longlong (64-bit) and unconditionally cast the result to size_t. On 32-bit platforms, this could mean that the resulting array index would be incorrect, if the specified array index was greater than or equal to 2^32. This was fixed by making the JSON path parser raise an error if the specified array index is greater than UINT_MAX32. (The JSON binary storage format doesn't allow more than UINT_MAX32 elements in an array, so allowing larger array indexes in the JSON path language isn't very useful even on 64-bit platforms.)

# Changes in the path lookup

The path lookup has two different implementations. One in `Json_dom::find_child_doms()` which works on DOM representations, and one in `Json_wrapper::seek_no_ellipsis()` which allows lookup without converting the document to a DOM if the path has no ellipsis token. Both of these are changed as described in the [Path lookups](#pathlookups) section of the high-level specification.

# Changes in JSON functions

`Item_func_json_array_append::val_json()`, `Item_func_json_insert::val_json()` and `Item_func_json_set_replace::val_json()` are changed so that they can insert new values in front of the first element in an array, if an array cell path leg points to an element before the first element, as described in the [Indexes before the first array element](#indexesbeforethefirstarrayelement) section of the high-level specification.

The Item_json_func sub-classes that previously used `Json_path::contains_wildcard_or_ellipsis()` to reject paths that contained a wildcard or an ellipsis, now use `Json_path::can_match_many()` instead to reject ranges as well as wildcard and ellipsis.