WL#12228: Add JSON_VALUE function

Affects: Server-8.0   —   Status: Complete

This WL aims to ease index creation of JSON values by introducing JSON_VALUE function. Its purpose is to extract scalar value at the specified path from the given JSON document and return it as the specified type. For example:

 SELECT JSON_VALUE('{"name": "Evgen"}', '$.name')

will return unquoted string 'Evgen' as VARCHAR(512) with JSON's default collation,

 SELECT JSON_VALUE('{"price": 123.45}', '$.price' RETURNING DECIMAL(5,2))

will return 123.45 as decimal(5,2).

The user experience with indexing JSON data is eased in the following ways:

  • ) The expression one has to specify in order to create a usable index is simpler and automatically contains all necessary functionality.
  • ) By default JSON_VALUE's return type will be VARCHAR(512), not BLOB like for JSON_UNQUOTE. This allows to create index using immediate function's output, without need of CAST().
  • ) JSON_VALUE's string result will have collation of given JSON value, not the table's default. This removes need to match collation of GC/func index and the expression used in WHERE clause for index to be used.

This function is described in SQL 2016, chapter 6.27 <JSON value function>, and is implemented in:

Note that RETURNING/ON ERROR/ON EMPTY clauses aren't implemented by SQL Server.

Functional requirements

F1 - Queries with predicates that use JSON_VALUE should be able to use functional indexes created with JSON_VALUE (given that path expression, data type and the ON EMPTY/ON ERROR clauses match).

F2 - The RETURNING clause of JSON_VALUE should accept specification of any non-array data type accepted by the CAST operator.

F3 - If no RETURNING clause is specified, the return value of the JSON_VALUE expression is VARCHAR(512) COLLATE utf8mb4_0900_bin.

F4 - If no ON EMPTY clause is specified, NULL ON EMPTY is implicit.

F5 - If no ON ERROR clause is specified, NULL ON ERROR is implicit.

F6 - If a DEFAULT ... ON EMPTY/ERROR clause specifies a default value that cannot be converted to the RETURNING type without truncation or loss of precision, an error should be raised.

F7 - If a DEFAULT ... ON EMPTY/ERROR clause specifies an invalid date, or a zero date or a date with a zero component, an error is raised.

F8 - If the JSON path argument is not a valid JSON path, an error is raised.

F9 - If the type of the JSON document argument is not JSON and not a character string type, an error is raised.

F10 - If the JSON document argument is NULL, the JSON_VALUE expression returns NULL.

F11 - If the JSON document argument is not NULL and its type is a character string type (not JSON), and the string is not a valid JSON document, the ON ERROR clause should be invoked.

F12 - If the JSON path argument matches multiple values in the JSON document argument (because the path contains wildcards or ranges), the ON ERROR clause should be invoked.

F13 - If the JSON path argument matches no values in the JSON document argument, the ON EMPTY clause should be invoked.

F14 - If the value found on the specified path in the JSON document cannot be converted to the RETURNING type without truncation or loss of precision, the ON ERROR clause should be invoked.

Contents


Overview

This WL introduces a new function called JSON_VALUE. Its purpose is to combine functionality of 3 already existing functions to ease handling of JSON value and ease index creation. For example:

 JSON_VALUE('{"price": 123.45}', '$.price' RETURNING DECIMAL(5,2))

is mostly the same as

 CAST(
   JSON_UNQUOTE(JSON_EXTRACT('{"price": 123.45}', '$.price'))
   AS DECIMAL(5,2)

However, there are few differences:

  • ) When JSON_VALUE returns string, it will have the collation of source JSON, while for CAST() it'll be connection's collation.
  • ) JSON_VALUE allows to specify two optional clauses for more fine grained handling of the conversion process - ON ERROR and ON EMPTY. They behave in a similar way as in JSON_TABLE.

[JS] Using DECIMAL(6,4) in the above JSON_VALUE() call returns NULL. Using DECIMAL(6,4) in the CAST call() returns 99.9999 with this warning:

Out of range value for column 'CAST(JSON_UNQUOTE(JSON_EXTRACT('{"price": 123.45}', '$.price')) AS DECIMAL(6,4))' at row 1

Regarding DECIMAL, the Manual says, that the first argument is the precision (total number of digits) and the second is the scale (digits to the right of the decimal point). I have therefore amended the examples to use DECIMAL(5,2)--5 digits total with 2 of these to the right of the decimal point.


Syntax

 JSON_VALUE '(' <json expr>,
   <json_path> <opt_returning_type> <opt_on_empty> <opt_on_error> ')'
 <json expr> is a general expression which should return valid JSON document
 <json path> string literal specifying path in the given doc to extract value
             from
 opt_returning_type:
     <empty>
   | RETURNING <cast_type>
   ;

The above two clauses specify the return type of JSON_VALUE. The type is specified (and uses the same syntax rules) as for the CAST function.

 opt_on_empty:
     <empty>
   | NULL ON EMPTY
   | ERROR ON EMPTY
   | DEFAULT <literal> ON EMPTY
   ;
 opt_on_error:
     <empty>
   | NULL ON ERROR
   | ERROR ON ERROR
   | DEFAULT <literal> ON ERROR
   ;

NULL ON (ERROR|EMPTY) is the default when these clauses aren't specified. These clauses have similar meaning as for JSON_TABLE. The difference from JSON_TABLE's ON EMPTY/ON ERROR clauses is that JSON_TABLE only allows the DEFAULT clause to take a string literal, and requires that it can be parsed as JSON text. This is not how it is defined in the standard, so JSON_VALUE will allow any literal in the DEFAULT clause, and it will not require it to be parsable JSON text, except for the case where RETURNING JSON has been specified. (JSON_TABLE should also be changed to match the SQL standard, but that is outside the scope of this worklog.)

Result type and its collation

When RETURNING clause isn't specified, JSON_VALUE will return VARCHAR(512) having utf8mb4 charset and utf8mb4_0900_bin collation. 512 chars (2048 bytes in case of utf8mb4 charset) is the longest string that can be represented in server without getting converted to a blob. The utf8mb4_0900_bin collation is chosen because it is the collation used by MySQL when comparing two JSON string literals.

In general, JSON_VALUE behaves in the same way as CAST function. It supports the same types that CAST supports:

   FLOAT DOUBLE DECIMAL SIGNED UNSIGNED
   DATE TIME DATETIME
   CHAR
   JSON

One exception is that JSON_VALUE only supports returning non-array types, whereas CAST allows returning array types when used in functional index expressions.

The other exception is how it handles result's collation for CHAR type. For CAST( .. AS CHAR) expression in functional index the CREATE TABLE will create a hidden field with table's default collation. Usually it's utf8mb4_0900_ai_ci, a case insensitive collation, unlike JSON's one which is binary and thus case sensitive. This is the cause of bug#27337092. To address that for JSON_TABLE's result CREATE TABLE sees JSON_TABLE's collation, instead of table's default. The reason for this is that hidden field represents and holds the result of JSON_VALUE function and arbitrarily changing its collation will result in discrepancy. Retaining JSON_VALUE's collation allows to fix this bug.

However sometimes users need to use case insensitive collation for index, but syntax for type specification for CAST functions (JSON_VALUE uses it too) doesn't allow to specify the desired collation. This WL implements a simple workaround to address that case. CAST's type specification rules allow to specify charset for the result. When there's no charset specified, JSON_VALUE will use utf8mb4 with binary collation (my_charset_utf8mb4_0900_bin), but if user specifies utf8mb4 charset for the result, server will use default collation for this charset and it is my_charset_utf8mb4_0900_ai_ci, which is case insensitive. This approach allows to use both options keeping general CAST's syntax intact.

Error handling

Errors are handled in following ways:

  • ) All input JSON data, in string literals or non-constant input, is checked to be valid. If it's not, an SQL error is thrown. Same is done for string literal containing JSON path. ON ERROR clause isn't triggered in this case.
  • ) Attempt to extract an object or an array, conversion error (e.g. 'asdf' to unsigned), or a truncation triggers ON ERROR clause.
  • ) Even in NULL and DEFAULT cases of ON ERROR a warning on bad conversion is issued.
  • ) If source JSON document doesn't have the specified path, then ON EMPTY clause is triggered.

In the sense of error handling, JSON_VALUE behaves in same way as JSON_TABLE.

Changes to the optimizer

When the user creates an index using JSON_VALUE, e.g like following:

 CREATE TABLE t1(j JSON, key i1((JSON_VALUE(j, '$.id RETURNING UNSIGNED))));
 INSERT INTO t1 VALUES(...),(...)...;

the optimizer should employ it for following query:

 SELECT * FROM t1 WHERE JSON_VALUE(j, '$.id RETURNING UNSIGNED)=123;

if all the requirements for employment of indexes over generated columns are met.

Conformance to the SQL standard

The implementation of JSON_VALUE attempts to follow the standard as much as possible.

It does not implement feature T826, "General value expression in ON ERROR or ON EMPTY clauses". It only accepts a literal for DEFAULT ... ON EMPTY and DEFAULT ... ON ERROR.

The implementation of the RETURNING clause reuses grammar rules from the CAST operator, which uses non-standard type names for some data types and cannot be used for specifying all of MySQL's supported types.

Overview

The goal of this WL is to create a single function drop-in replacement for expressions like CAST(JSON_UNQUOTE(JSON_EXTRACT(jsn,'$.path'))) that are used to create indexes over JSON data. So the goal is to ease users' experience with creating/using indexes, but keep it consistent with SQL.

The JSON_VALUE function implementation

The function is implemented in the Item_func_json_value class. It reuses some code for resolving from Item_func_array_cast. That code is factored out from Item_func_array_cast and into standalone helper functions used by both classes.

It follows the common Item_func architecture:

  • fix_fields() does pre-check of all arguments to be valid: path, default values for ON ERROR and ON EMPTY clauses. The latter are checked to be convertible without errors to the function's result type.
  • print() and func_name() are overloaded to return appropriate info for JSON_VALUE function.
  • data is returned via val_x() and get_date()/get_time() in two steps: extract_json_value() extracts data from source JSON and handles ON EMPTY clause, and the ON ERROR clause if the JSON document isn't valid JSON, and the val_xxx() functions convert the extracted value to the specified type and handle the ON ERROR clause if the value cannot be converted to the target type.
  • eq() is extended to check that all arguments, including the RETURNING, ON EMPTY and ON ERROR clauses are equivalent.

Other changes

The standard syntax for JSON_VALUE doesn't fit a regular function call with comma-separated arguments, so it's implemented in the SQL parser. Two new tokens were added - RETURNING_SYM and JSON_VALUE_SYM. Both are non-reserved words in MySQL. JSON_VALUE is a reserved word in the SQL standard, RETURNING is non-reserved.

The implementation in the parser borrows grammar rules from the CAST operator and from JSON_TABLE. The implementation of JSON_TABLE breaks with the standard in some ways. One is that it allows the ON EMPTY clause and the ON ERROR clause to come in either order, whereas the standard says the ON EMPTY clause should come before the ON ERROR clause. The other is that it only allows string literals for the default value in ON EMPTY/ON ERROR, and the string must be a valid JSON string. The standard does not require the default value to be a string.

The JSON_VALUE implementation does not inherit these deviations from the standard. The rules for the ON EMPTY and ON ERROR clauses are therefore refactored so that ON ERROR before ON EMPTY is only allowed for JSON_TABLE. Also, the grammar allows any literal type for the DEFAULT. JSON_TABLE has not been updated to do anything with the non-string default values, though, so non-string defaults are rejected by PT_json_table_column_with_path::contextualize().

The rules for literals in the parser are a little backwards compared to the SQL standard syntax:

  • MySQL's literal rule accepts NULL, whereas <literal> in the standard is described as "Specify a non-null value."
  • MySQL's literal rule does not accept signed numeric values, instead there is a signed_literal rule which accepts both literals and signed numeric values. In the standard, it is the other way around. <literal> is <signed numeric literal> or <general literal>.

To address the former issue, literal is renamed to literal_or_null, and signed_literal is renamed to signed_literal_or_null, and new versions of literal and signed_literal, which don't accept NULL, are added. The latter issue is not fixed, so json_on_response is changed to use signed_literal, which is what best matches the standard's <literal>.

Json_wrapper::coerce_int() now returns unsigned_flag of the coerced value. This is used for proper coercion of signed and unsigned integers and correct error handling.

handle_coercion_error() now has one more option regarding error handling - CE_IGNORE. When it's given, neither an error nor a warning will be raised. But even then coerce_x() will return true to allow the caller to react consistently and throw the proper error/warning.