WL#7987: JSON aggregation functions

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

Add aggregation functions to generate JSON arrays and objects. This
makes it possible to combine JSON documents in multiple rows into a
JSON array or a JSON object.

Examples:

CREATE TABLE t1(id INT, grp INT, jsoncol JSON);
INSERT INTO t1 VALUES(1, 1, '{"key1":"value1","key2":"value2"}');
INSERT INTO t1 VALUES(2, 1, '{"keyA":"valueA","keyB":"valueB"}');
INSERT INTO t1 VALUES(3, 2, '{"keyX":"valueX","keyY":"valueY"}');

SELECT JSON_ARRAYAGG(jsoncol) AS json FROM t1;

Result:
[{"key1":"value1","key2":"value2"},
 {"keyA":"valueA","keyB":"valueB"},
 {"keyX":"valueX","keyY":"valueY"}]

SELECT JSON_ARRAYAGG(jsoncol) AS json FROM t1 GROUP BY grp;

Result:
1 | [{"key1":"value1","key2":"value2"},
     {"keyA":"valueA","keyB":"valueB"}]
2 | [{"keyX":"valueX","keyY":"valueY"}]



SELECT JSON_OBJECTAGG(id, jsoncol) AS json FROM t1;

Result:
{"1":{"key1":"value1","key2":"value2"},
 "2":{"keyA":"valueA","keyB":"valueB"},
 "3":{"keyX":"valueX","keyY":"valueY"}}

SELECT JSON_OBJECTAGG(id, jsoncol) AS json FROM t1 GROUP BY grp;

Result:
1 | {"1":{"key1":"value1","key2":"value2"},
     "2":{"keyA":"valueA","keyB":"valueB"}}
2 | {"3":{"keyX":"valueX","keyY":"valueY"}}

F1 - Provide a function called JSON_ARRAYAGG that aggregates data into JSON array.

F2 - The JSON_ARRAYAGG function should return either a valid JSON array, NULL in 
case there are no rows or error.

F3 - Provide a function JSON_OBJECTAGG that aggregates data into JSON object.

F4 - The JSON_OBJECTAGG function should return either a valid JSON array, NULL in 
case there are no rows or error.

The functions are defined based on the current specifications found in the 2015 
SQL draft standard.

The standard syntax for JSON_ARRAYAGG:

 ::=
JSON_ARRAYAGG 

[  ]
[  ]
[  ]


- value expression: values to be used to create the array.

- order by clause: Not supported. This means that the order 
  of the elements in the array is undefined. No known workaround.

- null clause: Not supported. The implementation behaves as NULL ON
  NULL. The standard says that ABSENT ON NULL is implicit.
  Workaround: can be emulated by adding "WHERE x IS NOT NULL" to the query

- output clause: Not supported. We return JSON, so it's implicitly
  RETURNING JSON FORMAT. Standard says it's an implementation-defined
  string type, so there's no problem.

The standard syntax for JSON_OBJECTAGG:

 ::=
JSON_OBJECTAGG 
  
[  ]
[  ]


- name: Key to be used for creating new property inside the object.

- value expression: Value to be used as the property's value.

- null clause: Not supported. Standard says implicit NULL ON NULL
  the same as it is implemented.

- output clause: Not supported. We have implicit JSON return type, as
  for JSON_ARRAYAGG.

The standard expects for exception in case the  is null and the 
implementation returns error.

The standard also expects exception for duplicate  (aka key) but we 
do not comply with this and instead skip the insert for the duplicate
(JSON DOM implementation).


New Item_sum* classes will be introduced:
- Item_sum_json
- Item_sum_json_array
- Item_sum_json_object

And the hierarchy will look like this:

          Item
            ^
            |
     Item_result_field
            ^
            |
         Item_sum
            ^
            |
      Item_sum_json
            ^
            |
 ----------------------
 ^                    ^
 |                    |   
Item_sum_json_array  Item_sum_json_object

The implementation will take advantage of the existing Item_sum aggregation 
framework.

Item_sum_json will:
- take care of the conversions(all the val_ ... functions)
- hold the container(either a json_array or a json_object) inside m_wrapper
- grouping mechanisms(reset_field, update_field)
- take care of the details regarding the temporary result_field creation and 
output type( using field_type(), sum_func()- type, result_type())

Item_sum_json_array and Item_sum_json_object will take care of the specifics:
- implement the clear() function which will create the container(Json_array or 
Json_object)
- implement the add() function which will know how to append to the container.

The implementation will use the Item_sum aggregation framework and it will work 
like this:

When a query will call JSON_ARRAYAGG or JSON_OBJECTAGG without GROUP BY:
- init_sum_functions will be called which will call reset_and_add which will 
call 
Item_sum's reset_and_add
- inside reset_and_add, Item_sum_json_array::clear and Item_sum_json_array::add 
will be called which will create the container and then append the first element
- update_sum_func will call Item_sum_json_array::add which will for each of the 
rows and it will append the elements to the container
- when all the rows in the table will have been processed and the 
send_result_set_row is called, the Item_sum_json::val_str will be called and the 
container will be converted to string and returned.

When a query will call JSON_ARRAYAGG or JSON_OBJECTAGG with GROUP BY:
- a Field_json will be created to hold each group's value
- depending if the row is from an existing group or from a new group, either 
Item_sum_json::reset_field or Item_sum_json::update_field will be called:
  (1) Item_sum_json::reset_field() will:
    (*) call Item_sum_json_array::clear to create a container(Json_array) and 
reference it to m_wrapper
    (*) call Item_sum_json_array::add and append the first element of the group 
to 
the container(m_wrapper)
    (*) serialize and store the current value of the container(m_wrapper) inside 
the result_field
  (2) Item_sum_json::update_field() will:
    (*) de-serialize result_field's value to the container(m_wrapper) calling 
the 
val_json function over result_field
    (*) call Item_sum_json_array::add and append the current element to the 
container
    (*) serialize and store the current value inside result_field(calling the 
store_json function with the m_wrapper as argument) 
- when the results will be send to the client the result will already be saved 
in 
result_field so val_str function will be called to get the string value.

The presence of JSON aggregates in the query blocks makes the optimizer change 
the plan from using tmp table + index and update to always do grouping with 
filesort. This was done by introducing with_json_agg boolean member which will 
be set to true when the group by contains a JSON aggregate function and it will 
force the use of filesort.