WL#11574: Add JSON_ARRAYAGG and JSON_OBJECTAGG windowing functions

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

Add windowing support for JSON_ARRAYAGG and JSON_OBJECTAGG aggregate functions.

This is a followup of the work done in WL#9236. Add SQL window functions.
Please refer to that WL for a general background and specifications.

See also WL#9727, WL#9603 and WL#11573.

JSON_ARRAYAGG and JSON_OBJECTAGG have the same semantics as that of their
grouped aggregate counterparts except that all the rules of window specification
are added to it. A JSON_ARRAY or JSON_OBJECT is returned for every row within a
frame based on the window frame specification.
For example:
create table t1 ( id integer, attribute char(50), value char(50));
insert into t1 values (1, 'color', 'red'),
                      (1, 'shape', 'square'),
                      (1, 'fabric', 'silk'),
                      (1, 'fabric', 'cotton'),
                      (1, 'shape', 'rectangle'),
                      (1, 'color', 'green'),
                      (2, 'color', 'green'),
                      (2, 'shape', 'triangle'),
                      (2, 'shape', 'circle'),
                      (2, 'color', 'yellow'),
                      (2, 'fabric', 'cotton');
JSON_ARRAYAGG

Single Partition:

mysql> SELECT  id, attribute, JSON_ARRAYAGG(attribute) OVER w attribute_arr FROM
t1 WINDOW w AS (ORDER BY id);
+------+-----------+--------------------------------------------------------------------------------------------------------+
| id   | attribute | attribute_arr                                             
                                            |
+------+-----------+--------------------------------------------------------------------------------------------------------+
|    1 | color     | ["color", "shape", "fabric", "fabric", "shape", "color"]  
                                            |
|    1 | shape     | ["color", "shape", "fabric", "fabric", "shape", "color"]  
                                            |
|    1 | fabric    | ["color", "shape", "fabric", "fabric", "shape", "color"]  
                                            |
|    1 | fabric    | ["color", "shape", "fabric", "fabric", "shape", "color"]  
                                            |
|    1 | shape     | ["color", "shape", "fabric", "fabric", "shape", "color"]  
                                            |
|    1 | color     | ["color", "shape", "fabric", "fabric", "shape", "color"]  
                                            |
|    2 | color     | ["color", "shape", "fabric", "fabric", "shape", "color",
"color", "shape", "shape", "color", "fabric"] |
|    2 | shape     | ["color", "shape", "fabric", "fabric", "shape", "color",
"color", "shape", "shape", "color", "fabric"] |
|    2 | shape     | ["color", "shape", "fabric", "fabric", "shape", "color",
"color", "shape", "shape", "color", "fabric"] |
|    2 | color     | ["color", "shape", "fabric", "fabric", "shape", "color",
"color", "shape", "shape", "color", "fabric"] |
|    2 | fabric    | ["color", "shape", "fabric", "fabric", "shape", "color",
"color", "shape", "shape", "color", "fabric"] |
+------+-----------+--------------------------------------------------------------------------------------------------------+
11 rows in set (0.00 sec)


With window frames:

SELECT  id, attribute, JSON_ARRAYAGG(attribute) OVER w attribute_arr
  FROM t1 WINDOW w AS (PARTITION BY id ORDER BY attribute
                       ROWS BETWEEN 2 PRECEDING  AND 1 FOLLOWING);
+------+-----------+----------------------------------------+
| id   | attribute | attribute_arr                          |
+------+-----------+----------------------------------------+
|    1 | color     | ["color", "color"]                     |
|    1 | color     | ["color", "color", "fabric"]           |
|    1 | fabric    | ["color", "color", "fabric", "fabric"] |
|    1 | fabric    | ["color", "fabric", "fabric", "shape"] |
|    1 | shape     | ["fabric", "fabric", "shape", "shape"] |
|    1 | shape     | ["fabric", "shape", "shape"]           |
|    2 | color     | ["color", "color"]                     |
|    2 | color     | ["color", "color", "fabric"]           |
|    2 | fabric    | ["color", "color", "fabric", "shape"]  |
|    2 | shape     | ["color", "fabric", "shape", "shape"]  |
|    2 | shape     | ["fabric", "shape", "shape"]           |
+------+-----------+----------------------------------------+
11 rows in set (0.00 sec)

JSON_OBJECT AGG -

Single partition:

SELECT  id, JSON_OBJECTAGG(attribute,value) OVER w attribute_object
  FROM t1 WINDOW w AS (PARTITION BY id);
+------+--------------------------------------------------------------+
| id   | attribute_object                                             |
+------+--------------------------------------------------------------+
|    1 | {"color": "green", "shape": "rectangle", "fabric": "cotton"} |
|    1 | {"color": "green", "shape": "rectangle", "fabric": "cotton"} |
|    1 | {"color": "green", "shape": "rectangle", "fabric": "cotton"} |
|    1 | {"color": "green", "shape": "rectangle", "fabric": "cotton"} |
|    1 | {"color": "green", "shape": "rectangle", "fabric": "cotton"} |
|    1 | {"color": "green", "shape": "rectangle", "fabric": "cotton"} |
|    2 | {"color": "yellow", "shape": "circle", "fabric": "cotton"}   |
|    2 | {"color": "yellow", "shape": "circle", "fabric": "cotton"}   |
|    2 | {"color": "yellow", "shape": "circle", "fabric": "cotton"}   |
|    2 | {"color": "yellow", "shape": "circle", "fabric": "cotton"}   |
|    2 | {"color": "yellow", "shape": "circle", "fabric": "cotton"}   |
+------+--------------------------------------------------------------+
11 rows in set (0.00 sec)

With frames:

Example 1:

SELECT  id, attribute, JSON_OBJECTAGG(attribute,id) OVER w attribute_object
  FROM t1 WINDOW w AS (PARTITION BY id ORDER BY attribute
                       ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING);
+------+-----------+---------------------------------------+
| id   | attribute | attribute_object                      |
+------+-----------+---------------------------------------+
|    1 | color     | {"color": 1}                          |
|    1 | color     | {"color": 1, "fabric": 1}             |
|    1 | fabric    | {"color": 1, "fabric": 1}             |
|    1 | fabric    | {"color": 1, "shape": 1, "fabric": 1} |
|    1 | shape     | {"shape": 1, "fabric": 1}             |
|    1 | shape     | {"shape": 1, "fabric": 1}             |
|    2 | color     | {"color": 2}                          |
|    2 | color     | {"color": 2, "fabric": 2}             |
|    2 | fabric    | {"color": 2, "shape": 2, "fabric": 2} |
|    2 | shape     | {"color": 2, "shape": 2, "fabric": 2} |
|    2 | shape     | {"shape": 2, "fabric": 2}             |
+------+-----------+---------------------------------------+
11 rows in set (0.01 sec)

Example 2:

SELECT  id, attribute, JSON_OBJECTAGG(id,attribute) OVER w attribute_object
  FROM t1 WINDOW w AS (PARTITION BY id ORDER BY attribute
                       ROWS BETWEEN 2 PRECEDING and 1 FOLLOWING);
+------+-----------+------------------+
| id   | attribute | attribute_object |
+------+-----------+------------------+
|    1 | color     | {"1": "color"}   |
|    1 | color     | {"1": "fabric"}  |
|    1 | fabric    | {"1": "fabric"}  |
|    1 | fabric    | {"1": "shape"}   |
|    1 | shape     | {"1": "shape"}   |
|    1 | shape     | {"1": "shape"}   |
|    2 | color     | {"2": "color"}   |
|    2 | color     | {"2": "fabric"}  |
|    2 | fabric    | {"2": "shape"}   |
|    2 | shape     | {"2": "shape"}   |
|    2 | shape     | {"2": "shape"}   |
+------+-----------+------------------+
11 rows in set (0.00 sec)

Note that in the example 2 above, since the key value is same as the
partitioning key, only the last value within the frame is retained. This is in
line with the grouped aggregate specification where only the last value for any
key (if there are duplicate keys) is retained in the result.
For details about semantics of JSON aggregate functions, please refer to WL#7987. 
Also check -
https://dev.mysql.com/doc/refman/8.0/en/group-by-functions.html#function_json-arrayagg
https://dev.mysql.com/doc/refman/8.0/en/group-by-functions.html#function_json-objectagg
F1: The window functions should have the similar semantics as the grouped
aggregates JSON_ARRAYAGG and JSON_OBJECTAGG, that is to combine JSON documents
in multiple rows into a JSON array or a JSON object respectively. A JSON_ARRAY or
JSON_OBJECT is returned for every row within a frame based on the window frame
specification.

F2: For JSON_OBJECTAGG, if there are duplicate keys within a frame, only the
last value for the key is present in the result. This is in keeping with the
MySQL JSON data type specification that does not allow duplicate keys. When
parsing a text values as JSON, the value for the last occurring key will be
retained. Any earlier values will be discarded.
The value for the key from last row in the frame will be deterministic if the
ORDER BY specification guarantees that the values will have specific order. If
not, the resulting value of the key will be non-deterministic.

F3: JSON_ARRAYAGG and JSON_OBJECTAGG window functions should be optimizable,
that is, not show (multiplicative) performance complexity increase as the window
frame size increases.
Both JSON_ARRAYAGG and JSON_OBJECTAGG window functions use the existing
infrastructure provided by the corresponding grouped aggregates and the window
function implementation. For moving frames,optimization is needed to avoid
reading the whole set of rows to calculate result when a row leaves a frame.

For JSON_ARRAYAGG, the first element in the Json_array is removed when a row
leaves the frame. 
For JSON_OBJECTAGG, as Json_object always stores only the last value for a key,
 optimization (via inverting the addition of the row leaving the frame) is not
possible unless the row identity of the stored key/value pair is known. So two
methods are chosen to address this.

- In case where the first column of the ORDER BY clause is the key column, if
its known that a row is the last peer in the window frame for that key, then
that key/value pair is removed from the Json_object.

- If the rows are not ordered by the key, then a map is kept to keep track of
the number of times a key is present within a frame. When a row leaves a window
frame, the count is reduced by 1. If the count reaches 0, then the key/value
for that row is removed from Json_object.

Class Item_sum_json_object has two new members:
/**
  Map of keys in Json_object and the count for each key
  within a window frame. It is used in handling rows
  leaving a window frame when rows are not sorted
  according to the key in Json_object.
*/
  std::map m_key_map;
/**
  If window provides ordering on the key in Json_object,
  a key_map is not needed to handle rows leaving a window
  frame. In this case, process_buffered_windowing_record()
  will set flags when a key/value pair can be removed from
  the Json_object.
*/
  bool m_optimize;

Item_sum_json_object/array::add() has additional logic of handling
inverse/optimization logic for window functions. All the Item_sum_json::val_*()
functions now handle window functions too.