WL#17056: Enable use of auto-increment column in DML for Json Duality

Affects: Server-9.x — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

In V1 DML's on Json Duality View expected values for auto-increment columns to be explicitly supplied. This requirement was removed for columns which are not part of a join condition.

Ideally the auto generated value can be used. E.g., The value can also be used to determine key value in sub-objects. This WL would design and implement behavior of auto-increment column's used in join conditions of Json Duality Views.

Functional Requirements

F1: It must be possible to use the generated value for base table columns having the AUTO_INCREMENT attribute when inserting into JSON Duality Views also when the base table column is part of a join condition, subject to the limitations mentioned below.

F2: For a singleton child; if the child join condition column is AUTO_INCREMENT, the JSON document being inserted can omit (or specify as 0 or NULL, provided the sql_mode NO_AUTO_VALUE_ON_ZERO is off) the join condition fields.

F3: For a nested child; if the parent join condition column is AUTO_INCREMENT, the JSON document being inserted can omit (or specify as 0 or NULL, provided the sql_mode NO_AUTO_VALUE_ON_ZERO is off) the join condition fields.

F4: Mixing explicit and generated values for a field of a JSON document which map to a PRIMARY KEY AUTO_INCREMENT column must be rejected. See example in the Duplicate Primary Key Check section of the HLS.

F5: If the value for one side of a join condition is provided explicitly this value must be propagated to the other side of the join condition, even if the the other side is AUTO_INCREMENT. In other words, inference from an explicitly provided value through the join condition takes precedence over automatic generation. This applies even if automatic generation is explicitly requested using 0 or null.

The reason why JSON fields which map to columns of a join condition always must be provided explicitly even when the column is AUTO_INCREMENT, is that there are situations where the value being generated for the column needs to be available to the JDV code.

Duplicate Primary Key Check

Insert into a JDV can specify multiple objects which map to the same base table row provided that the objects contain identical values for all the columns of the base table. This check relies on being able to group objects based on primary key value and check that other column values are the same in each group.

Allowing a field which maps to a PRIMARY KEY AUTO_INCREMENT column to be omitted (and thereby the column value generated) leads to a potential problem: When combining explicit PK values with generated ones, it is possible that an insert using a generated PK and one using an explicit PK actually end up using the same value. Since the identical PKs are not observable in the JSON document it will be allowed, even if the other column values are not identical.

Allowing a mix of generated and explicit PKs would require us to track all PKs generated for a particular base table so far, and check the explicit PKs against this set. The added benefit of this does not seem to be worth the effort.

Example

INSERT INTO customers_orders_dv VALUES (
'{  "customer_name": "Alice",
    "_id": 1,
    "orders": [
      {"order_id": 0, "product": "Laptop", "amount": 1299.99},
      {"product": "Mouse", "amount": 19.99},
      {"order_id": 42, "product": "Laptop", "amount": 1299.99}
    ]
}');

Instead we propose to allow inserts which always relies on the generated value for primary keys, or never does so. Such a check is relativeley simple to implement. When the primary key values are provided explicitly the check can be performed in the same way as in V1. If all the primary keys are generated no check needs to be performed as there cannot be any duplicates.

Using only explicit values

INSERT INTO customers_orders_dv VALUES (
'{  "customer_name": "Alice",
    "_id": 1,
    "orders": [
      {"order_id": 100, "product": "Laptop", "amount": 1299.99},
      {"order_id": 101, "product": "Mouse", "amount": 19.99},
      {"order_id": 100, "product": "Laptop", "amount": 1299.99}
    ]
}');

Using only `AUTO_INCREMENT`-generated values

INSERT INTO customers_orders_dv VALUES (
'{  "customer_name": "Alice",
    "_id": 1,
    "orders": [
      {"order_id": 0, "product": "Laptop", "amount": 1299.99},
      {"product": "Mouse", "amount": 19.99},
      {"order_id": 0, "product": "Laptop", "amount": 1299.99}
    ]
}');

Generated Values in Join-Conditions

Relying on AUTO_INCREMENT for columns which are part of a join-condition connecting a child table to its parent represent a problem because since the user does not know which AUTO_INCREMENT value will be generated, they cannot provide a correct value for the other column of the join condition. So to use AUTO_INCREMENT in this way code must be added which can fill in the value of the column which depends on the AUTO_INCREMENT column from the value actually generated for the AUTO_INCREMENT column.

To support automatically generated keys in join-conditions we can make use of the LAST_INSERT_ID functionality to retrieve the automatically generated key after the insert has taken place. But in order for this to work the insertion into the table with an AUTO_INCREMENT join-condition column must happen before insertion into the dependent join-condition column.

Currently inserts into a singleton child happen before inserts into the parent, while inserts into a nested child happens after inserts into the parent. This insertion order is required to ensure that FK constraints are not violated.

Nested Child Example

For nested children this insertion order works well as only the parent join condition column can use the AUTO_INCREMENT-generated value. Since each of the nested child objects need to refer to the same parent object it makes little sense to rely on an AUTO_INCREMENT-generated value for it.

CREATE TABLE nested(nc1 INT PRIMARY KEY, nc2 INT);
CREATE TABLE parent(pc1 INT PRIMARY KEY AUTO_INCREMENT, pc2 INT);

CREATE JSON DUALITY VIEW jdv_nested AS
SELECT JSON_DUALITY_OBJECT(WITH (INSERT, UPDATE, DELETE)
"_id" : pc1,
"_pc2" : pc2,
"_nested" : (SELECT JSON_ARRAYAGG(JSON_DUALITY_OBJECT(WITH
(INSERT, UPDATE, DELETE)
"_nc1" : nc1,
"_nc2" : nc2
))
FROM nested
WHERE nested.nc2 = parent.pc1
)
)
FROM parent;

INSERT INTO jdv_nested VALUES (
'{ "_nested": [{"_nc1": 0}, {"_nc1":1}]}');

Singleton Child Example

For singleton children the insertion order allows the use of an AUTO_INCREMENT-generated value for the child join condition column, but not for the parent join condition column. If the parent join condition column has a foreign key, as was required in the original JDV specification, it makes little sense to declare it as AUTO_INCREMENT, since there is no way to know which, if any, row would be referenced. But even without a foreign key AUTO_INCREMENT is not useful here, because the value would be generated after insertion in the child table has already happened. In order to actually use the generated value the insertion order would need to be changed in this case.

CREATE TABLE singleton(sc1 INT PRIMARY KEY AUTO_INCREMENT, sc2 INT);
CREATE TABLE parent(pc1 INT PRIMARY KEY, pc2 INT REFERENCES singleton(sc1));

CREATE JSON DUALITY VIEW jdv_singleton AS
SELECT JSON_DUALITY_OBJECT(WITH (INSERT, UPDATE, DELETE)
"_id" : pc1,
"_pc2" : pc2,
"_singleton" : (SELECT JSON_DUALITY_OBJECT(WITH (INSERT, UPDATE)
"_sc1" : sc1,
"_sc2" : sc2
)
FROM singleton
WHERE singleton.sc1 = parent.pc2
)
)
FROM parent;
INSERT INTO jdv_singleton VALUES ('{ "_id":1, "_singleton": {"_sc2": 42}}');

For this worklog the plan is to reject the use of AUTO_INCREMENT-generated values where this is not supported by the current insertion order. It would still be possible to use such a column in the join condition, but a value must be provided for it from the JSON document being inserted like today.

Replication

What are the Replication/High-Availability considerations with respect to deployment and installation?

Are there extra steps needed to install the feature?

No extra steps are required for the installation of the feature.
Is there a specific order for those steps? For instance, need to be installed first on replica servers and only then on source servers.

Yes, a specific upgrade order must be followed. First, upgrade the replica servers to a version that supports AUTO_INCREMENT columns in JSON Duality Views. Once all replicas are successfully upgraded, the source server can then be upgraded. This order is essential to avoid the source generating replication statements that the older replica versions cannot understand or execute.

How does the feature operate on distributed environment?

Does the feature rely on server local configuration? That is, do source and replica server need to have the same configuration, plugins and or components?

The feature does not rely on local configurations, whether session or global, nor does it depend on specific plugins or components. Therefore, the source and replica servers do not need to have identical configurations or components for the feature to function properly.
What is the expected behavior when the configuration does not match?

NA
Does the feature rely on session configuration? Does that session configuration need to be replicated together with data?

The feature does not rely on session configuration. As such, session configuration does not need to be replicated alongside the data, allowing for flexibility in session settings between the source and replica servers.

Upgrade/Downgrade and Cross-Version Replication

Any impact on cross-version replication? Which?

There is no impact on cross-version replication in general. However, if replicating new -> old, and old performs a takeover, it will not be able accept DMLs relying on AUTO_INCREMENT columns being generated.
Any impact on upgrade/downgrade? Which?

Upgrade must be possible from a version which does not support AUTO_INCREMNT columns in JDVs, to the current version that does. When downgrading from a version that supports AUTO_INCREMENT, DMLs which rely on AUTO_INCREMENT columns being generated will fail on the target version.
What about rolling upgrades? How is the user able to do rolling upgrades? Please consider LTS and non-LTS for all bullets in this section https://dev.mysql.com/doc/refman/8.4/en/mysql-releases.html

There is no impact on the rolling update.

Behavior Change

Is there anything the user could do before the worklog, which has a different effect after the worklog?

There is no such behavior. Actions taken by the user before the worklog will have the same effect after the worklog.
Is there anything the user could do before the worklog, which cannot be done after the worklog?

There is no such behavior. All actions that can be performed by the user before the worklog can also be performed after the worklog. There's no such behaviour.

Limitations

A JSON document being inserted must either provide all AUTO_INCREMENT PRIMARY KEY column values, or none of them (relying on the values being generated). See functional requirement F4.
A generated value for an AUTO_INCREMENT column can only be inferred through the join condition if insertion into this column happens before insertion into the other side of the join condition (See functional requirements F2 and F3).

Relaxing Check of JSON Document

Currently, all primary key columns had to be explicitly provided and the error ER_JDV_PRIMARY_KEY_MUST_BE_PROVIDED was returned if they were not. This restriction was limited to join condition columns.

This check must be changed so that AUTO_INCREMENT-generated values are permitted in the cases allowed by this wl. For the remaining cases it is probably necessary to create a new error message, or at least adjust the message text (which states that a value must always be explicitly provided even for AUTO_INCREMENT columns).

Currently, it was also required that all join condition columns had a valid (non-nullptr) Json_dom* after resolving was complete, and the error ER_JDV_JOIN_CONDITION_NOT_SATISFIED was reported if this was not the case. For this wl the check must be modified such that join condition columns that will receive an AUTO_INCREMENT-generated value when inserting, or which can be resolved from such a column after the insert has happened, must be allowed. Otherwise and error still needs to be reported. It is unclear if the existing error message can be reused, or if new error messages which specify why it is not possible to rely on the AUTO_INCREMENT-generated value, must be created.

Modifications to Duplicate PK Check

Currently, all the bindings are grouped by table and primary key, and all Resolve_rows within each group is checked to verify that all the column values are identical.

This check will break if the resolved value of an PRIMARY KEY AUTO_INCREMENT column can be nullptr or have the value 0 (meaning that it will be generated by the insert). To make it work the comparator for the sorting must be made to accept nullptr and sort nullptr before any non-nullptr value. If the primary key is AUTO_INCREMENT and (FILL IN SQL MODE) is ON (and only then) , must the value 0 be sorted together with nullptr, as they both indicate that the value will be auto-generated.

If all the primary key values are provided, as is required in V1, this would work as before. E.g. given the following sequence of inserts

t1(1,4), t3(1,3), t2(2,1), t1(1,5), t2(2,1), t3(5,1)

would be sorted as

t1(1,4), t1(1,5), t2(2,1), t2(2,1), t3(1,3), t3(5,1)

and grouped into the following 4 groups by table,pk:

<t1(1,4), t1(1,5)>, <t2(2,1), t2(2,1)>, <t3(1,3)>, <t3(5,1)>

Where the first group will result in an error because the two members of the group has different values for the second column (4 vs 5).

With the changes proposed by this wl, it becomes possible to have something like this (all PKs are AUTO_INCREMNT):

t1(,4), t3(,3), t2(2,1), t1(0,5), t2(2,1), t3(5,1)

which needs to be sorted as

t1(,4), t1(0,5), t2(2,1), t2(2,1), t3(,3), t3(5,1)

and grouped into the following 3 groups (nullptr and 0 are grouped together) by table,pk:

<t1(,4), t1(0,5)>, <t2(2,1), t2(2,1)>, <t3(,3), t3(5,1)>

Now t1 is ok because all primary keys are auto-generated (and thus distinct). t2 is also ok as it explicitly specifies the primary key as 2 and the other column is identical. But t3 must trigger an error as it combines an auto-generated with an explicit primary key value.

The loop over each of the groups will now have two distinct branches

If the first primary key is auto-generated, it is necessary to verify that all are auto-generated.
Otherwise, it is necessary to perform the same check of the other columns as before (due to the sorting order all the other primary keys in the group must be explicit in this case).

Detecting that a Value will be Generated

It is necessary to detect inserts which actually generate a value for an AUTO_INCREMENT column, because if a value is provided explicitly LAST_INSERT_ID() does not return this value.

Storing and Propagating the Generated Value After Insert

The value generated for an AUTO_INCREMENT column can be obtained after the insert through

ulonglong THD::first_successful_insert_id_in_prev_stmt

This value needs to be saved and used to resolve join condition columns which depend on it. Even though the generated value is always an ulonglong it is desirable to represent it as a Json_uint and put a pointer to it into the Resolve_row, because this will make it possible to use the existing resolve_columns() function to propagate the generated value to other Resolve_rows.

To avoid a separate heap allocation for this Json_uint object it is added as a member in Resolve_row. Since Json_uint objects to be heap-allocated they don't have a working assignment operator, or other way to update their value. One way to work around this is to use placement new to create a new Json_uint over the existing one. An alternative approach is to use an std::optional which can be initialised to std::nullopt, and then re-assigned to the actual Json_uint when this becomes available.

Modifications to `resolve_columns()`

Currently this function combines both the initial population of Resolve_row::columns from the JSON document, and the iterative propagation of values through join conditions. In order to do a re-resolve after a generated values has been added to a Resolve_row, it is necessary to split the initialization and the propagation steps into separate functions since the initialization must only happen once.