WL#11999: Add support for JSON Schema
Contents |
What is JSON Schema?
JSON Schema is a way to describe the structure of a JSON document. The JSON Schema is a JSON document in itself, and allows you to define required names/attributes, data types etc. As an example, here is an example of a JSON Schema describing a geographic coordinate (taken from json-schema.org):
{ "id": "http://json-schema.org/geo", "$schema": "http://json-schema.org/draft-06/schema#", "description": "A geographical coordinate", "type": "object", "properties": { "latitude": { "type": "number" }, "longitude": { "type": "number" } } }
The first interesting part is the "type" attribute. It says that the JSON document MUST be an object (start with '{'). Next, it says that the object may have two properties named "latitude" and "longitude". Both of these are required to be a number. So, the following JSON document is valid according to the above JSON Schema:
{ "latitude": 63.444697, "longitude": 10.445118 }
However, the following is also valid because neither "latitude" or "longitude" are required:
{}
In order to make those attributes required, we can add a required attribute to the JSON Schema:
{ "id": "http://json-schema.org/geo", "$schema": "http://json-schema.org/draft-06/schema#", "description": "A geographical coordinate", "type": "object", "properties": { "latitude": { "type": "number" }, "longitude": { "type": "number" } } "required": ["latitude", "longitude"] }
A nice feature would be to have a function that validates a JSON value against a JSON Schema:
SELECT JSON_SCHEMA_VALID('{ "id": "http://json-schema.org/geo", "$schema": "http://json-schema.org/draft-06/schema#", "description": "A geographical coordinate", "type": "object", "properties": { "latitude": { "type": "number", "minimum": -90, "maximum": 90 }, "longitude": { "type": "number", "minimum": -180, "maximum": 180 } }, "required": ["latitude", "longitude"] }','{ "latitude": 63.444697, "longitude": 10.445118 }'); => Returns "true"
SELECT JSON_SCHEMA_VALID('{ "id": "http://json-schema.org/geo", "$schema": "http://json-schema.org/draft-06/schema#", "description": "A geographical coordinate", "type": "object", "properties": { "latitude": { "type": "number", "minimum": -90, "maximum": 90 }, "longitude": { "type": "number", "minimum": -180, "maximum": 180 } }, "required": ["latitude", "longitude"] }','{}'); => Returns "false" (missing required properties)
rapidjson (which we already use) supports validation of the JSON Schema Draft v4: https://github.com/Tencent/rapidjson/blob/master/doc/schema.md
Other DBMS
As far as we know, MongoDB is the only DBMS that has native support for JSON Schema validation.
MongoDB
Starting from version 3.6, MongoDB also supports JSON Schema (version 4) validation. It's is defined by passing a validator to the createCollection command:
db.createCollection("coordinates", { validator: { "$jsonSchema": { "bsonType": "object", "properties": { "latitude": { "type": "number", "minimum": -90, "maximum": 90 }, "longitude": { "type": "number", "minimum": -180, "maximum": 180 } }, "required": [ "latitude", "longitude" ] } } })
MongoDB does not support the $ref keyword, which is a pointer to an external JSON Schema (local or remote).
PostgreSQL
There is no native support for JSON schema, but there are extensions that can be used as CHECK constraints (https://github.com/gavinwahl/postgres-json-schema):
CREATE TABLE example (id serial PRIMARY KEY, data jsonb); ALTER TABLE example ADD CONSTRAINT data_is_valid CHECK ( validate_json_schema('{"type": "object"}', data));
The above CHECK constraint enforces that every JSON value stored in the column data MUST be a JSON object. Remote references are not supported.
Possible problems
-
The JSON Schema has a keyword $ref that is a pointer to an external JSON Schema (local or remote). To fully support this, we would need to be able to fetch data from the internet. This may cause security issues, and it might be wise to not support this keyword.
-
For strings it's possible to specify a regex pattern. rapidjson gives us two regex engines to choose between: their internal home-made and std::regex. MySQL however is now using ICU, and it would perhaps be good if we used the same regex engine in both JSON Schema and rest of the server.
F-1: The server MUST accept the new function <bool> = JSON_SCHEMA_VALID(<json>, <json>)
F-2: The first argument to JSON_SCHEMA_VALID MUST be the JSON Schema definition, and the second argument MUST be the JSON document the user wants to validate.
F-3: The function JSON_SCHEMA_VALID MUST return SQL NULL if any of it's arguments are SQL NULL.
F-4: If the first argument given to JSON_SCHEMA_VALID isn't a valid JSON, the server MUST return the error message ER_INVALID_TYPE_FOR_JSON
F-5: If the second argument given to JSON_SCHEMA_VALID isn't a valid JSON, the server MUST return the error message ER_INVALID_TYPE_FOR_JSON
F-6: If the second argument given to JSON_SCHEMA_VALID is a valid JSON according to the JSON Schema given in the first argument, the function MUST return TRUE/1
F-7: If the second argument given to JSON_SCHEMA_VALID isn't a valid JSON according to the JSON Schema given in the first argument, the function MUST return FALSE/0
F-8: If the provided JSON Schema contains a pointer to an external JSON Schema (local or remote), the server MUST return the error ER_NOT_SUPPORTED_YET.
F-9: If the first argument is a valid JSON but not a JSON object, the server MUST return the error ER_INVALID_JSON_TYPE
Contents |
Syntax
This worklog will implement one new SQL function:
<bool> = JSON_SCHEMA_VALID(<json>, <json>);
JSON_SCHEMA_VALID takes in a JSON Schema as the first argument and a JSON value as the second argument. If the second argument is considered valid according to the JSON Schema provided in the first argument, the function returns true. Otherwise it returns false.
The function returns SQL NULL if any of its arguments are SQL NULL.
rapidjson
JSON Schema version
Since we already are using rapidjson in the server, we will use the JSON Schema validator that they provides. That means that we are forced to support whatever reason of the JSON Schema draft they support. Currently, it's JSON Schema draft 4 (https://json-schema.org/specification-links.html#draft-4). Note that the current draft is version 7, so this will not be the bleeding edge.
Regex
JSON Schema has support for specifying regex patterns for strings, and rapidjson provides two regex engines for us to use: std::regex and their home-made regex engine. We have also considered adding a layer between rapidjson and ICU so that it would be possible to use the ICU engine that the MySQL server is using, but it has been decided not to do that.
- We need to support -DWITH_RAPIDJSON=system , so modifying the rapidjson source code would break this.
- The regex engine should be as consistent as possible across different platforms. Using the rapidjson home-made engine would ensure that we are using one implementation as opposed to using std::regex.
- std::regex seems to generally support more patterns, while rapidjson supports the most common ones.
We have seen regex patterns that rapidjson doesn't handle well (stack overflow), so for those reasons we will use std::regex.
Note that invalid regex patterns are silently ignored by rapidjson, so any invalid regex patters will be removed from the validation. For instance, the below query will return "true/1" even if the provided regex is invalid:
SELECT JSON_SCHEMA_VALID('{"type":"string","pattern":"("}', '"abc"');
External JSON Schema
The JSON Schema has a keyword $ref that is a pointer to an external JSON Schema (local or remote). To fully support this, we would need to be able to fetch data from the internet. As this may cause security issues and a number of other potential problems, we have decided to not support the $ref keyword.
Since rapidjson provides support for JSON Schema validation, we will only implement a thin layer in MySQL to call the necessary rapidjson functions.
New classes and functions
item_json_func.h/cc
Item_func_json_schema_valid represents the SQL function <bool> = JSON_SCHEMA_VALID(<json>, <json>)
class Item_func_json_schema_valid final : public Item_bool_func
json_schema.h/cc
MyRemoteSchemaDocumentProvider acts as a handler/callback for the JSON schema validator and it is called whenever a schema reference is encountered in the JSON document. Since MySQL doesn't support schema references, this class is only used to detect whether or not we actually found one in the JSON document.
class My_remote_schema_document_provider : public rapidjson::IRemoteSchemaDocumentProvider
IsValidJsonSchema takes in a JSON Schema string and a JSON value to be validated. It returns true on error, and false if the validation went OK. The result of the validation can be found in the ouput parameter "is_valid". The function will only set up the necessary rapidjson objects, instantiated with the MyRemoteSchemaDocumentProvider. The actual validation of the JSON value will be done by rapidjson.
bool is_valid_json_schema(const char *json_schema_str, size_t json_schema_length, const char *document_str, size_t document_length, bool *is_valid);