Run the ML_TRAIN
routine to
create an anomaly detection model.
If running an unsupervised model, the
target_column_name
parameter must be set to
NULL
.
If running a semi-supervised model, the
target_column_name
parameter must specify a
column whose only allowed values are 0, 1, and NULL. All rows
will be used to train the unsupervised component, while the
rows with a value different than NULL will be used to train
the supervised component.
Use the following JSON
options
:
task
: This must be set toanomaly_detection
.-
contamination
: A new option which represents an estimate of the percentage of outliers in the training table.The contamination factor is calculated as: estimated number of rows with anomalies/total number of rows in the training table.
The contamination value must be greater than 0 and less than 0.5. The default value is 0.01.
As of MySQL 8.4.0,
model_list
is supported to allow the selection of the Principal Component Analysis (PCA) model and Generalized Local Outlier Factor (GLOF) model. If no option is specified, the default model is Generalized kth Nearest Neighbors (GkNN). Selecting more than one model or an unsupported model produces an error.
MySQL 9.0.1-u1 introduces the following options to train a semi-supervised anomaly detection model:
experimental
: Must be set tosemisupervised
to enable semi-supervised training.-
supervised_submodel_options
: Allows you to set optional override parameters for the supervised model component. The only model supported isDistanceWeightedKNNClassifier
. The following parameters are supported:n_neighbors
: Sets the desired k value that checks the k closest neighbors for each unclassified point. The default value is 5 and the value must be an integer greater than 0.min_labels
: Sets the minimum number of labeled data points required to train the supervised component. If fewer labeled data points are provided during training of the model,ML_TRAIN
fails. The default value is 20 and the value must be an integer greater than 0.
ensemble_score
: This option specifies the metric to use to score the ensemble of unsupervised and supervised components. It identifies the optimal weight between the two components based on the metric. The supported metrics areaccuracy
,precision
,recall
, andf1
. The default metric isf1
.
The following options are not supported for anomaly detection:
exclude_model_list
optimization_metric
Before MySQL 8.4.0,
model_list
is not supported because the only supported algorithm model is Generalized kth Nearest Neighbors (GkNN). As of MySQL 8.4.0,model_list
is supported to allow the selection of the Principal Component Analysis (PCA) model and Generalized Local Outlier Factor (GLOF) model.
See Section 3.5, “Training a Model”, and for full details
of all the options
, see
ML_TRAIN
.
-
An
ML_TRAIN
example that specifies theanomaly_detection
task type:mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target')), @anomaly); Query OK, 0 rows affected (46.59 sec)
-
An
ML_TRAIN
example that specifies theanomaly_detection
task with acontamination
option. Access the model catalog metadata to check the value of thecontamination
option.mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', NULL, JSON_OBJECT('task', 'anomaly_detection', 'contamination', 0.013, 'exclude_column_list', JSON_ARRAY('target')), @anomaly_with_contamination); Query OK, 0 rows affected (50.22 sec) mysql> SELECT JSON_EXTRACT(model_metadata, '$.contamination') FROM ML_SCHEMA_root.MODEL_CATALOG WHERE model_handle = @anomaly_with_contamination; +-------------------------------------------------+ | JSON_EXTRACT(model_metadata, '$.contamination') | +-------------------------------------------------+ | 0.013000000268220901 | +-------------------------------------------------+ 1 row in set (0.00 sec)
-
An
ML_TRAIN
example that enables semi-supervised learning using all defaults. Thetarget_column_name
is set totarget
. Theexperimental
option is set tosemisupervised
.mysql> CALL sys.ML_TRAIN('mlcorpus.anomaly_train_with_partial_target', "target", CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {}}}' as JSON), @semisupervised_model);
-
An
ML_TRAIN
example that enables semi-supervised learning with additional options.mysql> CALL sys.ML_TRAIN('mlcorpus.`anomaly_train_with_partial_target`', "target", CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {"supervised_submodel_options": {""min_labels": 10, "n_neighbors": 3}, "ensemble_score": "recall"}}}' as JSON), @semisupervised_model_options);
Where:
The
supervised_submodel_options
parametermin_labels
is set to 10.The
supervised_submodel_options
parametern_neighbors
is set to 3.The
ensemble_score
option is set to therecall
metric.
-
An
ML_TRAIN
example that selects the PCA algorithm model.mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.` volcanoes-b3_anomaly_train`', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target'), 'model_list', JSON_ARRAY('PCA')), @anomaly_pca);
-
An
ML_TRAIN
example that selects the GLOF algorithm model.mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1. `volcanoes-b3_anomaly_train`', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target'), 'model_list', JSON_ARRAY('GLOF')), @anomaly_glof);
-
An
ML_TRAIN
example that does not specify an algorithm model for themodel_list
option. If no model is specified, the default model GkNN is used.mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1. `volcanoes-b3_anomaly_train`', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target'), 'model_list', JSON_ARRAY()), @anomaly_empty_list);