Run the ML_TRAIN
routine to
create an anomaly detection model.
Consider the following based on the type of anomaly detection you are running:
Set the
task
parameter toanomaly_detection
for running anomaly detection on table data, orlog_anomaly_detection
for running anomaly detection on log data (MySQL 9.2.2 and later).If running an unsupervised model, the
target_column_name
parameter must be set toNULL
.-
If running a semi-supervised model:
The
target_column_name
parameter must specify a column whose only allowed values are 0 (normal), 1 (anomalous), and NULL (unlabeled). All rows will be used to train the unsupervised component, while the rows with a value different than NULL will be used to train the supervised component.The
experimental
option must be set tosemisupervised
.
-
If running anomaly detection on log data (MySQL 9.2.2 and later), the input table can only have the following columns:
The column containing the logs.
If including logs from different sources, a column containing the source of each log. Identify this column with the
log_source_column
option.If including labeled data, a column identifying the labeled log lines. See Semi-supervised Anomaly Detection to learn more.
If the input table has additional columns to the ones permitted, you must use the
exclude_column_list
option to exclude irrelevant columns.
Use the following JSON
options
:
-
contamination
: Represents an estimate of the percentage of outliers in the training table.The contamination factor is calculated as: estimated number of rows with anomalies/total number of rows in the training table.
The contamination value must be greater than 0 and less than 0.5. The default value is 0.01.
As of MySQL 8.4.0,
model_list
is supported to allow the selection of the Principal Component Analysis (PCA) model and Generalized Local Outlier Factor (GLOF) model. If no option is specified, the default model is Generalized kth Nearest Neighbors (GkNN). Selecting more than one model or an unsupported model produces an error.
MySQL 9.0.1-u1 introduces the following options to train a semi-supervised anomaly detection model:
-
supervised_submodel_options
: Allows you to set optional override parameters for the supervised model component. The only model supported isDistanceWeightedKNNClassifier
. The following parameters are supported:n_neighbors
: Sets the desired k value that checks the k closest neighbors for each unclassified point. The default value is 5 and the value must be an integer greater than 0.min_labels
: Sets the minimum number of labeled data points required to train the supervised component. If fewer labeled data points are provided during training of the model,ML_TRAIN
fails. The default value is 20 and the value must be an integer greater than 0.
ensemble_score
: This option specifies the metric to use to score the ensemble of unsupervised and supervised components. It identifies the optimal weight between the two components based on the metric. The supported metrics areaccuracy
,precision
,recall
, andf1
. The default metric isf1
.
MySQL 9.2.2 introduces anomaly detection for log data. The
following options are available as a separate
JSON_OBJECT
named
logad_options
:
-
additional_masking_regex
: Allows you to mask log data by using regular expression in aJSON_ARRAY
. By default, the following parameters are automatically masked during training and when generating anomaly scores.IP
DATETIME
TIME
HEX
IPPORT
OCID
window_size
: Specifies the maximum number of log lines to be grouped for anomaly detection. The default value is 10.window_stride
: Specifies the stride value to use for segmenting log lines. For example, there is log A, B, C, D, and E. Thewindow_size
is 3, and thewindow_stride
is 2. The first row has log A, B, and C. The second row has log C, D, and E. If this value is equal towindow_size
, there is no overlapping of log segments. The default value is 3.log_source_column
: Specifies the column name that contains the source identifier of the respective log lines. Log lines are grouped according to their respective source (for example, logs from multiple MySQL databases that are in the same table). By default, all log lines are assumed to be from the same source.
The following options are not supported for anomaly detection:
exclude_model_list
optimization_metric
Before MySQL 8.4.0,
model_list
is not supported because the only supported algorithm model is Generalized kth Nearest Neighbors (GkNN). As of MySQL 8.4.0,model_list
is supported to allow the selection of the Principal Component Analysis (PCA) model and Generalized Local Outlier Factor (GLOF) model.
See Section 3.5, “Training a Model”, and for full
details of all the options
, see
ML_TRAIN
.
-
An
ML_TRAIN
example that specifies theanomaly_detection
task type:Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target')), @anomaly); Query OK, 0 rows affected (46.59 sec)
-
An
ML_TRAIN
example that specifies theanomaly_detection
task with acontamination
option. Access the model catalog metadata to check the value of thecontamination
option.Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', NULL, JSON_OBJECT('task', 'anomaly_detection', 'contamination', 0.013, 'exclude_column_list', JSON_ARRAY('target')), @anomaly_with_contamination); Query OK, 0 rows affected (50.22 sec) mysql> SELECT JSON_EXTRACT(model_metadata, '$.contamination') FROM ML_SCHEMA_root.MODEL_CATALOG WHERE model_handle = @anomaly_with_contamination; +-------------------------------------------------+ | JSON_EXTRACT(model_metadata, '$.contamination') | +-------------------------------------------------+ | 0.013000000268220901 | +-------------------------------------------------+ 1 row in set (0.00 sec)
-
An
ML_TRAIN
example that enables semi-supervised learning using all defaults. Thetarget_column_name
is set totarget
. Theexperimental
option is set tosemisupervised
.Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus.anomaly_train_with_partial_target', "target", CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {}}}' as JSON), @semisupervised_model);
-
An
ML_TRAIN
example that enables semi-supervised learning with additional options.Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus.`anomaly_train_with_partial_target`', "target", CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {"supervised_submodel_options": {""min_labels": 10, "n_neighbors": 3}, "ensemble_score": "recall"}}}' as JSON), @semisupervised_model_options);
Where:
The
supervised_submodel_options
parametermin_labels
is set to 10.The
supervised_submodel_options
parametern_neighbors
is set to 3.The
ensemble_score
option is set to therecall
metric.
-
An
ML_TRAIN
example that selects the PCA algorithm model.Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.`volcanoes-b3_anomaly_train`', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target'), 'model_list', JSON_ARRAY('PCA')), @anomaly_pca);
-
An
ML_TRAIN
example that selects the GLOF algorithm model.Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.`volcanoes-b3_anomaly_train`', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target'), 'model_list', JSON_ARRAY('GLOF')), @anomaly_glof);
-
An
ML_TRAIN
example that does not specify an algorithm model for themodel_list
option. If no model is specified, the default model GkNN is used.Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.`volcanoes-b3_anomaly_train`', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target'), 'model_list', JSON_ARRAY()), @anomaly_empty_list);
-
An
ML_TRAIN
example that uses available default values.Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_just_patterns`', NULL, JSON_OBJECT('task', 'log_anomaly_detection'), @logad_model);
-
An
ML_TRAIN
example that selects the PCA algorithm model.Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_just_patterns`', NULL, JSON_OBJECT('task', 'log_anomaly_detection', 'model_list', JSON_ARRAY('PCA')), @logad_model);
-
An
ML_TRAIN
example that masks log data with theadditional_masking_regex
option. In this example, in addition to the default parameters that are automatically masked, email addresses from Yahoo, Hotmail, and Gmail are also masked. Thelog_source_column
option is also included, which specifies the column that identifies the respective source of the log line.Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_sourced`', NULL, JSON_OBJECT('task', 'log_anomaly_detection', 'logad_options', JSON_OBJECT('additional_masking_regex', JSON_ARRAY('(\W|^)[\w.\-]{0,25}@(yahoo|hotmail|gmail)\.com(\W|$)'), 'log_source_column', 'source')), @log_anomaly_us);
-
An
ML_TRAIN
example that sets semi-supervised learning for training the log data. The window size is also set to a value of 4, and the window stride is set to 1.Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_semi`', "label", JSON_OBJECT('task', 'log_anomaly_detection', 'logad_options', JSON_OBJECT('window_size', 4, 'window_stride', 1), "experimental", JSON_OBJECT("semisupervised", JSON_OBJECT("supervised_submodel_options", JSON_OBJECT("min_labels", 10)))), @log_anomaly_us);