MySQL :: HeatWave User Guide :: 3.10.4 Training an Anomaly Detection Model

Requirements for Anomaly Detection Training

Consider the following based on the type of anomaly detection you are running:

Set the task parameter to anomaly_detection for running anomaly detection on table data, or log_anomaly_detection for running anomaly detection on log data (MySQL 9.2.2 and later).
If running an unsupervised model, the target_column_name parameter must be set to NULL.
If running a semi-supervised model:
- The target_column_name parameter must specify a column whose only allowed values are 0 (normal), 1 (anomalous), and NULL (unlabeled). All rows will be used to train the unsupervised component, while the rows with a value different than NULL will be used to train the supervised component.
- The experimental option must be set to semisupervised.
If running anomaly detection on log data (MySQL 9.2.2 and later), the input table can only have the following columns:
- The column containing the logs.
- If including logs from different sources, a column containing the source of each log. Identify this column with the log_source_column option.
- If including labeled data, a column identifying the labeled log lines. See Semi-supervised Anomaly Detection to learn more.
- If the input table has additional columns to the ones permitted, you must use the exclude_column_list option to exclude irrelevant columns.

Anomaly Detection Options

Use the following JSON options:

contamination: Represents an estimate of the percentage of outliers in the training table.
- The contamination factor is calculated as: estimated number of rows with anomalies/total number of rows in the training table.
- The contamination value must be greater than 0 and less than 0.5. The default value is 0.01.
As of MySQL 8.4.0, model_list is supported to allow the selection of the Principal Component Analysis (PCA) model and Generalized Local Outlier Factor (GLOF) model. If no option is specified, the default model is Generalized kth Nearest Neighbors (GkNN). Selecting more than one model or an unsupported model produces an error.

Semi-supervised Learning Options

MySQL 9.0.1-u1 introduces the following options to train a semi-supervised anomaly detection model:

supervised_submodel_options: Allows you to set optional override parameters for the supervised model component. The only model supported is DistanceWeightedKNNClassifier. The following parameters are supported:
- n_neighbors: Sets the desired k value that checks the k closest neighbors for each unclassified point. The default value is 5 and the value must be an integer greater than 0.
- min_labels: Sets the minimum number of labeled data points required to train the supervised component. If fewer labeled data points are provided during training of the model, ML_TRAIN fails. The default value is 20 and the value must be an integer greater than 0.
ensemble_score: This option specifies the metric to use to score the ensemble of unsupervised and supervised components. It identifies the optimal weight between the two components based on the metric. The supported metrics are accuracy, precision, recall, and f1. The default metric is f1.

Log Anomaly Detection Options

MySQL 9.2.2 introduces anomaly detection for log data. The following options are available as a separate JSON_OBJECT named logad_options:

additional_masking_regex: Allows you to mask log data by using regular expression in a JSON_ARRAY. By default, the following parameters are automatically masked during training and when generating anomaly scores.
- IP
- DATETIME
- TIME
- HEX
- IPPORT
- OCID
window_size: Specifies the maximum number of log lines to be grouped for anomaly detection. The default value is 10.
window_stride: Specifies the stride value to use for segmenting log lines. For example, there is log A, B, C, D, and E. The window_size is 3, and the window_stride is 2. The first row has log A, B, and C. The second row has log C, D, and E. If this value is equal to window_size, there is no overlapping of log segments. The default value is 3.
log_source_column: Specifies the column name that contains the source identifier of the respective log lines. Log lines are grouped according to their respective source (for example, logs from multiple MySQL databases that are in the same table). By default, all log lines are assumed to be from the same source.

Unsupported Anomaly Detection Options

The following options are not supported for anomaly detection:

exclude_model_list
optimization_metric
Before MySQL 8.4.0, model_list is not supported because the only supported algorithm model is Generalized kth Nearest Neighbors (GkNN). As of MySQL 8.4.0, model_list is supported to allow the selection of the Principal Component Analysis (PCA) model and Generalized Local Outlier Factor (GLOF) model.

See Section 3.5, “Training a Model”, and for full details of all the options, see ML_TRAIN.

Syntax Examples for Unsupervised Learning

An ML_TRAIN example that specifies the anomaly_detection task type:

Press CTRL+C to copy
 
mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', NULL, 
JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', 
JSON_ARRAY('target')), @anomaly);
Query OK, 0 rows affected (46.59 sec)

An ML_TRAIN example that specifies the anomaly_detection task with a contamination option. Access the model catalog metadata to check the value of the contamination option.

Press CTRL+C to copy
 
mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', NULL, 
JSON_OBJECT('task', 'anomaly_detection', 'contamination', 0.013, 'exclude_column_list', 
JSON_ARRAY('target')), @anomaly_with_contamination);
Query OK, 0 rows affected (50.22 sec)

mysql> SELECT JSON_EXTRACT(model_metadata, '$.contamination') 
          FROM ML_SCHEMA_root.MODEL_CATALOG 
          WHERE model_handle = @anomaly_with_contamination;
+-------------------------------------------------+
| JSON_EXTRACT(model_metadata, '$.contamination') |
+-------------------------------------------------+
| 0.013000000268220901                            |
+-------------------------------------------------+
1 row in set (0.00 sec)

Syntax Examples for Semi-Supervised Learning

An ML_TRAIN example that enables semi-supervised learning using all defaults. The target_column_name is set to target. The experimental option is set to semisupervised.

Press CTRL+C to copy
 
mysql> CALL sys.ML_TRAIN('mlcorpus.anomaly_train_with_partial_target', "target", 
CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {}}}' 
as JSON), @semisupervised_model);

An ML_TRAIN example that enables semi-supervised learning with additional options.
```
Press CTRL+C to copy
 
mysql> CALL sys.ML_TRAIN('mlcorpus.`anomaly_train_with_partial_target`',
 "target", CAST('{"task": "anomaly_detection", "experimental": {"semisupervised":
 {"supervised_submodel_options": {""min_labels": 10, "n_neighbors": 3}, 
 "ensemble_score": "recall"}}}' as JSON), @semisupervised_model_options);
```
Where:
- The supervised_submodel_options parameter min_labels is set to 10.
- The supervised_submodel_options parameter n_neighbors is set to 3.
- The ensemble_score option is set to the recall metric.

Syntax Examples for Model Selection

An ML_TRAIN example that selects the PCA algorithm model.

Press CTRL+C to copy
 
mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.`volcanoes-b3_anomaly_train`', NULL, 
JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', 
JSON_ARRAY('target'), 'model_list', JSON_ARRAY('PCA')), @anomaly_pca);

An ML_TRAIN example that selects the GLOF algorithm model.

Press CTRL+C to copy
 
mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.`volcanoes-b3_anomaly_train`', NULL, 
JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', 
JSON_ARRAY('target'), 'model_list', JSON_ARRAY('GLOF')), @anomaly_glof);

An ML_TRAIN example that does not specify an algorithm model for the model_list option. If no model is specified, the default model GkNN is used.

Press CTRL+C to copy
 
mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.`volcanoes-b3_anomaly_train`', NULL, 
JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', 
JSON_ARRAY('target'), 'model_list', JSON_ARRAY()), @anomaly_empty_list);

Syntax Examples for Anomaly Detection on Log Data

An ML_TRAIN example that uses available default values.

Press CTRL+C to copy
 
mysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_just_patterns`', NULL, 
JSON_OBJECT('task', 'log_anomaly_detection'), @logad_model);

An ML_TRAIN example that selects the PCA algorithm model.

Press CTRL+C to copy
 
mysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_just_patterns`', NULL, 
JSON_OBJECT('task', 'log_anomaly_detection', 'model_list', 
JSON_ARRAY('PCA')), @logad_model);

An ML_TRAIN example that masks log data with the additional_masking_regex option. In this example, in addition to the default parameters that are automatically masked, email addresses from Yahoo, Hotmail, and Gmail are also masked. The log_source_column option is also included, which specifies the column that identifies the respective source of the log line.
```
Press CTRL+C to copy
 
mysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_sourced`', NULL, 
JSON_OBJECT('task', 'log_anomaly_detection', 'logad_options', 
JSON_OBJECT('additional_masking_regex', 
JSON_ARRAY('(\W|^)[\w.\-]{0,25}@(yahoo|hotmail|gmail)\.com(\W|$)'), 'log_source_column', 'source')), @log_anomaly_us);
```

An ML_TRAIN example that sets semi-supervised learning for training the log data. The window size is also set to a value of 4, and the window stride is set to 1.

Press CTRL+C to copy
 
mysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_semi`', "label", 
JSON_OBJECT('task', 'log_anomaly_detection', 'logad_options', 
JSON_OBJECT('window_size', 4, 'window_stride', 1), "experimental", 
JSON_OBJECT("semisupervised", JSON_OBJECT("supervised_submodel_options", 
JSON_OBJECT("min_labels", 10)))), @log_anomaly_us);