Documentation Home
HeatWave User Guide
Related Documentation Download this Manual
PDF (US Ltr) - 2.1Mb
PDF (A4) - 2.1Mb


HeatWave User Guide  /  ...  /  Training an Anomaly Detection Model

3.10.2 Training an Anomaly Detection Model

Run the ML_TRAIN routine to create an anomaly detection model.

If running an unsupervised model, the target_column_name parameter must be set to NULL.

If running a semi-supervised model, the target_column_name parameter must specify a column whose only allowed values are 0, 1, and NULL. All rows will be used to train the unsupervised component, while the rows with a value different than NULL will be used to train the supervised component.

Use the following JSON options:

  • task: This must be set to anomaly_detection.

  • contamination: A new option which represents an estimate of the percentage of outliers in the training table.

    • The contamination factor is calculated as: estimated number of rows with anomalies/total number of rows in the training table.

    • The contamination value must be greater than 0 and less than 0.5. The default value is 0.01.

  • As of MySQL 8.4.0, model_list is supported to allow the selection of the Principal Component Analysis (PCA) model and Generalized Local Outlier Factor (GLOF) model. If no option is specified, the default model is Generalized kth Nearest Neighbors (GkNN). Selecting more than one model or an unsupported model produces an error.

MySQL 9.0.1-u1 introduces the following options to train a semi-supervised anomaly detection model:

  • experimental: Must be set to semisupervised to enable semi-supervised training.

  • supervised_submodel_options: Allows you to set optional override parameters for the supervised model component. The only model supported is DistanceWeightedKNNClassifier. The following parameters are supported:

    • n_neighbors: Sets the desired k value that checks the k closest neighbors for each unclassified point. The default value is 5 and the value must be an integer greater than 0.

    • min_labels: Sets the minimum number of labeled data points required to train the supervised component. If fewer labeled data points are provided during training of the model, ML_TRAIN fails. The default value is 20 and the value must be an integer greater than 0.

  • ensemble_score: This option specifies the metric to use to score the ensemble of unsupervised and supervised components. It identifies the optimal weight between the two components based on the metric. The supported metrics are accuracy, precision, recall, and f1. The default metric is f1.

The following options are not supported for anomaly detection:

  • exclude_model_list

  • optimization_metric

  • Before MySQL 8.4.0, model_list is not supported because the only supported algorithm model is Generalized kth Nearest Neighbors (GkNN). As of MySQL 8.4.0, model_list is supported to allow the selection of the Principal Component Analysis (PCA) model and Generalized Local Outlier Factor (GLOF) model.

See Section 3.5, “Training a Model”, and for full details of all the options, see ML_TRAIN.

Syntax Examples for Unsupervised Learning

  • An ML_TRAIN example that specifies the anomaly_detection task type:

    mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', 
              NULL, JSON_OBJECT('task', 'anomaly_detection', 
              'exclude_column_list', JSON_ARRAY('target')), 
              @anomaly);
    Query OK, 0 rows affected (46.59 sec)
  • An ML_TRAIN example that specifies the anomaly_detection task with a contamination option. Access the model catalog metadata to check the value of the contamination option.

    mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', 
              NULL, JSON_OBJECT('task', 'anomaly_detection', 'contamination', 0.013, 
              'exclude_column_list', JSON_ARRAY('target')), 
              @anomaly_with_contamination);
    Query OK, 0 rows affected (50.22 sec)
    
    mysql> SELECT JSON_EXTRACT(model_metadata, '$.contamination') 
              FROM ML_SCHEMA_root.MODEL_CATALOG 
              WHERE model_handle = @anomaly_with_contamination;
    +-------------------------------------------------+
    | JSON_EXTRACT(model_metadata, '$.contamination') |
    +-------------------------------------------------+
    | 0.013000000268220901                            |
    +-------------------------------------------------+
    1 row in set (0.00 sec)

Syntax Examples for Semi-Supervised Learning

  • An ML_TRAIN example that enables semi-supervised learning using all defaults. The target_column_name is set to target. The experimental option is set to semisupervised.

    mysql> CALL sys.ML_TRAIN('mlcorpus.anomaly_train_with_partial_target', "target", 
    CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {}}}' 
    as JSON), @semisupervised_model);
  • An ML_TRAIN example that enables semi-supervised learning with additional options.

    mysql> CALL sys.ML_TRAIN('mlcorpus.`anomaly_train_with_partial_target`',
     "target", CAST('{"task": "anomaly_detection", "experimental": {"semisupervised":
     {"supervised_submodel_options": {""min_labels": 10, "n_neighbors": 3}, 
     "ensemble_score": "recall"}}}' as JSON), @semisupervised_model_options);

    Where:

    • The supervised_submodel_options parameter min_labels is set to 10.

    • The supervised_submodel_options parameter n_neighbors is set to 3.

    • The ensemble_score option is set to the recall metric.

Syntax Examples for Model Selection

  • An ML_TRAIN example that selects the PCA algorithm model.

    mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.`
    volcanoes-b3_anomaly_train`', NULL, 
    JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', 
    JSON_ARRAY('target'), 'model_list', JSON_ARRAY('PCA')), @anomaly_pca);
  • An ML_TRAIN example that selects the GLOF algorithm model.

    mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.
    `volcanoes-b3_anomaly_train`', NULL, 
    JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', 
    JSON_ARRAY('target'), 'model_list', JSON_ARRAY('GLOF')), @anomaly_glof);
  • An ML_TRAIN example that does not specify an algorithm model for the model_list option. If no model is specified, the default model GkNN is used.

    mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.
    `volcanoes-b3_anomaly_train`', NULL, 
    JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', 
    JSON_ARRAY('target'), 'model_list', JSON_ARRAY()), @anomaly_empty_list);