MySQL :: MySQL AI 9.6 :: 4.7.4.5 Training an Anomaly Detection Model

Before You Begin

Review and complete all the tasks to Prepare Data for an Anomaly Detection Model.

Requirements for Anomaly Detection Training

Consider the following based on the type of anomaly detection you are running:

Set the task parameter to anomaly_detection for running anomaly detection on table data, or log_anomaly_detection for running anomaly detection on log data.
If running an unsupervised model, the target_column_name parameter must be set to NULL.
If running a semi-supervised model:
- The target_column_name parameter must specify a column whose only allowed values are 0 (normal), 1 (anomalous), and NULL (unlabeled). All rows are used to train the unsupervised component, while the rows with a value different than NULL are used to train the supervised component.
- The experimental option must be set to semisupervised.
If running anomaly detection on log data, the input table can only have the following columns:
- The column containing the logs.
- If including logs from different sources, a column containing the source of each log. Identify this column with the log_source_column option.
- If including labeled data, a column identifying the labeled log lines. See Semi-supervised Anomaly Detection to learn more.
- At least one column must act as the primary key to establish the temporal order of logs. If the primary key column (or columns) is not one of the previous required columns (log data, source of log, or label), then you must use the exclude_column_list option when running ML_TRAIN to exclude all primary key columns that don't include required data. See Syntax Examples for Anomaly Detection Training to review relevant examples.
- If the input table has additional columns to the ones permitted, you must use the exclude_column_list option to exclude irrelevant columns.

Anomaly Detection Options

Use the following JSON options:

contamination: Represents an estimate of the percentage of outliers in the training table.
- The contamination factor is calculated as: estimated number of rows with anomalies/total number of rows in the training table.
- The contamination value must be greater than 0 and less than 0.5. The default value is 0.01.
model_list: Allows you to select the model for training. If no option is specified, the default model is Generalized kth Nearest Neighbors (GkNN). Selecting more than one model or an unsupported model produces an error. Review supported Anomaly Detection Models.

Semi-supervised Learning Options

You have the following options to train a semi-supervised anomaly detection model:

supervised_submodel_options: Allows you to set optional override parameters for the supervised model component. The only model supported is DistanceWeightedKNNClassifier. The following parameters are supported:
- n_neighbors: Sets the desired k value that checks the k closest neighbors for each unclassified point. The default value is 5 and the value must be an integer greater than 0.
- min_labels: Sets the minimum number of labeled data points required to train the supervised component. If fewer labeled data points are provided during training of the model, ML_TRAIN fails. The default value is 20 and the value must be an integer greater than 0.
ensemble_score: This option specifies the metric to use to score the ensemble of unsupervised and supervised components. It identifies the optimal weight between the two components based on the metric. The supported metrics are accuracy, precision, recall, and f1. The default metric is f1.

Log Anomaly Detection Options

You have the following options for anomaly detection on log data. The options are available as a separate JSON_OBJECT named logad_options:

additional_masking_regex: Allows you to mask log data by using regular expression in a JSON_ARRAY. By default, the following parameters are automatically masked during training and when generating anomaly scores.
- IP
- DATETIME
- TIME
- HEX
- IPPORT
- OCID
window_size: Specifies the maximum number of log lines to be grouped for anomaly detection. The default value is 10.
window_stride: Specifies the stride value to use for segmenting log lines. For example, there is log A, B, C, D, and E. The window_size is 3, and the window_stride is 2. The first row has log A, B, and C. The second row has log C, D, and E. If this value is equal to window_size, there is no overlapping of log segments. The default value is 3.
log_source_column: Specifies the column name that contains the source identifier of the respective log lines. Log lines are grouped according to their respective source (for example, logs from multiple MySQL databases that are in the same table). By default, all log lines are assumed to be from the same source.
embedding_model: The embedding model used to extract semantic features from log data. To review supported embedding models in MySQL AI, run the following query: SELECT sys.ML_LIST_LLMS(); and see models that have capabilities with TEXT_EMBEDDINGS. The default value is multilingual-e5-small. Using an embedding model causes higher memory usage. If you set this to NULL, then you cannot also set keyword_model to NULL.
keyword_model: The keyword feature extractor used to extract keyword features from log data. The available options are tf-idf and NULL. The default value is tf-idf. If you set this to NULL, then you cannot also set embedding_model to NULL.

Unsupported Anomaly Detection Options

The following options are not supported for anomaly detection:

exclude_model_list
optimization_metric

Unsupported Routines

You cannot run the following routines for a trained anomaly detection model:

ML_EXPLAIN
ML_EXPLAIN_ROW
ML_EXPLAIN_TABLE
ML_PREDICT_ROW (only for anomaly detection for logs)

Training a Semi-Supervised Anomaly Detection Model

Train the model with the ML_TRAIN routine and use the credit_card_train table previously created. Before training the model, it is good practice to define the model handle instead of automatically creating one. See Defining Model Handle.

Optionally, set the value of the session variable, which sets the model handle to this same value.
```
mysql> SET @variable = 'model_handle';
```
Replace @variable and model_handle with your own definitions. For example:
```
mysql> SET @semi_supervised_model='anomaly_detection_semi_supervised_use_case';
```
The model handle is set to anomaly_detection_semi_supervised_use_case.
Run the ML_TRAIN routine.
```
mysql> CALL sys.ML_TRAIN('table_name', 'target_column_name', JSON_OBJECT('task', 'task_name'), model_handle);
```
Replace table_name, target_column_name, task_name, and model_handle with your own values.

The following example runs ML_TRAIN on the training dataset previously created.
```
mysql> CALL sys.ML_TRAIN('anomaly_data.credit_card_train', "target", 
                          CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {}}}' as JSON), 
                          @semi_supervised_model);
```
Where:
- anomaly_data.credit_card_train is the fully qualified name of the table that contains the training dataset (database_name.table_name).
- target is the name of the target column, which contains ground truth values to use for semi-supervised learning.
- CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {}}}' as JSON) specifies the machine learning task type. The experimental parameter is required to use a semi-supervised learning model. All default values are used for semi-supervised learning.
- @semi_supervised_model is the session variable previously set that defines the model handle to the name defined by the user: anomaly_detection_semi_supervised_use_case. If you do not define the model handle before training the model, the model handle is automatically generated, and the session variable only stores the model handle for the duration of the connection. User variables are written as @var_name. Any valid name for a user-defined variable is permitted. See Work with Model Handles to learn more.

When the training operation finishes, the model handle is assigned to the @semi_supervised_model session variable, and the model is stored in the model catalog. View the entry in the model catalog with the following query. Replace user1 with your MySQL account name.

mysql> SELECT model_id, model_handle, train_table_name FROM ML_SCHEMA_user1.MODEL_CATALOG  WHERE model_handle = 'anomaly_detection_semi_supervised_use_case';
+----------+----------------------------------------------+-------------------------------------+
| model_id | model_handle                                 | train_table_name                    |
+----------+----------------------------------------------+-------------------------------------+
|        3 | anomaly_detection_semi_supervised_use_case   | anomaly_data.credit_card_train      |
+----------+----------------------------------------------+-------------------------------------+

Training an Unsupervised Anomaly Detection Model for Logs

Train the model with the ML_TRAIN routine and use the training_data table previously created. Before training the model, it is good practice to define the model handle instead of automatically creating one. See Defining Model Handle.

You have the option to select the keyword feature extractor and embedding model for training the model. See Anomaly Detection for Logs to learn more. Run the following query to confirm available models that have capabilities with TEXT_EMBEDDINGS.

mysql> SELECT sys.ML_LIST_LLMS();
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| sys.ML_LIST_LLMS()                                                                                                                                            |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [{"model_id": "llama3.2-3b-instruct-v1", "provider": "HeatWave", "capabilities": ["GENERATION"], "default_model": true, "availability_date": "2025-05-20"},   | 
| {"model_id": "all_minilm_l12_v2", "provider": "HeatWave", "capabilities": ["TEXT_EMBEDDINGS"], "default_model": false, "availability_date": "2024-07-01"},    |
| {"model_id": "multilingual-e5-small", "provider": "HeatWave", "capabilities": ["TEXT_EMBEDDINGS"], "default_model": true, "availability_date": "2024-07-24"}] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

The output displays two compatible models: all_minilm_l12_v2 and multilingual-e5-small.

Optionally, set the value of the session variable, which sets the model handle to this same value.
```
mysql> SET @variable = 'model_handle';
```
Replace @variable and model_handle with your own definitions. For example:
```
mysql> SET @unsupervised_log_model='anomaly_detection_log_use_case';
```
The model handle is set to anomaly_detection_log_use_case.
Run the ML_TRAIN routine.
```
mysql> CALL sys.ML_TRAIN('table_name', 'target_column_name', JSON_OBJECT('task', 'task_name'), model_handle);
```
Replace table_name, target_column_name, task_name, and model_handle with your own values.

The following example runs ML_TRAIN on the training dataset previously created.
```
mysql> CALL sys.ML_TRAIN('anomaly_log_data.training_data', NULL, 
                          JSON_OBJECT('task', 'log_anomaly_detection', 
                                      'exclude_column_list', JSON_ARRAY('log_id', 'timestamp', 'target'),
                                      'logad_options', 
                                      JSON_OBJECT('embedding_model', 'all_minilm_l12_v2', 'keyword_model', 'tf-idf')), @unsupervised_log_model);
```
Where:
- anomaly_log_data.training_data is the fully qualified name of the table that contains the training dataset (database_name.table_name).
- NULL is set for the target column because it is an unsupervised learning model, so no labeled data is used to train the model.
- JSON_OBJECT('task', 'log_anomaly_detection' specifies the machine learning task type.
- 'exclude_column_list', JSON_ARRAY('log_id', 'timestamp', 'target') sets the required options to run the model for anomaly detection on logs. The columns log_id and timestamp are excluded because they are not any of the required columns for training. See Requirements for Anomaly Detection Training to learn more. The target column is excluded because it is an unsupervised learning model.
- 'embedding_model', 'all_minilm_l12_v2 sets the embedding model to one of the models previously confirmed as available for training the model.
- 'keyword_model', 'tf-idf' sets the keyword feature extractor for training the model.
- @unsupervised_log_model is the session variable previously set that defines the model handle to the name defined by the user: anomaly_detection_log_use_case. If you do not define the model handle before training the model, the model handle is automatically generated, and the session variable only stores the model handle for the duration of the connection. User variables are written as @var_name. Any valid name for a user-defined variable is permitted. See Work with Model Handles to learn more.

When the training operation finishes, the model handle is assigned to the @unsupervised_log_model session variable, and the model is stored in the model catalog. View the entry in the model catalog with the following query. Replace user1 with your MySQL account name.

mysql> SELECT model_id, model_handle, train_table_name FROM ML_SCHEMA_user1.MODEL_CATALOG  WHERE model_handle = 'anomaly_detection_log_use_case';
+----------+----------------------------------------------+-------------------------------------+
| model_id | model_handle                                 | train_table_name                    |
+----------+----------------------------------------------+-------------------------------------+
|        4 | anomaly_detection_log_use_case               | anomaly_log_data.training_data      |
+----------+----------------------------------------------+-------------------------------------+

What's Next

Learn how to Generate Predictions for an Anomaly Detection Model
Review additional Syntax Examples for Anomaly Detection Training