After preparing the data for an anomaly detection model, you can train the model.
This topic has the following sections.
Review and complete all the tasks to Prepare Data for an Anomaly Detection Model.
Consider the following based on the type of anomaly detection you are running:
Set the
task
parameter toanomaly_detection
for running anomaly detection on table data, orlog_anomaly_detection
for running anomaly detection on log data.If running an unsupervised model, the
target_column_name
parameter must be set toNULL
.-
If running a semi-supervised model:
The
target_column_name
parameter must specify a column whose only allowed values are 0 (normal), 1 (anomalous), and NULL (unlabeled). All rows are used to train the unsupervised component, while the rows with a value different than NULL are used to train the supervised component.The
experimental
option must be set tosemisupervised
.
-
If running anomaly detection on log data, the input table can only have the following columns:
The column containing the logs.
If including logs from different sources, a column containing the source of each log. Identify this column with the
log_source_column
option.If including labeled data, a column identifying the labeled log lines. See Semi-supervised Anomaly Detection to learn more.
At least one column must act as the primary key to establish the temporal order of logs. If the primary key column (or columns) is not one of the previous required columns (log data, source of log, or label), then you must use the
exclude_column_list
option when runningML_TRAIN
to exclude all primary key columns that don't include required data. See Syntax Examples for Anomaly Detection Training to review relevant examples.If the input table has additional columns to the ones permitted, you must use the
exclude_column_list
option to exclude irrelevant columns.
Use the following JSON
options
:
-
contamination
: Represents an estimate of the percentage of outliers in the training table.The contamination factor is calculated as: estimated number of rows with anomalies/total number of rows in the training table.
The contamination value must be greater than 0 and less than 0.5. The default value is 0.01.
model_list
: Allows you to select the model for training. If no option is specified, the default model is Generalized kth Nearest Neighbors (GkNN). Selecting more than one model or an unsupported model produces an error. Review supported Anomaly Detection Models.
You have the following options to train a semi-supervised anomaly detection model:
-
supervised_submodel_options
: Allows you to set optional override parameters for the supervised model component. The only model supported isDistanceWeightedKNNClassifier
. The following parameters are supported:n_neighbors
: Sets the desired k value that checks the k closest neighbors for each unclassified point. The default value is 5 and the value must be an integer greater than 0.min_labels
: Sets the minimum number of labeled data points required to train the supervised component. If fewer labeled data points are provided during training of the model,ML_TRAIN
fails. The default value is 20 and the value must be an integer greater than 0.
ensemble_score
: This option specifies the metric to use to score the ensemble of unsupervised and supervised components. It identifies the optimal weight between the two components based on the metric. The supported metrics areaccuracy
,precision
,recall
, andf1
. The default metric isf1
.
You have the following options for anomaly detection on log
data. The options are available as a separate
JSON_OBJECT
named
logad_options
:
-
additional_masking_regex
: Allows you to mask log data by using regular expression in aJSON_ARRAY
. By default, the following parameters are automatically masked during training and when generating anomaly scores.IP
DATETIME
TIME
HEX
IPPORT
OCID
window_size
: Specifies the maximum number of log lines to be grouped for anomaly detection. The default value is 10.window_stride
: Specifies the stride value to use for segmenting log lines. For example, there is log A, B, C, D, and E. Thewindow_size
is 3, and thewindow_stride
is 2. The first row has log A, B, and C. The second row has log C, D, and E. If this value is equal towindow_size
, there is no overlapping of log segments. The default value is 3.log_source_column
: Specifies the column name that contains the source identifier of the respective log lines. Log lines are grouped according to their respective source (for example, logs from multiple MySQL databases that are in the same table). By default, all log lines are assumed to be from the same source.
The following options are not supported for anomaly detection:
exclude_model_list
optimization_metric
You cannot run the following routines for a trained anomaly detection model:
ML_PREDICT_ROW
(only for anomaly detection for logs)
Train the model with the
ML_TRAIN
routine and use the credit_card_train
table previously created. Before training the model, it is
good practice to define the model handle instead of
automatically creating one. See
Defining Model Handle.
-
Optionally, set the value of the session variable, which sets the model handle to this same value.
mysql> SET @variable = 'model_handle';
Replace
@variable
andmodel_handle
with your own definitions. For example:mysql> SET @semi_supervised_model='anomaly_detection_semi_supervised_use_case';
The model handle is set to
anomaly_detection_semi_supervised_use_case
. -
Run the
ML_TRAIN
routine.mysql> CALL sys.ML_TRAIN('table_name', 'target_column_name', JSON_OBJECT('task', 'task_name'), model_handle);
Replace
table_name
,target_column_name
,task_name
, andmodel_handle
with your own values.The following example runs
ML_TRAIN
on the training dataset previously created.mysql> CALL sys.ML_TRAIN('anomaly_data.credit_card_train', "target", CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {}}}' as JSON), @semi_supervised_model);
Where:
anomaly_data.credit_card_train
is the fully qualified name of the table that contains the training dataset (database_name.table_name
).target
is the name of the target column, which contains ground truth values to use for semi-supervised learning.CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {}}}' as JSON)
specifies the machine learning task type. Theexperimental
parameter is required to use a semi-supervised learning model. All default values are used for semi-supervised learning.@semi_supervised_model
is the session variable previously set that defines the model handle to the name defined by the user:anomaly_detection_semi_supervised_use_case
. If you do not define the model handle before training the model, the model handle is automatically generated, and the session variable only stores the model handle for the duration of the connection. User variables are written as@
. Any valid name for a user-defined variable is permitted. See Work with Model Handles to learn more.var_name
-
When the training operation finishes, the model handle is assigned to the
@semi_supervised_model
session variable, and the model is stored in the model catalog. View the entry in the model catalog with the following query. Replaceuser1
with your MySQL account name.mysql> SELECT model_id, model_handle, train_table_name FROM ML_SCHEMA_user1.MODEL_CATALOG WHERE model_handle = 'anomaly_detection_semi_supervised_use_case'; +----------+----------------------------------------------+-------------------------------------+ | model_id | model_handle | train_table_name | +----------+----------------------------------------------+-------------------------------------+ | 3 | anomaly_detection_semi_supervised_use_case | anomaly_data.credit_card_train | +----------+----------------------------------------------+-------------------------------------+
Train the model with the
ML_TRAIN
routine and use the training_data
table
previously created. Before training the model, it is good
practice to define the model handle instead of automatically
creating one. See
Defining Model Handle.
-
Optionally, set the value of the session variable, which sets the model handle to this same value.
mysql> SET @variable = 'model_handle';
Replace
@variable
andmodel_handle
with your own definitions. For example:mysql> SET @unsupervised_log_model='anomaly_detection_log_use_case';
The model handle is set to
anomaly_detection_log_use_case
. -
Run the
ML_TRAIN
routine.mysql> CALL sys.ML_TRAIN('table_name', 'target_column_name', JSON_OBJECT('task', 'task_name'), model_handle);
Replace
table_name
,target_column_name
,task_name
, andmodel_handle
with your own values.The following example runs
ML_TRAIN
on the training dataset previously created.mysql> CALL sys.ML_TRAIN('anomaly_log_data.training_data', NULL, JSON_OBJECT('task', 'log_anomaly_detection', 'exclude_column_list', JSON_ARRAY('log_id', 'timestamp', 'target')), @unsupervised_log_model);
Where:
anomaly_log_data.training_data
is the fully qualified name of the table that contains the training dataset (database_name.table_name
).NULL
is set for the target column because it is an unsupervised learning model, so no labeled data is used to train the model.JSON_OBJECT('task', 'log_anomaly_detection'
specifies the machine learning task type.'exclude_column_list', JSON_ARRAY('log_id', 'timestamp', 'target')
sets the required options to run the model for anomaly detection on logs. The columnslog_id
andtimestamp
are excluded because they are not any of the required columns for training. See Requirements for Anomaly Detection Training to learn more. Thetarget
column is excluded because it is an unsupervised learning model.@unsupervised_log_model
is the session variable previously set that defines the model handle to the name defined by the user:anomaly_detection_log_use_case
. If you do not define the model handle before training the model, the model handle is automatically generated, and the session variable only stores the model handle for the duration of the connection. User variables are written as@
. Any valid name for a user-defined variable is permitted. See Work with Model Handles to learn more.var_name
-
When the training operation finishes, the model handle is assigned to the
@unsupervised_log_model
session variable, and the model is stored in the model catalog. View the entry in the model catalog with the following query. Replaceuser1
with your MySQL account name.mysql> SELECT model_id, model_handle, train_table_name FROM ML_SCHEMA_user1.MODEL_CATALOG WHERE model_handle = 'anomaly_detection_log_use_case'; +----------+----------------------------------------------+-------------------------------------+ | model_id | model_handle | train_table_name | +----------+----------------------------------------------+-------------------------------------+ | 4 | anomaly_detection_log_use_case | anomaly_log_data.training_data | +----------+----------------------------------------------+-------------------------------------+
Learn how to Generate Predictions for an Anomaly Detection Model
Review additional Syntax Examples for Anomaly Detection Training