Documentation Home
HeatWave User Guide
Related Documentation Download this Manual
PDF (US Ltr) - 2.2Mb
PDF (A4) - 2.2Mb


HeatWave User Guide  /  ...  /  Using an Anomaly Detection Model

3.10.5 Using an Anomaly Detection Model

Anomaly detection models produce anomaly scores, which indicate the probability that a row of data or segment of log lines has an anomaly. In the results, is_anomaly generates a value of 1 for an anomaly, or 0 for normal. The normal value represents the probability the row of data or log segment is normal. The anomaly value represents the probability the row of data or log segment is anomalous.

To detect anomalies, run the ML_PREDICT routines on data with the same columns as the training model.

For ML_SCORE the target_column_name column must only contain the anomaly scores as an integer: 1: an anomaly or 0 normal.

Anomaly Detection Model Options

The threshold you set on anomaly detection models determines which rows in the output table are labeled as anomalies. The value for the threshold sets a probability value for anomaly detection. Any sample with a probability value above the threshold value is classified an anomaly.

There are two ways to set threshold values for anomaly detection models.

Set the Contamination Value

You can set the contamination option for the ML_TRAIN routine. This option uses the following calculation to set the threshold: (1 - contamination)-th percentile of all the anomaly scores.

The default contamination value is 0.01. The default threshold value based on the default contamination value is the 0.99-th percentile of all the anomaly scores.

Set the Threshold Value

You can set the threshold option for the ML_PREDICT_TABLE, ML_PREDICT_ROW, and ML_SCORE routines. The value must be greater than 0 and less than 1.

If no value is set for the threshold option, the value set for the contamination option in the ML_TRAIN routine determines the threshold.

The following additional options are available:

  • An alternative to threshold is topk. The results include the top K rows with the highest anomaly scores. The ML_PREDICT_TABLE and ML_SCORE routines include the topk option, which is an integer between 1 and the table length.

  • ML_SCORE includes an options parameter in JSON format. The options are threshold and topk.

  • When running a semi-supervised model, the ML_PREDICT_ROW, ML_PREDICT_TABLE, and ML_SCORE routines have the supervised_submodel_weight option. It allows you to override the ensemble_score weighting estimated during ML_TRAIN with a new value. The value must be greater than 0 and less than 1.

Options for Anomaly Detection on Log Data

For running anomaly detection on log data (MySQL 9.2.2 and later), you have the option to leverage HeatWave GenAI for textual summaries of the results. To create summaries, use the following options:

  • summarize_logs: Enable summaries by setting this to TRUE. If enabled, summaries are generated for log segments that are labeled as an anomaly or exceed the value set for the summary_threshold.

  • summary_threshold: Determines the rows in the output table that are summarized. This does not affect how the contamination and threshold options determine anomalies. You can set a value greater than 0 and less than 1. The default value is NULL.

    Summaries are generated for the following:

    • All rows labeled as anomalies.

    • If a value is set for summary_threshold, any non-anomaly rows that exceed the probability of the summary_threshold.

    If the default NULL value is used for summary_threshold, then only rows labeled as anomalies are summarized.

Note

Enabling the summary_threshold option and setting a very low threshold value can potentially lead to a high number of summaries being generated, which may substantially increase the time required to generate output tables.

Syntax Examples for Unsupervised Learning

  • An anomaly detection example that uses the roc_auc metric for ML_SCORE.

    Press CTRL+C to copy
    mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target')), @anomaly); Query OK, 0 rows affected (46.59 sec) mysql> CALL sys.ML_MODEL_LOAD(@anomaly, NULL); Query OK, 0 rows affected (3.23 sec) mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', @anomaly, 'mlcorpus_anomaly_detection.volcanoes-predictions',NULL); Query OK, 0 rows affected (10.28 sec) mysql> SELECT * FROM mlcorpus_anomaly_detection.volcanoes-predictions LIMIT 5; +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | V1 | V2 | V3 | target | ml_results | +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | 1 | 128 | 802 | 0.47255 | 0 | {'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | | 2 | 631 | 642 | 0.387302 | 0 | {'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.96, 'anomaly': 0.04}} | | 3 | 438 | 959 | 0.556034 | 0 | {'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.74, 'anomaly': 0.26}} | | 4 | 473 | 779 | 0.407626 | 0 | {'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.87, 'anomaly': 0.13}} | | 5 | 67 | 933 | 0.383843 | 0 | {'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | +-----+------+------+----------+--------+----------------------------------------------------------------------------------------+ 5 rows in set (0.00 sec) mysql> CALL sys.ML_SCORE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', 'target', @anomaly, 'roc_auc', @score, NULL); Query OK, 0 rows affected (5.84 sec) mysql> SELECT @score; +--------------------+ | @score | +--------------------+ | 0.7465642094612122 | +--------------------+ 1 row in set (0.00 sec)
  • An ML_PREDICT_ROW example that uses default options.

    Press CTRL+C to copy
    mysql> SELECT sys.ML_PREDICT_ROW('{"V1": 438.0, "V2": 959.0, "V3": 0.556034}', @anomaly, NULL); +----------------------------------------------------------------------------------------------------------------------------------------------------+ | sys.ML_PREDICT_ROW('{"V1": 438.0, "V2": 959.0, "V3": 0.556034}', @anomaly, NULL) | +----------------------------------------------------------------------------------------------------------------------------------------------------+ | {"V1": 438.0, "V2": 959.0, "V3": 0.556034, "ml_results": "{'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.74, 'anomaly': 0.26}}"} | +----------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (5.35 sec)
  • An ML_PREDICT_TABLE example that uses the threshold option set to 1%. All rows shown have probabilities of being an anomaly above 1%, and are predicted to be anomalies.

    Press CTRL+C to copy
    mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', @anomaly, 'mlcorpus_anomaly_detection.volcanoes-predictions_threshold', JSON_OBJECT('threshold', 0.01)); Query OK, 0 rows affected (12.77 sec) mysql> SELECT * FROM mlcorpus_anomaly_detection.volcanoes-predictions_threshold LIMIT 5; +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | V1 | V2 | V3 | target | ml_results | +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | 1 | 128 | 802 | 0.47255 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | | 2 | 631 | 642 | 0.387302 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.96, 'anomaly': 0.04}} | | 3 | 438 | 959 | 0.556034 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.74, 'anomaly': 0.26}} | | 4 | 473 | 779 | 0.407626 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.87, 'anomaly': 0.13}} | | 5 | 67 | 933 | 0.383843 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ 5 rows in set (0.00 sec)
  • An ML_SCORE example that uses the accuracy metric with a threshold set to 90%.

    Press CTRL+C to copy
    mysql> CALL sys.ML_SCORE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', 'target', @anomaly, 'accuracy', @score, JSON_OBJECT('threshold', 0.9)); Query OK, 0 rows affected (1.86 sec) mysql> SELECT @score; +--------------------+ | @score | +--------------------+ | 0.9791129231452942 | +--------------------+ 1 row in set (0.00 sec)
  • An ML_SCORE example that uses the precision_at_k metric with a topk value of 10.

    Press CTRL+C to copy
    mysql> CALL sys.ML_SCORE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', 'target', @anomaly, 'precision_at_k', @score, JSON_OBJECT('topk', 10)); Query OK, 0 rows affected (5.84 sec) mysql> SELECT @score; +---------------------+ | @score | +---------------------+ | 0 | +---------------------+ 1 row in set (0.00 sec)

Syntax Examples for Semi-Supervised Learning

  • An ML_PREDICT_ROW example that overrides the ensemble_score value from the ML_TRAIN routine to a new value of 0.8

    Press CTRL+C to copy
    mysql> SET @row_input = JSON_OBJECT('V1', 250, 'V2', 525, 'V3', 0.438976); mysql> SELECT sys.ML_PREDICT_ROW(@row_input, @semsup_gknn, CAST('{"experimental": {"semisupervised": {"supervised_submodel_weight": 0.8}}}' as JSON));
  • An ML_PREDICT_TABLE example that overrides the ensemble_score value from the ML_TRAIN routine to a new value of 0.5.

    Press CTRL+C to copy
    mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.anomaly_train', @semsup_gknn, 'mlcorpus.preds_gknn_weighted', CAST('{"experimental": {"semisupervised": {"supervised_submodel_weight": 0.5}}}' as JSON));
  • An ML_SCORE example that overrides the ensemble_score value from the ML_TRAIN routine to a new value of 0.5.

    Press CTRL+C to copy
    mysql> CALL sys.ML_SCORE('mlcorpus.anomaly_train_with_target', "target", @semsup_gknn, 'precision_at_k', @semsup_score_gknn_weighted, CAST('{"topk": 10, "experimental": {"semisupervised": {"supervised_submodel_weight": 0.5}}}' as JSON));
  • An ML_SCORE example that overrides the ensemble_score value from the ML_TRAIN routine to a new value of 0.5.

    Press CTRL+C to copy
    mysql> CALL sys.ML_SCORE('mlcorpus.anomaly_train_with_target', "target", @semsup_gknn, 'precision_at_k', @semsup_score_gknn_weighted, CAST('{"topk": 10, "experimental": {"semisupervised": {"supervised_submodel_weight": 0.5}}}' as JSON));

Syntax Examples for Anomaly Detection on Log Data

  • An ML_PREDICT_TABLE example that disables log summaries in the results.

    Press CTRL+C to copy
    mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.`log_anomaly_just_patterns`', @logad_model, 'mlcorpus.log_anomaly_test_out', JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', FALSE))); mysql> SELECT * FROM mlcorpus.log_anomaly_test_out LIMIT 1; +----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | id | parsed_log_segment | ml_results | +----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | 1 | 2024-04-11T14:39:45.443597Z 1 [Note] [MY-013546] [InnoDB] Atomic write enabled | {"index_map": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], | | | 2024-04-11T14:39:45.443618Z 1 [Note] [MY-012932] [InnoDB] PUNCH HOLE support available | "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.55, "anomaly": 0.45}} | | | 2024-04-11T14:39:45.443631Z 1 [Note] [MY-012944] [InnoDB] Uses event mutexes | | | | 2024-04-11T14:39:45.443635Z 1 [Note] [MY-012945] [InnoDB] GCC builtin __atomic_thread_fence() is used for memory barrier | | | | 2024-04-11T14:39:45.443646Z 1 [Note] [MY-012948] [InnoDB] Compressed tables use zlib 1.2.13 | | | | 2024-04-11T14:40:25.128143Z 0 [Note] [MY-010264] [Server] - '127.0.0.1' resolves to '127.0.0.1'; | | | | 2024-04-11T14:40:25.128182Z 0 [Note] [MY-010251] [Server] Server socket created on IP: '127.0.0.1'. | | | | 2024-04-11T14:40:25.128245Z 0 [Note] [MY-010252] [Server] Server hostname (bind-address): '10.0.1.125'; port: 3306 | | | | 2024-04-11T14:40:25.128272Z 0 [Note] [MY-010264] [Server] - '10.0.1.125' resolves to '10.0.1.125'; | | | | 2024-04-26T13:01:30.287325Z 0 [Warning] [MY-015116] [Server] Background histogram update on nexus.fetches: | | | | Lock wait timeout exceeded; try restarting transaction | | +----+------+------+----------+--------+----------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+

    Where:

    • index_map displays index values that represent the row location of the log lines in the input table. Since the the window_size value can cause the shape of the output table to differ from the input table, this allows you to refer to the relevant log lines in the input table.

    • Ten log lines were grouped and analyzed based on the default window_size.

  • An ML_PREDICT_TABLE example that enables textual summaries for the results.

    Press CTRL+C to copy
    mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.`log_anomaly_just_patterns`', @log_anomaly_us, 'mlcorpus.log_anomaly_summary_out', JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', TRUE, 'summary_threshold', 0.30))); mysql> SHOW COLUMNS FROM mlcorpus.log_anomaly_summary_out;FieldTypeNullKeyDefaultExtraid intNOPRINULLauto_incrementparsed_log_segmenttextNO NULL ml_resultstextNO NULL mysql> SELECT ml_results FROM mlcorpus.log_anomaly_summary_out; +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ml_results | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | {"summary": "\n\nThe text describes the configuration of a MySQL database server. It includes information about the version of MySQL being used, | | the hostname and IP address of the server, and the port number it is listening on. The text also mentions several features enabled in the server, | | such as atomic writes, punch hole support, event mutexes, and memory barriers. Additionally, the text notes that the server uses zlib for compressed tables | | and GCC builtin __atomic_thread_fence() for memory barrier operations. Finally, the text includes a warning about a background histogram update on nexus.fetches | | that exceeded a lock wait timeout, indicating a potential issue with the database server's performance.", | | "index_map": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.55, "anomaly": 0.45}} | | {"summary": "\n\nThe text describes the execution of a query with QEP size 36800 and target nodes 4 in HeatWave. The query was successfully executed, | | and the ModIndexMap was initialized with 128 superpartitions. ORMA was started with one worker, and FIPC was also started with one worker. | | The text mentions that a script named ruby_env.sh was used to set the backtrace level to ERROR(4) and the logger level to INFO(2).", | | "index_map": [16, 17, 18, 19, 20, 21, 22, 23, 24, 25], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.65, "anomaly": 0.35}} | | {"index_map": [40, 41, 42, 43, 44, 45, 46, 47], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.98, "anomaly": 0.02}} | | {"index_map": [4, 5, 6, 7, 8, 9, 10, 11, 12, 13], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.77, "anomaly": 0.23}} | | {"summary": "\n\nThe text describes the initialization of a system with 128 superpartitions and the starting of | | ORMA (Operational Research Modeling Assistant) and FIPC (Fault-Tolerant In-Process Communication). The text also mentions the setting of logger levels and writing logs to a file. | | Additionally, it describes the loading of configuration from a JSON file and the starting of a new chef-run with a run_id_token.", | | "index_map": [19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.65, "anomaly": 0.35}} | | {"index_map": [43, 44, 45, 46, 47], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.96, "anomaly": 0.04}} | | {"index_map": [22, 23, 24, 25, 26, 27, 28, 29, 30, 31], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.87, "anomaly": 0.13}} | | {"summary": " The text is a log entry indicating that the system MY-011071 is using the RAPID method to offload data to HeatWave with qid=190075 and SCN=2889192.", | | "index_map": [46, 47], "predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.0, "anomaly": 1.0}} | | {"index_map": [7, 8, 9, 10, 11, 12, 13, 14, 15, 16], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.78, "anomaly": 0.22}} | | {"index_map": [25, 26, 27, 28, 29, 30, 31, 32, 33, 34], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 1.0, "anomaly": 0.0}} | | {"index_map": [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.78, "anomaly": 0.22}} | | {"index_map": [28, 29, 30, 31, 32, 33, 34, 35, 36, 37], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.81, "anomaly": 0.19}} | | {"index_map": [13, 14, 15, 16, 17, 18, 19, 20, 21, 22], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.71, "anomaly": 0.29}} | | {"index_map": [31, 32, 33, 34, 35, 36, 37, 38, 39, 40], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8, "anomaly": 0.2}} | | {"index_map": [34, 35, 36, 37, 38, 39, 40, 41, 42, 43], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.99, "anomaly": 0.01}} | | {"index_map": [37, 38, 39, 40, 41, 42, 43, 44, 45, 46], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.98, "anomaly": 0.02}} | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

    Where:

    • summarize_logs is enabled, so HeatWave GenAI is leveraged to generate summaries of the results.

    • Summaries are generated for results that are labeled as an anomaly or exceed the summary_threshold value of 30%.