Anomaly detection models produce anomaly scores, which
indicate the probability that a row of data or segment of log
lines has an anomaly. In the results,
is_anomaly
generates a value of 1 for an
anomaly, or 0 for normal. The normal
value
represents the probability the row of data or log segment is
normal. The anomaly
value represents the
probability the row of data or log segment is anomalous.
To detect anomalies, run the ML_PREDICT
routines on data with the same columns as the training model.
To detect anomalies in row data, you can run the
ML_PREDICT_ROW
orML_PREDICT_TABLE
routines.To detect anomalies in log data (MySQL 9.2.2 and later), you can only run the
ML_PREDICT_TABLE
routine.
For ML_SCORE
the
target_column_name
column must only contain
the anomaly scores as an integer: 1
: an
anomaly or 0
normal.
The threshold you set on anomaly detection models determines which rows in the output table are labeled as anomalies. The value for the threshold sets a probability value for anomaly detection. Any sample with a probability value above the threshold value is classified an anomaly.
There are two ways to set threshold values for anomaly detection models.
Set the Contamination Value
You can set the contamination
option for
the ML_TRAIN
routine. This
option uses the following calculation to set the threshold:
(1 - contamination
)-th percentile of all
the anomaly scores.
The default contamination
value is 0.01.
The default threshold
value based on the
default contamination
value is the
0.99-th percentile of all the anomaly scores.
Set the Threshold Value
You can set the threshold
option for the
ML_PREDICT_TABLE
,
ML_PREDICT_ROW
, and
ML_SCORE
routines. The value
must be greater than 0 and less than 1.
If no value is set for the threshold
option, the value set for the
contamination
option in the
ML_TRAIN
routine determines
the threshold.
The following additional options are available:
An alternative to
threshold
istopk
. The results include the top K rows with the highest anomaly scores. TheML_PREDICT_TABLE
andML_SCORE
routines include thetopk
option, which is an integer between 1 and the table length.ML_SCORE
includes an options parameter inJSON
format. The options arethreshold
andtopk
.When running a semi-supervised model, the
ML_PREDICT_ROW
,ML_PREDICT_TABLE
, andML_SCORE
routines have thesupervised_submodel_weight
option. It allows you to override theensemble_score
weighting estimated duringML_TRAIN
with a new value. The value must be greater than 0 and less than 1.
For running anomaly detection on log data (MySQL 9.2.2 and later), you have the option to leverage HeatWave GenAI for textual summaries of the results. To create summaries, use the following options:
summarize_logs
: Enable summaries by setting this toTRUE
. If enabled, summaries are generated for log segments that are labeled as an anomaly or exceed the value set for thesummary_threshold
.-
summary_threshold
: Determines the rows in the output table that are summarized. This does not affect how thecontamination
andthreshold
options determine anomalies. You can set a value greater than 0 and less than 1. The default value isNULL
.Summaries are generated for the following:
All rows labeled as anomalies.
If a value is set for
summary_threshold
, any non-anomaly rows that exceed the probability of thesummary_threshold
.
If the default
NULL
value is used forsummary_threshold
, then only rows labeled as anomalies are summarized.
Enabling the summary_threshold
option
and setting a very low threshold value can potentially
lead to a high number of summaries being generated, which
may substantially increase the time required to generate
output tables.
-
An anomaly detection example that uses the
roc_auc
metric forML_SCORE
.Press CTRL+C to copymysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target')), @anomaly); Query OK, 0 rows affected (46.59 sec) mysql> CALL sys.ML_MODEL_LOAD(@anomaly, NULL); Query OK, 0 rows affected (3.23 sec) mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', @anomaly, 'mlcorpus_anomaly_detection.volcanoes-predictions',NULL); Query OK, 0 rows affected (10.28 sec) mysql> SELECT * FROM mlcorpus_anomaly_detection.volcanoes-predictions LIMIT 5; +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | V1 | V2 | V3 | target | ml_results | +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | 1 | 128 | 802 | 0.47255 | 0 | {'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | | 2 | 631 | 642 | 0.387302 | 0 | {'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.96, 'anomaly': 0.04}} | | 3 | 438 | 959 | 0.556034 | 0 | {'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.74, 'anomaly': 0.26}} | | 4 | 473 | 779 | 0.407626 | 0 | {'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.87, 'anomaly': 0.13}} | | 5 | 67 | 933 | 0.383843 | 0 | {'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | +-----+------+------+----------+--------+----------------------------------------------------------------------------------------+ 5 rows in set (0.00 sec) mysql> CALL sys.ML_SCORE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', 'target', @anomaly, 'roc_auc', @score, NULL); Query OK, 0 rows affected (5.84 sec) mysql> SELECT @score; +--------------------+ | @score | +--------------------+ | 0.7465642094612122 | +--------------------+ 1 row in set (0.00 sec)
-
An
ML_PREDICT_ROW
example that uses default options.Press CTRL+C to copymysql> SELECT sys.ML_PREDICT_ROW('{"V1": 438.0, "V2": 959.0, "V3": 0.556034}', @anomaly, NULL); +----------------------------------------------------------------------------------------------------------------------------------------------------+ | sys.ML_PREDICT_ROW('{"V1": 438.0, "V2": 959.0, "V3": 0.556034}', @anomaly, NULL) | +----------------------------------------------------------------------------------------------------------------------------------------------------+ | {"V1": 438.0, "V2": 959.0, "V3": 0.556034, "ml_results": "{'predictions': {'is_anomaly': 0}, 'probabilities': {'normal': 0.74, 'anomaly': 0.26}}"} | +----------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (5.35 sec)
-
An
ML_PREDICT_TABLE
example that uses thethreshold
option set to 1%. All rows shown have probabilities of being an anomaly above 1%, and are predicted to be anomalies.Press CTRL+C to copymysql> CALL sys.ML_PREDICT_TABLE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', @anomaly, 'mlcorpus_anomaly_detection.volcanoes-predictions_threshold', JSON_OBJECT('threshold', 0.01)); Query OK, 0 rows affected (12.77 sec) mysql> SELECT * FROM mlcorpus_anomaly_detection.volcanoes-predictions_threshold LIMIT 5; +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | V1 | V2 | V3 | target | ml_results | +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | 1 | 128 | 802 | 0.47255 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | | 2 | 631 | 642 | 0.387302 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.96, 'anomaly': 0.04}} | | 3 | 438 | 959 | 0.556034 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.74, 'anomaly': 0.26}} | | 4 | 473 | 779 | 0.407626 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.87, 'anomaly': 0.13}} | | 5 | 67 | 933 | 0.383843 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ 5 rows in set (0.00 sec)
-
An
ML_SCORE
example that uses theaccuracy
metric with athreshold
set to 90%.Press CTRL+C to copymysql> CALL sys.ML_SCORE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', 'target', @anomaly, 'accuracy', @score, JSON_OBJECT('threshold', 0.9)); Query OK, 0 rows affected (1.86 sec) mysql> SELECT @score; +--------------------+ | @score | +--------------------+ | 0.9791129231452942 | +--------------------+ 1 row in set (0.00 sec)
-
An
ML_SCORE
example that uses theprecision_at_k
metric with atopk
value of 10.Press CTRL+C to copymysql> CALL sys.ML_SCORE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', 'target', @anomaly, 'precision_at_k', @score, JSON_OBJECT('topk', 10)); Query OK, 0 rows affected (5.84 sec) mysql> SELECT @score; +---------------------+ | @score | +---------------------+ | 0 | +---------------------+ 1 row in set (0.00 sec)
-
An
ML_PREDICT_ROW
example that overrides theensemble_score
value from theML_TRAIN
routine to a new value of 0.8Press CTRL+C to copymysql> SET @row_input = JSON_OBJECT('V1', 250, 'V2', 525, 'V3', 0.438976); mysql> SELECT sys.ML_PREDICT_ROW(@row_input, @semsup_gknn, CAST('{"experimental": {"semisupervised": {"supervised_submodel_weight": 0.8}}}' as JSON));
-
An
ML_PREDICT_TABLE
example that overrides theensemble_score
value from theML_TRAIN
routine to a new value of 0.5.Press CTRL+C to copymysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.anomaly_train', @semsup_gknn, 'mlcorpus.preds_gknn_weighted', CAST('{"experimental": {"semisupervised": {"supervised_submodel_weight": 0.5}}}' as JSON));
-
An
ML_SCORE
example that overrides theensemble_score
value from theML_TRAIN
routine to a new value of 0.5.Press CTRL+C to copymysql> CALL sys.ML_SCORE('mlcorpus.anomaly_train_with_target', "target", @semsup_gknn, 'precision_at_k', @semsup_score_gknn_weighted, CAST('{"topk": 10, "experimental": {"semisupervised": {"supervised_submodel_weight": 0.5}}}' as JSON));
-
An
ML_SCORE
example that overrides theensemble_score
value from theML_TRAIN
routine to a new value of 0.5.Press CTRL+C to copymysql> CALL sys.ML_SCORE('mlcorpus.anomaly_train_with_target', "target", @semsup_gknn, 'precision_at_k', @semsup_score_gknn_weighted, CAST('{"topk": 10, "experimental": {"semisupervised": {"supervised_submodel_weight": 0.5}}}' as JSON));
-
An
ML_PREDICT_TABLE
example that disables log summaries in the results.Press CTRL+C to copymysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.`log_anomaly_just_patterns`', @logad_model, 'mlcorpus.log_anomaly_test_out', JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', FALSE))); mysql> SELECT * FROM mlcorpus.log_anomaly_test_out LIMIT 1; +----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | id | parsed_log_segment | ml_results | +----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | 1 | 2024-04-11T14:39:45.443597Z 1 [Note] [MY-013546] [InnoDB] Atomic write enabled | {"index_map": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], | | | 2024-04-11T14:39:45.443618Z 1 [Note] [MY-012932] [InnoDB] PUNCH HOLE support available | "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.55, "anomaly": 0.45}} | | | 2024-04-11T14:39:45.443631Z 1 [Note] [MY-012944] [InnoDB] Uses event mutexes | | | | 2024-04-11T14:39:45.443635Z 1 [Note] [MY-012945] [InnoDB] GCC builtin __atomic_thread_fence() is used for memory barrier | | | | 2024-04-11T14:39:45.443646Z 1 [Note] [MY-012948] [InnoDB] Compressed tables use zlib 1.2.13 | | | | 2024-04-11T14:40:25.128143Z 0 [Note] [MY-010264] [Server] - '127.0.0.1' resolves to '127.0.0.1'; | | | | 2024-04-11T14:40:25.128182Z 0 [Note] [MY-010251] [Server] Server socket created on IP: '127.0.0.1'. | | | | 2024-04-11T14:40:25.128245Z 0 [Note] [MY-010252] [Server] Server hostname (bind-address): '10.0.1.125'; port: 3306 | | | | 2024-04-11T14:40:25.128272Z 0 [Note] [MY-010264] [Server] - '10.0.1.125' resolves to '10.0.1.125'; | | | | 2024-04-26T13:01:30.287325Z 0 [Warning] [MY-015116] [Server] Background histogram update on nexus.fetches: | | | | Lock wait timeout exceeded; try restarting transaction | | +----+------+------+----------+--------+----------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+
Where:
index_map
displays index values that represent the row location of the log lines in the input table. Since the thewindow_size
value can cause the shape of the output table to differ from the input table, this allows you to refer to the relevant log lines in the input table.Ten log lines were grouped and analyzed based on the default
window_size
.
-
An
ML_PREDICT_TABLE
example that enables textual summaries for the results.Press CTRL+C to copymysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.`log_anomaly_just_patterns`', @log_anomaly_us, 'mlcorpus.log_anomaly_summary_out', JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', TRUE, 'summary_threshold', 0.30))); mysql> SHOW COLUMNS FROM mlcorpus.log_anomaly_summary_out;FieldTypeNullKeyDefaultExtraid intNOPRINULLauto_incrementparsed_log_segmenttextNO NULL ml_resultstextNO NULL mysql> SELECT ml_results FROM mlcorpus.log_anomaly_summary_out; +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ml_results | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | {"summary": "\n\nThe text describes the configuration of a MySQL database server. It includes information about the version of MySQL being used, | | the hostname and IP address of the server, and the port number it is listening on. The text also mentions several features enabled in the server, | | such as atomic writes, punch hole support, event mutexes, and memory barriers. Additionally, the text notes that the server uses zlib for compressed tables | | and GCC builtin __atomic_thread_fence() for memory barrier operations. Finally, the text includes a warning about a background histogram update on nexus.fetches | | that exceeded a lock wait timeout, indicating a potential issue with the database server's performance.", | | "index_map": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.55, "anomaly": 0.45}} | | {"summary": "\n\nThe text describes the execution of a query with QEP size 36800 and target nodes 4 in HeatWave. The query was successfully executed, | | and the ModIndexMap was initialized with 128 superpartitions. ORMA was started with one worker, and FIPC was also started with one worker. | | The text mentions that a script named ruby_env.sh was used to set the backtrace level to ERROR(4) and the logger level to INFO(2).", | | "index_map": [16, 17, 18, 19, 20, 21, 22, 23, 24, 25], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.65, "anomaly": 0.35}} | | {"index_map": [40, 41, 42, 43, 44, 45, 46, 47], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.98, "anomaly": 0.02}} | | {"index_map": [4, 5, 6, 7, 8, 9, 10, 11, 12, 13], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.77, "anomaly": 0.23}} | | {"summary": "\n\nThe text describes the initialization of a system with 128 superpartitions and the starting of | | ORMA (Operational Research Modeling Assistant) and FIPC (Fault-Tolerant In-Process Communication). The text also mentions the setting of logger levels and writing logs to a file. | | Additionally, it describes the loading of configuration from a JSON file and the starting of a new chef-run with a run_id_token.", | | "index_map": [19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.65, "anomaly": 0.35}} | | {"index_map": [43, 44, 45, 46, 47], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.96, "anomaly": 0.04}} | | {"index_map": [22, 23, 24, 25, 26, 27, 28, 29, 30, 31], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.87, "anomaly": 0.13}} | | {"summary": " The text is a log entry indicating that the system MY-011071 is using the RAPID method to offload data to HeatWave with qid=190075 and SCN=2889192.", | | "index_map": [46, 47], "predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.0, "anomaly": 1.0}} | | {"index_map": [7, 8, 9, 10, 11, 12, 13, 14, 15, 16], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.78, "anomaly": 0.22}} | | {"index_map": [25, 26, 27, 28, 29, 30, 31, 32, 33, 34], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 1.0, "anomaly": 0.0}} | | {"index_map": [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.78, "anomaly": 0.22}} | | {"index_map": [28, 29, 30, 31, 32, 33, 34, 35, 36, 37], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.81, "anomaly": 0.19}} | | {"index_map": [13, 14, 15, 16, 17, 18, 19, 20, 21, 22], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.71, "anomaly": 0.29}} | | {"index_map": [31, 32, 33, 34, 35, 36, 37, 38, 39, 40], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.8, "anomaly": 0.2}} | | {"index_map": [34, 35, 36, 37, 38, 39, 40, 41, 42, 43], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.99, "anomaly": 0.01}} | | {"index_map": [37, 38, 39, 40, 41, 42, 43, 44, 45, 46], "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.98, "anomaly": 0.02}} | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Where:
summarize_logs
is enabled, so HeatWave GenAI is leveraged to generate summaries of the results.Summaries are generated for results that are labeled as an anomaly or exceed the
summary_threshold
value of 30%.