ML_PREDICT_TABLE
generates
predictions for an entire table of unlabeled data and saves the
results to an output table. MySQL HeatWave AutoML performs the predictions
in parallel.
This topic has the following sections.
ML_PREDICT_TABLE
is a compute
intensive process. Depending on your MySQL version, we
recommend the following:
Before MySQL 9.4.1, limit operations to batches of rows by splitting large tables into smaller tables by using the
batch_size
option.For MySQL 9.4.1 and later, if
ML_PREDICT_TABLE
takes a long time to complete, manually limit input tables to a maximum of 1,000 rows.
A call to ML_PREDICT_TABLE
can
include columns that were not present during
ML_TRAIN
. A table can include
extra columns, and still use the MySQL HeatWave AutoML model. This allows
side by side comparisons of target column labels, ground
truth, and predictions in the same table.
ML_PREDICT_TABLE
ignores any
extra columns, and appends them to the results.
The output table includes a primary key:
If the input table has a primary key, the output table has the same primary key.
-
If the input table does not have a primary key, the output table has a new primary key column that auto increments.
As of MySQL 8.4.1, the name of the new primary key column is
_4aad19ca6e_pk_id
. The input table must not have a column with the name_4aad19ca6e_pk_id
that is not a primary key.Before MySQL 8.4.1, the name of the new primary key column is
_id
. The input table must not have a column with the name_id
that is not a primary key.
The output of predictions includes the
ml_results
column, which contains the
prediction results and the data. The combination of results
and data must be less than 65,532 characters.
As of MySQL 9.4.1, you have the option to specify the input table and output table as the same table if specific conditions are met. See Input Tables and Output Tables to learn more.
ML_PREDICT_TABLE
supports data
drift detection for classification, regression, and anomaly
detection (as of MySQL 9.3.2) models with the following:
The
options
parameter includes theadditional_details
boolean value.The
ml_results
column includes thedrift
JSON object literal.
See Analyze Data Drift.
mysql> CALL sys.ML_PREDICT_TABLE(table_name, model_handle, output_table_name), [options]);
options: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
['threshold', 'N']
['topk', 'N']
['recommend', {'ratings'|'items'|'users'|'users_to_items'|'items_to_users'|'items_to_items'|'users_to_users'}|NULL]
['remove_seen', {'true'|'false'}]
['batch_size', 'N']
['additional_details', {'true'|'false'}]
['prediction_interval', 'N']
['logad_options', JSON_OBJECT(("key","value"[,"key","value"] ...)
"key","value": {
['summarize_logs', {'true'|'false'}]
['summary_threshold', 'N']
}
}
}
Set the following required parameters:
table_name
: Specifies the fully qualified name of the input table (database_name.table_name
). The input table should contain the same feature columns as the training dataset. If the target column is included in the input table, it is not considered when generating predictions.model_handle
: Specifies the model handle or a session variable containing the model handle. See Work with Model Handles.output_table_name
: Specifies the table where predictions are stored. A fully qualified table name must be specified (database_name.table_name
). As of MySQL 9.4.1, you have the option to specify the input table and output table as the same table if specific conditions are met. See Input Tables and Output Tables to learn more.
Set the following options in JSON
format as
needed.
To view data drift detection values for classification and regression models, set the
additional_details
option totrue
. Theml_results
includes thedrift
JSON object literal.batch_size
: Deprecated as of MySQL 9.4.1. The size of each batch. You can set a value between 1 and 1,000. The default is 1,000, and this provides the best results. If on MySQL 9.4.1 and later andML_PREDICT_TABLE
takes a long time to complete, manually limit input tables to a maximum of 1,000 rows.
Additional options are available for recommendation, anomaly detection, and forecasting models.
Set the following options as needed for recommendation models.
threshold
: The optional threshold that defines positive feedback, and a relevant sample. Only use with ranking metrics. It can be used for either explicit or implicit feedback.-
topk
: The optional top number of recommendations to provide. The default is3
. Set a positive integer between 1 and the number of rows in the table.A
recommendation
task with implicit feedback can use boththreshold
andtopk
. -
recommend
: Specify what to recommend.-
ratings
: Use this option to predict ratings. This is the default value.The target column is
prediction
, and the values arefloat
.The input table must contain at least two columns with the same names as the user column and item column from the training model.
-
items
: Use this option to recommend items for users.The target column is
item_recommendation
, and the values are:JSON_OBJECT("column_item_id_name", JSON_ARRAY("item_1", ... , "item_k"), "column_rating_name" , JSON_ARRAY(rating_1, ..., rating_k))
The input table must contain at least one column with the same name as the user column from the training model.
-
users
: Use this option to recommend users for items.The target column is
user_recommendation
, and the values are:JSON_OBJECT("column_user_id_name", JSON_ARRAY("user_1", ... , "user_k"), "column_rating_name" , JSON_ARRAY(rating_1, ..., rating_k))
The input table must contain at least one column with the same name as the item column from the training model.
users_to_items
: This is the same asitems
.items_to_users
: This is the same asusers
.-
items_to_items
: Use this option to recommend similar items for items.The target column is
item_recommendation
, and the values are:JSON_OBJECT("column_item_id_name", JSON_ARRAY("item_1", ... , "item_k"))
The input table must contain at least one column with the same name as the item column from the training model.
-
users_to_users
: Use this option to recommend similar users for users.The target column is
user_recommendation
, and the values are:JSON_OBJECT("column_user_id_name", JSON_ARRAY("user_1", ... , "user_k"))
The input table must at least contain a column with the same name as the user column from the training model.
-
remove_seen
: If the input table overlaps with the training table, andremove_seen
istrue
, then the model will not repeat existing interactions. The default istrue
. Setremove_seen
tofalse
to repeat existing interactions from the training table.
As of MySQL 9.4.1, if you run
ML_PREDICT_TABLE
with the
log_anomaly_detection
task, at least one
column must act as the primary key to establish the temporal
order of logs.
Set the following options as needed for anomaly detection models.
threshold
: The threshold you set on anomaly detection models determines which rows in the output table are labeled as anomalies with an anomaly score of1
, or normal with an anomaly score of0
. The value for the threshold is the degree to which a row of data or log segment is considered for anomaly detection. Any sample with an anomaly score above the threshold is classified an anomaly. The default value is (1 -contamination
)-th percentile of all the anomaly scores.-
topk
: The optional top K rows to display with the highest anomaly scores. Set a positive integer between 1 and the number of rows in the table. Iftopk
is not set,ML_PREDICT_TABLE
usesthreshold
.Do not set both
threshold
andtopk
. Usethreshold
ortopk
, or setoptions
toNULL
. -
logad_options
: AJSON_OBJECT
that allows you to configure the following options for running an anomaly detection model on log data (MySQL 9.2.2 and later).summarize_logs
: Allows you to leverage MySQL HeatWave GenAI to generate textual summaries of results. Enable this option by setting it toTRUE
. If enabled, summaries are generated for log segments that are labeled as an anomaly or have anomaly scores higher than the value set for thesummary_threshold
.summary_threshold
: Determines the rows in the output table that are summarized. This does not affect how thecontamination
andthreshold
options determine anomalies. You can set a value greater than 0 and less than 1. The default value isNULL
. IfNULL
is selected, only the log segments tagged withis_anomaly
are used to generate summaries.
Set the following options as needed for forecasting models.
-
prediction_interval
: Use this to generate forecasted values with lower and upper bounds based on a specific prediction interval (level of confidence). For theprediction_interval
value:The default value is 0.95.
The data type for this value must be FLOAT.
The value must be greater than 0 and less than 1.
-
A typical usage example that specifies the fully qualified name of the table to generate predictions for, the session variable containing the model handle, and the fully qualified output table name.
mysql> CALL sys.ML_PREDICT_TABLE('census_data.census_train', @census_model, 'census_data.census_train_predictions', NULL);
To view
ML_PREDICT_TABLE
results, query the output table. The table shows the predictions and the feature column values used to make each prediction. The table includes the primary key,_4aad19ca6e_pk_id
, and theml_results
column, displays the predictions and probabilities for each prediction.mysql> SELECT * FROM census_train_predictions LIMIT 5; +-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | revenue | Prediction | ml_results | +-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+ | 1 | 37 | Private | 99146 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 1977 | 50 | United-States | >50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.58, ">50K": 0.42}} | | 2 | 34 | Private | 27409 | 9th | 5 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.76, ">50K": 0.24}} | | 3 | 30 | Private | 299507 | Assoc-acdm | 12 | Separated | Other-service | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.99, ">50K": 0.01}} | | 4 | 62 | Self-emp-not-inc | 102631 | Some-college | 10 | Widowed | Farming-fishing | Unmarried | White | Female | 0 | 0 | 50 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.9, ">50K": 0.1}} | | 5 | 51 | Private | 153486 | Some-college | 10 | Married-civ-spouse | Handlers-cleaners | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.7, ">50K": 0.3}} | +-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+ 5 rows in set (0.0014 sec)
-
The following example generates a table of recommendations. The output recommends the top three items that particular users will like.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.test_sample', @model, 'mlcorpus.table_predictions_users', JSON_OBJECT("recommend", "items", "topk", 3)); Query OK, 0 rows affected (5.0672 sec) mysql> SELECT * FROM mlcorpus.table_predictions_users LIMIT 3; +-------------------+---------+---------+--------+--------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | user_id | item_id | rating | ml_results | +-------------------+---------+---------+--------+--------------------------------------------------------------------------------+ | 1 | 1026 | 13763 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.43, 3.37, 3.18]}} | | 2 | 992 | 16114 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.42, 3.38, 3.17]}} | | 3 | 1863 | 4527 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.42, 3.37, 3.18]}} | +-------------------+---------+---------+--------+--------------------------------------------------------------------------------+
-
The following example generates a table of anomaly detection predictions. A threshold value of 1% is specified. The
ml_results
column displays the prediction if each row is an anomaly, and the probability for that prediction.mysql> CALL sys.ML_PREDICT_TABLE('volcano_data.volcano_data_train', 'anomaly_model', 'volcano_data.volcano_predictions', JSON_OBJECT('threshold', 0.01)); Query OK, 0 rows affected (4.8868 sec) mysql> SELECT * FROM mlcorpus_anomaly_detection.volcanoes-predictions_threshold LIMIT 5; +-------------------+-------+-------+----------+--------+--------------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | V1 | V2 | V3 | target | ml_results | +-------------------+-------+-------+----------+--------+--------------------------------------------------------------------------------------------+ | 1 | 806.0 | 962.0 | 0.353207 | n | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.978, "anomaly": 0.022}} | | 2 | 326.0 | 254.0 | 0.368590 | n | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9873, "anomaly": 0.0127}} | | 3 | 586.0 | 654.0 | 0.409559 | n | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9786, "anomaly": 0.0214}} | | 4 | 370.0 | 816.0 | 0.404861 | n | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9885, "anomaly": 0.0115}} | | 5 | 918.0 | 952.0 | 0.419940 | n | {"predictions": {"is_anomaly": 1}, "probabilities": {"normal": 0.9767, "anomaly": 0.0233}} | +-------------------+-------+-------+----------+--------+--------------------------------------------------------------------------------------------+ 5 rows in set (0.0409 sec)
-
The following example generates a table of anomaly detection predictions by using semi-supervised learning. It overrides the
ensemble_score
value from theML_TRAIN
routine to a new value of 0.5. Theml_results
column displays the prediction if each row is an anomaly, and the probability for that prediction.mysql> CALL sys.ML_PREDICT_TABLE('anomaly_data.anomaly_data_train', 'semisupervised_model', 'anomaly_data.anomaly_predictions_semisupervised', CAST('{"experimental": {"semisupervised": {"supervised_submodel_weight": 0.5}}}' as JSON)); Query OK, 0 rows affected (1.5698 sec) mysql> SELECT * FROM anomaly_predictions_semisupervised LIMIT 5; +-------------------+----------------------+---------------------+---------------------+----------------------+----------------------+---------------------+----------------------+----------+--------------------+---------+--------------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | att1 | att2 | att3 | att4 | att5 | att6 | att7 | att8 | att9 | 1utlier | ml_results | +-------------------+----------------------+---------------------+---------------------+----------------------+----------------------+---------------------+----------------------+----------+--------------------+---------+--------------------------------------------------------------------------------------------+ | 1 | 0.114285714285714280 | 0.50000000000000000 | 0.46808510638297873 | 0.020145495243424735 | 0.252808988764044950 | 0.08848812744214006 | 0.666666666666666600 | 0.753125 | 0.6804123711340206 | 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.992, "anomaly": 0.008}} | | 2 | 0.171428571428571430 | 0.47159090909090910 | 0.25531914893617020 | 0.017907106883044210 | 0.258426966292134850 | 0.08806732792305380 | 0.472222222222222200 | 0.712500 | 0.6855670103092784 | 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9986, "anomaly": 0.0014}} | | 3 | 0.142857142857142850 | 0.50568181818181820 | 0.12765957446808510 | 0.020145495243424735 | 0.258426966292134850 | 0.08854824165915239 | 0.416666666666666700 | 0.693750 | 0.6804123711340206 | 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9945, "anomaly": 0.0055}} | | 4 | 0.114285714285714280 | 0.47727272727272730 | 0.25531914893617020 | 0.020145495243424735 | 0.252808988764044950 | 0.08860835587616471 | 0.527777777777777800 | 0.718750 | 0.6804123711340206 | 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9984, "anomaly": 0.0016}} | | 5 | 0.114285714285714280 | 0.47727272727272730 | 0.34042553191489360 | 0.016787912702853944 | 0.247191011235955050 | 0.08794709948902915 | 0.583333333333333400 | 0.740625 | 0.6855670103092784 | 0 | {"predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.9989, "anomaly": 0.0011}} | +-------------------+----------------------+---------------------+---------------------+----------------------+----------------------+---------------------+----------------------+----------+--------------------+---------+--------------------------------------------------------------------------------------------+ 5 rows in set (0.0491 sec)
-
The following example generates a table of anomaly detection predictions for log data (MySQL 9.2.2 and later). It disables log summaries in the results.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.`log_anomaly_just_patterns`', @logad_model, 'mlcorpus.log_anomaly_test_out', JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', FALSE))); mysql> SELECT * FROM mlcorpus.log_anomaly_test_out LIMIT 1; +----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | id | parsed_log_segment | ml_results | +----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | 1 | 2024-04-11T14:39:45.443597Z 1 [Note] [MY-013546] [InnoDB] Atomic write enabled | {"index_map": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], | | | 2024-04-11T14:39:45.443618Z 1 [Note] [MY-012932] [InnoDB] PUNCH HOLE support available | "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.55, "anomaly": 0.45}} | | | 2024-04-11T14:39:45.443631Z 1 [Note] [MY-012944] [InnoDB] Uses event mutexes | | | | 2024-04-11T14:39:45.443635Z 1 [Note] [MY-012945] [InnoDB] GCC builtin __atomic_thread_fence() is used for memory barrier | | | | 2024-04-11T14:39:45.443646Z 1 [Note] [MY-012948] [InnoDB] Compressed tables use zlib 1.2.13 | | | | 2024-04-11T14:40:25.128143Z 0 [Note] [MY-010264] [Server] - '127.0.0.1' resolves to '127.0.0.1'; | | | | 2024-04-11T14:40:25.128182Z 0 [Note] [MY-010251] [Server] Server socket created on IP: '127.0.0.1'. | | | | 2024-04-11T14:40:25.128245Z 0 [Note] [MY-010252] [Server] Server hostname (bind-address): '10.0.1.125'; port: 3306 | | | | 2024-04-11T14:40:25.128272Z 0 [Note] [MY-010264] [Server] - '10.0.1.125' resolves to '10.0.1.125'; | | | | 2024-04-26T13:01:30.287325Z 0 [Warning] [MY-015116] [Server] Background histogram update on nexus.fetches: | | | | Lock wait timeout exceeded; try restarting transaction | | +----+------+------+----------+--------+----------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+