ML_PREDICT_TABLE
generates predictions for an entire table of unlabeled data.
AutoML performs the predictions in parallel.
This topic has the following sections.
ML_PREDICT_TABLE
is a compute intensive process. If
ML_PREDICT_TABLE
takes a long time to complete, manually limit input tables to
a maximum of 1,000 rows.
A call to
ML_PREDICT_TABLE
can include columns that were not present during
ML_TRAIN.
A table can include extra columns, and still use the AutoML
model. This allows side by side comparisons of target column
labels, ground truth, and predictions in the same table.
ML_PREDICT_TABLE
ignores any extra columns, and appends them to the results.
The output table includes a primary key:
If the input table has a primary key, the output table has the same primary key.
If the input table does not have a primary key, the output table has a new primary key column that auto increments. The name of the new primary key column is
_4aad19ca6e_pk_id. The input table must not have a column with the name_4aad19ca6e_pk_idthat is not a primary key.
The output of predictions includes the
ml_results column, which contains the
prediction results and the data. The combination of results
and data must be less than 65,532 characters.
You have the option to specify the input table and output table as the same table if specific conditions are met. See Input Tables and Output Tables to learn more.
ML_PREDICT_TABLE
supports data drift detection for classification and
regression models with the following:
The
optionsparameter includes theadditional_detailsboolean value.The
ml_resultscolumn includes thedriftJSON object literal.
See Analyze Data Drift.
mysql> CALL sys.ML_PREDICT_TABLE(table_name, model_handle, output_table_name), [options]);
options: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
['threshold', 'N']
['topk', 'N']
['recommend', {'ratings'|'items'|'users'|'users_to_items'|'items_to_users'|'items_to_items'|'users_to_users'}|NULL]
['remove_seen', {'true'|'false'}]
['additional_details', {'true'|'false'}]
['prediction_interval', 'N']
['item_metadata', JSON_OBJECT('table_name'[,'database_name.table_name'] ...)]
['user_metadata', JSON_OBJECT('table_name'[,'database_name.table_name'] ...)]
['logad_options', JSON_OBJECT(("key","value"[,"key","value"] ...)
"key","value": {
['summarize_logs', {'true'|'false'}]
['summary_threshold', 'N']
}
}
}
Set the following required parameters:
table_name: Specifies the fully qualified name of the input table (database_name.table_name). The input table should contain the same feature columns as the training dataset. If the target column is included in the input table, it is not considered when generating predictions.model_handle: Specifies the model handle or a session variable containing the model handle. See Work with Model Handles.output_table_name: Specifies the table where predictions are stored. A fully qualified table name must be specified (database_name.table_name). You have the option to specify the input table and output table as the same table if specific conditions are met. See Input Tables and Output Tables to learn more.
Set the following options in JSON format as
needed.
To view data drift detection values for classification and regression models, set the
additional_detailsoption totrue. Theml_resultsincludes thedriftJSON object literal.
Additional options are available for recommendation, anomaly detection, and forecasting models.
Set the following options as needed for recommendation models.
threshold: The optional threshold that defines positive feedback, and a relevant sample. Only use with ranking metrics. It can be used for either explicit or implicit feedback.-
topk: The optional top number of recommendations to provide. The default is3. Set a positive integer between 1 and the number of rows in the table.A
recommendationtask with implicit feedback can use boththresholdandtopk. -
recommend: Specify what to recommend.-
ratings: Use this option to predict ratings. This is the default value.The target column is
prediction, and the values arefloat.The input table must contain at least two columns with the same names as the user column and item column from the training model.
-
items: Use this option to recommend items for users.The target column is
item_recommendation, and the values are:JSON_OBJECT("column_item_id_name", JSON_ARRAY("item_1", ... , "item_k"), "column_rating_name" , JSON_ARRAY(rating_1, ..., rating_k))The input table must contain at least one column with the same name as the user column from the training model.
-
users: Use this option to recommend users for items.The target column is
user_recommendation, and the values are:JSON_OBJECT("column_user_id_name", JSON_ARRAY("user_1", ... , "user_k"), "column_rating_name" , JSON_ARRAY(rating_1, ..., rating_k))The input table must contain at least one column with the same name as the item column from the training model.
users_to_items: This is the same asitems.items_to_users: This is the same asusers.-
items_to_items: Use this option to recommend similar items for items.The target column is
item_recommendation, and the values are:JSON_OBJECT("column_item_id_name", JSON_ARRAY("item_1", ... , "item_k"))The input table must contain at least one column with the same name as the item column from the training model.
-
users_to_users: Use this option to recommend similar users for users.The target column is
user_recommendation, and the values are:JSON_OBJECT("column_user_id_name", JSON_ARRAY("user_1", ... , "user_k"))The input table must at least contain a column with the same name as the user column from the training model.
-
remove_seen: If the input table overlaps with the training table, andremove_seenistrue, then the model will not repeat existing interactions. The default istrue. Setremove_seentofalseto repeat existing interactions from the training table.item_metadata: Defines the table that has item descriptions. It is a JSON object that has thetable_nameoption as a key, which specifies the table that has item descriptions. One column must be the same as theitem_idin the input table.-
user_metadata: Defines the table that has user descriptions. It is a JSON object that has thetable_nameoption as a key, which specifies the table that has user descriptions. One column must be the same as theuser_idin the input table.table_name: To be used with theitem_metadataanduser_metadataoptions. It specifies the table name that has item or user descriptions. It must be a string in a fully qualified format (schema_name.table_name) that specifies the table name.
If you run
ML_PREDICT_TABLE
with the log_anomaly_detection task, at
least one column must act as the primary key to establish the
temporal order of logs.
Set the following options as needed for anomaly detection models.
threshold: The threshold you set on anomaly detection models determines which rows in the output table are labeled as anomalies with an anomaly score of1, or normal with an anomaly score of0. The value for the threshold is the degree to which a row of data or log segment is considered for anomaly detection. Any sample with an anomaly score above the threshold is classified an anomaly. The default value is (1 -contamination)-th percentile of all the anomaly scores.-
topk: The optional top K rows to display with the highest anomaly scores. Set a positive integer between 1 and the number of rows in the table. Iftopkis not set,ML_PREDICT_TABLEusesthreshold.Do not set both
thresholdandtopk. Usethresholdortopk, or setoptionstoNULL. -
logad_options: AJSON_OBJECTthat allows you to configure the following options for running an anomaly detection model on log data.summarize_logs: Allows you to leverage GenAI to generate textual summaries of results. Enable this option by setting it toTRUE. If enabled, summaries are generated for log segments that are labeled as an anomaly or have anomaly scores higher than the value set for thesummary_threshold.summary_threshold: Determines the rows in the output table that are summarized. This does not affect how thecontaminationandthresholdoptions determine anomalies. You can set a value greater than 0 and less than 1. The default value isNULL. IfNULLis selected, only the log segments tagged withis_anomalyare used to generate summaries.
Set the following options as needed for forecasting models.
-
prediction_interval: Use this to generate forecasted values with lower and upper bounds based on a specific prediction interval (level of confidence). For theprediction_intervalvalue:The default value is 0.95.
The data type for this value must be FLOAT.
The value must be greater than 0 and less than 1.
-
A typical usage example that specifies the fully qualified name of the table to generate predictions for, the session variable containing the model handle, and the fully qualified output table name.
mysql> CALL sys.ML_PREDICT_TABLE('census_data.census_train', @census_model, 'census_data.census_train_predictions', NULL);To view
ML_PREDICT_TABLEresults, query the output table. The table shows the predictions and the feature column values used to make each prediction. The table includes the primary key,_4aad19ca6e_pk_id, and theml_resultscolumn, which usesJSONformat:mysql> SELECT * FROM census_train_predictions LIMIT 5; +-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | revenue | Prediction | ml_results | +-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+ | 1 | 37 | Private | 99146 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 1977 | 50 | United-States | >50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.58, ">50K": 0.42}} | | 2 | 34 | Private | 27409 | 9th | 5 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.76, ">50K": 0.24}} | | 3 | 30 | Private | 299507 | Assoc-acdm | 12 | Separated | Other-service | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.99, ">50K": 0.01}} | | 4 | 62 | Self-emp-not-inc | 102631 | Some-college | 10 | Widowed | Farming-fishing | Unmarried | White | Female | 0 | 0 | 50 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.9, ">50K": 0.1}} | | 5 | 51 | Private | 153486 | Some-college | 10 | Married-civ-spouse | Handlers-cleaners | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K | <=50K | {"predictions": {"revenue": "<=50K"}, "probabilities": {"<=50K": 0.7, ">50K": 0.3}} | +-------------------+-----+------------------+--------+--------------+---------------+--------------------+-------------------+--------------+-------+--------+--------------+--------------+----------------+----------------+---------+------------+---------------------------------------------------------------------------------------+ 5 rows in set (0.0014 sec) -
The following example generates a table of recommendations. The output recommends the top three items that particular users will like.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.test_sample', @model, 'mlcorpus.table_predictions_users', JSON_OBJECT("recommend", "items", "topk", 3)); Query OK, 0 rows affected (5.0672 sec) mysql> SELECT * FROM mlcorpus.table_predictions_users LIMIT 3; +-------------------+---------+---------+--------+--------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | user_id | item_id | rating | ml_results | +-------------------+---------+---------+--------+--------------------------------------------------------------------------------+ | 1 | 1026 | 13763 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.43, 3.37, 3.18]}} | | 2 | 992 | 16114 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.42, 3.38, 3.17]}} | | 3 | 1863 | 4527 | 1 | {"predictions": {"item_id": ["10", "14", "11"], "rating": [3.42, 3.37, 3.18]}} | +-------------------+---------+---------+--------+--------------------------------------------------------------------------------+ -
The following example generates a table of anomaly detection predictions. A threshold value of 1% is specified.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', @anomaly, 'mlcorpus_anomaly_detection.volcanoes-predictions_threshold', JSON_OBJECT('threshold', 0.01)); Query OK, 0 rows affected (12.77 sec) mysql> SELECT * FROM mlcorpus_anomaly_detection.volcanoes-predictions_threshold LIMIT 5; +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | _4aad19ca6e_pk_id | V1 | V2 | V3 | target | ml_results | +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ | 1 | 128 | 802 | 0.47255 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | | 2 | 631 | 642 | 0.387302 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.96, 'anomaly': 0.04}} | | 3 | 438 | 959 | 0.556034 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.74, 'anomaly': 0.26}} | | 4 | 473 | 779 | 0.407626 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.87, 'anomaly': 0.13}} | | 5 | 67 | 933 | 0.383843 | 0 | {'predictions': {'is_anomaly': 1}, 'probabilities': {'normal': 0.95, 'anomaly': 0.05}} | +-------------------+------+------+----------+--------+----------------------------------------------------------------------------------------+ 5 rows in set (0.00 sec) -
The following example generates a table of anomaly detection predictions by using semi-supervised learning. It overrides the
ensemble_scorevalue from theML_TRAINroutine to a new value of 0.5.mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.anomaly_train',@semsup_gknn, 'mlcorpus.preds_gknn_weighted', CAST('{"experimental": {"semisupervised": {"supervised_submodel_weight": 0.5}}}' as JSON)); -
The following example generates a table of anomaly detection predictions for log data. It disables log summaries in the results.
mysql> CALL sys.ML_PREDICT_TABLE('mlcorpus.`log_anomaly_just_patterns`', @logad_model, 'mlcorpus.log_anomaly_test_out', JSON_OBJECT('logad_options', JSON_OBJECT('summarize_logs', FALSE))); mysql> SELECT * FROM mlcorpus.log_anomaly_test_out LIMIT 1; +----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | id | parsed_log_segment | ml_results | +----+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+ | 1 | 2024-04-11T14:39:45.443597Z 1 [Note] [MY-013546] [InnoDB] Atomic write enabled | {"index_map": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], | | | 2024-04-11T14:39:45.443618Z 1 [Note] [MY-012932] [InnoDB] PUNCH HOLE support available | "predictions": {"is_anomaly": 0}, "probabilities": {"normal": 0.55, "anomaly": 0.45}} | | | 2024-04-11T14:39:45.443631Z 1 [Note] [MY-012944] [InnoDB] Uses event mutexes | | | | 2024-04-11T14:39:45.443635Z 1 [Note] [MY-012945] [InnoDB] GCC builtin __atomic_thread_fence() is used for memory barrier | | | | 2024-04-11T14:39:45.443646Z 1 [Note] [MY-012948] [InnoDB] Compressed tables use zlib 1.2.13 | | | | 2024-04-11T14:40:25.128143Z 0 [Note] [MY-010264] [Server] - '127.0.0.1' resolves to '127.0.0.1'; | | | | 2024-04-11T14:40:25.128182Z 0 [Note] [MY-010251] [Server] Server socket created on IP: '127.0.0.1'. | | | | 2024-04-11T14:40:25.128245Z 0 [Note] [MY-010252] [Server] Server hostname (bind-address): '10.0.1.125'; port: 3306 | | | | 2024-04-11T14:40:25.128272Z 0 [Note] [MY-010264] [Server] - '10.0.1.125' resolves to '10.0.1.125'; | | | | 2024-04-26T13:01:30.287325Z 0 [Warning] [MY-015116] [Server] Background histogram update on nexus.fetches: | | | | Lock wait timeout exceeded; try restarting transaction | | +----+------+------+----------+--------+----------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+