Run the
ML_TRAIN
routine on a training dataset to produce a trained machine
learning model.
Before training models, make sure to review the following:
This topic has the following sections. Refer to the appropriate sections depending on the type of machine learning model you would like to train.
mysql> CALL sys.ML_TRAIN ('table_name', 'target_column_name', [options | NULL], model_handle);
options: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
['task', {'classification'|'regression'|'forecasting'|'anomaly_detection'|'log_anomaly_detection'|'recommendation'|'topic_modeling'}|NULL]
['datetime_index', 'column']
['endogenous_variables', JSON_ARRAY('column'[,'column'] ...)]
['exogenous_variables', JSON_ARRAY('column'[,'column'] ...)]
['model_list', JSON_ARRAY('model'[,'model'] ...)]
['exclude_model_list', JSON_ARRAY('model'[,'model'] ...)]
['optimization_metric', 'metric']
['include_column_list', JSON_ARRAY('column'[,'column'] ...)]
['exclude_column_list', JSON_ARRAY('column'[,'column'] ...)]
['contamination', 'contamination factor']
['supervised_submodel_options', {'n_neighbors', 'N', 'min_labels', N}']
['ensemble_score', 'ensemble metric']
['users', 'users_column']
['items', 'items_column']
['notes', 'notes_text']
['feedback', {'explicit' ['implicit'}]
['feedback_threshold', 'threshold']
['item_metadata', JSON_OBJECT('table_name'[,'database_name.table_name'] ...)]
['user_metadata', JSON_OBJECT('table_name'[,'database_name.table_name'] ...)]
['document_column', 'column_name']
['logad_options', JSON_OBJECT(("key","value"[,"key","value"] ...)
"key","value": {
['additional_masking_regex', JSON_ARRAY('regular_expression'[,'regular_expression', ...])]]
['window_size', 'N']
['window_stride', 'N']
['log_source_column', 'column']
['embedding_model', 'model']
['keyword_model', 'model']
}
}
}
Set the following parameters to train all machine learning models.
table_name: The name of the table that contains the labeled training dataset. The table name must be valid and fully qualified, so it must include the database name,database_name.table_name. The table cannot exceed 10 GB, 100 million rows, or 1017 columns.-
target_column_name: The name of the target column containing ground truth values.AutoML does not support a text target column.
If training an unsupervised Anomaly detection model (unlabeled data), set
target_column_nametoNULL.Forecasting does not require
target_column_name, and it can be set toNULL. -
model_handle: A user-defined session variable that stores the machine learning model handle for the duration of the connection. User variables are written as@. Any valid name for a user-defined variable is permitted. For example,var_name@my_model.If you set a value to the
model_handlevariable before callingML_TRAIN, that model handle is used for the model. A model handle must be unique in the model catalog. We recommend this method.If you don't set a value to the
model_handlevariable, AutoML generates one. WhenML_TRAINfinishes executing, retrieve the generated model handle by querying the session variable. See Model Handles to learn more.
The following optional parameters apply to more than one type
of machine learning task. They are specified as key-value
pairs in JSON format. If an option is not
specified, the default setting is used. If no options are
specified, you can specify NULL in place of
the JSON argument.
-
task: Specifies the machine learning task.classification: The default value if a task is not set. Use this task type to assign items to defined categories.regression: Use this task type if the target column is a continuous numerical value. This task generates predictions based on the relationship between a dependent variable and one or more independent variables.forecasting: Use this task type if you have a date-time column that requires a timeseries forecast. To use this task, you must set a target column, the date-time column (datetime_index), and endogenous variables (endogenous_variables).anomaly_detection: Use this task type to detect unusual patterns in data.log_anomaly_detection: Use this task to detect unusual patterns in log data.recommendation: Use this task type for generate recommendations for users and items.topic_modeling: Use this task to cluster word groups and similar expressions that best characterize the documents.
-
model_list: The type of model to be trained. If more than one model is specified, the best model type is selected from the list. See Model Types.This option cannot be used together with the
exclude_model_listoption. -
exclude_model_list: Model types that should not be trained. Specified model types are excluded from consideration during model selection. See Model Types.This option cannot be specified together with the
model_listoption. -
optimization_metric: The scoring metric to optimize for when training a machine learning model. The metric must be compatible with thetasktype and the target data. See Section 8.1.15, “Optimization and Scoring Metrics”.This is not supported for
anomaly_detectiontasks. Instead, metrics for anomaly detection can only be used with theML_SCOREroutine. -
include_column_list:ML_TRAINmust include this list of columns.For
classification,regression,anomaly_detectionandrecommendationtasks,include_column_listensures thatML_TRAINwill not drop these columns.For
forecastingtasks,include_column_listcan only includeexogenous_variables. Ifinclude_column_listis included in theML_TRAINoptions for aforecastingtask with at least oneexogenous_variables, this forcesML_TRAINto only consider those models that supportexogenous_variables.All columns in
include_column_listmust be included in the training table. -
exclude_column_list: Feature columns of the training dataset to exclude from consideration when training a model. Columns that are excluded usingexclude_column_listdo not also need to be excluded from the dataset used for predictions.The
exclude_column_listcannot contain any columns provided inendogenous_variables,exogenous_variables, andinclude_column_list. notes: Add notes to themodel_metadatafor your own reference.
Refer to the following model-specific parameters to train different types of machine learning models.
To train a classification model, set the
task to classification.
If the task is set to
NULL, or if all training options is set to
NULL, a classification model is trained by
default.
-
The following example sets the model handle before training, which is good practice. See Defining Model Handle. The
taskis set toclassification.mysql> SET @census_model = 'census_manual'; mysql> CALL sys.ML_TRAIN('census_data.census_train', 'revenue', JSON_OBJECT('task', 'classification'), @census_model); -
The following example sets all options to
NULL, soML_TRAINruns theclassificationtask option by default.mysql> CALL sys.ML_TRAIN('census_data.census_train', 'revenue', NULL, @census_model);
-
The following example specifies the
regressiontask type.mysql> CALL sys.ML_TRAIN('nyc_taxi.nyc_taxi_train', 'tip_amount', JSON_OBJECT('task', 'regression'), @nyc_taxi);
See the following to learn more about forecasting models:
To train a forecasting model, set the task
to forecasting and set the following
required parameters.
-
datetime_index: The column name for a datetime column that acts as an index for the forecast variable. The column can be one of the supported datetime column types,DATETIME,TIMESTAMP,DATE,TIME, andYEAR, or an auto-incrementing index.The forecast models
SARIMAXForecaster,VARMAXForecaster, andDynFactorForecastercannot back test, that is forecast into training data, when usingexogenous_variables. Therefore, the predict table must not overlap thedatetime_indexwith the training table. The start date in the predict table must be a date immediately following the last date in the training table whenexogenous_variablesare used. For example, the predict table has to start with year 2024 if the training table withYEARdata typedatetime_indexends with year 2023.The
datetime_indexfor the predict table must not have missing dates after the last date in the training table. For example, the predict table has to start with year 2024 if the training table withYEARdata typedatetime_indexends with year 2023. The predict table cannot start with year, for example, 2025 or 2030, because that would miss out 1 and 6 years, respectively.When
optionsdo not includeexogenous_variables, the predict table can overlap thedatetime_indexwith the training table. This supports back testing.The valid range of years for
datetime_indexdates must be between 1678 and 2261. It will cause an error if any part of the training table or predict table has dates outside this range. The last date in the training table plus the predict table length must still be inside the valid year range. For example, if thedatetime_indexin the training table hasYEARdata type, and the last date is year 2023, the predict table length must be less than 238 rows: 2261 minus 2023 equals 238 rows. -
endogenous_variables: The column or columns to be forecast.Univariate forecasting models support a single numeric column, specified as a
JSON_ARRAY. This column must also be specified as thetarget_column_name, because that field is required, but it is not used in that location.Multivariate forecasting models support multiple numeric columns, specified as a
JSON_ARRAY. One of these columns must also be specified as thetarget_column_name.endogenous_variablescannot be text.
Set the following forecasting options as required to train forecasting models.
-
exogenous_variables: For forecasting tasks, the column or columns of independent, non-forecast, predictive variables, specified as aJSON_ARRAY. These optional variables are not forecast, but help to predict the future values of the forecast variables. These variables affect a model without being affected by it. For example, for sales forecasting these variables might be advertising expenditure, occurrence of promotional events, weather, or holidays.ML_TRAINwill consider all supported models during the algorithm selection stage ifoptionsincludesexogenous_variables, including models that do not supportexogenous_variables.For example, if
optionsincludes univariateendogenous_variableswithexogenous_variables, thenML_TRAINwill considerNaiveForecaster,ThetaForecaster,ExpSmoothForecaster,ETSForecaster,STLwESForecaster,STLwARIMAForecaster, andSARIMAXForecaster.ML_TRAINwill ignoreexogenous_variablesif the model does not support them.Similarly, if
optionsincludes multivariateendogenous_variableswithexogenous_variables, thenML_TRAINwill considerVARMAXForecasterandDynFactorForecaster.If
optionsalso includesinclude_column_list, this will forceML_TRAINto only consider those models that supportexogenous_variables. include_column_list: Can only includeexogenous_variables. Ifinclude_column_listcontains at least oneexogenous_variables, this will forceML_TRAINto only consider those models that supportexogenous_variables.
-
The following example specifies the
forecastingtask type, and the additional required parameters,datetime_indexandendogenous_variables.mysql> CALL sys.ML_TRAIN('ml_data.opsd_germany_daily_train', 'consumption', JSON_OBJECT('task', 'forecasting', 'datetime_index', 'ddate', 'endogenous_variables', JSON_ARRAY('consumption')), @forecast_model); -
The following example specifies the
OrbitForecasterforecasting model with exogenous variables.mysql> CALL sys.ML_TRAIN('mlcorpus.opsd_germany_daily_train', NULL, JSON_OBJECT('task', 'forecasting', 'datetime_index', 'ddate', 'endogenous_variables', JSON_ARRAY('consumption'), 'exogenous_variables', JSON_ARRAY('wind', 'solar', 'wind_solar'), 'model_list', JSON_ARRAY('OrbitForecaster')), @model); -
The following example specifies the
OrbitForecasterforecasting model without exogenous variables.mysql> CALL sys.ML_TRAIN('mlcorpus.`datetime_train`', 'C1', JSON_OBJECT('task', 'forecasting', 'datetime_index', 'C0', 'endogenous_variables', JSON_ARRAY('C1'), 'model_list', JSON_ARRAY('OrbitForecaster')), @datetime_model);
See the following to learn more about anomaly detection models:
To train an anomaly detection model, set the appropriate required parameters depending on the type of anomaly detection model to train.
Set the
taskparameter toanomaly_detectionfor running anomaly detection on table data, orlog_anomaly_detectionfor running anomaly detection on log data.If running an unsupervised model, the
target_column_nameparameter must be set toNULL.-
If running a semi-supervised model:
The
target_column_nameparameter must specify a column whose only allowed values are 0 (normal), 1 (anomalous), and NULL (unlabeled). All rows will be used to train the unsupervised component, while the rows with a value different than NULL will be used to train the supervised component.The
experimentaloption must be set tosemisupervised.
-
If running anomaly detection on log data, the input table can only have the following columns:
The column containing the logs.
If including logs from different sources, a column containing the source of each log. Identify this column with the
log_source_columnoption.If including labeled data, a column identifying the labeled log lines. See Semi-supervised Anomaly Detection to learn more.
At least one column must act as the primary key to establish the temporal order of logs. If the primary key column (or columns) is not one of the previous required columns (log data, source of log, or label), then you must use the
exclude_column_listoption when runningML_TRAINto exclude all primary key columns that don't include required data. See Syntax Examples for Anomaly Detection Training to review relevant examples.If the input table has additional columns to the ones permitted, you must use the
exclude_column_listoption to exclude irrelevant columns.
Set the following options as needed for anomaly detection models:
-
contamination: Represents an estimate of the percentage of outliers in the training table.The contamination factor is calculated as: estimated number of rows with anomalies/total number of rows in the training table.
The contamination value must be greater than 0 and less than 0.5. The default value is 0.01.
model_list: You can select the Principal Component Analysis (PCA), Generalized Local Outlier Factor (GLOF), or Generalized kth Nearest Neighbors (GkNN) model. If no option is specified, the default model is GkNN. Selecting more than one model or an unsupported model produces an error.
To train a semi-supervised anomaly detection model, set the following options:
-
supervised_submodel_options: Allows you to set optional override parameters for the supervised model component. The only model supported isDistanceWeightedKNNClassifier. The following parameters are supported:n_neighbors: Sets the desired k value that checks the k closest neighbors for each unclassified point. The default value is 5 and the value must be an integer greater than 0.min_labels: Sets the minimum number of labeled data points required to train the supervised component. If fewer labeled data points are provided during training of the model,ML_TRAINfails. The default value is 20 and the value must be an integer greater than 0.
ensemble_score: This option specifies the metric to use to score the ensemble of unsupervised and supervised components. It identifies the optimal weight between the two components based on the metric. The supported metrics areaccuracy,precision,recall, andf1. The default metric isf1.
To train a model for anomaly detection on log data, set the following options:
-
logad_options: AJSON_OBJECTthat allows you to configure the following options.-
additional_masking_regex: Allows you to mask log data in aJSON_ARRAY. By default, the following parameters are automatically masked during training.IP
DATETIME
TIME
HEX
IPPORT
OCID
window_size: Specifies the maximum number of log lines to be grouped for anomaly detection. The default value is 10.window_stride: Specifies the stride value to use for segmenting log lines. For example, there is log A, B, C, D, and E. Thewindow_sizeis 3, and thewindow_strideis 2. The first row has log A, B, and C. The second row has log C, D, and E. If this value is equal towindow_size, there is no overlapping of log segments. The default value is 3.log_source_column: Specifies the column name that contains the source identifier of the respective log lines. Log lines are grouped according to their respective source (for example, logs from multiple MySQL databases that are in the same table). By default, all log lines are assumed to be from the same source.embedding_model: The embedding model used to extract semantic features from log data. To review supported embedding models in MySQL AI, run the following query:SELECT sys.ML_LIST_LLMS();and see models that havecapabilitieswithTEXT_EMBEDDINGS. The default value ismultilingual-e5-small. Using an embedding model causes higher memory usage. If you set this toNULL, then you cannot also setkeyword_modeltoNULL.keyword_model: The keyword feature extractor used to extract keyword features from log data. The available options aretf-idfandNULL. The default value istf-idf. If you set this toNULL, then you cannot also setembedding_modeltoNULL.
-
Anomaly detection models don't support the following options during training:
exclude_model_listoptimization_metric
-
The following example specifies the
anomaly_detectiontask type.mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target')), @anomaly); Query OK, 0 rows affected (46.59 sec) -
The following example specifies the
anomaly_detectiontask with acontaminationoption. Query the model catalog metadata to check the value of thecontaminationoption.mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection.volcanoes-b3_anomaly_train', NULL, JSON_OBJECT('task', 'anomaly_detection', 'contamination', 0.013, 'exclude_column_list', JSON_ARRAY('target')), @anomaly_with_contamination); Query OK, 0 rows affected (50.22 sec) mysql> SELECT JSON_EXTRACT(model_metadata, '$.contamination') FROM ML_SCHEMA_root.MODEL_CATALOG WHERE model_handle = @anomaly_with_contamination; +-------------------------------------------------+ | JSON_EXTRACT(model_metadata, '$.contamination') | +-------------------------------------------------+ | 0.013000000268220901 | +-------------------------------------------------+ 1 row in set (0.00 sec) -
The following example enables semi-supervised learning using all defaults. The
target_column_nameis set totarget. Theexperimentaloption is set tosemisupervised.mysql> CALL sys.ML_TRAIN('mlcorpus.anomaly_train_with_partial_target', "target", CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {}}}' as JSON), @semisupervised_model); -
The following example enables semi-supervised learning with additional options.
mysql> CALL sys.ML_TRAIN('mlcorpus.`anomaly_train_with_partial_target`', "target", CAST('{"task": "anomaly_detection", "experimental": {"semisupervised": {"supervised_submodel_options": {""min_labels": 10, "n_neighbors": 3}, "ensemble_score": "recall"}}}' as JSON), @semisupervised_model_options);Where:
The
supervised_submodel_optionsparametermin_labelsis set to 10.The
supervised_submodel_optionsparametern_neighborsis set to 3.The
ensemble_scoreoption is set to therecallmetric.
-
The following example selects the PCA (Principal Component Analysis) anomaly detection model.
mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.`volcanoes-b3_anomaly_train`', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target'), 'model_list', JSON_ARRAY('PCA')), @anomaly_pca); -
The following example selects the GLOF (Generalized Local Outlier Factor) anomaly detection model.
mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.`volcanoes-b3_anomaly_train`', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target'), 'model_list', JSON_ARRAY('GLOF')), @anomaly_glof); -
The following example does not specify an algorithm model for the
model_listoption. If no model is specified, the default model GkNN is used.mysql> CALL sys.ML_TRAIN('mlcorpus_anomaly_detection_v1.`volcanoes-b3_anomaly_train`', NULL, JSON_OBJECT('task', 'anomaly_detection', 'exclude_column_list', JSON_ARRAY('target'), 'model_list', JSON_ARRAY()), @anomaly_empty_list); -
The following example runs the
log_anomaly_detectiontask with available default values.mysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_just_patterns`', NULL, JSON_OBJECT('task', 'log_anomaly_detection'), @logad_model); -
The following example runs the
log_anomaly_detectiontask with the PCA anomaly detection model.mysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_just_patterns`', NULL, JSON_OBJECT('task', 'log_anomaly_detection', 'model_list', JSON_ARRAY('PCA')), @logad_model); -
An
ML_TRAINexample that excludes two primary key columns:primary_key_column1andprimary_key_column2. These columns must be excluded because they do not have one of the required items of data for training: the log data, the source of the log, or the label.mysql>CALL sys.ML_TRAIN('mlcorpus.log_anomaly_two_primary', NULL, JSON_OBJECT('task', 'log_anomaly_detection', 'logad_options', JSON_OBJECT('window_size', 2, 'window_stride', 1), 'exclude_column_list', JSON_ARRAY('primary_key_column1', 'primary_key_column2')), @log_anomaly_us ); -
The following example runs the
log_anomaly_detectiontask and masks log data with theadditional_masking_regexoption. In addition to the default parameters that are automatically masked, email addresses from Yahoo, Hotmail, and Gmail are also masked. Thelog_source_columnoption is also included, which specifies the column that identifies the respective source of the log line.mysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_sourced`', NULL, JSON_OBJECT('task', 'log_anomaly_detection', 'logad_options', JSON_OBJECT('additional_masking_regex', JSON_ARRAY('(\W|^)[\w.\-]{0,25}@(yahoo|hotmail|gmail)\.com(\W|$)'), 'log_source_column', 'source')), @log_anomaly_us); -
The following example sets semi-supervised learning for training log data for anomaly detection. The window size is also set to a value of 4, and the window stride is set to 1.
mysql> CALL sys.ML_TRAIN('mlcorpus.`log_anomaly_semi`', "label", JSON_OBJECT('task', 'log_anomaly_detection', 'logad_options', JSON_OBJECT('window_size', 4, 'window_stride', 1), "experimental", JSON_OBJECT("semisupervised", JSON_OBJECT("supervised_submodel_options", JSON_OBJECT("min_labels", 10)))), @log_anomaly_us); -
The following example sets unsupervised learning for training log data for anomaly detection. A query reviews supported embedding models. The
all_minilm_l12_v2embedding model andtf-idfkeyword model are selected for training.mysql> SELECT sys.ML_LIST_LLMS(); +---------------------------------------------------------------------------------------------------------------------------------------------------------------+ | sys.ML_LIST_LLMS() | +---------------------------------------------------------------------------------------------------------------------------------------------------------------+ | [{"model_id": "llama3.1-8b-instruct-v1", "provider": "HeatWave", "capabilities": ["GENERATION"], "default_model": false, "availability_date": "2025-05-20"}, | | {"model_id": "llama3.2-1b-instruct-v1", "provider": "HeatWave", "capabilities": ["GENERATION"], "default_model": false, "availability_date": "2025-05-20"}, | | {"model_id": "llama3.2-3b-instruct-v1", "provider": "HeatWave", "capabilities": ["GENERATION"], "default_model": true, "availability_date": "2025-05-20"}, | | {"model_id": "mistral-7b-instruct-v3", "provider": "HeatWave", "capabilities": ["GENERATION"], "default_model": false, "availability_date": "2025-05-20"}, | | {"model_id": "all_minilm_l12_v2", "provider": "HeatWave", "capabilities": ["TEXT_EMBEDDINGS"], "default_model": false, "availability_date": "2024-07-01"}, | | {"model_id": "multilingual-e5-small", "provider": "HeatWave", "capabilities": ["TEXT_EMBEDDINGS"], "default_model": true, "availability_date": "2024-07-24"}] | +---------------------------------------------------------------------------------------------------------------------------------------------------------------+ mysql> SET @model='log_embedding_model'; Query OK, 0 rows affected (0.0490 sec) mysql> CALL sys.ML_TRAIN('anomaly_log_embedding.training_data', NULL, JSON_OBJECT('task', 'log_anomaly_detection', 'exclude_column_list', JSON_ARRAY('log_id', 'timestamp', 'target'), 'logad_options', JSON_OBJECT('embedding_model', 'all_minilm_l12_v2', 'keyword_model', 'tf-idf')), @model); Query OK, 0 rows affected (27.0830 sec)
See Recommendation Task Types to learn more about recommendation models.
To train a recommendation model, set the
task to recommendation
and set the following required parameters.
-
users: Specifies the column name corresponding to the user ids. Values in this column must be in aSTRINGdata type, otherwise an error will be generated during training.This must be a valid column name, and it must be different from the
itemscolumn name. -
items: Specifies the column name corresponding to the item ids. Values in this column must be in aSTRINGdata type, otherwise an error will be generated during training.This must be a valid column name, and it must be different from the
userscolumn name.
To train a recommendation model with explicit feedback, set
feedback to explicit. If
feedback is not set, the default value is
explicit.
To train a recommendation model with implicit feedback, set
feedback to implicit and
set the following option as needed:
feedback_threshold: The feedback threshold for a recommendation model that uses implicit feedback. It represents the threshold required to be considered positive feedback. For example, if numerical data records the number of times users interact with an item, you might set a threshold with a value of 3. This means users would need to interact with an item more than three times to be considered positive feedback.
To train a content-based recommendation model, set
feedback to implicit and
set the following required parameters:
item_metadata: Defines the table that has item descriptions. It is a JSON object that has thetable_nameoption as a key, which specifies the table that has item descriptions. One column must be the same as theitem_idin the input table.-
user_metadata: Defines the table that has user descriptions. It is a JSON object that has thetable_nameoption as a key, which specifies the table that has user descriptions. One column must be the same as theuser_idin the input table.table_name: To be used with theitem_metadataanduser_metadataoptions. It specifies the table name that has item or user descriptions. It must be a string in a fully qualified format (database_name.table_name) that specifies the table name.
-
The following example specifies the
SVDrecommendation model type. The default model isTwoTower.mysql> SET @rec_model = 'rec_model'; mysql> CALL sys.ML_TRAIN('movielens_data.movielens_train', 'rating', JSON_OBJECT('task', 'recommendation', 'users', 'user_id', 'items', 'item_id'), @rec_model); Query OK, 0 rows affected (14.4091 sec) mysql> SELECT model_handle, model_type FROM ML_SCHEMA_admin.MODEL_CATALOG WHERE model_handle='rec_model'; +--------------+------------+ | model_handle | model_type | +--------------+------------+ | rec_model | TwoTower | +--------------+------------+ 1 row in set (0.0395 sec) -
The following example specifies the
SVDpprecommendation model type.mysql> CALL sys.ML_TRAIN('mlcorpus.foursquare_NYC_train', 'rating', JSON_OBJECT('task', 'recommendation', 'users', 'user_id', 'items', 'item_id', 'model_list', JSON_ARRAY('SVDpp')), @model); Query OK, 0 rows affected (13.97 sec) mysql> SELECT model_type FROM ML_SCHEMA_root.MODEL_CATALOG WHERE model_handle=@model; +------------+ | model_type | +------------+ | SVDpp | +------------+ 1 row in set (0.00 sec) -
The following example specifies the
NMFrecommendation model type.mysql> CALL sys.ML_TRAIN('mlcorpus.foursquare_NYC_train', 'rating', JSON_OBJECT('task', 'recommendation', 'users', 'user_id', 'items', 'item_id', 'model_list', JSON_ARRAY('NMF')), @model); Query OK, 0 rows affected (12.28 sec) mysql> SELECT model_type FROM ML_SCHEMA_root.MODEL_CATALOG WHERE model_handle=@model; +------------+ | model_type | +------------+ | NMF | +------------+ 1 row in set (0.00 sec) -
The following example specifies three models for the
model_listoption. From those three recommendation models, theSVDmodel is automatically selected for training.mysql> SET @allowed_models = JSON_ARRAY('SVD', 'SVDpp', 'NMF'); mysql> CALL sys.ML_TRAIN('mlcorpus.foursquare_NYC_train', 'rating', JSON_OBJECT('task', 'recommendation', 'users', 'user_id', 'items', 'item_id', 'model_list', CAST(@allowed_models AS JSON)), @model); Query OK, 0 rows affected (14.88 sec) mysql> SELECT model_type FROM ML_SCHEMA_root.MODEL_CATALOG WHERE model_handle=@model; +------------+ | model_type | +------------+ | SVD | +------------+ 1 row in set (0.00 sec) -
The following example specifies five models for the
exclude_model_listoption. TheSVDpprecommendation model is automatically selected from the remaining available models.mysql> SET @exclude_models= JSON_ARRAY('NormalPredictor', 'Baseline', 'SlopeOne', 'CoClustering', 'SVD'); mysql> CALL sys.ML_TRAIN('mlcorpus.foursquare_NYC_train', 'rating', JSON_OBJECT('task', 'recommendation', 'users', 'user_id', 'items', 'item_id', 'exclude_model_list', CAST(@exclude_models AS JSON)), @model); Query OK, 0 rows affected (14.71 sec) mysql> SELECT model_type FROM ML_SCHEMA_root.MODEL_CATALOG WHERE model_handle=@model; +------------+ | model_type | +------------+ | SVDpp | +------------+ 1 row in set (0.00 sec) -
The following example specifies the
recommendationtask with implicit feedback.mysql> CALL sys.ML_TRAIN('mlcorpus.training_table', 'rating', JSON_OBJECT('task', 'recommendation', 'users', 'user_id', 'items', 'item_id', 'feedback', 'implicit'), @model); Query OK, 0 rows affected (2 min 13.6415 sec) -
The following example trains a content-based recommendation model by specifying a table with item descriptions (
mlcorpus_recsys.`citeulike_items_sample). The optimization metrichit_ratio_at_kis used. The model must use implicit feedback.mysql> CALL sys.ML_TRAIN('mlcorpus_recsys.`citeulike_train_sample`', 'rating', JSON_OBJECT('task', 'recommendation', 'model_list', JSON_ARRAY('CTR'), 'users', 'user_id', 'items', 'item_id', 'feedback', 'implicit', 'optimization_metric', 'hit_ratio_at_k', 'item_metadata', JSON_OBJECT('table_name', 'mlcorpus_recsys.`citeulike_items_sample`')), @model);
To train a machine learning model with topic modeling, set the
task to topic_modeling
and set the following required parameter:
document_column: Specify the column name that contains the text to train.
The following parameters are not supported for training machine learning models with topic modeling:
model_listoptimization_metricexclude_model_listexclude_column_listinclude_column_list
The following example runs the
topic_modeling task with the required
defined parameters.
mysql> CALL sys.ML_TRAIN('topic_modeling_data.text_types_train', NULL,
JSON_OBJECT('task', 'topic_modeling', 'document_column', 'D0'), @topic_model);
The
ML_TRAIN
routine also runs the
ML_EXPLAIN
routine with the default Permutation Importance model for
prediction explainers and model explainers. See
Generate Model
Explanations. To train other prediction explainers and
model explainers use the
ML_EXPLAIN
routine with the preferred explainer after
ML_TRAIN.
ML_EXPLAIN
does not support the anomaly_detection and
recommendation tasks, and
ML_TRAIN
does not run
ML_EXPLAIN.
-
The
model_listoption permits specifying the type of model to be trained. If more than one type of model specified, the best model type is selected from the list. For a list of supported model types, see Model Types. This option cannot be used together with theexclude_model_listoption.The following example trains either an
XGBClassifierorLGBMClassifiermodel.mysql> CALL sys.ML_TRAIN('ml_data.iris_train', 'class', JSON_OBJECT('task','classification', 'model_list', JSON_ARRAY('XGBClassifier', 'LGBMClassifier')), @iris_model); -
The
exclude_model_listoption specifies types of models that should not be trained. Specified model types are excluded from consideration. For a list of model types you can specify, see Model Types. This option cannot be used together with themodel_listoption.The following example excludes the
LogisticRegressionandGaussianNBmodels.mysql> CALL sys.ML_TRAIN('ml_data.iris_train', 'class', JSON_OBJECT('task','classification', 'exclude_model_list', JSON_ARRAY('LogisticRegression', 'GaussianNB')), @iris_model); -
The
optimization_metricoption specifies a scoring metric to optimize for. See: Optimization and Scoring Metrics.The following example optimizes for the
neg_log_lossmetric.mysql> CALL sys.ML_TRAIN('automl_bench.census_train', 'revenue', JSON_OBJECT('task','classification', 'optimization_metric', 'neg_log_loss'), @census_model); -
The
exclude_column_listoption specifies feature columns to exclude from consideration when training a model.The following example excludes the
'age'column from consideration when training a model for thecensusdataset.mysql> CALL sys.ML_TRAIN('automl_bench.census_train', 'revenue', JSON_OBJECT('task','classification', 'exclude_column_list', JSON_ARRAY('age')), @census_model); -
The
include_column_listoption specifies feature columns that must be considered for training and should not be dropped.The following example specifies to consider the
'job'column when training a model for thecensusdataset.mysql> CALL sys.ML_TRAIN('automl_bench.census_train', 'revenue', JSON_OBJECT('task','classification', 'include_column_list', JSON_ARRAY('job')), @census_model); -
The following example adds
notesto themodel_metadata.mysql> CALL sys.ML_TRAIN('ml_data.iris_train', 'class', JSON_OBJECT('task', 'classification', 'notes', 'classification model'), @model); Query OK, 0 rows affected (1 min 42.53 sec) mysql> SELECT model_metadata FROM ML_SCHEMA_user1.MODEL_CATALOG WHERE model_handle=@model; +-----------------------------------------------------------+ | JSON_PRETTY(model_metadata) | +-----------------------------------------------------------+ | { "task": "classification", "notes": "classification model", "chunks": 1, "format": "HWMLv2.0", "n_rows": 120, "status": "Ready", "options": { "task": "classification", "notes": "classification model", "model_explainer": "permutation_importance", "prediction_explainer": "permutation_importance" }, "n_columns": 4, "column_names": [ "sepal length", "sepal width", "petal length", "petal width" ], "contamination": null, "model_quality": "high", "training_time": 15.591492652893066, "algorithm_name": "SVC", "training_score": -0.03133905306458473, "build_timestamp": 1751897493, "hyperparameters": { "C": 47.004275502593885, "gamma": 0.000030517578125, "cache_size": 800, "class_weight": "balanced" }, "n_selected_rows": 96, "training_params": { "recommend": "ratings", "force_use_X": false, "recommend_k": 3, "remove_seen": true, "ranking_topk": 10, "lsa_components": 100, "ranking_threshold": 1, "feedback_threshold": 1 }, "train_table_name": "ml_data.iris_train", "model_explanation": { "permutation_importance": { "petal width": 0.4194, "sepal width": 0.0, "petal length": 0.4192, "sepal length": 0.0415 } }, "n_selected_columns": 3, "target_column_name": "class", "optimization_metric": "neg_log_loss", "selected_column_names": [ "petal length", "petal width", "sepal length" ], "training_drift_metric": { "mean": 0.0749, "variance": 0.0083 } } | +-----------------------------------------------------------+ 1 row in set (0.0416 sec)