3.10.1 ML_TRAIN

Running the ML_TRAIN routine on a labeled training dataset produces a trained machine learning model.

ML_TRAIN Syntax

CALL sys.ML_TRAIN ('table_name', 'target_column_name', [options], model_handle);
 
options: {
    JSON_OBJECT('key','value'[,'key','value'] ...)
        'key','value':
        |'task', {'classification'|'regression'|'forecasting'}|NULL
        |'datetime_index', 'column'
        |'endogenous_variables', JSON_ARRAY('column')
        |'model_list', JSON_ARRAY('model'[,'model'] ...)
        |'exclude_model_list', JSON_ARRAY('model'[,'model'] ...)
        |'optimization_metric', 'metric'
        |'exclude_column_list', JSON_ARRAY('column'[,'column'] ...)
}
Note

The MySQL account that runs ML_TRAIN cannot have a period character (".") in its name; for example, a user named 'joesmith'@'%' is permitted to train a model, but a user named 'joe.smith'@'%' is not. For more information about this limitation, see Section 3.13, “Limitations”.

From MySQL HeatWave 8.0.31, after you use the ML_TRAIN routine, use the ML_EXPLAIN routine to train prediction explainers and model explainers for HeatWave ML. You must train prediction explainers in order to use ML_EXPLAIN_ROW and ML_EXPLAIN_TABLE. In earlier releases, the ML_TRAIN routine trains the default Permutation Importance model and prediction explainers. See Section 3.5, “Training Explainers”.

ML_TRAIN parameters:

  • table_name: The name of the table that contains the labeled training dataset. The table name must be valid and fully qualified; that is, it must include the schema name (schema_name.table_name). The table cannot exceed 10 GB, 100 million rows, or 900 columns.

  • target_column_name: The name of the target column containing ground truth values.

  • model_handle: The name of a user-defined session variable that stores the machine learning model handle for the duration of the connection. User variables are written as @var_name. Some of the examples in this guide use @census_model as the variable name. Any valid name for a user-defined variable is permitted (e.g., @my_model).

    From MySQL 8.0.31, if you set the model_handle variable to a value before calling ML_TRAIN, that model handle is used for the model. A model handle that you set must be unique in the model catalog.

    Otherwise, HeatWave ML generates a model handle. When ML_TRAIN finishes executing, you can retrieve the generated model handle by querying the session variable. See Section 3.9.8, “Model Handles”.

  • options: Optional parameters specified as key-value pairs in JSON_OBJECT() format. If an option is not specified, the default setting is used. If no options are specified, you can specify NULL in place of the JSON_OBJECT() argument.

    • task: Specifies the machine learning task. Permitted values are:

      • classification: The default. Use this task type if the target is a discrete value.

      • regression: Use this task type if the target column is a continuous numerical value.

      • forecasting: Use this task type if the target column is a date-time column that requires a timeseries forecast. The datetime_index and endogenous_variables parameters are required when you specify this task type.

    • datetime_index: For forecasting tasks, the column name for a datetime column that acts as an index for the forecast variable. The column can be one of the supported datetime column types (DATETIME, TIMESTAMP, DATE, TIME, and YEAR), or an auto-incrementing index.

    • endogenous_variables: For forecasting tasks, the column to be forecast. The forecasting function supports a single numeric column, specified as a JSON array. This column must also be specified in the target_column_name option of the ML_TRAIN routine, because that field is required, but it is not used in that location.

    • model_list: The type of model to be trained. If more than one model is specified, the best model type is selected from the list. For a list of supported model types, see Model Types.

      This option cannot be used together with the exclude_model_list option.

    • exclude_model_list: Model types that should not be trained. Specified model types are excluded from consideration during model selection. For a list of models types that you can specify, see Model Types.

      This option cannot be specified together with the model_list option.

    • optimization_metric: The scoring metric to optimize for when training a machine learning model. The metric must be compatible with the task type (classification, regression, or forecasting) and the target data. For a list of supported metrics, see Scoring Metrics.

    • exclude_column_list: Feature columns of the training dataset to exclude from consideration when training a model. Columns that are excluded using exclude_column_list do not need to be excluded from the dataset used for predictions.

Model Types

Classification, regression, and forecasting training models supported by HeatWave Machine Learning are listed below. When training a model, you can use the ML_TRAIN model_list and exclude_model_list options to specify training models that should be considered or excluded.

Scoring Metrics

Supported scoring metrics are listed below. When training a model, you can optimize for a specific scoring metric using the optimization_metric option.

The scoring metric must be compatible with the task type (classification, regression, or forecasting) and the target data. For example, if you are working with a classification dataset with a binary target, you can only specify classification scoring metric for a binary target, such as f1.

Syntax Examples

  • An ML_TRAIN example that uses the classification task option implicitly (classification is the default if not specified explicitly):

    CALL sys.ML_TRAIN('ml_data.iris_train', 'class', NULL, @iris_model);
  • An ML_TRAIN example that specifies the classification task type explicitly, and sets a model handle instead of letting HeatWave ML generate one:

    SET @iris_model = 'iris_manual';
    CALL sys.ML_TRAIN('ml_data.iris_train', 'class', JSON_OBJECT('task', 'classification'), 
    @iris_model);
  • An ML_TRAIN example that specifies the regression task type:

    CALL sys.ML_TRAIN('employee.salary_train', 'salary', JSON_OBJECT('task', 'regression'), 
    @salary_model);
  • An ML_TRAIN example that specifies the forecasting task type and the additional required parameters datetime_index and endogenous_variables:

    CALL sys.ML_TRAIN('ml_data.opsd_germany_daily_train', 'consumption', 
    JSON_OBJECT('task', 'forecasting', 'datetime_index', 'ddate', 
    'endogenous_variables', JSON_ARRAY('consumption')), @forecast_model);

  • An ML_TRAIN example that specifies the model_list option. This example trains either an XGBClassifier or LGBMClassifier model.

    CALL sys.ml_train('ml_data.iris_train', 'class', JSON_OBJECT('task','classification', 
    'model_list', JSON_ARRAY('XGBClassifier', 'LGBMClassifier')) , @iris_model);
  • An ML_TRAIN example that specifies the exclude_model_list option. In this example, LogisticRegression and GaussianNB models are excluded from model selection.

    CALL sys.ml_train('ml_data.iris_train', 'class', JSON_OBJECT('task','classification', 
    'exclude_model_list', JSON_ARRAY('LogisticRegression', 'GaussianNB')) , @iris_model);
  • An ML_TRAIN example that specifies the optimization_metric option.

    CALL sys.ml_train('ml_data.iris_train', 'class', JSON_OBJECT('task','classification', 
    'optimization_metric', 'neg_log_loss') , @iris_model);
  • An ML_TRAIN example that specifies the exclude_column_list option.

    CALL sys.ml_train('ml_data.iris_train', 'class', JSON_OBJECT('task','classification', 
    'exclude_column_list', JSON_ARRAY('sepal length', 'petal length')) , @iris_model);