Documentation Home
MySQL HeatWave User Guide
Related Documentation Download this Manual
PDF (US Ltr) - 1.0Mb
PDF (A4) - 1.0Mb


3.8.1 ML_TRAIN

Running the ML_TRAIN routine on a labeled training dataset produces a trained machine learning model.

ML_TRAIN Syntax

CALL sys.ML_TRAIN ('table_name', 'target_column_name', [options], model_handle_variable);
 
options: {
    JSON_OBJECT('key','value'[,'key','value'] ...)
        'key','value':
        |'task', {'classification'|'regression'}|NULL
        |'model_list', JSON_ARRAY('model'[,'model'] ...)
        |'exclude_model_list', JSON_ARRAY('model'[,'model'] ...)
        |'optimization_metric', 'metric'
        |'exclude_column_list', JSON_ARRAY('column'[,'column'] ...)
}
Note

The MySQL account that runs ML_TRAIN cannot have a period character (".") in its name; for example, a user named 'joesmith'@'%' is permitted to train a model, but a user named 'joe.smith'@'%' is not. For more information about this limitation, see Section 3.12, “Limitations”.

ML_TRAIN parameters:

  • table_name: The name of the table that contains the labeled training dataset. The table name must be valid and fully qualified; that is, it must include the schema name (schema_name.table_name). The table cannot exceed 10 GB, 100 million rows, or 900 columns.

  • target_column_name: The name of the target column containing ground truth values.

  • model_handle_variable: The name of a user-defined session variable that stores the machine learning model handle for the duration of the connection. User variables are written as @var_name. Some of the examples in this guide use @census_model as the variable name. Any valid name for a user-defined variable is permitted (e.g., @my_model).

    When ML_TRAIN finishes executing, you can retrieve the generated model handle by querying the session variable. See Section 3.7.7, “Model Handles”.

  • options: Optional parameters specified as key-value pairs in JSON_OBJECT() format. If an option is not specified, the default setting is used. If no options are specified, you can specify NULL in place of the JSON_OBJECT() argument.

    • task: Specifies the machine learning task. Permitted values are:

      • classification: The default. Use this task type if the target is a discrete value.

      • regression: Use this task type if the target column is a continuous numerical value.

      • model_list: The type of model to be trained. If more than one model is specified, the best model type is selected from the list. For a list of supported model types, see Model Types.

        This option cannot be used together with the exclude_model_list option.

      • exclude_model_list: Model types that should not be trained. Specified model types are excluded from consideration during model selection. For a list of models types that you can specify, see Model Types.

        This option cannot be specified together with the model_list option.

      • optimization_metric: The scoring metric to optimize for when training a machine learning model. The metric must be compatible with the task type (classification or regression) and the target data. For a list of supported metrics, see Scoring Metrics.

      • exclude_column_list: Feature columns of the training dataset to exclude from consideration when training a model. Columns that are excluded using exclude_column_list do not need to be excluded from the dataset used for predictions.

Model Types

Classification and regression training models supported by HeatWave Machine Learning are listed below. When training a model, you can use the ML_TRAIN model_list and exclude_model_list options to specify training models that should be considered or excluded.

Scoring Metrics

Supported scoring metrics are listed below. When training a model, you can optimize for a specific scoring metric using the optimization_metric option.

The scoring metric must be compatible with the task type (classification or regression) and the target data. For example, if you are working with a classification dataset with a binary target, you can only specify classification scoring metric for a binary target, such as f1.

Syntax Examples

  • An ML_TRAIN example that uses the classification task option implicitly (classification is the default if not specified explicitly):

    CALL sys.ML_TRAIN('ml_data.iris_train', 'class', NULL, @iris_model);
  • An ML_TRAIN example that specifies the classification task type explicitly:

    CALL sys.ML_TRAIN('ml_data.iris_train', 'class', JSON_OBJECT('task', 'classification'), 
    @iris_model);
  • An ML_TRAIN example that specifies the regression task type:

    CALL sys.ML_TRAIN('employee.salary_train', 'salary', JSON_OBJECT('task', 'regression'), 
    @salary_model);
  • An ML_TRAIN example that specifies the model_list option. This example trains either an XGBClassifier or LGBMClassifier model.

    CALL sys.ml_train('ml_data.iris_train', 'class', JSON_OBJECT('task','classification', 
    'model_list', JSON_ARRAY('XGBClassifier', 'LGBMClassifier')) , @iris_model);
  • An ML_TRAIN example that specifies the exclude_model_list option. In this example, LogisticRegression and GaussianNB models are excluded from model selection.

    CALL sys.ml_train('ml_data.iris_train', 'class', JSON_OBJECT('task','classification', 
    'exclude_model_list', JSON_ARRAY('LogisticRegression', 'GaussianNB')) , @iris_model);
  • An ML_TRAIN example that specifies the optimization_metric option.

    CALL sys.ml_train('ml_data.iris_train', 'class', JSON_OBJECT('task','classification', 
    'optimization_metric', 'neg_log_loss') , @iris_model);
  • An ML_TRAIN example that specifies the exclude_column_list option.

    CALL sys.ml_train('ml_data.iris_train', 'class', JSON_OBJECT('task','classification', 
    'exclude_column_list', JSON_ARRAY('sepal length', 'petal length')) , @iris_model);