Documentation Home
HeatWave User Guide
Related Documentation Download this Manual
PDF (US Ltr) - 3.8Mb
PDF (A4) - 3.8Mb


HeatWave User Guide  /  ...  /  Prepare Training and Testing Datasets

6.5.1.2 Prepare Training and Testing Datasets

As of MySQL 9.4.1, you can automatically create training and testing datasets with the TRAIN_TEST_SPLIT routine.

Before You Begin
Overview

The TRAIN_TEST_SPLIT routine takes your datasets and prepares new tables for training and testing machine learning models. Two new tables in the same database are created with the following names:

  • [original_table_name]_train

  • [original_table_name]_test

The split of the data between training and testing datasets depends on the machine learning task.

  • Classification: A stratified split of data. For each class in the dataset, 80% of the samples go into the training dataset, and the remaining go into the testing dataset. If the number of samples in the 80% subset is fewer than five, then five samples are inserted into the training dataset.

  • Regression: A random split of data.

  • Forecasting: A time-based split of data. The data is inserted in order according to datetime_index values. The first 80% of the samples go into the training dataset. The remaining samples go into the testing dataset.

  • Unsupervised anomaly detection: A random split of data. 80% of the samples go into the training dataset and the remaining samples go into the testing dataset.

  • Semi-supervised anomaly detection: A stratified split of data.

  • Anomaly detection for log data: A split of data based on primary key values. The first 80% of the samples go into the training dataset. The remaining samples go into the testing dataset. Review requirements when running Anomaly Detection for Logs.

  • Recommendations: A random split of data.

  • Topic modeling: A random split of data.

Parameters to Prepare Training and Testing Datasets

To run the TRAIN_TEST_SPLIT routine, you use the following parameters:

  • table_name: You must provide the fully qualified name of the table that contains the dataset to split (schema_name.table_name).

  • target_column_name: Classification and semi-supervised anomaly detection tasks require a target column. All other tasks do not require a target column. If a target column is not required, you can set this parameter to NULL.

  • options: Set the following options as needed as key-value pairs in JSON object format. If no options are needed, set this to NULL.

    • task: Set the appropriate machine learning task: classification, regression, forecasting, anomaly_detection, log_anomaly_detection, recommendation, or topic_modeling. If the machine learning task is not set, the default task is classification.

    • datetime_index: Required for forecasting tasks. The column that has datetime values.

      The following data types for this column are supported:

    • semisupervised: If running an anomaly detection task, set this to true for semi-supervised learning, or false for unsupervised learning. If this is set to NULL, then the default value of false is selected.

TRAIN_TEST_SPLIT Example

To automatically generate a training and testing dataset:

  1. Run the TRAIN_TEST_SPLIT routine.

    mysql> CALL sys.TRAIN_TEST_SPLIT('table_name', 'target_column_name', options);

    Replace table_name, target_column_name, and options with your own values. For example:

    mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.data_files_1', 'class', JSON_OBJECT('task', 'classification'));
  2. Confirm the two datasets are created ([original_table_name]_train and [original_table_name]_test) by querying the tables in the database.

    mysql> SHOW TABLES;
    +-------------------------+
    | Tables_in_data_files_db |
    +-------------------------+
    | data_files_1            |
    | data_files_1_test       |
    | data_files_1_train      |
    +-------------------------+
What's Next