MySQL :: HeatWave User Guide :: 6.5.1.2 Prepare Training and Testing Datasets

HeatWave User Guide / ... / Prepare Training and Testing Datasets

6.5.1.2 Prepare Training and Testing Datasets

As of MySQL 9.4.1, you can automatically create training and testing datasets with the TRAIN_TEST_SPLIT routine.

Before You Begin

Review the Requirements.
Get the Required Privileges to use MySQL HeatWave AutoML.
Review the Data Types Supported For Machine Learning Tasks.
Load structured data from the DB system or External Sources.

Overview

The TRAIN_TEST_SPLIT routine takes your datasets and prepares new tables for training and testing machine learning models. Two new tables in the same database are created with the following names:

[original_table_name]_train
[original_table_name]_test

The split of the data between training and testing datasets depends on the machine learning task.

Classification: A stratified split of data. For each class in the dataset, 80% of the samples go into the training dataset, and the remaining go into the testing dataset. If the number of samples in the 80% subset is fewer than five, then five samples are inserted into the training dataset.
Regression: A random split of data.
Forecasting: A time-based split of data. The data is inserted in order according to datetime_index values. The first 80% of the samples go into the training dataset. The remaining samples go into the testing dataset.
Unsupervised anomaly detection: A random split of data. 80% of the samples go into the training dataset, and the remaining samples go into the testing dataset.
Semi-supervised anomaly detection: A stratified split of data.
Anomaly detection for log data: A split of data based on primary key values. The first 80% of the samples go into the training dataset. The remaining samples go into the testing dataset. Review requirements when running Anomaly Detection for Logs.
Recommendations: A random split of data.
Topic modeling: A random split of data.

Parameters to Prepare Training and Testing Datasets

To run the TRAIN_TEST_SPLIT routine, you use the following parameters:

table_name: You must provide the fully qualified name of the table that contains the dataset to split (schema_name.table_name).
target_column_name: Classification and semi-supervised anomaly detection tasks require a target column. All other tasks do not require a target column. If a target column is not required, you can set this parameter to NULL.
options: Set the following options as needed as key-value pairs in JSON object format. If no options are needed, set this to NULL.
- task: Set the appropriate machine learning task: classification, regression, forecasting, anomaly_detection, log_anomaly_detection, recommendation, or topic_modeling. If the machine learning task is not set, the default task is classification.
- datetime_index: Required for forecasting tasks. The column that has datetime values.
  
  The following data types for this column are supported:
  - DATETIME
  - TIMESTAMP
  - DATE
  - TIME
  - YEAR
- semisupervised: If running an anomaly detection task, set this to true for semi-supervised learning, or false for unsupervised learning. If this is set to NULL, then the default value of false is selected.

TRAIN_TEST_SPLIT Example

To automatically generate a training and testing dataset:

Run the TRAIN_TEST_SPLIT routine.

mysql> CALL sys.TRAIN_TEST_SPLIT('table_name', 'target_column_name', options);

Replace table_name, target_column_name, and options with your own values. For example:

mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.data_files_1', 'class', JSON_OBJECT('task', 'classification'));

Confirm the two datasets are created ([original_table_name]_train and [original_table_name]_test) by querying the tables in the database.

mysql> SHOW TABLES;
+-------------------------+
| Tables_in_data_files_db |
+-------------------------+
| data_files_1            |
| data_files_1_test       |
| data_files_1_train      |
+-------------------------+

What's Next

Learn how to Train a Model.

PREV HOME UP NEXT