As of MySQL 9.4.1, you can automatically create
training and testing datasets with the
TRAIN_TEST_SPLIT
routine.
Review the Requirements.
Get the Required Privileges to use MySQL HeatWave AutoML.
Review the Data Types Supported For Machine Learning Tasks.
Load structured data from the DB system or External Sources.
The TRAIN_TEST_SPLIT
routine
takes your datasets and prepares new tables for training and
testing machine learning models. Two new tables in the same
database are created with the following names:
[original_table_name]_train
[original_table_name]_test
The split of the data between training and testing datasets depends on the machine learning task.
Classification: A stratified split of data. For each class in the dataset, 80% of the samples go into the training dataset, and the remaining go into the testing dataset. If the number of samples in the 80% subset is fewer than five, then five samples are inserted into the training dataset.
Regression: A random split of data.
Forecasting: A time-based split of data. The data is inserted in order according to
datetime_index
values. The first 80% of the samples go into the training dataset. The remaining samples go into the testing dataset.Unsupervised anomaly detection: A random split of data. 80% of the samples go into the training dataset and the remaining samples go into the testing dataset.
Semi-supervised anomaly detection: A stratified split of data.
Anomaly detection for log data: A split of data based on primary key values. The first 80% of the samples go into the training dataset. The remaining samples go into the testing dataset. Review requirements when running Anomaly Detection for Logs.
Recommendations: A random split of data.
Topic modeling: A random split of data.
To run the TRAIN_TEST_SPLIT
routine, you use the following parameters:
table_name
: You must provide the fully qualified name of the table that contains the dataset to split (schema_name.table_name
).target_column_name
: Classification and semi-supervised anomaly detection tasks require a target column. All other tasks do not require a target column. If a target column is not required, you can set this parameter toNULL
.-
options
: Set the following options as needed as key-value pairs in JSON object format. If no options are needed, set this toNULL
.task
: Set the appropriate machine learning task:classification
,regression
,forecasting
,anomaly_detection
,log_anomaly_detection
,recommendation
, ortopic_modeling
. If the machine learning task is not set, the default task isclassification
.-
datetime_index
: Required for forecasting tasks. The column that has datetime values.The following data types for this column are supported:
semisupervised
: If running an anomaly detection task, set this totrue
for semi-supervised learning, orfalse
for unsupervised learning. If this is set toNULL
, then the default value offalse
is selected.
To automatically generate a training and testing dataset:
-
Run the
TRAIN_TEST_SPLIT
routine.mysql> CALL sys.TRAIN_TEST_SPLIT('table_name', 'target_column_name', options);
Replace
table_name
,target_column_name
, andoptions
with your own values. For example:mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.data_files_1', 'class', JSON_OBJECT('task', 'classification'));
-
Confirm the two datasets are created ([original_table_name]_train and [original_table_name]_test) by querying the tables in the database.
mysql> SHOW TABLES; +-------------------------+ | Tables_in_data_files_db | +-------------------------+ | data_files_1 | | data_files_1_test | | data_files_1_train | +-------------------------+
Learn how to Train a Model.