HeatWave User Guide  /  ...  /  TRAIN_TEST_SPLIT

10.2.14 TRAIN_TEST_SPLIT

MySQL 9.4.1 introduces the TRAIN_TEST_SPLIT routine, which automatically splits your data into training and testing datasets.

Two new tables in the same database are created with the following names:

  • [original_table_name]_train

  • [original_table_name]_test

The split of the data between training and testing datasets depends on the machine learning task.

  • Classification: A stratified split of data. For each class in the dataset, 80% of the samples goes into the training dataset, and the remaining goes into the testing dataset. If the number of samples in the 80% subset is fewer than 5, then instead select 5 of the samples for the training dataset.

  • Regression: A random split of data.

  • Forecasting: A time-based split of data. Order the table by the datetime_index values and select the first 80% of the samples for the training dataset. Insert the subsequent samples into the testing dataset.

  • Unsupervised anomaly detection: A random split of data. Select 80% of the samples for the training dataset and select the remaining samples for the testing dataset.

  • Semi-supervised anomaly detection: A stratified split of data.

  • Anomaly detection for log data: A split of data based on primary key values. The first 80% of the samples go into the training dataset. The remaining samples go into the testing dataset. Review requirements when running Anomaly Detection for Logs.

  • Recommendations: A random split of data.

  • Topic modeling: A random split of data.

TRAIN_TEST_SPLIT Syntax

mysql> CALL sys.TRAIN_TEST_SPLIT ('table_name', 'target_column_name', [options | NULL]);
 
options: {
     JSON_OBJECT("key","value"[,"key","value"] ...)
          "key","value": {
          ['task', {'classification'|'regression'|'forecasting'|'anomaly_detection'|'log_anomaly_detection'|'recommendation'|'topic_modeling'}]
          ['datetime_index', 'column']
          ['semisupervised', {'true'|'false'}]

TRAIN_TEST_SPLIT parameters:

  • table_name: You must provide the fully qualified name of the table that contains the dataset to split (schema_name.table_name).

  • target_column_name: Classification and semi-supervised anomaly detection tasks require a target column. All other tasks do not require a target column. If a target column is not required you can set this parameter to NULL.

  • options: Set the following options as needed as key-value pairs in JSON object format. If no options are needed, set this to NULL.

    • task: If the machine learning task is not set, the default task is classification.

    • datetime_index: The column that has datetime values. This parameter is required for forecasting tasks.

      The following data types for this column are supported:

    • semisupervised: If running an anomaly detection task, set this to true for semi-supervised learning, or false for unsupervised learning. If this is set to NULL, then the default value of false is selected.

Syntax Examples

  • A classification task:

    mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.data_files_1', 'class', JSON_OBJECT('task', 'classification'));
    mysql> SHOW TABLES;
    +-------------------------+
    | Tables_in_data_files_db |
    +-------------------------+
    | data_files_1            |
    | data_files_1_test       |
    | data_files_1_train      |
    +-------------------------+
  • A regression task:

    mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.food_delivery_data', NULL, JSON_OBJECT('task', 'regression'));
    mysql> SHOW TABLES;
    +--------------------------+
    | Tables_in_data_files_db  |
    +--------------------------+
    | food_delivery_data       |
    | food_delivery_data_test  |
    | food_delivery_data_train |
    +--------------------------+
  • A forecasting task:

    mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.forecasting_data', NULL, JSON_OBJECT('task', 'forecasting', 'datetime_index', 'timestamp'));
    mysql> SHOW TABLES;
    +-------------------------+
    | Tables_in_data_files_db |
    +-------------------------+
    | forecasting_data        |
    | forecasting_data_test   |
    | forecasting_data_train  |
    +-------------------------+
  • An unsupervised anomaly detection task:

    mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.anomaly_detection_data', NULL, JSON_OBJECT('task', 'anomaly_detection'));
    mysql> SHOW TABLES;
    +------------------------------+
    | Tables_in_data_files_db      |
    +------------------------------+
    | anomaly_detection_data       |
    | anomaly_detection_data_test  |
    | anomaly_detection_data_train |
    +------------------------------+
  • A semi-supervised anomaly detection task:

    mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.anomaly_detection_semi', 'anomaly', JSON_OBJECT('task', 'anomaly_detection', 'semisupervised', 'true'));
    mysql> SHOW TABLES;
    +------------------------------+
    | Tables_in_data_files_db      |
    +------------------------------+
    | anomaly_detection_semi       |
    | anomaly_detection_semi_test  |
    | anomaly_detection_semi_train |
    +------------------------------+
  • A task for anomaly detection on log data:

    mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.anomaly_detection_logs', NULL, JSON_OBJECT('task', 'log_anomaly_detection'));
    mysql> SHOW TABLES;
    +------------------------------+
    | Tables_in_data_files_db      |
    +------------------------------+
    | anomaly_detection_logs       |
    | anomaly_detection_logs_test  |
    | anomaly_detection_logs_train |
    +------------------------------+
  • A recommendation task:

    mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.rec_data', NULL, JSON_OBJECT('task', 'recommendation'));
    mysql> SHOW TABLES;
    +-------------------------+
    | Tables_in_data_files_db |
    +-------------------------+
    | rec_data                |
    | rec_data_test           |
    | rec_data_train          |
    +-------------------------+
  • A topic modeling task:

    mysql> CALL sys.TRAIN_TEST_SPLIT('data_files_db.text_data', NULL, JSON_OBJECT('task', 'topic_modeling'));
    mysql> SHOW TABLES;
    +-------------------------+
    | Tables_in_data_files_db |
    +-------------------------+
    | text_data               |
    | text_data_test          |
    | text_data_train         |
    +-------------------------+