General requirements for HeatWave AutoML data include the following:
-
Each dataset must reside in a single table on the MySQL DB System. HeatWave AutoML routines such as
ML_TRAIN
,ML_PREDICT_TABLE
, andML_EXPLAIN_TABLE
operate on a single table.For information about loading data into a MySQL DB System, see Importing and Exporting Databases in the HeatWave on OCI Service Guide.
Tables used with HeatWave AutoML must not exceed 10 GB, 100 million rows, or 1017 columns.
Table columns must use supported data types. For supported data types and recommendations for how to handle unsupported types, see Section 3.17, “Supported Data Types”.
NaN (Not a Number) values are not recognized by MySQL and should be replaced by
NULL
.The target column in a training dataset for a classification model must have at least two distinct values, and each distinct value should appear in at least five rows. For a regression model, only a numeric target column is permitted.
The ML_TRAIN
routine ignores
columns missing more than 20% of its values and columns with
the same value in each row. Missing values in numerical
columns are replaced with the average value of the column,
standardized to a mean of 0 and with a standard deviation of
1. Missing values in categorical columns are replaced with
the most frequent value, and either one-hot or ordinal
encoding is used to convert categorical values to numeric
values. The input data as it exists in the MySQL database is
not modified by ML_TRAIN
.