Examples in this guide use the Census Income Data Set.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information.
Census Income Data Set examples
demonstrate classification
training and
inference. HeatWave AutoML also supports
regression
training and inference for
datasets suited for that purpose. The
ML_TRAIN
task
parameter defines whether the
machine learning model is trained for
classification
or
regression
.
To replicate the examples in this guide, perform the following steps to create the required schema and tables. Python 3 and MySQL Shell are required.
-
Create the following schema and tables on the MySQL DB System by executing the following statements:
mysql> CREATE SCHEMA heatwaveml_bench; mysql> USE heatwaveml_bench; mysql> CREATE TABLE census_train ( age INT, workclass VARCHAR(255), fnlwgt INT, education VARCHAR(255), `education-num` INT, `marital-status` VARCHAR(255), occupation VARCHAR(255), relationship VARCHAR(255), race VARCHAR(255), sex VARCHAR(255), `capital-gain` INT, `capital-loss` INT, `hours-per-week` INT, `native-country` VARCHAR(255), revenue VARCHAR(255)); mysql> CREATE TABLE `census_test` LIKE `census_train`;
Navigate to the HeatWave AutoML Code for Performance Benchmarks GitHub repository at https://github.com/oracle-samples/heatwave-ml.
-
Follow the
README.md
instructions to createcensus_train.csv
andcensus_test.csv
data files. In summary, the instructions are:-
Install the required Python packages:
$> pip install pandas==1.2.3 numpy==1.22.2 unlzw3==0.2.1 sklearn==1.0.2
Download or clone the repository, which includes the census source data and preprpocessing script.
-
Run the
preprocess.py
script to create thecensus_train.csv
andcensus_test.csv
data files.$> python3 heatwave-ml/preprocess.py --benchmark census
NoteDo not run the benchmark as instructed in the
README.md
file. The benchmark script removes the schema and data at the end of processing. -
-
Start MySQL Shell with the
--mysql
option to open aClassicSession
, which is required when using the Parallel Table Import Utility.$> mysqlsh --mysql Username@IPAddressOfMySQLDBSystemEndpoint
-
Load the data from the
.csv
files into the MySQL DB System using the following commands:MySQL>JS> util.importTable("census_train.csv",{table: "census_train", dialect: "csv-unix", skipRows:1}) MySQL>JS> util.importTable("census_test.csv",{table: "census_test", dialect: "csv-unix", skipRows:1})
-
Create a validation table:
mysql> CREATE TABLE `census_validate` LIKE `census_test`; mysql> INSERT INTO `census_validate` SELECT * FROM `census_test`;
-
Modify the
census_test
table to remove the target`revenue`
column:mysql> ALTER TABLE `census_test` DROP COLUMN `revenue`;
For other example data sets to use with HeatWave AutoML, refer to the HeatWave AutoML Code for Performance Benchmarks GitHub repository.