Documentation Home
HeatWave User Guide
Related Documentation Download this Manual
PDF (US Ltr) - 2.0Mb
PDF (A4) - 2.0Mb


HeatWave User Guide  /  ...  /  Example Data

3.4.4 Example Data

Examples in this guide use the Census Income Data Set.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information.

Note

Census Income Data Set examples demonstrate classification training and inference. HeatWave AutoML also supports regression training and inference for datasets suited for that purpose. The ML_TRAIN task parameter defines whether the machine learning model is trained for classification or regression.

To replicate the examples in this guide, perform the following steps to create the required schema and tables. Python 3 and MySQL Shell are required.

  1. Create the following schema and tables on the MySQL DB System by executing the following statements:

    mysql> CREATE SCHEMA heatwaveml_bench;
    
    mysql> USE heatwaveml_bench;
    
    mysql> CREATE TABLE census_train ( 
              age INT, workclass VARCHAR(255), 
              fnlwgt INT, education VARCHAR(255), 
              `education-num` INT, 
              `marital-status` VARCHAR(255), 
              occupation VARCHAR(255), 
              relationship VARCHAR(255), 
              race VARCHAR(255), 
              sex VARCHAR(255), 
              `capital-gain` INT, 
              `capital-loss` INT, 
              `hours-per-week` INT, 
              `native-country` VARCHAR(255), 
              revenue VARCHAR(255));
    
    mysql> CREATE TABLE `census_test` LIKE `census_train`;
  2. Navigate to the HeatWave AutoML Code for Performance Benchmarks GitHub repository at https://github.com/oracle-samples/heatwave-ml.

  3. Follow the README.md instructions to create census_train.csv and census_test.csv data files. In summary, the instructions are:

    1. Install the required Python packages:

      $> pip install pandas==1.2.3 numpy==1.22.2 unlzw3==0.2.1 sklearn==1.0.2
    2. Download or clone the repository, which includes the census source data and preprpocessing script.

    3. Run the preprocess.py script to create the census_train.csv and census_test.csv data files.

      $> python3 heatwave-ml/preprocess.py --benchmark census
    Note

    Do not run the benchmark as instructed in the README.md file. The benchmark script removes the schema and data at the end of processing.

  4. Start MySQL Shell with the --mysql option to open a ClassicSession, which is required when using the Parallel Table Import Utility.

    $> mysqlsh --mysql Username@IPAddressOfMySQLDBSystemEndpoint
  5. Load the data from the .csv files into the MySQL DB System using the following commands:

    MySQL>JS> util.importTable("census_train.csv",{table: "census_train", 
                    dialect: "csv-unix", skipRows:1})
    
    MySQL>JS> util.importTable("census_test.csv",{table: "census_test", 
                    dialect: "csv-unix", skipRows:1})
  6. Create a validation table:

    mysql> CREATE TABLE `census_validate` LIKE `census_test`;
    
    mysql> INSERT INTO `census_validate` SELECT * FROM `census_test`;
  7. Modify the census_test table to remove the target `revenue` column:

    mysql> ALTER TABLE `census_test` DROP COLUMN `revenue`;

Other Example Data Sets

For other example data sets to use with HeatWave AutoML, refer to the HeatWave AutoML Code for Performance Benchmarks GitHub repository.