If you have external data that you want to use with MySQL HeatWave AutoML, you can load it directly from Object Storage with Lakehouse.
This topic has the following sections.
Lakehouse must be enabled in the MySQL HeatWave cluster. See Additional MySQL HeatWave Lakehouse Requirements to learn more.
For MySQL HeatWave on OCI, you have three options to access external data from Object Storage: Pre-Authenticated Requests (PAR), Resource Principals, or Uniform Resource Identifiers (URI) (As of MySQL 9.3.1).
See the following to learn more:
Resource Principals in the MySQL HeatWave on OCI Service Guide.
Uploading an Object Storage Object to a Bucket in the MySQL HeatWave on OCI Service Guide.
Object Storage Pre-Authenticated Requests in the MySQL HeatWave on OCI Service Guide.
After loading data from Object Storage into Lakehouse, you can start running AutoML routines. You can run the following routines as needed with no changes:
If you run ML_PREDICT_ROW or
ML_EXPLAIN_ROW on data loaded
from Object Storage, you cannot use a FROM
clause.
MySQL HeatWave AutoML commands operate on data loaded into MySQL HeatWave. If the original Lakehouse data in Object Storage is deleted or modified, this does not affect a MySQL HeatWave AutoML command until the data is unloaded from MySQL HeatWave.
The workflow to use MySQL HeatWave AutoML with Lakehouse includes the following:
Upload the file that has the data to a bucket in your Oracle Cloud account. See Upload File to Object Storage.
Load the external file into MySQL HeatWave and create an external Lakehouse table. You have two options to do this. The preferred method is to automate the process with the
heatwave_loadcommand. See Load Data into an External Lakehouse Table. Alternatively, you can manually load the data. See Section 4.3.11, “Load Structured Data Manually” to learn more.Run MySQL HeatWave AutoML routines as needed. For
ML_PREDICT_ROWandML_EXPLAIN_ROW, you cannot use theFROMclause in Lakehouse tables. See Use ML_PREDICT_ROW and ML_EXPLAIN_ROW with Lakehouse Data to review how to do this.
The following tasks use data from:
Bank
Marketing. This data is used for a classification
machine learning model, and predicts if a client will subscribe
to a term deposit. The target column is y.
To access the file from Object Storage, you create a pre-authenticated request. This is a temporary request that allows you to access files in Object Storage. See Object Storage Pre-Authenticated Requests to learn more.
To upload the file that has the data to Object Storage:
Visit Bank Marketing and download the
bank+marketing.zipfile.Unzip the file, and then unzip the
bank.zipfile. Refer to thebank.csvfile for the remaining steps.Access your Oracle Cloud account and upload the
bank.csvfile to Object Storage. See Uploading an Object Storage Object to a Bucket.Create a pre-authenticated request for the
bank.csvfile. Select theObjectPre-Authenticated Request Target and enter thebank.csvObject Name. See Creating a Pre-Authenticated Request in Object Storage in the MySQL HeatWave on OCI Service Guide.
After uploading the file to Object Storage, set up the name of
the table that will store the data, the format of the external
file, and the pre-authorized request link to access the external
file. You then use the heatwave_load command
to allow Auto Parallel Load to automatically infer the characteristics of the
table columns and load the data into an external Lakehouse
table. To learn more, see
Section 4.3.10, “Load Structured Data Using Lakehouse Auto Parallel Load”.
-
Create a session variable with the characteristics of the data to load and the external file:
mysql> SET @input_list = '[{ "db_name": "bank_marketing", "tables": [{ "table_name": "bank_train", "engine_attribute": { "dialect": {"format": "csv", "has_header": true}, "file": [{"par": "pre_authenticated_request"}] } }] }]';Where:
@input_listis the name of the session variable.db_nameidentifies the database name to store the table. If the database does not exist, MySQL HeatWave automatically creates it.table_namesets the table name to store the data:bank_train. If the table does not exist, MySQL HeatWave automatically creates it.formatdefines the format of the external file:csv.has_headeridentifies a header in the external file. Auto Parallel Load then infers the column names from the first row in the file.parsets the pre-authenticated request link to access the file. Replacepre_authenticated_requestwith your own link.
If you are on MySQL 9.1.2 and earlier, you need to update
dialectwith thefield delimiterandrecord delimiterparameters. Setfield delimiterwith a value of",", andrecord delimiterwith a value of"\\n". As of MySQL 9.2.0, Lakehouse Autopilot can automatically detect these values. See Lakehouse External Table Syntax to learn more. -
Use the
HEATWAVE_LOADcommand to automatically load the data into an external Lakehouse table with the default settings. You have the option to test the load before implementing it. See Section 4.3.10, “Load Structured Data Using Lakehouse Auto Parallel Load” to learn more about different modes.mysql> CALL sys.HEATWAVE_LOAD(CAST(@input_list AS JSON), NULL); +------------------------------------------+ | INITIALIZING HEATWAVE AUTO PARALLEL LOAD | +------------------------------------------+ | Version: 4.24 | | | | Load Mode: normal | | Load Policy: disable_unsupported_columns | | Output Mode: normal | | | +------------------------------------------+ 6 rows in set (0.0676 sec) +--------------------------------------------------------------------------------------------------------------------+ | LAKEHOUSE AUTO SCHEMA INFERENCE | +--------------------------------------------------------------------------------------------------------------------+ | Verifying external lakehouse tables: 1 | | | | SCHEMA TABLE TABLE IS RAW NUM. OF ESTIMATED SUMMARY OF | | NAME NAME CREATED FILE SIZE COLUMNS ROW COUNT ISSUES | | ------ ----- -------- --------- ------- --------- ---------- | | `bank_marketing` `bank_train` NO 450.66 KiB 17 4.52 K | | | | New schemas to be created: 0 | | External lakehouse tables to be created: 1 | | | +--------------------------------------------------------------------------------------------------------------------+ 10 rows in set (0.0676 sec) +------------------------------------------------------------------------+ | OFFLOAD ANALYSIS | +------------------------------------------------------------------------+ | Verifying input schemas: 1 | | User excluded items: 0 | | | | SCHEMA OFFLOADABLE OFFLOADABLE SUMMARY OF | | NAME TABLES COLUMNS ISSUES | | ------ ----------- ----------- ---------- | | `bank_marketing` 1 17 | | | | Total offloadable schemas: 1 | | | +------------------------------------------------------------------------+ 10 rows in set (0.0676 sec) +-----------------------------------------------------------------------------------------------------------------------------+ | CAPACITY ESTIMATION | +-----------------------------------------------------------------------------------------------------------------------------+ | Default encoding for string columns: VARLEN (unless specified in the schema) | | Estimating memory footprint for 1 schema(s) | | | | TOTAL ESTIMATED ESTIMATED TOTAL DICTIONARY VARLEN ESTIMATED | | SCHEMA OFFLOADABLE HEATWAVE NODE MYSQL NODE STRING ENCODED ENCODED LOAD | | NAME TABLES FOOTPRINT FOOTPRINT COLUMNS COLUMNS COLUMNS TIME | | ------ ----------- --------- --------- ------- ---------- ------- --------- | | `bank_marketing` 1 3.10 MiB 1.12 MiB 10 0 10 7.00 s | | | | Sufficient MySQL host memory available to load all tables. | | Sufficient HeatWave cluster memory available to load all tables. | | | +-----------------------------------------------------------------------------------------------------------------------------+ 12 rows in set (0.0676 sec) +---------------------------------------------------------------------------------------------------------------------------------------+ | EXECUTING LOAD SCRIPT | +---------------------------------------------------------------------------------------------------------------------------------------+ | HeatWave Load script generated | | Retrieve load script containing 2 generated DDL command(s) using the query below: | | Deprecation Notice: "heatwave_load_report" will be deprecated, please switch to "heatwave_autopilot_report" | | SELECT log->>"$.sql" AS "Load Script" FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id; | | | | Adjusting load parallelism dynamically per internal/external table. | | Using current parallelism of 32 thread(s) as maximum for internal tables. | | | | Warning: Executing the generated script may alter column definitions and secondary engine flags in the schema | | | | Using SQL_MODE: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION | | | | Proceeding to load 1 table(s) into HeatWave. | | | | Applying changes will take approximately 7.01 s | | | +---------------------------------------------------------------------------------------------------------------------------------------+ 16 rows in set (0.0676 sec) +-----------------------------------------------+ | TABLE LOAD | +-----------------------------------------------+ | TABLE (1 of 1): `bank_marketing`.`bank_train` | | Commands executed successfully: 2 of 2 | | Warnings encountered: 0 | | Table load succeeded! | | Total columns loaded: 17 | | Elapsed time: 3.50 s | | | +-----------------------------------------------+ 7 rows in set (0.0676 sec) +-------------------------------------------------------------------------------+ | LOAD SUMMARY | +-------------------------------------------------------------------------------+ | | | SCHEMA TABLES TABLES COLUMNS LOAD | | NAME LOADED FAILED LOADED DURATION | | ------ ------ ------ ------- -------- | | `bank_marketing` 1 0 17 3.50 s | | | | Total errors encountered: 0 | | Total warnings encountered: 0 | | | +-------------------------------------------------------------------------------+ 9 rows in set (0.0676 sec) Query OK, 0 rows affected (0.0676 sec)A summary is generated that allows you to review the details of the load and any warnings or errors.
Once the data is loaded into Lakehouse, you can use the following routines as needed with no changes:
For ML_PREDICT_ROW and
ML_EXPLAIN_ROW, you must avoid
the FROM clause. The following task shows how
to use these commands specifically for data loaded into MySQL HeatWave
from Lakehouse.
To avoid using the FROM clause, you can
insert the data to predict and explain directly into a
JSON object.
-
Train the table. Setting
NULLto all JSON options means that the defaulttaskofclassificationis used for training.mysql> CALL sys.ML_TRAIN('bank_marketing.bank_train', 'y', JSON_OBJECT('task', 'classification'), @bank_model);Where:
bank_marketing.bank_trainis the fully qualified name of the table that contains the training dataset (schema_name.table_name).yis the name of the target column, which contains ground truth values.JSON_OBJECT('task', 'classification')specifies the machine learning task type.@bank_modelis the name of the user-defined session variable that stores the model handle for the duration of the connection. User variables are written as@. Any valid name for a user-defined variable is permitted. For example,var_name@my_model. Learn more about Model Handles.
-
After training, load the trained model.
mysql> CALL sys.ML_MODEL_LOAD(@bank_model, NULL); -
Insert the data to predict and explain directly into a JSON object named
@row_input.mysql> SET @row_input = JSON_OBJECT( 'age', 37, 'job', 'admin.', 'marital', 'married', 'education', 'unknown', 'default', 'no', 'balance', 734, 'housing', 'yes', 'loan', 'no', 'contact', 'unknown', 'day', 21, 'month', 'may', 'duration', 1106, 'campaign', 1, 'pdays', -1, 'previous', 0, 'poutcome', 'unknown', 'y', 'no'); -
Run
ML_PREDICT_ROWto generate predictions on the data in the JSON object. Optionally, use\Gto view the output in an easily readable format.mysql> SELECT sys.ML_PREDICT_ROW(@row_input, @bank_model, NULL)\G *************************** 1. row *************************** sys.ML_PREDICT_ROW(@row_input, @bank_model, NULL): {"y": "no", "age": 37, "day": 21, "job": "admin.", "loan": "no", "month": "may", "pdays": -1, "balance": 734, "contact": "unknown", "default": "no", "housing": "yes", "marital": "married", "campaign": 1, "duration": 1106, "poutcome": "unknown", "previous": 0, "education": "unknown", "Prediction": "no", "ml_results": {"predictions": {"y": "no"}, "probabilities": {"no": 0.7052, "yes": 0.2948}}} 1 row in set (1.0027 sec)The prediction gives a value of
nowith a probability of 65%. This is the same as the labeled value for the row in theycolumn. This demonstrates that the trained model gave a reliable prediction. -
Run
ML_EXPLAIN_ROWwith thepermutation_importanceprediction explainer to generate an explanation on the prediction previously generated. Optionally, use\Gto view the output in an easily readable format.mysql> SELECT sys.ML_EXPLAIN_ROW(@row_input, @bank_model, JSON_OBJECT('prediction_explainer', 'permutation_importance'))\G *************************** 1. row *************************** sys.ML_EXPLAIN_ROW(@row_input, @bank_model, JSON_OBJECT('prediction_explainer', 'permutation_importance')): {"y": "no", "age": 37, "day": 21, "job": "admin.", "loan": "no", "Notes": "month (may) had the largest impact towards predicting no, whereas duration (1106) contributed the most against predicting no", "month": "may", "pdays": -1, "balance": 734, "contact": "unknown", "default": "no", "housing": "yes", "marital": "married", "campaign": 1, "duration": 1106, "poutcome": "unknown", "previous": 0, "education": "unknown", "Prediction": "no", "ml_results": {"notes": "month (may) had the largest impact towards predicting no, whereas duration (1106) contributed the most against predicting no", "predictions": {"y": "no"}, "attributions": {"month": 0.24, "pdays": 0.02, "contact": 0.15, "default": 0.0, "housing": 0.05, "duration": -0.29, "poutcome": -0.0, "previous": -0.02}}, "month_attribution": 0.24, "pdays_attribution": 0.02, "contact_attribution": 0.15, "default_attribution": 0, "housing_attribution": 0.05, "duration_attribution": -0.29, "poutcome_attribution": 0, "previous_attribution": -0.02} 1 row in set (5.0770 sec)The explanation determines that the
mayvalue in themonthcolumn had the greatest impact on generating the previous prediction ofno. It also determines thatdurationcontributed the most against the prediction ofno.
Learn how to Load Data from Object Storage into MySQL HeatWave Cluster using MySQL HeatWave Lakehouse.
Review how to Create a Machine Learning Model.
Review Machine Learning Use Cases to create machine learning models with sample datasets.