If you have external data that you want to use with MySQL HeatWave AutoML, you can load it directly from Object Storage with Lakehouse.
This topic has the following sections.
Lakehouse must be enabled in the MySQL HeatWave cluster. See Additional MySQL HeatWave Lakehouse Requirements to learn more.
For MySQL HeatWave on OCI, you have three options to access external data from Object Storage: Pre-Authenticated Requests (PAR), Resource Principals, or Uniform Resource Identifiers (URI) (As of MySQL 9.3.1).
See the following to learn more:
Resource Principals in the MySQL HeatWave on OCI Service Guide.
Uploading an Object Storage Object to a Bucket in the MySQL HeatWave on OCI Service Guide.
Object Storage Pre-Authenticated Requests in the MySQL HeatWave on OCI Service Guide.
After loading data from Object Storage into Lakehouse, you can start running AutoML routines. You can run the following routines as needed with no changes:
If you run ML_PREDICT_ROW
or
ML_EXPLAIN_ROW
on data loaded
from Object Storage, you cannot use a FROM
clause.
MySQL HeatWave AutoML commands operate on data loaded into MySQL HeatWave. If the original Lakehouse data in Object Storage is deleted or modified, this does not affect a MySQL HeatWave AutoML command until the data is unloaded from MySQL HeatWave.
The workflow to use MySQL HeatWave AutoML with Lakehouse includes the following:
Upload the file that has the data to a bucket in your Oracle Cloud account. See Upload File to Object Storage.
Load the external file into MySQL HeatWave and create an external Lakehouse table. You have two options to do this. The preferred method is to automate the process with the
heatwave_load
command. See Load Data into an External Lakehouse Table. Alternatively, you can manually load the data. See Section 4.3.9, “Load Structured Data Manually” to learn more.Run MySQL HeatWave AutoML routines as needed. For
ML_PREDICT_ROW
andML_EXPLAIN_ROW
, you cannot use theFROM
clause in Lakehouse tables. See Use ML_PREDICT_ROW and ML_EXPLAIN_ROW with Lakehouse Data to review how to do this.
The following tasks use data from:
Bank
Marketing. This data is used for a classification
machine learning model, and predicts if a client will subscribe
to a term deposit. The target column is y
.
To access the file from Object Storage, you create a pre-authenticated request. This is a temporary request that allows you to access files in Object Storage. See Object Storage Pre-Authenticated Requests to learn more.
To upload the file that has the data to Object Storage:
Visit Bank Marketing and download the
bank+marketing.zip
file.Unzip the file, and then unzip the
bank.zip
file. Refer to thebank.csv
file for the remaining steps.Access your Oracle Cloud account and upload the
bank.csv
file to Object Storage. See Uploading an Object Storage Object to a Bucket.Create a pre-authenticated request for the
bank.csv
file. Select theObject
Pre-Authenticated Request Target and enter thebank.csv
Object Name. See Creating a Pre-Authenticated Request in Object Storage in the MySQL HeatWave on OCI Service Guide.
After uploading the file to Object Storage, set up the name of
the table that will store the data, the format of the external
file, and the pre-authorized request link to access the external
file. You then use the heatwave_load
command
to allow Auto Parallel Load to automatically infer the characteristics of the
table columns and load the data into an external Lakehouse
table. To learn more, see
Section 4.3.8.1, “Load Data from Object Storage”.
-
Create a session variable with the characteristics of the data to load and the external file:
mysql> SET @input_list = '[{ "db_name": "bank_marketing", "tables": [{ "table_name": "bank_train", "engine_attribute": { "dialect": {"format": "csv", "has_header": true}, "file": [{"par": "pre_authenticated_request"}] } }] }]';
Where:
@input_list
is the name of the session variable.db_name
identifies the database name to store the table. If the database does not exist, MySQL HeatWave automatically creates it.table_name
sets the table name to store the data:bank_train
. If the table does not exist, MySQL HeatWave automatically creates it.format
defines the format of the external file:csv
.has_header
identifies a header in the external file. Auto Parallel Load then infers the column names from the first row in the file.par
sets the pre-authenticated request link to access the file. Replacepre_authenticated_request
with your own link.
If you are on MySQL 9.1.2 and earlier, you need to update
dialect
with thefield delimiter
andrecord delimiter
parameters. Setfield delimiter
with a value of","
, andrecord delimiter
with a value of"\\n"
. As of MySQL 9.2.0, Lakehouse Autopilot can automatically detect these values. See Lakehouse External Table Syntax to learn more. -
Use the
HEATWAVE_LOAD
command to automatically load the data into an external Lakehouse table with the default settings. You have the option to test the load before implementing it. See Section 4.3.8.1, “Load Data from Object Storage” to learn more about different modes.mysql> CALL sys.HEATWAVE_LOAD(CAST(@input_list AS JSON), NULL); +------------------------------------------+ | INITIALIZING HEATWAVE AUTO PARALLEL LOAD | +------------------------------------------+ | Version: 4.24 | | | | Load Mode: normal | | Load Policy: disable_unsupported_columns | | Output Mode: normal | | | +------------------------------------------+ 6 rows in set (0.0676 sec) +--------------------------------------------------------------------------------------------------------------------+ | LAKEHOUSE AUTO SCHEMA INFERENCE | +--------------------------------------------------------------------------------------------------------------------+ | Verifying external lakehouse tables: 1 | | | | SCHEMA TABLE TABLE IS RAW NUM. OF ESTIMATED SUMMARY OF | | NAME NAME CREATED FILE SIZE COLUMNS ROW COUNT ISSUES | | ------ ----- -------- --------- ------- --------- ---------- | | `bank_marketing` `bank_train` NO 450.66 KiB 17 4.52 K | | | | New schemas to be created: 0 | | External lakehouse tables to be created: 1 | | | +--------------------------------------------------------------------------------------------------------------------+ 10 rows in set (0.0676 sec) +------------------------------------------------------------------------+ | OFFLOAD ANALYSIS | +------------------------------------------------------------------------+ | Verifying input schemas: 1 | | User excluded items: 0 | | | | SCHEMA OFFLOADABLE OFFLOADABLE SUMMARY OF | | NAME TABLES COLUMNS ISSUES | | ------ ----------- ----------- ---------- | | `bank_marketing` 1 17 | | | | Total offloadable schemas: 1 | | | +------------------------------------------------------------------------+ 10 rows in set (0.0676 sec) +-----------------------------------------------------------------------------------------------------------------------------+ | CAPACITY ESTIMATION | +-----------------------------------------------------------------------------------------------------------------------------+ | Default encoding for string columns: VARLEN (unless specified in the schema) | | Estimating memory footprint for 1 schema(s) | | | | TOTAL ESTIMATED ESTIMATED TOTAL DICTIONARY VARLEN ESTIMATED | | SCHEMA OFFLOADABLE HEATWAVE NODE MYSQL NODE STRING ENCODED ENCODED LOAD | | NAME TABLES FOOTPRINT FOOTPRINT COLUMNS COLUMNS COLUMNS TIME | | ------ ----------- --------- --------- ------- ---------- ------- --------- | | `bank_marketing` 1 3.10 MiB 1.12 MiB 10 0 10 7.00 s | | | | Sufficient MySQL host memory available to load all tables. | | Sufficient HeatWave cluster memory available to load all tables. | | | +-----------------------------------------------------------------------------------------------------------------------------+ 12 rows in set (0.0676 sec) +---------------------------------------------------------------------------------------------------------------------------------------+ | EXECUTING LOAD SCRIPT | +---------------------------------------------------------------------------------------------------------------------------------------+ | HeatWave Load script generated | | Retrieve load script containing 2 generated DDL command(s) using the query below: | | Deprecation Notice: "heatwave_load_report" will be deprecated, please switch to "heatwave_autopilot_report" | | SELECT log->>"$.sql" AS "Load Script" FROM sys.heatwave_autopilot_report WHERE type = "sql" ORDER BY id; | | | | Adjusting load parallelism dynamically per internal/external table. | | Using current parallelism of 32 thread(s) as maximum for internal tables. | | | | Warning: Executing the generated script may alter column definitions and secondary engine flags in the schema | | | | Using SQL_MODE: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION | | | | Proceeding to load 1 table(s) into HeatWave. | | | | Applying changes will take approximately 7.01 s | | | +---------------------------------------------------------------------------------------------------------------------------------------+ 16 rows in set (0.0676 sec) +-----------------------------------------------+ | TABLE LOAD | +-----------------------------------------------+ | TABLE (1 of 1): `bank_marketing`.`bank_train` | | Commands executed successfully: 2 of 2 | | Warnings encountered: 0 | | Table load succeeded! | | Total columns loaded: 17 | | Elapsed time: 3.50 s | | | +-----------------------------------------------+ 7 rows in set (0.0676 sec) +-------------------------------------------------------------------------------+ | LOAD SUMMARY | +-------------------------------------------------------------------------------+ | | | SCHEMA TABLES TABLES COLUMNS LOAD | | NAME LOADED FAILED LOADED DURATION | | ------ ------ ------ ------- -------- | | `bank_marketing` 1 0 17 3.50 s | | | | Total errors encountered: 0 | | Total warnings encountered: 0 | | | +-------------------------------------------------------------------------------+ 9 rows in set (0.0676 sec) Query OK, 0 rows affected (0.0676 sec)
A summary is generated that allows you to review the details of the load and any warnings or errors.
Once the data is loaded into Lakehouse, you can use the following routines as needed with no changes:
For ML_PREDICT_ROW
and
ML_EXPLAIN_ROW
, you must avoid
the FROM
clause. The following task shows how
to use these commands specifically for data loaded into MySQL HeatWave
from Lakehouse.
To avoid using the FROM
clause, you can
insert the data to predict and explain directly into a
JSON
object.
-
Train the table. Setting
NULL
to all JSON options means that the defaulttask
ofclassification
is used for training.mysql> CALL sys.ML_TRAIN('bank_marketing.bank_train', 'y', JSON_OBJECT('task', 'classification'), @bank_model);
Where:
bank_marketing.bank_train
is the fully qualified name of the table that contains the training dataset (schema_name.table_name
).y
is the name of the target column, which contains ground truth values.JSON_OBJECT('task', 'classification')
specifies the machine learning task type.@bank_model
is the name of the user-defined session variable that stores the model handle for the duration of the connection. User variables are written as@
. Any valid name for a user-defined variable is permitted. For example,var_name
@my_model
. Learn more about Model Handles.
-
After training, load the trained model.
mysql> CALL sys.ML_MODEL_LOAD(@bank_model, NULL);
-
Insert the data to predict and explain directly into a JSON object named
@row_input
.mysql> SET @row_input = JSON_OBJECT( 'age', 37, 'job', 'admin.', 'marital', 'married', 'education', 'unknown', 'default', 'no', 'balance', 734, 'housing', 'yes', 'loan', 'no', 'contact', 'unknown', 'day', 21, 'month', 'may', 'duration', 1106, 'campaign', 1, 'pdays', -1, 'previous', 0, 'poutcome', 'unknown', 'y', 'no');
-
Run
ML_PREDICT_ROW
to generate predictions on the data in the JSON object. Optionally, use\G
to view the output in an easily readable format.mysql> SELECT sys.ML_PREDICT_ROW(@row_input, @bank_model, NULL)\G *************************** 1. row *************************** sys.ML_PREDICT_ROW(@row_input, @bank_model, NULL): {"y": "no", "age": 37, "day": 21, "job": "admin.", "loan": "no", "month": "may", "pdays": -1, "balance": 734, "contact": "unknown", "default": "no", "housing": "yes", "marital": "married", "campaign": 1, "duration": 1106, "poutcome": "unknown", "previous": 0, "education": "unknown", "Prediction": "no", "ml_results": {"predictions": {"y": "no"}, "probabilities": {"no": 0.7052, "yes": 0.2948}}} 1 row in set (1.0027 sec)
The prediction gives a value of
no
with a probability of 65%. This is the same as the labeled value for the row in they
column. This demonstrates that the trained model gave a reliable prediction. -
Run
ML_EXPLAIN_ROW
with thepermutation_importance
prediction explainer to generate an explanation on the prediction previously generated. Optionally, use\G
to view the output in an easily readable format.mysql> SELECT sys.ML_EXPLAIN_ROW(@row_input, @bank_model, JSON_OBJECT('prediction_explainer', 'permutation_importance'))\G *************************** 1. row *************************** sys.ML_EXPLAIN_ROW(@row_input, @bank_model, JSON_OBJECT('prediction_explainer', 'permutation_importance')): {"y": "no", "age": 37, "day": 21, "job": "admin.", "loan": "no", "Notes": "month (may) had the largest impact towards predicting no, whereas duration (1106) contributed the most against predicting no", "month": "may", "pdays": -1, "balance": 734, "contact": "unknown", "default": "no", "housing": "yes", "marital": "married", "campaign": 1, "duration": 1106, "poutcome": "unknown", "previous": 0, "education": "unknown", "Prediction": "no", "ml_results": {"notes": "month (may) had the largest impact towards predicting no, whereas duration (1106) contributed the most against predicting no", "predictions": {"y": "no"}, "attributions": {"month": 0.24, "pdays": 0.02, "contact": 0.15, "default": 0.0, "housing": 0.05, "duration": -0.29, "poutcome": -0.0, "previous": -0.02}}, "month_attribution": 0.24, "pdays_attribution": 0.02, "contact_attribution": 0.15, "default_attribution": 0, "housing_attribution": 0.05, "duration_attribution": -0.29, "poutcome_attribution": 0, "previous_attribution": -0.02} 1 row in set (5.0770 sec)
The explanation determines that the
may
value in themonth
column had the greatest impact on generating the previous prediction ofno
. It also determines thatduration
contributed the most against the prediction ofno
.
Learn how to Load Data from Object Storage into MySQL HeatWave Cluster using MySQL HeatWave Lakehouse.
Review how to Create a Machine Learning Model.
Review Machine Learning Use Cases to create machine learning models with sample datasets.