Documentation Home
MySQL HeatWave User Guide
Related Documentation Download this Manual
PDF (US Ltr) - 1.0Mb
PDF (A4) - 1.0Mb


MySQL HeatWave User Guide  /  ...  /  Loading Data Using Auto Parallel Load

2.2.3 Loading Data Using Auto Parallel Load

Auto Parallel Load facilitates the process of loading data into HeatWave by automating many of the steps involved, including:

  • Excluding schemas, tables, and columns that cannot be loaded.

  • Verifying that there is sufficient memory available for the data.

  • Optimizing load parallelism based on machine-learning models.

  • Loading data into HeatWave.

Auto Parallel Load, which can be run from any MySQL client or connector, is implemented as a stored procedure named heatwave_load, which resides in the MySQL sys schema. Running Auto Parallel Load involves issuing a CALL statement for the stored procedure, which takes schemas and options as arguments; for example, this statement loads the tpch schema:

CALL sys.heatwave_load(JSON_ARRAY("tpch"),NULL);

Auto Parallel Load is described under the following topics in this section:

Auto Parallel Load Requirements

  • The user must have the following MySQL privileges:

  • To run Auto Parallel Load in "normal" mode, the HeatWave Cluster must be active.

Auto Parallel Load Syntax

CALL sys.heatwave_load (db_list,[options]);

db_list: {
     JSON_ARRAY({""|"schema_name"[,"schema_name"] ...)
}

options: {
     JSON_OBJECT("key","value"[,"key","value"] ...)
          "key","value": {
          ["mode",{"normal"|"dryrun"}]
          ["output",{"normal"|"compact"|"silent"|"help"}]
          ["sql_mode","sql_mode"]
          ["policy",{"disable_unsupported_columns"|"not_disable_unsupported_columns"}]
          ["exclude_list",JSON_ARRAY("db_object"[,"db_object"] ...)]
          ["set_load_parallelism",{TRUE|FALSE}]
          ["auto_enc",JSON_OBJECT("mode",{"off"|"check"}]
}

db_list specifies the schemas to load. The list is specified as a JSON_ARRAY. Specifying an empty array is permitted for viewing the Auto Parallel Load command-line help (see Auto Parallel Load Command-Line Help). Otherwise, one or more valid schema names are required.

options are specified as key-value pairs in JSON object format. If an option is not specified, the default setting is used. If no options are specified, NULL can be specified in place of the option's argument.

For syntax examples, see Auto Parallel Load Examples.

Auto Parallel Load options include:

  • mode: Defines the Auto Parallel Load operational mode. Permitted values are:

    • normal: The default. Generates and executes the load script.

    • dryrun: Generates a load script only. Auto Parallel Load executes in dryrun mode automatically if the HeatWave Cluster is not active.

  • output: Defines how Auto Parallel Load produces output. Permitted values are:

  • sql_mode: Defines the SQL mode used while loading tables. Auto Parallel Load does not support the MySQL global or session sql_mode variable. To run Auto Parallel Load with a non-oci-default SQL mode configuration, specify the configuration using the Auto Parallel Load sql_mode option as a string value. If no SQL modes are specified, the default OCI SQL mode configuration is used.

    For information about SQL modes, see Server SQL Modes.

  • policy: Defines the policy for handling of tables containing columns with unsupported data types. Permitted values are:

    • disable_unsupported_columns: The default. Disable columns with unsupported data types and include the table in the load script. Columns that are explicitly pre-defined as NOT SECONDARY are ignored (they are neither disabled or enabled).

      Auto Parallel Load does not generate statements to disable columns that are explicitly defined as NOT SECONDARY.

    • not_disable_unsupported_columns: Exclude the table from the load script if the table contains a column with an unsupported data type.

      A column with an unsupported data type that is explicitly defined as a NOT SECONDARY column does not cause the table to be excluded. For information about defining columns as NOT SECONDARY, see Excluding Table Columns.

  • exclude_list: Defines a list of database objects (schemas, tables, and columns) to exclude from the load script. Names must be fully qualified without backticks, as in the following example:

    CALL sys.heatwave_load(JSON_ARRAY("tpch","airportdb"), 
    JSON_OBJECT("exclude_list",JSON_ARRAY("tpch.orders","airportdb.employee.salary")));

    Auto Parallel Load automatically excludes database objects that cannot be offloaded (according to the default policy setting). These objects need not be specified explicitly in the exclude list. System schemas, non-InnoDB tables, tables that are already loaded in HeatWave, and columns explicitly defined as NOT SECONDARY are excluded automatically.

  • set_load_parallelism: Enabled by default. Optimizes load parallelism based on machine-learning models by optimizing the innodb_parallel_read_threads variable setting before loading each table.

  • auto_enc: Checks if there is enough memory for string column encoding. Settings include:

    • mode: Defines the auto_enc operational mode. Permitted values are:

      • off: Disables the auto_enc option.

      • check: The default. Checks if there is enough memory on the MySQL node for dictionary-encoded columns and if there is enough root heap memory for variable-length column encoding overhead. Dictionary-encoded columns require memory on the MySQL node for dictionaries. For each loaded table, 4MB of memory (the default heap segment size) must be allocated from the root heap for variable-length column encoding overhead. From MySQL 8.0.30, the default heap segment size is reduced from 4GB to a default of 64KB per table. If there is not enough memory, Auto Parallel Load executes in "dryrun" mode and prints a warning about insufficient memory. The auto_enc option runs check mode if it is not specified explicitly and set to off. For more information, see Memory Estimation for String Column Encoding.

Running Auto Parallel Load

An Auto Parallel Load call that loads a single schema ("tpch") is written as follows:

CALL sys.heatwave_load(JSON_ARRAY("tpch"),NULL);

Specifying NULL in place of options means that Auto Parallel Load is run with the default option settings.

You can run Auto Parallel Load in "dryrun" mode first to check for errors and warnings and to inspect the generated load script. An Auto Parallel Load call in "dryrun" mode that specifies a single schema ("tpch") is written as follows:

CALL sys.heatwave_load(JSON_ARRAY("tpch"), JSON_OBJECT("mode","dryrun"));

In "dryrun" mode, Auto Parallel Load sends the load script to the heatwave_load_report table only. It does not load data into HeatWave.

If Auto Parallel Load fails with an error, inspect the errors by querying the heatwave_load_report table:

SELECT log FROM sys.heatwave_load_report WHERE type="error";

When Auto Parallel Load finishes running, use the following query to check for warnings:

SELECT log FROM sys.heatwave_load_report WHERE type="warn";

Issue the following query to inspect the load script that was generated:

SELECT log->>"$.sql" AS "Load Script" FROM sys.heatwave_load_report 
WHERE type = "sql" ORDER BY id;

Once you are satisfied with the Auto Parallel Load CALL statement and the generated load script, reissue the CALL statement in "normal" mode to load the data into HeatWave. For example:

CALL sys.heatwave_load(JSON_ARRAY("tpch"), JSON_OBJECT("mode","normal"));
Note

You can retrieve DDL statements in a table or use the following statements to produce a list of DDL statements that you can easily copy and paste:

SET SESSION group_concat_max_len = 1000000;
SELECT GROUP_CONCAT(log->>"$.sql" SEPARATOR ' ') FROM sys.heatwave_load_report
  WHERE type = "sql" ORDER BY id;

The time required to load data depends on the data size. Auto Parallel Load provides an estimate of the time required to complete the load operation.

Tables are loaded in sequence, ordered by schema and table name. Load-time errors are reported as they are encountered. If an error is encountered while loading a table, the operation is not terminated. Auto Parallel Load continues running, moving on to the next table.

When Auto Parallel Load finishes running, it checks if tables are loaded and shows a summary with the number of tables that were loaded and the number of tables that failed to load.

Memory Estimation for String Column Encoding

The auto_enc option is run in check mode by default to ensure that there is enough memory for string column encoding.

The following example uses the auto_enc option in check mode, if you want to ensure that there is sufficient memory for string column encoding before attempting a load operation. Insufficient memory can cause a load failure.

CALL sys.heatwave_load(JSON_ARRAY("tpch"), 
  JSON_OBJECT("mode","dryrun","auto_enc",JSON_OBJECT("mode","check")));
Note

The auto_enc option runs in check mode regardless of whether it is specified explicitly in the Auto Parallel Load call statement.

Look for capacity estimation data in the Auto Parallel Load output. The results indicate whether there is sufficient memory to load all tables.

The Auto Parallel Load Report Table

When Auto Parallel Load is run, output including Auto Parallel Load execution logs and the generated load script is sent to the heatwave_load_report table in the sys schema.

The heatwave_load_report table is a temporary table. It contains data from the last execution of Auto Parallel Load. Data is only available for the current session and is lost when the session terminates or when the server is shut down.

Auto Parallel Load Report Table Query Examples

The heatwave_load_report table can be queried after running Auto Parallel Load, as in the following examples:

  • View error information in case Auto Parallel Load stops unexpectedly:

    SELECT log FROM sys.heatwave_load_report WHERE type="error";
  • View warnings to find out why tables cannot be loaded:

    SELECT log FROM sys.heatwave_load_report WHERE type="warn";
  • View the generated load script to see commands that would be executed by Auto Parallel Load in "normal" mode:

    SELECT log->>"$.sql" AS "Load Script" FROM sys.heatwave_load_report 
    WHERE type = "sql" ORDER BY id;
  • View the number of load commands generated:

    SELECT Count(*) AS "Total Load Commands Generated" FROM sys.heatwave_load_report 
    WHERE type = "sql" ORDER BY id;
  • View load script data for a particular table:

    SELECT log->>"$.sql" FROM sys.heatwave_load_report  
    WHERE type="sql" AND log->>"$.schema_name" = "db0" AND log->>"$.table_name" = "tbl" 
    ORDER BY id;
  • Concatenate Auto Parallel Load generated DDL statements into a single string that can be copied and pasted for execution. The group_concat_max_len variable sets the result length in bytes for the GROUP_CONCAT() function to accommodate a potentially long string. (The default group_concat_max_len setting is 1024 bytes.)

    SET SESSION group_concat_max_len = 1000000;
    SELECT GROUP_CONCAT(log->>"$.sql" SEPARATOR ' ') FROM sys.heatwave_load_report 
      WHERE type = "sql" ORDER BY id;

Auto Parallel Load Command-Line Help

To view Auto Parallel Load command-line help, issue the following statement:

CALL sys.heatwave_load(JSON_ARRAY(""),JSON_OBJECT("output","help"));

The command-line help provides usage documentation for the Auto Parallel Load utility.

Auto Parallel Load Examples

  • Load multiple schemas. No options are specified, which means that the default options are used.

    CALL sys.heatwave_load(JSON_ARRAY("tpch","airportdb","employees","sakila"),NULL);
  • Load multiple schemas with the not_disable_unsupported_columns policy, which causes tables with unsupported columns to be excluded from the load operation. Unsupported columns are those with unsupported data types.

    CALL sys.heatwave_load(JSON_ARRAY("tpch","airportdb","employees","sakila"), 
    JSON_OBJECT("policy","not_disable_unsupported_columns"));
  • Load multiple schemas, excluding specified tables and a particular column:

    CALL sys.heatwave_load(JSON_ARRAY("tpch","airportdb"), 
    JSON_OBJECT("exclude_list",JSON_ARRAY("tpch.orders","airportdb.employee.salary")));
  • Load tables that begin with an hw prefix from a schema named schema_customer_1.

    SET @exc_list = (SELECT JSON_OBJECT('exclude_list',
                     JSON_ARRAYAGG(CONCAT(table_schema,'.',table_name))) 
                     FROM information_schema.tables 
                     WHERE table_schema = 'schema_customer_1' 
                     AND table_name NOT LIKE 'hw%');
    CALL sys.heatwave_load(JSON_ARRAY('schema_customer_1'), @exc_list);
  • Load all schemas with tables that start with an hw prefix.

    SET @db_list = (SELECT json_arrayagg(schema_name) FROM information_schema.schemata);
    SET @exc_list = (SELECT JSON_OBJECT('exclude_list',
                     JSON_ARRAYAGG(CONCAT(table_schema,'.',table_name))) 
                     FROM information_schema.tables 
                     WHERE table_schema NOT IN
                     ('mysql','information_schema', 'performance_schema','sys') 
                     AND table_name NOT LIKE 'hw%');
    CALL sys.heatwave_load(@db_list, @exc_list);

    You can check db_list and exc_list using SELECT JSON_PRETTY(@db_list); and SELECT JSON_PRETTY(@exc_list);

  • Call Auto Parallel Load from a stored procedure:

    DROP PROCEDURE IF EXISTS auto_load_wrapper;
    DELIMITER //
    CREATE PROCEDURE auto_load_wrapper()
    BEGIN
      -- AUTOMATED INPUT
      SET @db_list = (SELECT JSON_ARRAYAGG(schema_name) FROM information_schema.schemata);
      SET @exc_list = (SELECT JSON_ARRAYAGG(CONCAT(table_schema,'.',table_name)) 
      FROM information_schema.tables WHERE table_schema = "db0");
    
      CALL sys.heatwave_load(@db_list, JSON_OBJECT("output","silent","exclude_list", 
      CAST(@exc_list AS JSON)));
    
      -- CUSTOM OUTPUT
      SELECT log as 'Unsupported objects' FROM sys.heatwave_load_report WHERE type="warn" 
      AND stage="VERIFICATION" and log like "%Unsupported%";
      SELECT Count(*) AS "Total Load commands Generated" 
      FROM sys.heatwave_load_report WHERE type = "sql" ORDER BY id;
    
    END //
    DELIMITER ;
    
    CALL auto_load_wrapper();