HeatWave User Guide  /  ...  /  Lakehouse Auto Parallel Load with the external_tables Option

5.2.4.2 Lakehouse Auto Parallel Load with the external_tables Option

For the full Auto Parallel Load syntax, see: Section 2.2.3, “Loading Data Using Auto Parallel Load”.

HeatWave Lakehouse extends Auto Parallel Load with the external_tables option. This is a JSON array that includes one or more db_object.

Do not use as of MySQL 8.4.0. Use db_object with table or exclude_tables instead. external_tables will be deprecated in a future release.

db_object: {
     JSON_OBJECT("key","value"[,"key","value"] ...)
          "key","value": {
          "db_name": "name",
          "tables": JSON_ARRAY(table [, table] ...)
          }
}

table: {
     JSON_OBJECT("key","value"[,"key","value"] ...)
          "key","value": {
          "table_name": "name",
          "sampling": true|false,
          "dialect": {dialect_section},
          "file": JSON_ARRAY(file_section [, file_section]...), 
          }
}
  • db_object: the details of one or more tables. Each db_object contains the following:

    • db_name: name of the database. If the database does not exist, Lakehouse Auto Parallel Load creates it during the load process.

    • tables: a JSON array of table. Each table contains the following:

      • table_name: the name of the table to load.

      • sampling: if set to true, the default setting, Lakehouse Auto Parallel Load infers the schema by sampling the data and collect statistics.

        If set to false, Lakehouse Auto Parallel Load performs a full scan to infer the schema and collect statistics. Depending on the size of the data, this can take a long time.

        Auto Parallel Load uses the inferred schema to generate CREATE TABLE statements. The statistics are used to estimate storage requirements and load times.

      • dialect: details about the file format. See the dialect parameter in Section 5.2.2, “Lakehouse External Table Syntax”.

      • file: the location of the data in Object Storage. This can use a pre-authenticated request or a resource principal, and can be a path to a file, a file prefix, or a file pattern. See the file parameter in Section 5.2.2, “Lakehouse External Table Syntax”, and see: Section 5.4, “Access Object Storage”.

Syntax Examples

While it is possible to define the entire load command on a single line, for readability the configuration is divided into option definitions using SET.

Define the following option sets:

  • Define the name of the database which will store the data.

    mysql> SET @db_list = '["tpch"]';

    This assumes that Lakehouse Auto Parallel Load will analyze the data, infer the table structure, and create the database and all tables. See: Section 5.2.4.2, “Lakehouse Auto Parallel Load with the external_tables Option”.

  • Define the db_object parameters that will load data from three external sources with Avro, CSV and Parquet format files:

    mysql> SET @ext_tables = '[
    {
      "db_name": "tpch",
      "tables": [{
        "table_name": "supplier_pq",
        "dialect": {
          "format": "parquet"
        },
        "file": [{
          "prefix": "src_data/parquet/tpch/supplier/",
          "bucket": "myBucket",
          "namespace": "myNamespace",
          "region": "myRegion"
        }]
      },
      {
        "table_name": "nation_csv",
        "dialect": {
          "format": "csv",
          "field_delimiter": "|",
          "record_delimiter": "|\\n",
          "has_header": true
        },
        "file": [{
          "par": "https://objectstorage.../nation.csv"
        }]
      },
      {
        "table_name": "region_avro",
        "dialect": {
          "format": "avro"
        },
        "file": [{
          "par": "https://objectstorage.../region.avro"
        }]
      }]
    }
    ]';
  • Define the @options variable with SET:

    mysql> SET @options = JSON_OBJECT('external_tables', CAST(@ext_tables AS JSON));

    Setting mode to dryrun generates the load script but does not create or load the external tables. For example:

    mysql> SET @options = JSON_OBJECT('mode', 'dryrun', 'external_tables', CAST(@ext_tables AS JSON));

    To implement the changes as part of the load command, set mode to normal. This is the default, and it is not necessary to add it to the command.

Exclude columns from the loading process with the exclude_list option. See Section 2.2.3.2, “Auto Parallel Load Syntax”.

Lakehouse Auto Parallel Load infers the column names for Avro and Parquet files, and also for CSV files if has_header is true. For these situations, use the column names with the exclude_list option.

If the table already exists, but no data has been loaded, use the existing column names with the exclude_list option.

For CSV files if has_header is false, use the generated schema names with the exclude_list option. These are: col_1, col_2, col_3 ...