For the full Auto Parallel Load syntax, see: Section 2.2.3, “Loading Data Using Auto Parallel Load”.
As of MySQL 8.0.33-u3, HeatWave Lakehouse extends Auto Parallel Load with the
external_tables
option. This is a JSON
array that includes one or more
db_object
.
Do not use as of MySQL 8.4.0. Use
db_object
with table
or exclude_tables
instead.
external_tables
will be deprecated in a
future release.
db_object: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
"db_name": "name",
"tables": JSON_ARRAY(table [, table] ...)
}
}
table: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
"table_name": "name",
"sampling": true|false,
"dialect": {dialect_section},
"file": JSON_ARRAY(file_section [, file_section]...),
}
}
-
db_object
: the details of one or more tables. Eachdb_object
contains the following:db_name
: name of the database. If the database does not exist, Lakehouse Auto Parallel Load creates it during the load process.-
tables
: a JSON array oftable
. Eachtable
contains the following:table_name
: the name of the table to load.-
sampling
: if set totrue
, the default setting, Lakehouse Auto Parallel Load infers the schema by sampling the data and collect statistics.If set to
false
, Lakehouse Auto Parallel Load performs a full scan to infer the schema and collect statistics. Depending on the size of the data, this can take a long time.Auto Parallel Load uses the inferred schema to generate
CREATE TABLE
statements. The statistics are used to estimate storage requirements and load times. dialect
: details about the file format. See thedialect
parameter in Section 5.2.2, “Lakehouse External Table Syntax”.file
: the location of the data in Object Storage. This can use a pre-authenticated request or a resource principal, and can be a path to a file, a file prefix, or a file pattern. See thefile
parameter in Section 5.2.2, “Lakehouse External Table Syntax”, and see: Section 5.4, “Access Object Storage”.
While it is possible to define the entire load command on
a single line, for readability the configuration is
divided into option definitions using
SET
.
Define the following option sets:
-
Define the name of the database which will store the data.
mysql> SET @db_list = '["tpch"]';
This assumes that Lakehouse Auto Parallel Load will analyze the data, infer the table structure, and create the database and all tables. See: Section 5.2.4.2, “Lakehouse Auto Parallel Load with the external_tables Option”.
-
Define the
db_object
parameters that will load data from three external sources with Avro, CSV and Parquet format files:mysql> SET @ext_tables = '[ { "db_name": "tpch", "tables": [{ "table_name": "supplier_pq", "dialect": { "format": "parquet" }, "file": [{ "prefix": "src_data/parquet/tpch/supplier/", "bucket": "myBucket", "namespace": "myNamespace", "region": "myRegion" }] }, { "table_name": "nation_csv", "dialect": { "format": "csv", "field_delimiter": "|", "record_delimiter": "|\\n", "has_header": true }, "file": [{ "par": "https://objectstorage.../nation.csv" }] }, { "table_name": "region_avro", "dialect": { "format": "avro" }, "file": [{ "par": "https://objectstorage.../region.avro" }] }] } ]';
-
Define the
@options
variable withSET
:mysql> SET @options = JSON_OBJECT('external_tables', CAST(@ext_tables AS JSON));
Setting
mode
todryrun
generates the load script but does not create or load the external tables. For example:mysql> SET @options = JSON_OBJECT('mode', 'dryrun', 'external_tables', CAST(@ext_tables AS JSON));
To implement the changes as part of the load command, set
mode
tonormal
. This is the default, and it is not necessary to add it to the command.
Exclude columns from the loading process with the
exclude_list
option. See
Section 2.2.3.2, “Auto Parallel Load Syntax”.
Lakehouse Auto Parallel Load infers the column names for Avro and Parquet files,
and also for CSV files if has_header
is
true
. For these situations, use the
column names with the exclude_list
option.
If the table already exists, but no data has been loaded,
use the existing column names with the
exclude_list
option.
For CSV files if has_header
is
false
, use the generated schema names
with the exclude_list
option. These
are: col_1
, col_2
,
col_3
...