While it is possible to define the entire load command on a
single line, for readability the configuration is divided
into option definitions using
SET
.
Define the following option sets:
-
Define the
input_list
parameters that will load data from four external sources with Avro, CSV, JSON and Parquet format files:mysql> SET @input_list = '[ {"db_name": "tpch", "tables": [{ "table_name": "supplier_pq", "engine_attribute": { "dialect": {"format": "parquet"}, "file": [{ "prefix": "src_data/parquet/tpch/supplier/", "bucket": "myBucket", "namespace": "myNamespace", "region": "myRegion" }] } }, { "table_name": "customer_csv", "engine_attribute": { "dialect": { "format": "csv", "field_delimiter": "|", "record_delimiter": "|\\n", "has_header": true }, "file": [{"par": "https://objectstorage.../customer.csv"}] } }, { "table_name": "region_avro", "engine_attribute": { "dialect": {"format": "avro"}, "file": [{"par": "https://objectstorage.../region.avro"}] } }, { "table_name": "nation_json", "engine_attribute": { "dialect": {"format": "json"}, "file": [{"par": "https://objectstorage.../nation.json"}] } } ]} ]';
-
Define the
@options
variable withSET
. Settingmode
todryrun
generates the load script but does not create or load the external tables. For example:mysql> SET @options = JSON_OBJECT('mode', 'dryrun');
To implement the changes as part of the load command, set
mode
tonormal
. This is the default, and it is not necessary to add it to the command.Set
mode
tovalidation
to validate the data files against the created table for any potential data errors. For example:mysql> SET @options = JSON_OBJECT('mode', 'validation');
Notevalidation
requires the tables to be created first, and it does not load the data to the tables. To load the tables themode
must be set tonormal
.
Exclude columns from the loading process with the
exclude_columns
option. See
Section 2.2.3.2, “Auto Parallel Load Syntax”.
Lakehouse Auto Parallel Load infers the column names for Avro, JSON and Parquet
files, and also for CSV files if
has_header
is true
.
For these situations, use the column names with the
exclude_columns
option.
If the table already exists, but no data has been loaded,
use the existing column names with the
exclude_columns
option.
For CSV files if has_header
is
false
, use the generated schema names
with the exclude_columns
option. These
are: col_1
, col_2
,
col_3
...