MySQL :: HeatWave User Guide :: 4.3.6.2 Lakehouse External Table JSON Syntax

CREATE TABLE Statement

As of MySQL 9.4.0, you can use the CREATE EXTERNAL TABLE statement to specify ENGINE_ATTRIBUTE options.

CREATE EXTERNAL TABLE table_name 
(col_name data_type [ENGINE_ATTRIBUTE='Col_Engine_Attribute_JSON'], ...)
ENGINE_ATTRIBUTE='Tbl_Engine_Attribute_JSON';

If you are on a version earlier than MySQL 9.4.0, you must use the CREATE TABLE statement to specify ENGINE_ATTRIBUTE options. You must also manually set ENGINE to lakehouse, and SECONDARY_ENGINE to rapid.

CREATE TABLE table_name 
(col_name data_type [ENGINE_ATTRIBUTE='Col_Engine_Attribute_JSON'], ...)
ENGINE=lakehouse 
SECONDARY_ENGINE=rapid 
ENGINE_ATTRIBUTE='Tbl_Engine_Attribute_JSON';

After creating a table, you can load data into the table in the following ways:

Load the data automatically with Lakehouse Auto Parallel Load by using the HEATWAVE_LOAD command. See how to Load External Data Using Lakehouse Auto Parallel Load with an Existing Table.
Load the data manually with the ALTER TABLE statement and the option to use Guided Load.

You can also use ENGINE_ATTRIBUTE options with the HEATWAVE_LOAD command to load data using Lakehouse Auto Parallel Load. See HEATWAVE_LOAD and Load Structured Data Using Lakehouse Auto Parallel Load.

External Table ENGINE_ATTRIBUTE Options

The external table ENGINE_ATTRIBUTE is a JSON object that consists of the dialect parameters and the file parameters. They are represented in the following ways:

file_JSON_array is a JSON array containing the JSON objects representing the file location details and file parameters. You can use the following methods to specify these details:
- Uniform Resource Identifiers (URI)
  
  Available as of MySQL 9.3.1 for MySQL HeatWave on OCI. You can provide an OCIFS URI, PAR URI, or Native URI.
- Resource Principals (RP)
  
  See Resource Principals in Oracle Cloud Infrastructure Documentation.
- Pre-Authenticated Requests (PAR)
  
  See Object Storage Pre-Authenticated Requests in Oracle Cloud Infrastructure Documentation.
Refer to File Options.
dialect_JSON_object is a JSON object containing the dialect parameters to specify the data format in the files.

Tbl_Engine_Attribute_JSON: {
  begin-object
    "file": file_JSON_array
    [, "dialect": dialect_JSON_object ]
    [, "auto_refresh_event_source": {"stream_OCID"}
    [, "match_columns_by": {"order" | "name_case_sensitive" | "name_case_insensitive"}]
    [, "allow_missing_columns": {true|false}]
  end-object
}

file_JSON_array: {
  begin-array file_def [, file_def] … end-array
}    

file_def: file_RP_JSON | file_PAR_JSON | file_URI_JSON

file_def_RP_JSON: {
  "bucket": "bucket_name",
  "namespace": "namespace_name",
  "region": "region_name",
  {"prefix": "prefix_value" | "name": "filename" | "pattern" : "pattern_value"},
  "is_strict_mode": {true | false},
  "allow_missing_files": {true | false}
}

file_PAR_JSON: {
  "par": "PAR URL",
  ["prefix": "prefix_value", | "name": "filename", | "pattern" : "pattern_value",]
  "is_strict_mode": {true | false},
  "allow_missing_files": {true | false}
}

file_URI_JSON: {
  ("uri": "OCIFS_URI") | ("uri": "PAR_URI") | ("uri": "NATIVE_URI"),
  "is_strict_mode": {true | false},
  "allow_missing_files": {true | false}
}

dialect_JSON_object: begin-object dialect_option_key_value [, dialect_option_key_value]... end-object
 
dialect_option_key_value: {
    "format": {"avro" | "csv" | "json" | "parquet" | "delta"}  
  | "check_constraints": {true | false}
  | "field_delimiter": {"|" | "," | "\\t" | ";" | "auto" | "custom_field_delimiter"} 
  | "record_delimiter": {"\\r" | "\\n" | "\\r\\n" | "auto" | "custom_record_delimiter"} 
  | "escape_character": {"\\\\" | "" | "custom_escape_character"} 
  | "quotation_marks": {"\\"" | "" | "custom_quotation_marks"}  
  | "encoding": "utf8mb4"
  | "date_format": "custom_date_format" 
  | "time_format": "custom_time_format"
  | "timestamp_format": "custom_timestamp_or_datetime_format"
  | "datetime_format": "custom_timestamp_or_datetime_format"
  | "trim_spaces": {true | false}
  | "skip_rows": Integer_20
  | "has_header": {true | false}
  | "is_strict_mode": {true | false}
  | "compression": {"zip" | "gzip" | "bzip2" | "auto"}
  | "embed_model_id": {"MySQL HeatWave_in_database_embedding_model"}
} 
Integer_20: Represents integer value between 0 and 20.

Note

When you specify ENGINE_ATTRIBUTE parameters in JSON format, you must escape certain characters using \ (backslash). For example, to specify linefeed as a record delimiter, you need to specify it as "\\n" in the ENGINE_ATTRIBUTE parameter’s JSON value.
In the syntax description, square brackets ("[" and "]") indicate optional words or clauses.
The syntax description uses a vertical bar ("|") to separate alternatives. The list of alternatives are enclosed in square brackets "[" and "]") when you must choose one member. The list of alternatives are enclosed in curly braces "{" and "}") when you must choose one member.
The terms begin-array and end-array in the syntax represent a JSON array. While writing the sample, replace them with [].
The terms begin-object and end-object in the syntax represent a JSON object. While writing the sample, replace them with {}.
The following examples create tables with the CREATE EXTERNAL TABLE statement. If you are on a version earlier than MySQL 9.4.0, you must use CREATE TABLE and set ENGINE to lakehouse, and SECONDARY_ENGINE to rapid. See CREATE TABLE Statement.

Column Definitions

For versions earlier than MySQL 9.5.0, you can override some options at the column level by specifying column-level parameters. For MySQL 9.5.0 and later, you can do this by loading data manually with Guided Load disabled. You can set the date_format, time_format, timestamp_format, datetime_format (as of MySQL 9.5.1), and model as optional column-level ENGINE_ATTRIBUTE parameters. If you set both timestamp_format and datetime_format, only the datetime_format is used for inference and loading of timestamp/datetime values. We recommend using datetime_format instead of timestamp_format. For more information, refer to dialect parameters.

As of MySQL 9.5.0, Lakehouse Auto Parallel Load and Guided Load can infer supported formats for temporal data types by setting the appropriate dialect-level temporal formats to auto. The auto setting is the default value if no temporal format is specified. If you set a specific format at the column level for a temporal data type when creating a table, and a different format is inferred when loading the table to Lakehouse, the specific format used for the column when creating the table is prioritized and used for loading the column. See About Lakehouse Auto Parallel Load Schema Inference to learn more.

Note

The temporal column-level parameters of date_format, time_format, timestamp_format, and datetime_format are only available for CSV files. See Dialect Parameters for CSV Files.

Col_Engine_Attribute_JSON: begin-object column_option_key_value [, column_option_key_value]... end-object
 
column_option_key_value: {
    "date_format": "custom_date_format"  
  | "time_format": "custom_time_format"
  | "timestamp_format": "custom_timestamp_or_datetime_format"
  | "datetime_format": "custom_timestamp_or_datetime_format"
  | "model": {MySQL HeatWave_in_database_embedding_model}
}

model: Supported as of MySQL 9.3.2. Lets you specify the column-level embedding model for VECTOR columns. You must specify a model supported by MySQL HeatWave GenAI. If the vector uses another embedding model, you cannot use it with GenAI features. You can include an unsupported embedding model in the column or table comments for your own reference. Learn more about the Vector Data Type.

The following example sets the embedding model for the column to multilingual-e5-small.
```
mysql> CREATE EXTERNAL TABLE table_1(col_1 VECTOR(5) ENGINE_ATTRIBUTE '{"model": "multilingual-e5-small"}')
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
```

The following example sets the date_format, time_format, and datetime_format at the column level.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 date ENGINE_ATTRIBUTE '{"date_format": "%d/%m/%Y"}',
        col_2 time ENGINE_ATTRIBUTE '{"time_format": "%I:%i %p"}',
        col_3 timestamp ENGINE_ATTRIBUTE '{"datetime_format": "%Y-%m-%d %H:%i:%s"}',
        col_4 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';

The following example sets the date_format, time_format, and datetime_format options to auto, which allows Lakehouse Auto Parallel Load and Guided Load to infer supported formats for temporal data types.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 date ENGINE_ATTRIBUTE '{"date_format": "auto"}',
        col_2 time ENGINE_ATTRIBUTE '{"time_format": "auto"}',
        col_3 datetime ENGINE_ATTRIBUTE '{"datetime_format": "auto"}',
        col_4 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';

For more information about other column definitions, see Dialect Parameters for CSV Files.

File Options

The following parameters are required to create an external table:

ENGINE: Set to lakehouse if you are on a version earlier than MySQL 9.4.0. For MySQL 9.4.0 and later, you can use the CREATE EXTERNAL TABLE statement, which automatically sets ENGINE to lakehouse.
SECONDARY_ENGINE: Set to rapid if you are on a version earlier than MySQL 9.4.0. For MySQL 9.4.0 and later, you can use the CREATE EXTERNAL TABLE statement, which automatically sets SECONDARY_ENGINE to rapid.
ENGINE_ATTRIBUTE: JSON object literal. Specifies the location of files, the file format, and how the file format is handled.

Use key-value pairs in JSON format to specify options. Lakehouse uses the default setting if there is no defined option. Use NULL to specify no arguments.
file: Defines the Object Storage files. You can define the file locations using resource principals, uniform resource identifier (URI) (as of MySQL 9.3.1 for MySQL HeatWave on OCI), or PAR URL.

As of MySQL 9.4.1, Lakehouse supports a maximum of 164 file locations. As of MySQL 9.3.1, Lakehouse supports a maximum of 170 file locations. Before MySQL 9.3.1, Lakehouse supports a maximum of 256 file locations. To define more than the maximum number of files, store the files under the same bucket or use prefix or pattern.
- file parameters for uniform resource identifiers (As of MySQL 9.3.1 for MySQL HeatWave on OCI):
  - uri: The URL depending on the type of URI you define. Do not provide a pattern, prefix, or name parameter with the uri. These details are inferred by the object_path you define.
    
    The following examples use the URI syntax for OCIFS:
    
    oci://bucket_name@namespace_name/object_path
    
    Name: If the object_path is neither a glob pattern nor prefix.
    
    The following example specifies a single file by using an OCIFS URI.
    
    mysql> CREATE EXTERNAL TABLE table_1( col_1 int, col_2 int, col_3 int) ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"}, "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
    
    Prefix: If the object_path is not a glob pattern and ends with an unencoded / character, such as a folder path.
    
    The following example uses a prefix with an OCIFS URI. The prefix specifies files in an Object Storage folder.
    
    mysql> CREATE EXTERNAL TABLE table_1( col_1 int, col_2 int, col_3 int) ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"}, "file": [{"uri": "oci://mybucket@mynamespace/data_files/"}]}';
    
    Glob pattern: If the object_path contains at least one of the following unencoded characters: ?, *, or [. To use these characters as literals, you need to escape them or encode them as needed depending on the URI syntax. Regex patterns are not supported. See Glob Patterns from the Oracle Cloud Infrastructure Documentation to learn more.
    
    The following example uses an unencoded * character to create a glob pattern.
    
    mysql> CREATE EXTERNAL TABLE table_1( col_1 int, col_2 int, col_3 int) ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"}, "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_files/data_file_*.csv"}]}';
  Review additional examples that use URIs to create external tables using Auto Parallel Load or manually.
- file parameters for resource principals:
  - bucket_name: Buckets allow you to storage objects in a compartment. To learn more about buckets, see Object Storage Buckets in Oracle Cloud Infrastructure Documentation.
  - namespace_name: This is the top-level container for all buckets and objects. To learn how to view your namespace, see Object Storage Namespaces in Oracle Cloud Infrastructure Documentation.
  - region_name: You need to define the region identifier for your OCI tenancy. To view the list of available regions and region identifiers, see Regions and Availability Domains in Oracle Cloud Infrastructure Documentation.
    
    As of MySQL 9.2.1, if the region is not specified, the instance's associated region is used as the default region. Otherwise, the specified region is used.
  The following example specifies a single file by using resource principals.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "file": [{"region": "us-ashburn-1", 
                                     "namespace": "tenant_1", 
                                     "bucket": "bucket_1", 
                                     "name": "data_file_1.csv"}]}';
```
  Review additional examples that use resource principals to create external tables using Lakehouse Auto Parallel Load or manually.
- file parameters for pre-authenticated requests:
  - par: The PAR URL.
    
    Do not specify a region, namespace or bucket with par. These values are included in the PAR URL and generate an error if defined as separate parameters. For more information, refer to PAR Recommendations.
  The following example specifies a single file with a PAR.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "file": [{"par": "https://objectstorage.us-ashburn-1.oraclecloud.com/p/.../n/tenancy_1/b/bucket_1/o/data_file_1.csv"}]}';
```
  Review additional examples that use PARs to create external tables using Lakehouse Auto Parallel Load or manually.
The following file parameters apply only to PAR and resource principals. Use one of the parameters, unless the target defines a specific file.
- name: A specific Object Storage file name. For pre-authenticated requests, the file name provided in the PAR URL is considered.
  
  The following example uses a PAR and the name parameter to specify a file in an Object Storage bucket. The PAR points to the Object Storage bucket.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "file": [{"par": "https://objectstorage.us-ashburn-1.oraclecloud.com/p/.../n/tenancy_1/b/bucket_1/o/",
                                     "name": "data_files/data_file_1.csv"}]}';
```
- pattern: A regular expression that defines a set of Object Storage files.
  
  The following example uses resource principals and a pattern to load files that have a numerical suffix.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "file": [{"region": "us-ashburn-1", 
                                     "namespace": "tenant_1", 
                                     "bucket": "bucket_1", 
                                     "pattern": "data_files/data_file_\\\\d+\\\\.csv"}]}';
```
  See the following to learn more:
  - Regular Expression Syntax
  - Modified ECMAScript regular expression grammar
- prefix: The prefix for a set of Object Storage files. For pre-authenticated requests, prefix or bucket name present in the PAR URL is considered.
  
  The following example uses resource principals and a prefix to specify files to load in an Object Storage folder.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "file": [{"region": "us-ashburn-1", 
                                     "namespace": "tenant_1", 
                                     "bucket": "bucket_1", 
                                     "prefix": "data_files/"}]}';
```
- Review examples that use PARs to create external tables using Lakehouse Auto Parallel Load or manually.
- Review examples that use resource principals to create external tables using Lakehouse Auto Parallel Load or manually.

The following file parameters are optional:

compression: Supported as of MySQL 9.3.2. Use this to load compressed files.

Compression is supported for all file formats:
- CSV: The default value is no compression. You can set compression to zip, gzip, or bzip2.
  
  The following example specifies a CSV file compressed in zip format.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", "compression": "zip"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv.zip"}]}';
```
- JSON: The default value is no compression. You can set compression to zip, gzip, or bzip2.
  
  The following example specifies a JSON file compressed in gzip format.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "json", "compression": "gzip"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.json.gz"}]}';
```
- Avro: The default value is auto. You cannot set any other options for compression, as Avro files declare their compression format in the file metadata.
  
  The following example specifies an Avro file with the auto compression option.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "avro", "compression": "auto"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.avro"}]}';
```
- Parquet: The default value is auto. You cannot set any other options for compression, as Parquet files declare their compression format in the file metadata.
  
  The following example specifies a Parquet file with the auto compression option.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "parquet", "compression": "auto"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.parquet"}]}';
```
Note the following when loading compressed files:
- Each compressed file must only contain a single original file. You cannot provide a compressed file with archives or folders with multiple files.
- Files in zip format must be compressed with the Deflate algorithm.
- If loading multiple files, all the files must be in the same compression format.
- When loading multiple compressed files, it is best to prepare files of similar size, and in the range of 80-120 MB each. Otherwise, you may experience longer loading times.
is_strict_mode: Loads data in strict mode or non-strict mode, based on the selected option. This overrides the is_strict_mode dialect parameter. To learn about the default value, see the is_strict_mode dialect parameter in Common Dialect Parameters. The supported options are:
- true: Loads the data in strict mode. The data loading stops if there is an error due to missing files, empty columns, formatting errors or parsing errors.
- false: Loads the data in non-strict mode. Missing files, empty columns, formatting errors, or parsing errors display a warning, and data is loaded. If there are missing columns, implicit default values are used in the absence of user-provided default values. See Data Type Default Values to learn more.
The following example sets is_strict_mode to false in the file parameters.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": "format": "csv"},
                           "file": [{"oci://mybucket@mynamespace/data_files/data_file_1.csv", 
                                     "is_strict_mode": false}]}';
```
allow_missing_files: Supported as of MySQL 8.4.0. Handles missing files based on the selected option. This overrides the dialect parameter is_strict_mode for missing files. To learn about the default value, see the allow_missing_files dialect parameter in Common Dialect Parameters. The supported options are:
- true: If any file is missing, no error occurs and data loading continues with the existing files. An error occurs if all the files are not available.
- false: If any file is missing, an error occurs and data is not loaded.
The following example uses a glob pattern specified with a uniform resource identifier (URI) and sets allow_missing_files to true in the file parameters. If any of the specified files in the pattern are not available, an error does not occur and data continues loading for the existing files. If all files are missing, an error occurs.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_[1-10].csv", 
                                     "allow_missing_files": true}]}';
```

Event-Based Incremental Load Options

auto_refresh_event_source: Supported as of MySQL 9.4.1. Specifies a Stream OCID for event-based incremental load. This string value is verified as a valid Stream resource identifier. Learn more about Resource Identifiers from Oracle Cloud Infrastructure Documentation. Learn how to Refresh Data Using Event-Based Incremental Load.

The following example specifies a Stream OCID to set up event-based incremental load.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "auto_refresh_event_source": "ocid1.stream.oc1.iad...",
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
```

Matching Columns for Parquet Files Options

match_columns_by: Supported as of MySQL 9.5.0. Allows you to map columns from Parquet files to external tables based on column order, column name using case sensitivity, or column name using case insensitivity.

match_columns_by has the following requirements and specifications:
- The external file must be in parquet format.
- If there is a difference between the number of columns in the external table and the Parquet file, you must set is_strict_mode to false (non-strict mode).
- If the external file has more columns than the Parquet file, implicit default values are used for the missing columns in the absence of user-provided default values. See Data Type Default Values to learn more.
- If the Parquet file has columns with the same name but different upppercase and lowercase charcters (for example, col_1 and Col_1), Auto Parallel Load and Guided Load generate errors for duplicate column names even if you use the name_case_sensitive option.
- If you create a new table with Auto Parallel Load and use the name_case_insensitive option, column names in the external table are created in all lowercase characters.
match_columns_by has the following options:
- order: The external table schema maps to the Parquet file schema by column order. For example, the first column in the table schema maps to the first column in the file schema, and so on. This is the default value.
- name_case_sensitive: The external table schema maps to the Parquet file by column name using case sensitivity.
- name_case_insensitive: The external table schema maps to the Parquet file by column name using case insensitivity.
The following example loads columns from a Parquet file based on name using case sensitivity. Lakehouse maps columns in the Parquet file to the columns in the external table (col_1, col_2, col_3) with the same name and case sensitivity.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "parquet"},
                           "match_columns_by": "name_case_sensitive",
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.parquet"]}';
```

Allow Missing Columns Option

allow_missing_columns: Supported as of MySQL 9.5.2. Provides more flexibility when loading external tables by allowing some columns to be missing (in some files, or in the table as compared to files) even when is_strict_mode is true, but still having other strict mode properties such as data checks. During the initial load of a table using Auto Parallel Load, this option infers the table structure from a union of all column names from all files.
- This is currently only supported for Parquet files.
- To enable this, you must also set match_columns_by to name_case_sensitive or name_case_insensitive. If you set match_columns_by to name_case_sensitive or name_case_insensitive, allow_missing_columns is set to true by default.
- Keep in mind that when the ENGINE_ATTRIBUTE sampling option for HEATWAVE_LOAD is enabled and set to true, Lakehouse tries to correctly create a unified schema from the data that it samples. To guarantee a correctly configured schema, you need to set sampling to false. To learn more about HEATWAVE_LOAD options, see HEATWAVE_LOAD.
- If you do not manually create a table before loading data, Auto Parallel Load creates a table structure from a union of all column names found across files. If a column is present in one file but not another, NULL or implicit default values are used in the absence of user-provided default values. See Data Type Default Values to learn more.
- If you manually create a table before loading data, Auto Parallel Load and Guided Load use the provided table definition, and do not add or remove extra columns. You can check warnings for any differences in columns found during the load.
- When Lakehouse loads data, if there are any columns that are in the table but not in a file, NULL or implicit default values are used in the absence of user-provided default values with applicable warning messages. See Data Type Default Values to learn more.

The following example manually creates a table, specifies files in Parquet format, enables allow_missing_columns, matches columns by name using case sensitivity, disables sampling, specifies all files in to load from the data_files directory, and then loads the table with the HEATWAVE_LOAD command. When you load the table with the files specified, NULL and implicit default values are used for any columns in the table but not in the files. Any columns that are present other than col_1, col_2, and col_3 in the files are skipped.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "parquet"},
                           "allow_missing_columns": true,
                           "match_columns_by": "name_case_sensitive",
                           "sampling": false,
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/"}]}';

Dialect Parameters

The dialect parameters are optional and are used to define the data configuration.

Common Dialect Parameters

The following dialect parameters are applicable to all file formats (CSV, JSON, Parquet and Avro).

format: The file format defined in a table. You can define only one file format per table. The supported file formats are:

csv: The default file format.

The following example specifies a CSV file.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';

json: Supported as of MySQL 8.4.0. Only Newline Delimited JSON files are supported. Tables created with json format must only have a single column that conforms to the JSON data type.

The following example specifies a JSON file.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "json"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.json"}]}';
```

parquet: The Parquet data type.

The following example specifies a Parquet file.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "parquet"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.parquet"}]}';

avro: The Avro data type.

The following example specifies an Avro file.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "avro"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.avro"}]}';

delta: Supported as of MySQL 9.5.0. Loads delta lake tables. See Delta Lake Tables and Limitations for Delta Lake Tables to learn more.

The following example specifies a delta lake table.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "delta"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/"}]}';

check_constraints: Supported as of MySQL 8.4.0. Lakehouse validates primary key and unique key constraints during the initial load based on the selected option. The supported options are:
- true: The default value. Lakehouse validates primary key and unique key constraints only during the initial load of the table. If there are subsequent loads or refreshes of the table, validation does not occur.
- false: Lakehouse does not validate primary key and unique key constraints.
The following example sets check_constraints to false.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", "check_constraints": false},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
```
is_strict_mode: Loads data in strict mode or non-strict mode, based on the selected option. By default, is_strict_mode is set to the sql_mode value. See Strict SQL Mode. The file common parameter is_strict_mode can override this setting. The supported options are:
- true: Loads the data in strict mode. The data loading stops if there is an error due to missing files, empty columns, formatting errors or parsing errors.
  
  As of MySQL 8.4.0, the dialect parameter is_strict_mode applies to all file formats. Before MySQL 8.4.0, it only applies to the CSV file format. For Avro and Parquet file formats, use the file parameter is_strict_mode to define strict mode before MySQL 8.4.0.
- false: Loads the data in non-strict mode. Missing files, empty columns, formatting errors, or parsing errors display a warning, and data is loaded. If there are missing columns, implicit default values are used in the absence of user-provided default values. See Data Type Default Values to learn more.
- If no value is set, the setting is defined by the sql_mode.
- Review Table 4.7, “Combinations of Settings for allow_missing_files and is_strict_mode”.
This setting overrides the global sql_mode setting for handling missing files. The default is the value of sql_mode. See Strict SQL Mode. The file common parameter is_strict_mode can override this setting.

The following example sets is_strict_mode to false in the dialect parameters.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", "is_strict_mode": false},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
```
allow_missing_files: Supported as of MySQL 8.4.0. Handles missing files based on the selected option. This overrides the dialect parameter is_strict_mode for missing files. The supported options are:
- true: An error does not occur for any missing file, and data loading continues with the existing files. An error occurs if all files are not available.
- false: If any file is missing, an error occurs and data is not loaded.
- If no value is set, the setting is defined by is_strict_mode. If no value is set for is_strict_mode, the setting is defined by the sql_mode.
- Review Table 4.7, “Combinations of Settings for allow_missing_files and is_strict_mode”.
A missing file is defined as:
- With the name parameter: There is no file with that name.
- With the pattern parameter: There are no files that match the pattern.
- With the prefix parameter: There are no files with that prefix.
The following example uses a glob pattern specified with a uniform resource identifier (URI). It sets allow_missing_files to true in the dialect parameters. If any of the specified files in the pattern are not available, an error does not occur and data continues loading for the existing files. If all files are missing, an error occurs. If a file-specific setting is set, this may override the allow_missing_files setting.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", "allow_missing_files": true},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_[1-10].csv"}]}';
```

Refer to the following table to review how different combinations of settings for allow_missing_files and is_strict_mode affect errors for missing files.

Table 4.7 Combinations of Settings for allow_missing_files and is_strict_mode

allow_missing_files	is_strict_mode	Possible Errors
T	T	No error generated unless all files are missing.
T	F
T	Not set
Not set	F
F	T	Error generated for any missing files.
F	F
F	Not set
Not set	T
Not set	Not set	Error generation depends on setting for `sql_mode`.

Dialect Parameters for CSV and JSON Files

The following dialect parameters applicable to CSV and JSON file formats.

Note

If these parameters are used with Avro or Parquet file formats, an error occurs.

encoding: Defines the character encoding. The default is "utf8mb4".
record_delimiter: Specifies one or more characters used to delimit records. The maximum record delimiter length is 64 characters. You can add any string as a custom_record_delimiter.

The default record delimiter for json is "\n". The only alternative for json is "\r\n".

For CSV files, if you set the record_delimiter to auto, Auto Parallel Load automatically detects the record delimiters and composite record delimiters with field delimiters as prefixes. Supported as of MySQL 9.2.0 and set as the default value.

The following record delimiters detected automatically by Auto Parallel Load:
- \r: Carriage return.
- \n: Line feed. This is the default for all MySQL versions prior to 9.2.0.
- \r\n : Carriage return and line feed

Dialect Parameters for CSV and Parquet Files

The following dialect parameters are applicable to CSV and Parquet file formats.

embed_model_id: Supported as of MySQL 9.3.2. Lets you specify the embedding model for VECTOR columns in the table. You must specify a model supported by MySQL HeatWave GenAI. To specify an embedding model for a table, you must use Auto Parallel Load to automatically create the table and load the data. See the example below. If you want to use a CREATE EXTERNAL TABLE statement to create the table manually and specify an embedding model, provide the embedding model as a column definition. Learn more about the Vector Data Type.

The following example uses Auto Parallel Load to set the embedding model for all VECTOR columns detected in the table to minilm. Keep in mind that if the table already exists and there are VECTOR columns with an embedding model already set, the existing model is not replaced with the one specified in this example.
```
mysql> SET @input_list = '[{
  "db_name": "data_db",
  "tables": [{
    "table_name": "table_1",
    "engine_attribute": {
      "dialect": {"format": "csv", "embed_model_id": "minilm"},
      "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]
    }
  }]
}]';
```

Dialect Parameters for CSV Files

The following dialect parameters are applicable only to CSV file formats.

field_delimiter: Specifies one or more characters used to enclose fields. The maximum field delimiter length is 64 characters. You can add any string as a custom_field_delimiter.

When the field_delimiter is set to auto, Auto Parallel Load automatically detects the field delimiters. Supported as of MySQL 9.2.0 and set as the default value.

The following field delimiters are detected automatically by Auto Parallel Load:
- |: Pipe. Default for all MySQL versions prior to 9.2.0.
- ,: Comma
- \t: Tab
- ;: Semicolon
The following example sets the record_delimiter, and field_delimiter. The backslash character, \, needs to be escaped twice since it is inside an JSON object.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", 
                                       "record_delimiter": ";",
                                       "field_delimiter": "\\r\\n"},
                                       "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
```
escape_character: Specifies one or more characters used to escape special characters. The default is "\\". You can add any string as a custom_escape_character. As of MySQL 9.5.0, you have the option to ignore the escape character by setting its value to an empty string.
quotation_marks: Specifies one or more characters used to enclose fields. The default is "\'". You can add any string as custom_quotation_marks. As of MySQL 9.5.0, you have the option to ignore the quote character by setting its value to an empty string.

Consider the following example for setting empty strings for escape and quotation characters. You have the following data in a CSV file: "{""doc1"":{\""doc2\"":22}}".
- You create an external table and set an empty string for the escape character: "dialect": {"escape_character": ""}.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", 
                                       "escape_character": ""},
                                       "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
```
  The value loads as: {"doc1":{\"doc2\":22}}.
- You create an external table and set an empty string for the quotation character: "dialect": {"quotation_marks": ""}.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", 
                                       "quotation_marks": ""},
                                       "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
```
  The value loads as: "{""doc1"":{""doc2"":22}}".
- You create an external table and set empty strings for both the escape character and quotation character: "dialect": {"quotation_marks": "", "escape_character": ""}.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", 
                                       "escape_character": "",
                                       "quotation_marks": ""},
                                       "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
```
  The value loads as: "{""doc1"":{\""doc2\"":22}}".
The following example sets the escape_character, and quotation_marks. The backslash character, \, needs to be escaped twice since it is inside an JSON object.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", 
                                       "escape_character": "\\\\",
                                       "quotation_marks": "\\""},
                                       "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
```
date_format: The date format. See date_format. You can also set date formats for each column by using the column ENGINE_ATTRIBUTE option, which overrides the format in the dialect parameter. For versions earlier than MySQL 9.5.0, this format is ignored if you load data using Lakehouse Auto Parallel Load or manually with Guided Load. As of MySQL 9.5.0, the format specified at the dialect level is used to infer the temporal data if you load data using Lakehouse Auto Parallel Load or manually with Guided Load.
time_format: The time format. See String and Numeric Literals in Date and Time Context. You can also set time formats for each column by using the column ENGINE_ATTRIBUTE option, which overrides the format in the dialect parameter. For versions earlier than MySQL 9.5.0, this format is ignored if you load data using Lakehouse Auto Parallel Load or manually with Guided Load. As of MySQL 9.5.0, the format specified at the dialect level is used to infer the temporal data if you load data using Lakehouse Auto Parallel Load or manually with Guided Load.
timestamp_format: Supported as of MySQL 9.0.1. The timestamp format. See date_format. You can also set timestamp formats for each column by using the column ENGINE_ATTRIBUTE option, which overrides the format in the dialect parameter. For versions earlier than MySQL 9.5.0, this format is ignored if you load data using Lakehouse Auto Parallel Load or manually with Guided Load. As of MySQL 9.5.0, the format specified at the dialect level is used to infer the temporal data if you load data using Lakehouse Auto Parallel Load or manually with Guided Load.
datetime_format: Supported as of MySQL 9.5.1. The datetime format. You can also set datetime formats for each column by using the column ENGINE_ATTRIBUTE option, which overrides the format in the dialect parameter. The format specified at the dialect level is used to infer the temporal data if you load data using Lakehouse Auto Parallel Load or manually with Guided Load.
If you set both timestamp_format and datetime_format, only the datetime_format is used for inference and loading of timestamp/datetime values. We recommend using datetime_format instead of timestamp_format.
As of MySQL 9.5.0, Lakehouse Auto Parallel Load and Guided Load can infer supported formats for temporal data types by setting the appropriate dialect option to auto. The auto setting is the default value if no temporal format is specified. See About Lakehouse Auto Parallel Load Schema Inference to learn more.

The following example sets the date_format, time_format, and datetime_format.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 date,
        col_2 time,
        col_3 datetime,
        col_4 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", 
                                       "date_format": "%d/%m/%Y",
                                       "time_format": "%I:%i %p",
                                       "datetime_format": "%Y-%m-%d %H:%i:%s"},
                                       "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';

The following example sets the date_format, time_format, and datetime_format options to auto, which allows Auto Parallel Load and Guided Load to infer supported formats for temporal data types.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 date,
        col_2 time,
        col_3 datetime,
        col_4 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", 
                                       "date_format": "auto",
                                       "time_format": "auto",
                                       "datetime_format": "auto"},
                                       "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';

trim_spaces: Removes/retains the leading and trailing spaces, based on the set option. The default value is false.

The following example sets trim_spaces to true.

mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", "trim_spaces": true},
                                       "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';

skip_rows: The number of rows to skip at the start of the file. The default value is 0 and the maximum value is 20.
has_header: Adds a header row to the CSV file based on the selected option. The default is false.

If has_header and skip_rows are both specified, Lakehouse first skips the number of rows, and then uses the next row as the header row.

The following example creates a table with Auto Parallel Load and sets skip_rows and has_header. Since both parameters are set, the first row is skipped and the second row is set as the header row.
```
mysql> CREATE EXTERNAL TABLE table_1(
        col_1 int,
        col_2 int,
        col_3 int)
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv", "skip_rows": 1, "has_header": true},
                                       "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
```

What's Next

Learn about Lakehouse Auto Parallel Load
Learn how to create external tables using Auto Parallel Load or manually.
Learn how to Refresh Data Using Event-Based Incremental Load.