MySQL HeatWave User Guide  /  HeatWave Lakehouse  /  Lakehouse External Table Syntax

4.5 Lakehouse External Table Syntax

External tables are used by the load procedure to load the data from Object Storage. This section describes the external table syntax.

Note

While it is possible to create your external table manually, it is recommended to use the HeatWave Lakehouse Auto Parallel Load procedure for its analysis and generated report and load script.

See Lakehouse Auto Parallel Load.

mysql>CREATE TABLE table_name create_definition ENGINE=lakehouse SECONDARY_ENGINE=rapid 
ENGINE_ATTRIBUTE='{                    
  "dialect": 
  {
    "format": "csv",
    "field_delimiter": "|", 
    "record_delimiter": "|\n", 
    "escape_character": "\\", 
    "quotation_marks":"\"", 
    "skip_rows": 0,  
    "encoding": "utf8mb4", 
    "date_format": "auto", 
    "time_format": "auto", 
    "trim_spaces": false,
    "is_strict_mode": true
   },
  "file": [file1, file2, …]
 }';

  • ENGINE: Set to lakehouse.

  • SECONDARY_ENGINE: Set to rapid.

  • ENGINE_ATTRIBUTE: JSON object. Defines the location of files, the file format, and how the file format is handled.

    • dialect: (optional) defines how the data is configured.

      dialect contains the following:

      • format: csv (default) or parquet

        Note

        If you use the Parquet format, do not define any dialect parameters, other than format. HeatWave returns an error if they are defined. Only "format":"parquet" is required.

        The remaining dialect parameters pertain to CSV, only.

        It is not possible to define multiple formats per table.

      • field_delimiter: Default "|". Defines one or more characters used to enclose fields. Maximum field delimiter length is 64 characters.

      • record_delimiter: Default "|\n". Defines one or more characters used to delimit records. Maximum field delimiter length is 64 characters.

      • escape_character: Default "\\". Defines one or more characters used to escape special characters.

      • quotation_marks: Default "\"". Defines one or more characters used to enclose fields.

      • skip_rows: Default 0. Defines the number of rows to skip at the start of the file. The maximum value is 20.

      • encoding: Default "utf8mb4". Defines the character encoding.

      • date_format: Default "auto". See date_format for information on the supported formats.

      • time_format: Default "\"". For more information, see String and Numeric Literals in Date and Time Context.

      • trim_spaces: true | false (default). If true, leading and trailing spaces are removed.

      • is_strict_mode: whether the loading will take place in strict mode (true) or non-strict mode (false).

        If set to true, an error is thrown, and loading stops, if parsing or formatting errors are found in the file, and for empty columns. If set to false, a warning is displayed, but loading does not stop.

        If is_strict_mode is not defined, the value of sql_mode is used. See Strict SQL Mode for more information.

        Note

        If you are using the Parquet file format, you must use sql_mode to define strict mode. is_strict_mode is used for CSV, only.

    • file: (required) includes a list of Object Storage files, prefixes, or both.

      Note

      The file parameter supports a maximum of 256 file locations. If you need to define more than 256, it is recommended to store them under the same bucket or prefix and update the file parameter accordingly.

      If using a PAR to access the data:

      • par: the PAR URL.

        Note

        Do not specify a region, namespace, or bucket when using PARs. That information is contained in the PAR URL and will generate an error if defined in separate parameters.

      • prefix: if you are using a prefix, the prefix name.

      • name: if you are using a specific file, the filename.

      • is_strict_mode: whether the loading will take place in strict mode (true) or non-strict mode (false) by overriding global sql_mode. This parameter overrides is_strict_mode defined in the dialect array.

      If using a resource principal to access the data directly:

      • bucket: the bucket name.

      • namespace: the tenancy namespace.

      • region: the region your tenancy is in.

      • prefix: if you are using a prefix, the prefix name.

      • name: if you are using a specific file, the filename.

      • is_strict_mode: whether the loading will take place in strict mode (true) or non-strict mode (false) by overriding global sql_mode. This parameter overrides is_strict_mode defined in the dialect array.

If you are using a PAR link, the file parameter has the following syntax:

"file": [{"par": "PARURL", "prefix": "prefix" | "name": "filename", "is_strict_mode": true|false}]

If the PAR is defined on a bucket, you must ensure Enable Object Listing is enabled.

If you are using a resource principal for authentication, the file parameter has the following syntax:

"file": [{"region": "regionName", "namespace": tenancyNamespace", "bucket": "bucketName", 
          "name": "filename|foldername"}]

You can also define a prefix instead of a name:

"file": [{"region": "regionName", "namespace": tenancyNamespace", "bucket": "bucketName", 
          "prefix": "prefix"}]

Lakehouse Auto Parallel Load

HeatWave Lakehouse extends the Section 2.2.3, “Loading Data Using Auto Parallel Load” functionality to enable loading from external sources, such as Oracle Cloud Infrastructure Object Storage.

Auto Parallel Load analyzes the data to infer the table structure if not provided.

Note

By default, the load operation analyzes and executes the generated SQL script. To review the report and generated loading script before implementing it, run the load with the dryrun option.

The @load_params option of Auto Parallel Load is extended with the external_tables parameter. This parameter contains the following:

  • db_object: the details of one or more tables. Each db_object contains the following:

    • "db_name": name of the database. If the database does not exist, it is created by the load command.

    • "tables": details of the tables to load.

The db_object takes the following syntax:

db_object: {
 "db_name": "name",
 "tables": [table, table, ...]
 }
  • db_name: the schema name as specified in the external table definition.

  • tables: array of one or more tables.

Each table is defined in a JSON object named table:

table: {
 "table_name": "name",
 "sampling": true | false,
 "dialect": {dialect_options},
 "file": [file1, file2, ...],
 }
  • table_name: the name of the table to load.

  • sampling: if set to true (default), the schema is inferred by sampling the data and a variety of statistics are collected. The inferred schema is used to generate a CREATE TABLE statement. The statistics are used to estimate storage requirements and load times.

    Note

    If set to false, a full scan is performed to infer the schema and collect the relevant statistics. Depending on the size of the data, this can take a long time.

  • dialect: formatting options for the input format. See the dialect parameter in Section 4.5, “Lakehouse External Table Syntax” for more information.

  • file: the location of the data in Object Storage. This can be a path to a file, a prefix, or a PAR.

    For information on file syntax, see Section 4.5, “Lakehouse External Table Syntax”.