Related Documentation Download this Manual
PDF (US Ltr) - 2.1Mb
PDF (A4) - 2.1Mb


HeatWave User Guide  /  ...  /  Lakehouse External Table Syntax

5.2.2 Lakehouse External Table Syntax

Lakehouse uses external tables to load the data from Object Storage.

MySQL 9.2.0 adds support for the following:

  • Loading CSV files without knowing the exact values of the following dialect parameters in advance:

    • record_delimiter: If set to auto, Lakehouse Autopilot can automatically detect the record delimiters and composite record delimiters with field delimiters as prefixes.

      The following record delimiters can be detected automatically: \r (carriage return), \n (line feed), \r\n (carriage return and line feed), and ; (semicolon).

      Default value of record_delimiter for CSV files is auto.

    • field_delimiter: If set to auto, Lakehouse Autopilot can automatically detect the field delimiters.

      The following field delimiters can be detected automatically: | (pipe), , (comma), \t (tab), and ; (semicolon).

      Default value of field_delimiter for CSV files is auto.

    The detected delimiters are then used in the generated CREATE TABLE statement.

    mysql> CREATE TABLE table_name create_definition 
    (column_name ENGINE_ATTRIBUTE '
      {"date_format": "default", 
       "time_format": "default",
       "timestamp_format": "default"
      }')
    ENGINE=lakehouse SECONDARY_ENGINE=rapid ENGINE_ATTRIBUTE='{
      "dialect": 
      {
        "format": "avro" | "csv" | "json" | "parquet",
        "check_constraints": true | false,
        "field_delimiter": "|" | "," | "\t" | ";" | "auto", 
        "record_delimiter": "\r" | "\n" | "\r\n" | "auto", 
        "escape_character": "\\", 
        "quotation_marks":"\"", 
        "skip_rows": 0,  
        "encoding": "utf8mb4", 
        "date_format": "default", 
        "time_format": "default",
        "timestamp_format": "default",
        "trim_spaces": true | false,
        "has_header": true | false,
        "is_strict_mode": true | false,
        "allow_missing_files": true | false
      },
      "file":
      [
        file_section [, file_section, file_section, ...]
      ]
    }';
    
    file_section: {
      "bucket": "bucket_name",
      "namespace": "namespace",
      "region": "region",
      ("prefix": "prefix") | ("name": "filename")| ("pattern" : "pattern"),
      "is_strict_mode": true | false,
      "allow_missing_files": true | false
    }
    
    or
    
    file_section: {
      "par": "PAR URL",
      ["prefix": "prefix") | ("name": "filename")| ("pattern" : "pattern"),]
      "is_strict_mode": true | false,
      "allow_missing_files": true | false
    }

MySQL 9.0.1-u1 adds support for the following:

  • timestamp_format as a dialect parameter allows you to customize the format for columns of the DATETIME and TIMESTAMP data types.

  • ENGINE_ATTRIBUTE option for specific columns. Use this option to customize the format for columns of the DATETIME and TIMESTAMP data types. Formats set with this parameter override formats set with the dialect parameter.

mysql> CREATE TABLE table_name create_definition 
(column_name ENGINE_ATTRIBUTE '
  {"date_format": "default", 
   "time_format": "default",
   "timestamp_format": "default"
  }')
ENGINE=lakehouse SECONDARY_ENGINE=rapid ENGINE_ATTRIBUTE='{
  "dialect": 
  {
    "format": "avro" | "csv" | "json" | "parquet",
    "check_constraints": true | false,
    "field_delimiter": "|", 
    "record_delimiter": "|\n", 
    "escape_character": "\\", 
    "quotation_marks":"\"", 
    "skip_rows": 0,  
    "encoding": "utf8mb4", 
    "date_format": "default", 
    "time_format": "default",
    "timestamp_format": "default",
    "trim_spaces": true | false,
    "has_header": true | false,
    "is_strict_mode": true | false,
    "allow_missing_files": true | false
  },
  "file":
  [
    file_section [, file_section, file_section, ...]
  ]
}';

file_section: {
  "bucket": "bucket_name",
  "namespace": "namespace",
  "region": "region",
  ("prefix": "prefix") | ("name": "filename")| ("pattern" : "pattern"),
  "is_strict_mode": true | false,
  "allow_missing_files": true | false
}

or

file_section: {
  "par": "PAR URL",
  ["prefix": "prefix") | ("name": "filename")| ("pattern" : "pattern"),]
  "is_strict_mode": true | false,
  "allow_missing_files": true | false
}

MySQL 8.4.0 adds support for the following:

  • The JSON file format.

  • Support for primary key and unique key constraint validation.

  • is_strict_mode as a dialect parameter now supports all file formats. It can override the global sql_mode.

  • allow_missing_files is a dialect parameter and a file parameter that can allow missing files or not.

mysql> CREATE TABLE table_name create_definition ENGINE=lakehouse SECONDARY_ENGINE=rapid 
ENGINE_ATTRIBUTE='{                    
  "dialect": 
  {
    "format": "avro" | "csv" | "json" | "parquet",
    "check_constraints": true | false,
    "field_delimiter": "|", 
    "record_delimiter": "|\n", 
    "escape_character": "\\", 
    "quotation_marks":"\"", 
    "skip_rows": 0,  
    "encoding": "utf8mb4", 
    "date_format": "auto", 
    "time_format": "auto", 
    "trim_spaces": true | false,
    "has_header": true | false,
    "is_strict_mode": true | false,
    "allow_missing_files": true | false
  },
  "file":
  [
    file_section [, file_section, file_section, ...]
  ]
}';

file_section: {
  "bucket": "bucket_name",
  "namespace": "namespace",
  "region": "region",
  ("prefix": "prefix") | ("name": "filename")| ("pattern" : "pattern"),
  "is_strict_mode": true | false,
  "allow_missing_files": true | false
}

or

file_section: {
  "par": "PAR URL",
  ["prefix": "prefix") | ("name": "filename")| ("pattern" : "pattern"),]
  "is_strict_mode": true | false,
  "allow_missing_files": true | false
}

Before MySQL 8.4.0:

mysql> CREATE TABLE table_name create_definition ENGINE=lakehouse SECONDARY_ENGINE=rapid 
ENGINE_ATTRIBUTE='{                    
  "dialect": 
  {
    "format": "avro" | "csv" | "parquet",
    "field_delimiter": "|", 
    "record_delimiter": "|\n", 
    "escape_character": "\\", 
    "quotation_marks":"\"", 
    "skip_rows": 0,  
    "encoding": "utf8mb4", 
    "date_format": "auto", 
    "time_format": "auto", 
    "trim_spaces": true | false,
    "is_strict_mode": true | false,
    "has_header": true | false
  },
  "file":
  [
    file_section [, file_section, file_section, ...]
  ]
}';

file_section: {
  "bucket": "bucket_name",
  "namespace": "namespace",
  "region": "region",
  ("prefix": "prefix") | ("name": "filename") | ("pattern" : "pattern"),
  "is_strict_mode": true | false
}

or

file_section: {
  "par": "PAR URL",
  ["prefix": "prefix") | ("name": "filename") | ("pattern" : "pattern"),]
  "is_strict_mode": true | false
}
  • ENGINE: Set to lakehouse.

  • SECONDARY_ENGINE: Set to rapid.

  • ENGINE_ATTRIBUTE: JSON object literal. Defines the location of files, the file format, and how the file format is handled.

    Use key-value pairs in JSON format to specify options. Lakehouse uses the default setting if there is no defined option. Use NULL to specify no arguments.

    • dialect: Defines the data configuration. Optional.

      • dialect parameters that apply to all file formats:

        • format: avro, csv, json or parquet. The default is csv.

          It is not possible to define multiple formats per table.

          Tables created with json format must only have a single column that conforms to the JSON data type, see: The JSON Data Type.

          As of MySQL 8.4.0, json only supports Newline Delimited JSON files.

          For parquet see: Section 5.7.1, “Parquet Data Type Conversions”.

        • check_constraints: Whether to validate primary key and unique key constraints or not. The default is true. Supported as of MySQL 8.4.0.

          If set to true, then Lakehouse validates primary key and unique key constraints.

          If set to false, then Lakehouse does not validate primary key and unique key constraints.

        • is_strict_mode: Whether the loading takes place in strict mode, true, or non-strict mode, false. This setting overrides the global sql_mode. The default is the value of sql_mode. See Strict SQL Mode. The file common parameter is_strict_mode can override this setting.

          If set to true, then missing files, empty columns, formatting errors or parsing errors throw an error, and loading stops.

          If set to false, then missing files, empty columns, formatting errors or parsing errors display a warning, and loading continues.

          As of MySQL 8.4.0, the dialect parameter is_strict_mode applies to all file formats. Before MySQL 8.4.0, it only applies to the CSV file format. For Avro and Parquet file formats, use the file parameter is_strict_mode to define strict mode before MySQL 8.4.0.

        • allow_missing_files: Whether to allow missing files or not. This overrides the dialect parameter is_strict_mode for missing files. Supported as of MySQL 8.4.0.

          If set to true, then any missing files do not throw an error, and loading does not stop, unless all files are missing.

          If set to false, then any missing files throw an error, and loading stops.

          A missing file is defined as:

          • With the name parameter: there is no file with that name.

          • With the pattern parameter: there are no files that match the pattern.

          • With the prefix parameter: there are no files with that prefix.

      • dialect parameters that only apply to csv and json:

        The use of any of these parameters with avro or parquet will produce an error.

        • encoding: Defines the character encoding. The default is "utf8mb4".

        • record_delimiter: Defines one or more characters used to delimit records. The maximum field delimiter length is 64 characters.

          The default record delimiter for csv is "auto", and the default for json is "\n". The only alternative for json is "\r\n".

          As of MySQL 9.2.0, Lakehouse Autopilot can automatically detect the record delimiters and composite record delimiters with field delimiters as prefixes. You can load the files, without knowing the exact dialect parameters.

          The record delimiters detected automatically are: \r (carriage return), \n (line feed), \r\n (carriage return and line feed), and ; (semicolon).

      • dialect parameters that only apply to csv:

        The use of any of these parameters with avro, json or parquet will produce an error.

        • field_delimiter: Defines one or more characters used to enclose fields. The maximum field delimiter length is 64 characters. The default is "auto".

          As of MySQL 9.2.0, Lakehouse Autopilot can automatically detect the field delimiters. You can load the files, without knowing the exact dialect parameters.

          The field delimiters detected automatically are: | (pipe), , (comma), \t (tab), and ; (semicolon).

        • escape_character: Defines one or more characters used to escape special characters. The default is "\\".

        • escape_character: Defines one or more characters used to escape special characters. The default is "\\".

        • quotation_marks: Defines one or more characters used to enclose fields. The default is "\'".

        • skip_rows: The number of rows to skip at the start of the file. The maximum value is 20. The default is 0.

          See comments for has_header.

        • date_format: The date format, see: date_format. This format is ignored during auto parallel load. See Section 5.2.4, “Loading Data Using Auto Parallel Load”.

        • time_format: The time format, see: String and Numeric Literals in Date and Time Context. This format is ignored during auto parallel load. See Section 5.2.4, “Loading Data Using Auto Parallel Load”.

        • timestamp_format: The timestamp format, see: date_format. Optionally, you can set timestamp formats for each column by using the column ENGINE_ATTRIBUTE option, which overrides the format in the dialect parameter. This format is ignored during auto parallel load. See Section 5.2.4, “Loading Data Using Auto Parallel Load”.

        • trim_spaces: Whether to remove leading and trailing spaces, or not. The default is false.

        • has_header: Whether the CSV file has a header row, or not. The default is false.

          If has_header and skip_rows are both defined, Lakehouse first skips the number of rows, and then uses the next row as the header row.

    • file: Defines the Object Storage files. Required.

      Lakehouse supports a maximum of 256 file locations. To define more than 256, store the files under the same bucket or use prefix or pattern.