Documentation Home
HeatWave User Guide
Related Documentation Download this Manual
PDF (US Ltr) - 3.8Mb
PDF (A4) - 3.8Mb


HeatWave User Guide  /  ...  /  Supported File Formats and Data Types

4.3.4 Supported File Formats and Data Types

Review the various file formats supported for loading structured, semi-structured, and unstructured data in MySQL HeatWave from different sources.

This topic contains the following sections:

Supported File Formats for Structured and Semi-Structured Data

MySQL HeatWave Lakehouse supports loading structured and semi-structured relational data in the following file formats:

  • Avro

  • CSV

  • JSON

    As of MySQL 8.4.0, Lakehouse supports Newline Delimited JSON (ND-JSON).

  • Parquet

Supported File Formats for Unstructured Data

A Vector store can ingest files from the following unstructured formats: PPT, TXT, HTML, DOC, and PDF. Each file can be upto 100 MB in size.

To load unstructured data, see Ingest Files into a Vector Store.

Parquet File Format

Review the following about Parquet files:

Review the following table of Parquet data type conversions.

Table 4.3 Parquet Data Type Conversions

MySQL Data Type Parquet Logical Data Type Comments
BOOL STRING Valid values: TRUE, FALSE, T, F, true, false, 0, 1
BOOL INT All signed and unsigned integers with any bit width, but the value must be 0 or 1
BOOL None Physical Type FLOAT or DOUBLE
CHAR, VARCHAR, TEXT DECIMAL The maximum precision is 38 and Lakehouse supports the Physical Type FIXED_LEN_BYTE_ARRAY
CHAR, VARCHAR, TEXT STRING
CHAR, VARCHAR, TEXT INT All signed and unsigned integers
CHAR, VARCHAR, TEXT LIST Must be a supported simple (non-nested) list: a list of numeric type (all signed and unsigned integers or Physical Type FLOAT or DOUBLE), STRING or BOOL.
CHAR, VARCHAR, TEXT None Physical Type FLOAT or DOUBLE
DATE STRING
DATE DATE
DATE TIMESTAMP isAdjustedToUTC = false, timeUnit = x: Lakehouse only loads the date portion
DATE None Physical Type INT96: Lakehouse only loads the date portion
DATETIME STRING DATETIME accepts the range from 0000-00-00 to 9999-12-12
DATETIME DATE DATETIME accepts the range from 0000-00-00 to 9999-12-12
DATETIME TIME isAdjustedToUTC = false, timeUnit = x: DATETIME accepts the range from 0000-00-00 to 9999-12-12
DATETIME TIMESTAMP isAdjustedToUTC = false, timeUnit = x: DATETIME accepts the range from 0000-00-00 to 9999-12-12
DATETIME None Physical Type INT96: DATETIME accepts the range from 0000-00-00 to 9999-12-12
DECIMAL DECIMAL The maximum precision is 38 and Lakehouse supports the Physical Type FIXED_LEN_BYTE_ARRAY
DECIMAL STRING
DECIMAL INT All signed and unsigned integers
DECIMAL None Physical Type FLOAT or DOUBLE
DOUBLE, FLOAT DECIMAL The maximum precision is 38 and Lakehouse supports the Physical Type FIXED_LEN_BYTE_ARRAY
DOUBLE, FLOAT STRING
DOUBLE, FLOAT INT All signed and unsigned integers
DOUBLE, FLOAT None Physical Type FLOAT or DOUBLE
ENUM STRING
ENUM ENUM Physical Type BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY
ENUM INT All signed and unsigned integers
ENUM None Physical Type FLOAT or DOUBLE
JSON STRING
JSON JSON
JSON LIST Must be a supported simple (non-nested) list: a list of numeric type (all signed and unsigned integers or Physical Type FLOAT or DOUBLE), STRING or BOOL.
JSON None Physical Type BYTE_ARRAY, Binary encoded. The non-UTF-8 encoding should be JSON binary, see: The JSON Data Type.
TIME STRING
TIME TIME isAdjustedToUTC = false, timeUnit = x
TIMESTAMP STRING TIMESTAMP accepts the range from 1970-01-01 to 2038-01-19. Lakehouse might not load out of range values correctly. Use DATETIME instead.
TIMESTAMP DATE TIMESTAMP accepts the range from 1970-01-01 to 2038-01-19. Lakehouse might not load out of range values correctly. Use DATETIME instead.
TIMESTAMP TIME isAdjustedToUTC = false, timeUnit = x: TIMESTAMP accepts the range from 1970-01-01 to 2038-01-19. Lakehouse might not load out of range values correctly. Use DATETIME instead.
TIMESTAMP TIMESTAMP isAdjustedToUTC = false, timeUnit = x: TIMESTAMP accepts the range from 1970-01-01 to 2038-01-19. Lakehouse might not load out of range values correctly. Use DATETIME instead.
TIMESTAMP None Physical Type INT96: TIMESTAMP accepts the range from 1970-01-01 to 2038-01-19. Lakehouse might not load out of range values correctly. Use DATETIME instead.
All signed and unsigned TINYINT, SMALLINT, MEDIUMINT, INTEGER, INT, BIGINT DECIMAL The maximum precision is 38 and Lakehouse supports the Physical Type FIXED_LEN_BYTE_ARRAY
All signed and unsigned TINYINT, SMALLINT, MEDIUMINT, INTEGER, INT, BIGINT STRING
All signed and unsigned TINYINT, SMALLINT, MEDIUMINT, INTEGER, INT, BIGINT INT All signed and unsigned integers
All signed and unsigned TINYINT, SMALLINT, MEDIUMINT, INTEGER, INT, BIGINT None Physical Type FLOAT or DOUBLE
YEAR STRING The value should be in the range 0 to 99 or the range 1901 to 2155
YEAR INT The value should be in the range 0 to 99 or the range 1901 to 2155
YEAR None Physical Type FLOAT or DOUBLE: The value should be in the range 0 to 99 or the range 1901 to 2155
YEAR None Physical Type INT96: The value should be in the range 1901 to 2155, Lakehouse only loads the year portion
YEAR DATE The value should be in the range 1901 to 2155, Lakehouse only loads the year portion
YEAR TIMESTAMP isAdjustedToUTC = false: The value should be in the range 1901 to 2155, Lakehouse only loads the year portion
VECTOR STRING Should be a proper string-formatted vector: comma-separated numeric values contained within square brackets (for example, [1.1, 2.2, -3.3, 4e-4]). The number of values should be within the limits of the vector's dimension, and the individual values should be in the range of float (32-bits).
VECTOR LIST Must be a simple (non-nested) list of numeric type (All signed and unsigned integers or Physical Type FLOAT or DOUBLE). The list lengths should be within the limits of the vector's dimension, and the values should be in the range of float (32-bits).

Avro File Format

Review the following about Avro files:

Review the following table of Avro data type conversions.

Table 4.4 Avro Data Type Conversions

MySQL Data Type Avro Data Type Avro Logical Data Type Comments
BOOL Boolean None Stored in MySQL as TINYINT
BOOL INT, LONG None
BOOL Float, Double None
BOOL String None Can use TRUE/FALSE, true/false, or 0/1.
BOOL Enum None
BOOL Fixed, Bytes Decimal
TINYINT, SMALLINT, MEDIUMINT, INTEGER, INT, BIGINT Boolean None
TINYINT, SMALLINT, MEDIUMINT, INTEGER, INT, BIGINT Int, Long None Attempts narrowing/widening conversion. An error occurs if there is an overflow.
TINYINT, SMALLINT, MEDIUMINT, INTEGER, INT, BIGINT Float, Double None
TINYINT, SMALLINT, MEDIUMINT, INTEGER, INT, BIGINT String None Attempts parsing of numbers. An error occurs if this is unsuccessful.
TINYINT, SMALLINT, MEDIUMINT, INTEGER, INT, BIGINT Enum None
TINYINT, SMALLINT, MEDIUMINT, INTEGER, INT, BIGINT Fixed, Bytes Decimal
CHAR, VARCHAR, TEXT String None
CHAR, VARCHAR, TEXT String UUID Treated as a regular string.
CHAR, VARCHAR, TEXT Enum
CHAR, VARCHAR, TEXT Fixed, Bytes Decimal The maximum number of digits when converting to string is 1000 digits. If the string representation of the decimal exceeds the maximum, the string is truncated with an error/warning.
DOUBLE, FLOAT, REAL Boolean None
DOUBLE, FLOAT, REAL Int, Long None
DOUBLE, FLOAT, REAL Float, Double None A warning displays about loss of precision when converting Double to Float.
DOUBLE, FLOAT, REAL String None Attempts parsing of numbers. An error occurs if this is unsuccessful.
DOUBLE, FLOAT, REAL Enum None
DOUBLE, FLOAT, REAL Fixed, Bytes Decimal
ENUM Int, Long None
ENUM String None
ENUM Enum None
DECIMAL Int, Long None
DECIMAL Float, Double None
DECIMAL String None Attempts parsing of numbers. An error occurs if this is unsuccessful.
DECIMAL Fixed, Bytes Decimal

Errors/Warnings occur for the following:

  • If the MySQL maximum scale of 30 is exceeded, the decimal is truncated.

  • If the MySQL maximum precision of 65 is exceeded by the number of digits before the decimal place, the value is set to maximum representable, with a report for the overflow.

  • If the MySQL maximum precision of 65 is exceeded, but the number of digits before the decimal place is fewer than 65, then digits after the decimal place are truncated as needed.

TIME Int Time-Millis Value is interpreted as milliseconds since midnight, and corresponds exactly to TIME (3).
TIME Long Time-Micros Value is interpreted as milliseconds since midnight, and corresponds exactly to TIME (6).
TIME String None
DATETIME, TIMESTAMP Long Timestamp-Millis Value is interpreted as milliseconds since January, 1, 1970.
DATETIME, TIMESTAMP Long Timestamp-Micros Value is interpreted as microseconds since January, 1, 1970.
DATETIME, TIMESTAMP Int Date The days since January 1, 1970.
DATETIME, TIMESTAMP String None
DATE Long Timestamp-Millis Value is interpreted as milliseconds since January, 1, 1970.
DATE Long Timestamp-Micros Value is interpreted as microseconds since January, 1, 1970.
DATE Int Date The days since January 1, 1970.
DATE String None
YEAR Long Timestamp-Millis Value is interpreted as milliseconds since January, 1, 1970.
YEAR Long Timestamp-Micros Value is interpreted as microseconds since January, 1, 1970.
YEAR Int Date The days since January 1, 1970.
YEAR String None

Vector Data Type

As of MySQL 9.3.2, Lakehouse supports the VECTOR data type for CSV and Parquet files.

  • For CSV files, Lakehouse processes the data as string-formatted vectors.

  • For Parquet files, Lakehouse processes the data as either string-formatted vectors or in the Parquet List type. To load

Vectors are created using an embedding model. When creating a table with the the VECTOR data type, you can provide the model to use in the ENGINE_ATTRIBUTE for either the column or table.

  • To set the embedding model for all VECTOR columns in the table, load the data using Lakehouse Auto Parallel Load and set the dialect parameter embed_model_id accordingly.

  • To set the embedding model for the column, load the data manually and set the ENGINE_ATTRIBUTE parameter model accordingly. If you set the embedding model for the table, that model is set for all columns in the table.

Review the current list of MySQL HeatWave In-Database Embedding Models.

The following example loads a table using Lakehouse Auto Parallel Load and sets the embedding model for all VECTOR columns in the table to minilm:

mysql> SET @input_list = '[{
  "db_name": "data_db",
  "tables": [{
    "table_name": "table_1",
    "engine_attribute": {
      "dialect": {"format": "csv", "embed_model_id": "minilm"},
      "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]
    }
  }]
}]';
mysql> CALL sys.HEATWAVE_LOAD(CAST(@input_list AS JSON), NULL);

The following example manually loads a table and sets the embedding model for the column to multilingual-e5-small. If you are on a version earlier than MySQL 9.4.0, you must use CREATE TABLE and set ENGINE to lakehouse, and SECONDARY_ENGINE to rapid. See CREATE TABLE Statement with JSON syntax. You cannot set the embedding model with SQL syntax for the CREATE TABLE statement.

mysql> CREATE EXTERNAL TABLE table_1(col_1 VECTOR(5) ENGINE_ATTRIBUTE '{"model": "minilm"}')
        ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
                           "file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
mysql> ALTER TABLE table_1 SECONDARY_LOAD;

What's Next