Review the various file formats supported for loading structured, semi-structured, and unstructured data in MySQL HeatWave from different sources.
This topic contains the following sections:
-
Review the following:
MySQL HeatWave Lakehouse supports loading structured and semi-structured relational data in the following file formats:
Avro
CSV
-
JSON
As of MySQL 8.4.0, Lakehouse supports Newline Delimited JSON (ND-JSON).
Parquet
A Vector store can ingest files from the following unstructured formats: PPT, TXT, HTML, DOC, and PDF. Each file can be upto 100 MB in size.
To load unstructured data, see Ingest Files into a Vector Store.
Review the following about Parquet files:
Review the following table of Parquet data type conversions.
Table 4.3 Parquet Data Type Conversions
MySQL Data Type | Parquet Logical Data Type | Comments |
---|---|---|
BOOL |
STRING |
Valid values: TRUE , FALSE ,
T , F ,
true , false ,
0 , 1
|
BOOL |
INT |
All signed and unsigned integers with any bit width, but the value must
be 0 or 1
|
BOOL |
None |
Physical Type FLOAT or DOUBLE
|
CHAR ,
VARCHAR ,
TEXT
|
DECIMAL |
The maximum precision is 38 and Lakehouse supports the Physical Type
FIXED_LEN_BYTE_ARRAY
|
CHAR ,
VARCHAR ,
TEXT
|
STRING |
|
CHAR ,
VARCHAR ,
TEXT
|
INT |
All signed and unsigned integers |
CHAR ,
VARCHAR ,
TEXT
|
LIST |
Must be a supported simple (non-nested) list: a list of numeric type
(all signed and unsigned integers or Physical Type
FLOAT or DOUBLE ),
STRING or BOOL . |
CHAR ,
VARCHAR ,
TEXT
|
None |
Physical Type FLOAT or DOUBLE
|
DATE |
STRING |
|
DATE |
DATE |
|
DATE |
TIMESTAMP |
isAdjustedToUTC = false ,
timeUnit = x :
Lakehouse only loads the date portion |
DATE |
None |
Physical Type INT96 : Lakehouse only loads the date
portion |
DATETIME |
STRING |
DATETIME accepts the range from
0000-00-00 to 9999-12-12 |
DATETIME |
DATE |
DATETIME accepts the range from
0000-00-00 to 9999-12-12 |
DATETIME |
TIME |
isAdjustedToUTC = false ,
timeUnit = x :
DATETIME accepts the
range from 0000-00-00 to 9999-12-12 |
DATETIME |
TIMESTAMP |
isAdjustedToUTC = false ,
timeUnit = x :
DATETIME accepts the
range from 0000-00-00 to 9999-12-12 |
DATETIME |
None |
Physical Type INT96 :
DATETIME accepts the
range from 0000-00-00 to 9999-12-12 |
DECIMAL |
DECIMAL |
The maximum precision is 38 and Lakehouse supports the Physical Type
FIXED_LEN_BYTE_ARRAY
|
DECIMAL |
STRING |
|
DECIMAL |
INT |
All signed and unsigned integers |
DECIMAL |
None |
Physical Type FLOAT or DOUBLE
|
DOUBLE ,
FLOAT
|
DECIMAL |
The maximum precision is 38 and Lakehouse supports the Physical Type
FIXED_LEN_BYTE_ARRAY
|
DOUBLE ,
FLOAT
|
STRING |
|
DOUBLE ,
FLOAT
|
INT |
All signed and unsigned integers |
DOUBLE ,
FLOAT
|
None |
Physical Type FLOAT or DOUBLE
|
ENUM |
STRING |
|
ENUM |
ENUM |
Physical Type BYTE_ARRAY or
FIXED_LEN_BYTE_ARRAY
|
ENUM |
INT |
All signed and unsigned integers |
ENUM |
None |
Physical Type FLOAT or DOUBLE
|
JSON |
STRING |
|
JSON |
JSON |
|
JSON |
LIST |
Must be a supported simple (non-nested) list: a list of numeric type
(all signed and unsigned integers or Physical Type
FLOAT or DOUBLE ),
STRING or BOOL . |
JSON |
None |
Physical Type BYTE_ARRAY , Binary encoded. The
non-UTF-8 encoding should be JSON binary, see:
The JSON Data Type. |
TIME |
STRING |
|
TIME |
TIME |
isAdjustedToUTC = false ,
timeUnit = x
|
TIMESTAMP |
STRING |
TIMESTAMP accepts the range from
1970-01-01 to 2038-01-19. Lakehouse might not load out
of range values correctly. Use
DATETIME instead. |
TIMESTAMP |
DATE |
TIMESTAMP accepts the range from
1970-01-01 to 2038-01-19. Lakehouse might not load out
of range values correctly. Use
DATETIME instead. |
TIMESTAMP |
TIME |
isAdjustedToUTC = false ,
timeUnit = x :
TIMESTAMP accepts the
range from 1970-01-01 to 2038-01-19. Lakehouse might
not load out of range values correctly. Use
DATETIME instead. |
TIMESTAMP |
TIMESTAMP |
isAdjustedToUTC = false ,
timeUnit = x :
TIMESTAMP accepts the
range from 1970-01-01 to 2038-01-19. Lakehouse might
not load out of range values correctly. Use
DATETIME instead. |
TIMESTAMP |
None |
Physical Type INT96 :
TIMESTAMP accepts the
range from 1970-01-01 to 2038-01-19. Lakehouse might
not load out of range values correctly. Use
DATETIME instead. |
All signed and unsigned TINYINT ,
SMALLINT ,
MEDIUMINT ,
INTEGER ,
INT ,
BIGINT
|
DECIMAL |
The maximum precision is 38 and Lakehouse supports the Physical Type
FIXED_LEN_BYTE_ARRAY
|
All signed and unsigned TINYINT ,
SMALLINT ,
MEDIUMINT ,
INTEGER ,
INT ,
BIGINT
|
STRING |
|
All signed and unsigned TINYINT ,
SMALLINT ,
MEDIUMINT ,
INTEGER ,
INT ,
BIGINT
|
INT |
All signed and unsigned integers |
All signed and unsigned TINYINT ,
SMALLINT ,
MEDIUMINT ,
INTEGER ,
INT ,
BIGINT
|
None |
Physical Type FLOAT or DOUBLE
|
YEAR |
STRING |
The value should be in the range 0 to
99 or the range
1901 to 2155
|
YEAR |
INT |
The value should be in the range 0 to
99 or the range
1901 to 2155
|
YEAR |
None |
Physical Type FLOAT or DOUBLE : The
value should be in the range 0 to
99 or the range
1901 to 2155
|
YEAR |
None |
Physical Type INT96 : The value should be in the range
1901 to 2155 ,
Lakehouse only loads the year portion |
YEAR |
DATE |
The value should be in the range 1901 to
2155 , Lakehouse only loads the year
portion |
YEAR |
TIMESTAMP |
isAdjustedToUTC = false : The value
should be in the range 1901 to
2155 , Lakehouse only loads the year
portion |
VECTOR |
STRING |
Should be a proper string-formatted vector: comma-separated numeric values contained within square brackets (for example, [1.1, 2.2, -3.3, 4e-4]). The number of values should be within the limits of the vector's dimension, and the individual values should be in the range of float (32-bits). |
VECTOR |
LIST |
Must be a simple (non-nested) list of numeric type (All signed and
unsigned integers or Physical Type
FLOAT or DOUBLE ).
The list lengths should be within the limits of the
vector's dimension, and the values should be in the
range of float (32-bits). |
Review the following about Avro files:
Review the following table of Avro data type conversions.
Table 4.4 Avro Data Type Conversions
MySQL Data Type | Avro Data Type | Avro Logical Data Type | Comments |
---|---|---|---|
BOOL |
Boolean |
None |
Stored in MySQL as TINYINT
|
BOOL |
INT , LONG
|
None |
|
BOOL |
Float , Double
|
None |
|
BOOL |
String |
None |
Can use TRUE /FALSE ,
true /false , or
0 /1 . |
BOOL |
Enum |
None |
|
BOOL |
Fixed , Bytes
|
Decimal |
|
TINYINT ,
SMALLINT ,
MEDIUMINT ,
INTEGER ,
INT ,
BIGINT
|
Boolean |
None |
|
TINYINT ,
SMALLINT ,
MEDIUMINT ,
INTEGER ,
INT ,
BIGINT
|
Int , Long
|
None |
Attempts narrowing/widening conversion. An error occurs if there is an overflow. |
TINYINT ,
SMALLINT ,
MEDIUMINT ,
INTEGER ,
INT ,
BIGINT
|
Float , Double
|
None |
|
TINYINT ,
SMALLINT ,
MEDIUMINT ,
INTEGER ,
INT ,
BIGINT
|
String |
None |
Attempts parsing of numbers. An error occurs if this is unsuccessful. |
TINYINT ,
SMALLINT ,
MEDIUMINT ,
INTEGER ,
INT ,
BIGINT
|
Enum |
None |
|
TINYINT ,
SMALLINT ,
MEDIUMINT ,
INTEGER ,
INT ,
BIGINT
|
Fixed , Bytes
|
Decimal |
|
CHAR ,
VARCHAR ,
TEXT
|
String |
None |
|
CHAR ,
VARCHAR ,
TEXT
|
String |
UUID |
Treated as a regular string. |
CHAR ,
VARCHAR ,
TEXT
|
Enum |
||
CHAR ,
VARCHAR ,
TEXT
|
Fixed , Bytes
|
Decimal |
The maximum number of digits when converting to string is 1000 digits. If the string representation of the decimal exceeds the maximum, the string is truncated with an error/warning. |
DOUBLE ,
FLOAT ,
REAL
|
Boolean |
None |
|
DOUBLE ,
FLOAT ,
REAL
|
Int , Long
|
None |
|
DOUBLE ,
FLOAT ,
REAL
|
Float , Double
|
None |
A warning displays about loss of precision when converting
Double to Float . |
DOUBLE ,
FLOAT ,
REAL
|
String |
None |
Attempts parsing of numbers. An error occurs if this is unsuccessful. |
DOUBLE ,
FLOAT ,
REAL
|
Enum |
None |
|
DOUBLE ,
FLOAT ,
REAL
|
Fixed , Bytes
|
Decimal |
|
ENUM |
Int , Long
|
None |
|
ENUM |
String |
None |
|
ENUM |
Enum |
None |
|
DECIMAL |
Int , Long
|
None |
|
DECIMAL |
Float , Double
|
None |
|
DECIMAL |
String |
None | Attempts parsing of numbers. An error occurs if this is unsuccessful. |
DECIMAL |
Fixed , Bytes
|
Decimal |
Errors/Warnings occur for the following:
|
TIME |
Int |
Time-Millis |
Value is interpreted as milliseconds since midnight, and corresponds
exactly to TIME (3). |
TIME |
Long |
Time-Micros |
Value is interpreted as milliseconds since midnight, and corresponds
exactly to TIME (6). |
TIME |
String |
None | |
DATETIME ,
TIMESTAMP
|
Long |
Timestamp-Millis |
Value is interpreted as milliseconds since January, 1, 1970. |
DATETIME ,
TIMESTAMP
|
Long |
Timestamp-Micros |
Value is interpreted as microseconds since January, 1, 1970. |
DATETIME ,
TIMESTAMP
|
Int |
Date | The days since January 1, 1970. |
DATETIME ,
TIMESTAMP
|
String |
None | |
DATE |
Long |
Timestamp-Millis |
Value is interpreted as milliseconds since January, 1, 1970. |
DATE |
Long |
Timestamp-Micros |
Value is interpreted as microseconds since January, 1, 1970. |
DATE |
Int |
Date | The days since January 1, 1970. |
DATE |
String |
None | |
YEAR |
Long |
Timestamp-Millis |
Value is interpreted as milliseconds since January, 1, 1970. |
YEAR |
Long |
Timestamp-Micros |
Value is interpreted as microseconds since January, 1, 1970. |
YEAR |
Int |
Date | The days since January 1, 1970. |
YEAR |
String |
None |
As of MySQL 9.3.2, Lakehouse supports the
VECTOR
data type for CSV and
Parquet files.
For CSV files, Lakehouse processes the data as string-formatted vectors.
For Parquet files, Lakehouse processes the data as either string-formatted vectors or in the Parquet
List
type. To load
Vectors are created using an embedding model. When creating a
table with the the VECTOR
data
type, you can provide the model to use in the
ENGINE_ATTRIBUTE
for either the column or
table.
To set the embedding model for all
VECTOR
columns in the table, load the data using Lakehouse Auto Parallel Load and set thedialect
parameterembed_model_id
accordingly.To set the embedding model for the column, load the data manually and set the
ENGINE_ATTRIBUTE
parametermodel
accordingly. If you set the embedding model for the table, that model is set for all columns in the table.
Review the current list of MySQL HeatWave In-Database Embedding Models.
The following example loads a table using Lakehouse Auto Parallel Load and sets the
embedding model for all VECTOR
columns in the table to minilm
:
mysql> SET @input_list = '[{
"db_name": "data_db",
"tables": [{
"table_name": "table_1",
"engine_attribute": {
"dialect": {"format": "csv", "embed_model_id": "minilm"},
"file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]
}
}]
}]';
mysql> CALL sys.HEATWAVE_LOAD(CAST(@input_list AS JSON), NULL);
The following example manually loads a table and sets the
embedding model for the column to
multilingual-e5-small
. If you are on a
version earlier than MySQL 9.4.0, you must use CREATE
TABLE
and set ENGINE
to
lakehouse
, and
SECONDARY_ENGINE
to
rapid
. See
CREATE TABLE Statement
with JSON syntax. You cannot set the embedding model with SQL
syntax for the CREATE TABLE
statement.
mysql> CREATE EXTERNAL TABLE table_1(col_1 VECTOR(5) ENGINE_ATTRIBUTE '{"model": "minilm"}')
ENGINE_ATTRIBUTE='{"dialect": {"format": "csv"},
"file": [{"uri": "oci://mybucket@mynamespace/data_files/data_file_1.csv"}]}';
mysql> ALTER TABLE table_1 SECONDARY_LOAD;
Review MySQL HeatWave Lakehouse Limitations for supported file formats.
Review how to Access Object Storage.
Learn how to load structured data using Lakehouse Auto Parallel Load or manually.