MySQL :: HeatWave User Guide :: 4.1.2 Bulk Ingest Data

HeatWave User Guide / ... / Bulk Ingest Data

4.1.2 Bulk Ingest Data

MySQL includes a bulk load extension to the LOAD DATA statement that is only available in HeatWave on AWS. It can do the following:

Optimize the loading of data sorted by primary key.
Optimize the loading of unsorted data.
Optimize the loading of data from an object store.
Optimize the loading of data from a series of files.
Load a MySQL Shell dump file (As of MySQL 8.4.0).
Load ZSTD compressed CSV files (As of MySQL 8.4.0).
Monitor bulk load progress with the Performance Schema (As of MySQL 8.4.0).
Large data support (As of MySQL 9.0.0)

Use a second session to monitor bulk load progress:

If the data is sorted, there is a single stage: loading.
If the data is unsorted, there are two stages: sorting and loading.

See: Section 11.2.1.2, “Bulk Ingest Data to MySQL Server Limitations”.

Bulk Ingest Data Type Support

LOAD DATA with ALGORITHM=BULK supports the following data types.

Table 4.1 Supported Data Types for Bulk Ingest

Data Type	Available MySQL version	Comments
`INT`	All
`SMALLINT`	All
`TINYINT`	All
`BIGINT`	All
`CHAR`	All
`BINARY`	All
`VARCHAR`	All	Before MySQL 9.0.0, the record must fit in the page as there is no large data support.
`VARBINARY`	All	Before MySQL 9.0.0, the record must fit in the page as there is no large data support.
`NUMERIC`	All
`DECIMAL`	All
`UNSIGNED` `NUMERIC`	All
`UNSIGNED` `DECIMAL`	All
`DOUBLE`	All
`FLOAT`	All
`DATE`	All
`DATETIME`	All
`BIT`	As of MySQL 9.0.0
`ENUM`	As of MySQL 9.0.0
`JSON`	As of MySQL 9.0.0
`SET`	As of MySQL 9.0.0
`TIMESTAMP`	As of MySQL 9.0.0
`YEAR`	As of MySQL 9.0.0
`TINYBLOB`	As of MySQL 9.0.0
`BLOB`	As of MySQL 9.0.0
`MEDIUMBLOB`	As of MySQL 9.0.0
`LONGBLOB`	As of MySQL 9.0.0
`TINYTEXT`	As of MySQL 9.0.0
`TEXT`	As of MySQL 9.0.0
`MEDIUMTEXT`	As of MySQL 9.0.0
`LONGTEXT`	As of MySQL 9.0.0
`GEOMETRY`	As of MySQL 9.0.0
`GEOMETRYCOLLECTION`	As of MySQL 9.0.0
`POINT`	As of MySQL 9.0.0
`MULTIPOINT`	As of MySQL 9.0.0
`LINESTRING`	As of MySQL 9.0.0
`MULTILINESTRING`	As of MySQL 9.0.0
`POLYGON`	As of MySQL 9.0.0
`MULTIPOLYGON`	As of MySQL 9.0.0
`VECTOR`	As of MySQL 9.3.0

For the data types that HeatWave supports, see: Section 4.2.1, “Supported Data Types for HeatWave MySQL”

See: Section 11.2.1.2, “Bulk Ingest Data to MySQL Server Limitations”.

Bulk Ingest Syntax

Press CTRL+C to copy
 
mysql> LOAD DATA
  [LOW_PRIORITY | CONCURRENT]
  [FROM]
  {INFILE | URL | URI | S3} {'file_prefix' | 'options'} [COUNT N]
  [IN PRIMARY KEY ORDER]
  INTO TABLE tbl_name
  [CHARACTER SET charset_name] [COMPRESSION = {'ZSTD'}]
  [{FIELDS | COLUMNS}
      [TERMINATED BY 'string']
      [[OPTIONALLY] ENCLOSED BY 'char']
      [ESCAPED BY 'char']
  ]
  [LINES
      [TERMINATED BY 'string']
  ]
  [IGNORE number {LINES | ROWS}]
  [PARALLEL = number]
  [MEMORY = M]
  [ALGORITHM = BULK]

options: {
   JSON_OBJECT("key","value"[,"key","value"] ...)
        "key","value": {
        "url-prefix","prefix"
        ["url-sequence-start",0]
        ["url-suffix","suffix"]
        ["url-prefix-last-append","@"]
        ["is-dryrun",{true|false}]
        }
}

The additional LOAD DATA clauses are:

FROM: Makes the statement more readable.
URL: A URL accessible with a HTTP GET request.
URI: Available as of MySQL 9.4.0. The file location of the URI.
S3: The AWS S3 file location.

This requires the user privilege LOAD_FROM_S3.

See: Section 11.2.1.2, “Bulk Ingest Data to MySQL Server Limitations”.
COUNT: The number of files in a series of files.

For COUNT 5 and file_prefix set to data.csv., the five files would be: data.csv.1, data.csv.2, data.csv.3, data.csv.4, and data.csv.5.
IN PRIMARY KEY ORDER: Use when the data is already sorted. The values should be in ascending order within the file.

For a file series, the primary keys in each file must be disjoint and in ascending order from one file to the next.
PARALLEL: The number of concurrent threads to use. A typical value might be 16, 32 or 48. The default value is 16.

PARALLEL does not require CONCURRENT.
MEMORY: The amount of memory to use. A typical value might be 512M or 4G. The default value is 1G.
ALGORITHM: Set to BULK for bulk load. The file format is CSV.
COMPRESSION: The file compression algorithm. Bulk load supports the ZSTD algorithm.
options is a JSON object literal that includes:
- url-prefix: The common URL prefix for the files to load.
- url-sequence-start: The sequence number for the first file.
  
  The default value is 1, and the minimum value is 0. The value cannot be a negative number. The value can be a string or a number, for example, "134", or "default".
- url-suffix: The file suffix.
- url-prefix-last-append: The string to append to the prefix of the last file.
  
  This supports MySQL Shell dump files.
- is-dryrun: Set to true to run basic checks and report if bulk load is possible on the given table. The default value is false.
  
  To enable is-dryrun, use any of the following values: true, "true", "1", "on" or 1.
  
  To disable is-dryrun, use any of the following values: false, "false", "0", "off" or 0.

Unsupported Bulk Ingest Clauses

LOAD DATA with ALGORITHM=BULK does not support these clauses:

Press CTRL+C to copy
 
LOAD DATA
  [LOCAL]
  [REPLACE | IGNORE]
  [PARTITION (partition_name [, partition_name] ...)]
  ]
  [LINES
      [STARTING BY 'string']
  ]
  [(col_name_or_user_var
      [, col_name_or_user_var] ...)]
  [SET col_name={expr | DEFAULT}
      [, col_name={expr | DEFAULT}] ...]

Syntax Examples

An example that loads unsorted data from AWS S3 with 48 concurrent threads and 4G of memory:

Press CTRL+C to copy
 
mysql> GRANT LOAD_FROM_S3 ON *.* TO load_user@localhost;

mysql> LOAD DATA FROM S3 's3-us-east-1://innodb-bulkload-dev-1/lineitem.tbl'
        INTO TABLE lineitem
        FIELDS TERMINATED BY "|"
        OPTIONALLY ENCLOSED BY '"'
        LINES TERMINATED BY '\n'
        PARALLEL = 48
        MEMORY = 4G
        ALGORITHM=BULK;

An example that loads eight files of sorted data from AWS S3. The file_prefix ends with a period. The files are lineitem.tbl.1, lineitem.tbl.2, ... lineitem.tbl.8:

Press CTRL+C to copy
 
mysql> GRANT LOAD_FROM_S3 ON *.* TO load_user@localhost;

mysql> LOAD DATA FROM S3 's3-us-east-1://innodb-bulkload-dev-1/lineitem.tbl.' COUNT 8
        IN PRIMARY KEY ORDER
        INTO TABLE lineitem
        FIELDS TERMINATED BY "|"
        OPTIONALLY ENCLOSED BY '"'
        LINES TERMINATED BY '\n'
        ALGORITHM=BULK;

An example that performs a dry run on a sequence of MySQL Shell dump files compressed with the ZSTD algorithm:

Press CTRL+C to copy
 
mysql> GRANT LOAD_FROM_URL ON *.* TO load_user@localhost;

mysql> LOAD DATA FROM URL 
        '{"url-prefix","https://example.com/bucket/test@lineitem@","url-sequence-start",0,"url-suffix",".tsv.zst","url-prefix-last-append","@","is-dryrun",true}'
        COUNT 20
        INTO TABLE lineitem
        CHARACTER SET ???? COMPRESSION = {'ZSTD'}
        FIELDS TERMINATED BY "|"
        OPTIONALLY ENCLOSED BY '"'
        LINES TERMINATED BY '\n'
        IGNORE 20000 LINES
        ALGORITHM=BULK;

An example that loads data with the URI keyword (supported as of MySQL 9.4.0):

Press CTRL+C to copy
 
mysql> GRANT LOAD_FROM_URL ON *.* TO load_user@localhost;
  
mysql> LOAD DATA FROM URI 'https://data_files.com/data_files_1.tbl'
        INTO TABLE lineitem
        FIELDS TERMINATED BY "|"
        OPTIONALLY ENCLOSED BY '"'
        LINES TERMINATED BY '\n'
        ALGORITHM=BULK;

An example that monitors bulk load progress in a second session.

Review the list of stages with the following query:

Press CTRL+C to copy
 
mysql> SELECT NAME, ENABLED, TIMED FROM performance_schema.setup_instruments
        WHERE ENABLED='YES' AND NAME LIKE "stage/bulk_load%";

Enable the events_stages_current with the following query:

Press CTRL+C to copy
 
mysql> UPDATE performance_schema.setup_consumers
        SET ENABLED = 'YES' WHERE NAME LIKE 'events_stages_current';

Use one session to run bulk load, and monitor progress in a second session:

Press CTRL+C to copy
 
mysql> SELECT thread_id, event_id, event_name, WORK_ESTIMATED,  WORK_COMPLETED 
        FROM performance_schema.events_stages_current;
--------------
SELECT thread_id, event_id, event_name, WORK_ESTIMATED,  WORK_COMPLETED FROM performance_schema.events_stages_current
--------------

+-----------+----------+----------------------------------+----------------+----------------+
| thread_id | event_id | event_name                       | WORK_ESTIMATED | WORK_COMPLETED |
+-----------+----------+----------------------------------+----------------+----------------+
|        49 |        5 | stage/bulk_load_unsorted/sorting |     1207551343 |      583008145 |
+-----------+----------+----------------------------------+----------------+----------------+

PREV HOME UP NEXT