MySQL includes a bulk load extension to the
LOAD DATA statement. It can do the
following:
Optimize the loading of data sorted by primary key.
Optimize the loading of unsorted data.
Optimize the loading of data from an object store.
Optimize the loading of data from a series of files.
Load a MySQL Shell dump file.
Load ZSTD compressed CSV files.
Monitor bulk load progress with the Performance Schema.
Large data support.
Support for tables with generated columns. This includes tables with stored and virtual generated columns. Stored generated columns can be part of secondary indexes and the primary key. Virtual generated columns can be part of secondary indexes. Tables can have a combination of multiple stored and virtual generated columns. The input CSV files must contain data for the stored and virtual generated columns.
Support for bulk load into tables with existing data (non-empty tables). The bulk load operation compares the records from the input CSV files and the existing table, and then inserts the data into the existing table so that it is sorted correctly.
Use a second session to monitor bulk load progress:
If the data is sorted, there is a single stage:
loading.If the data is unsorted, there are two stages:
sortingandloading.
LOAD DATA with
ALGORITHM=BULK supports tables with at least
one column with the VECTOR data
type. If you attempt to load a table without at least one column
with the VECTOR data type, an
error occurs.
In addition to the requirement to have at least one
VECTOR column,
LOAD DATA with
ALGORITHM=BULK supports the following data
types:
mysql> LOAD DATA
[LOW_PRIORITY | CONCURRENT]
[FROM]
INFILE | URL | S3 'file_prefix' | 'options' [COUNT N]
[IN PRIMARY KEY ORDER]
INTO TABLE tbl_name
[CHARACTER SET charset_name] [COMPRESSION = {'ZSTD'}]
[{FIELDS | COLUMNS}
[TERMINATED BY 'string']
[[OPTIONALLY] ENCLOSED BY 'char']
[ESCAPED BY 'char']
]
[LINES
[TERMINATED BY 'string']
]
[IGNORE number {LINES | ROWS}]
[PARALLEL = number]
[MEMORY = M]
[ALGORITHM = BULK]
options: {
JSON_OBJECT("key","value"[,"key","value"] ...)
"key","value": {
"url-prefix","prefix"
["url-sequence-start",0]
["url-suffix","suffix"]
["url-prefix-last-append","@"]
["is-dryrun",{true|false}]
}
}
The additional LOAD DATA clauses
are:
FROM: Makes the statement more readable.URL: A URL accessible with a HTTP GET request.-
S3: The AWS S3 file location.This requires the user privilege
LOAD_FROM_S3. -
COUNT: The number of files in a series of files.For
COUNT 5andfile_prefixset todata.csv., the five files would be:data.csv.1,data.csv.2,data.csv.3,data.csv.4, anddata.csv.5. -
IN PRIMARY KEY ORDER: Use when the data is already sorted. The values should be in ascending order within the file.For a file series, the primary keys in each file must be disjoint and in ascending order from one file to the next.
-
PARALLEL: The number of concurrent threads to use. A typical value might be 16, 32 or 48. The default value is 16.PARALLELdoes not requireCONCURRENT. MEMORY: The amount of memory to use. A typical value might be 512M or 4G. The default value is 1G.ALGORITHM: Set toBULKfor bulk load. The file format is CSV.COMPRESSION: The file compression algorithm. Bulk load supports the ZSTD algorithm.-
optionsis a JSON object literal that includes:url-prefix: The common URL prefix for the files to load.-
url-sequence-start: The sequence number for the first file.The default value is 1, and the minimum value is 0. The value cannot be a negative number. The value can be a string or a number, for example, "134", or "default".
url-suffix: The file suffix.-
url-prefix-last-append: The string to append to the prefix of the last file.This supports MySQL Shell dump files.
-
is-dryrun: Set totrueto run basic checks and report if bulk load is possible on the given table. The default value isfalse.To enable
is-dryrun, use any of the following values:true,"true","1","on"or1.To disable
is-dryrun, use any of the following values:false,"false","0","off"or0.
LOAD DATA with
ALGORITHM=BULK does not support these
clauses:
LOAD DATA
[LOCAL]
[REPLACE | IGNORE]
[PARTITION (partition_name [, partition_name] ...)]
]
[LINES
[STARTING BY 'string']
]
[(col_name_or_user_var
[, col_name_or_user_var] ...)]
[SET col_name={expr | DEFAULT}
[, col_name={expr | DEFAULT}] ...]
-
An example that loads unsorted data from AWS S3 with 48 concurrent threads and 4G of memory:
mysql> GRANT LOAD_FROM_S3 ON *.* TO load_user@localhost; mysql> LOAD DATA FROM S3 's3-us-east-1://innodb-bulkload-dev-1/lineitem.tbl' INTO TABLE lineitem FIELDS TERMINATED BY "|" OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' PARALLEL = 48 MEMORY = 4G ALGORITHM=BULK; -
An example that loads eight files of sorted data from AWS S3. The
file_prefixends with a period. The files arelineitem.tbl.1,lineitem.tbl.2, ...lineitem.tbl.8:mysql> GRANT LOAD_FROM_S3 ON *.* TO load_user@localhost; mysql> LOAD DATA FROM S3 's3-us-east-1://innodb-bulkload-dev-1/lineitem.tbl.' COUNT 8 IN PRIMARY KEY ORDER INTO TABLE lineitem FIELDS TERMINATED BY "|" OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' ALGORITHM=BULK; -
An example that performs a dry run on a sequence of MySQL Shell dump files compressed with the ZSTD algorithm:
mysql> GRANT LOAD_FROM_URL ON *.* TO load_user@localhost; mysql> LOAD DATA FROM URL '{"url-prefix","https://example.com/bucket/test@lineitem@","url-sequence-start",0,"url-suffix",".tsv.zst","url-prefix-last-append","@","is-dryrun",true}' COUNT 20 INTO TABLE lineitem CHARACTER SET ???? COMPRESSION = {'ZSTD'} FIELDS TERMINATED BY "|" OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 20000 LINES ALGORITHM=BULK; -
An example that loads data with the URI keyword (supported as of MySQL 9.4.0):
mysql> GRANT LOAD_FROM_URL ON *.* TO load_user@localhost; mysql> LOAD DATA FROM URI 'https://data_files.com/data_files_1.tbl' INTO TABLE lineitem FIELDS TERMINATED BY "|" OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' ALGORITHM=BULK; -
An example that monitors bulk load progress in a second session.
-
Review the list of stages with the following query:
mysql> SELECT NAME, ENABLED, TIMED FROM performance_schema.setup_instruments WHERE ENABLED='YES' AND NAME LIKE "stage/bulk_load%"; -
Enable the
events_stages_currentwith the following query:mysql> UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE 'events_stages_current'; -
Use one session to run bulk load, and monitor progress in a second session:
mysql> SELECT thread_id, event_id, event_name, WORK_ESTIMATED, WORK_COMPLETED FROM performance_schema.events_stages_current; -------------- SELECT thread_id, event_id, event_name, WORK_ESTIMATED, WORK_COMPLETED FROM performance_schema.events_stages_current -------------- +-----------+----------+----------------------------------+----------------+----------------+ | thread_id | event_id | event_name | WORK_ESTIMATED | WORK_COMPLETED | +-----------+----------+----------------------------------+----------------+----------------+ | 49 | 5 | stage/bulk_load_unsorted/sorting | 1207551343 | 583008145 | +-----------+----------+----------------------------------+----------------+----------------+ 1 row in set (0.00 sec)
-