Documentation Home
MySQL Shell 8.0
Related Documentation Download this Manual
PDF (US Ltr) - 0.5Mb
PDF (A4) - 0.5Mb
HTML Download (TGZ) - 102.9Kb
HTML Download (Zip) - 114.8Kb


MySQL Shell 8.0  /  MySQL Shell Utilities  /  Parallel Table Import Utility

7.3 Parallel Table Import Utility

MySQL Shell's parallel table import utility, introduced in MySQL Shell 8.0.17, provides rapid data import to a MySQL relational table for large data files. The utility analyzes an input data file, divides it into chunks, and uploads the chunks to the target MySQL server using parallel connections. The utility is capable of completing a large data import many times faster than a standard single-threaded upload using a LOAD DATA statement.

When you invoke the parallel table import utility, you specify the mapping between the fields in the data file and the columns in the MySQL table. You can set field- and line-handling options as for the LOAD DATA command to handle data files in arbitrary formats. The default dialect for the utility maps to a file created using a SELECT...INTO OUTFILE statement with the default settings for that statement. The utility also has preset dialects that map to the standard data formats for CSV files (created on DOS or UNIX systems), TSV files, and JSON, and you can customize these using the field- and line-handling options as necessary. Note that JSON data must be in document-per-line format.

The parallel table import utility requires an existing classic MySQL protocol connection to the target MySQL server. Each thread opens its own session to send chunks of the data to the MySQL server. You can adjust the number of threads, number of bytes sent in each chunk, and maximum rate of data transfer per thread, to balance the load on the network and the speed of data transfer. The utility cannot operate over X Protocol connections, which do not support LOAD DATA statements.

The parallel table import utility uses LOAD DATA LOCAL INFILE statements to upload data chunks from the input file, so the data file to be imported must be in a location that is accessible to the client host as a local disk. The local_infile system variable must be set to ON on the target server. You can do this by issuing the following statement in SQL mode before running the parallel table import utility:

SET GLOBAL local_infile = 1;

To avoid a known potential security issue with LOAD DATA LOCAL, when the MySQL server replies to the parallel table import utility's LOAD DATA requests with file transfer requests, the utility only sends the predetermined data chunks, and ignores any specific requests attempted by the server. For more information, see Security Issues with LOAD DATA LOCAL.

In the MySQL Shell API, the parallel table import utility is a function of the util global object, and has the following signature:

importTable (filename, options)

filename is a string specifying the name and path for the file containing the data to be imported. On Windows, backslashes must be escaped in the file path, or you can use forward slashes instead. The data file to be imported must be in a location that is accessible to the client host as a local disk. The data is imported to the MySQL server to which the active MySQL session is connected.

options is a dictionary of import options that can be omitted if it is empty. The following options are available to specify where and how the data is imported:

schema: "db_name"

The name of the target database on the connected MySQL server. If you omit this option, the utility attempts to identify and use the schema name in use for the current MySQL Shell session, as specified in a connection URI string, \use command, or MySQL Shell option. If the schema name is not specified and cannot be identified from the session, an error is returned.

table: "table_name"

The name of the target relational table. If you omit this option, the utility assumes the table name is the name of the data file without the extension. The target table must exist in the target database.

columns: array of column names

An array of strings containing column names from the data file, given in the order that they map to columns in the target relational table. Use this option if the import file does not contain all the columns of the target table, or if the order of the fields in the import file differs from the order of the columns in the table. If you omit this option, input lines are expected to contain a matching field for each column in the target table.

skipRows: number

Skip this number of rows of data at the beginning of the file. You can use this option to omit an initial header line containing column names from the upload to the table. The default is that no rows are skipped.

replaceDuplicates: [true|false]

Whether input rows that have the same value for a primary key or unique index as an existing row should be replaced (true) or skipped (false). The default is false.

dialect: [default|csv|csv-unix|tsv|json]

Use a set of field- and line-handling options appropriate for the specified file format. You can use the selected dialect as a base for further customization, by also specifying one or more of the linesTerminatedBy, fieldsTerminatedBy, fieldsEnclosedBy, fieldsOptionallyEnclosed, and fieldsEscapedBy options to change the settings. The default dialect maps to a file created using a SELECT...INTO OUTFILE statement with the default settings for that statement. Other dialects are available to suit CSV files (created on either DOS or UNIX systems), TSV files, and JSON data. The settings applied for each dialect are as follows:

Table 7.1 Dialect settings for parallel table import utility

dialect

linesTerminatedBy

fieldsTerminatedBy

fieldsEnclosedBy

fieldsOptionallyEnclosed

fieldsEscapedBy

default

[LF]

[TAB]

[empty]

false

\

csv

[CR][LF]

,

''

true

\

csv-unix

[LF]

,

''

false

\

tsv

[CR][LF]

[TAB]

''

true

\

json

[LF]

[LF]

[empty]

false

[empty]


Note
  1. The carriage return and line feed values for the dialects are operating system independent.

  2. If you use the linesTerminatedBy, fieldsTerminatedBy, fieldsEnclosedBy, fieldsOptionallyEnclosed, and fieldsEscapedBy options, depending on the escaping conventions of your command interpreter, the backslash character (\) might need to be doubled if you use it in the option values.

  3. Like the MySQL server with the LOAD DATA statement, MySQL Shell does not validate the field- and line-handling options that you specify. Inaccurate selections for these options can cause data to be imported into the wrong fields, partially, and/or incorrectly. Always verify your settings before starting the import, and verify the results afterwards.

linesTerminatedBy: "characters"

One or more characters (or an empty string) that terminates each of the lines in the input data file. The default is as for the specified dialect, or a linefeed character (\n) if the dialect option is omitted. This option is equivalent to the LINES TERMINATED BY option for the LOAD DATA statement. Note that the utility does not provide an equivalent for the LINES STARTING BY option for the LOAD DATA statement, which is set to the empty string.

fieldsTerminatedBy: "characters"

One or more characters (or an empty string) that terminates each of the fields in the input data file. The default is as for the specified dialect, or a tab character (\t) if the dialect option is omitted. This option is equivalent to the FIELDS TERMINATED BY option for the LOAD DATA statement.

fieldsEnclosedBy: "character"

A single character (or an empty string) that encloses each of the fields in the input data file. The default is as for the specified dialect, or the empty string if the dialect option is omitted. This option is equivalent to the FIELDS ENCLOSED BY option for the LOAD DATA statement.

fieldsOptionallyEnclosed: [ true | false ]

Whether the character given for fieldsEnclosedBy encloses all of the fields in the input data file (false), or encloses the fields only in some cases (true). The default is as for the specified dialect, or false if the dialect option is omitted. This option makes the fieldsEnclosedBy option equivalent to the FIELDS OPTIONALLY ENCLOSED BY option for the LOAD DATA statement.

fieldsEscapedBy: "character"

The character that begins escape sequences in the input data file. If this is not provided, escape sequence interpretation does not occur. The default is as for the specified dialect, or a backslash (\) if the dialect option is omitted. This option is equivalent to the FIELDS ESCAPED BY option for the LOAD DATA statement.

bytesPerChunk: "size"

The number of bytes (plus any additional bytes required to reach the end of the row) that threads send for each LOAD DATA call to the target server. The utility divides the data into chunks of this size for threads to pick up and send to the target server. The chunk size can be specified as a number of bytes, or using the suffixes k (kilobytes), M (megabytes), G (gigabytes). For example, bytesPerChunk="2k" makes threads send chunks of approximately 2 kilobytes. The minimum chunk size is 131072 bytes, and the default chunk size is 50M.

threads: number

The maximum number of parallel threads to use to send the data in the input file to the target server. If you do not specify a number of threads, the default maximum is 8. The utility calculates an appropriate number of threads to create up to this maximum, using the following formula:

min{max{1, threads}, chunks}}

where threads is the maximum number of threads, and chunks is the number of chunks that the data will be split into, which is calculated by dividing the file size by the bytesPerChunk size then adding 1. The calculation ensures that if the maximum number of threads exceeds the number of chunks that will actually be sent, the utility does not create more threads than necessary.

maxRate: "rate"

The maximum limit on data throughput in bytes per second per thread. Use this option if you need to avoid saturating the network or the I/O or CPU for the client host or target server. The maximum rate can be specified as a number of bytes, or using the suffixes k (kilobytes), M (megabytes), G (gigabytes). For example, maxRate="5M" limits each thread to 5MB of data per second, which for eight threads gives a transfer rate of 40MB/second. The default is 0, meaning that there is no limit.

showProgress: [ true | false ]

Display (true) or hide (false) progress information for the import. The default is true if stdout is a terminal (tty), and false otherwise.

The following examples import the data in the CSV file /tmp/productrange.csv to the products table in the mydb database, skipping a header row in the file:

mysql-js> util.importTable("/tmp/productrange.csv", {schema: "mydb", table: "products", dialect: "csv-unix", skipRows: 1, showProgress: true})
mysql-py> util.import_table("/tmp/productrange.csv", {"schema": "mydb", "table": "products", "dialect": "csv-unix", "skipRows": 1, "showProgress": True})

The following example only specifies the dialect for the CSV file. mydb is the active schema for the MySQL Shell session. The utility therefore imports the data in the file /tmp/productrange.csv to the productrange table in the mydb database:

mysql-py> \use mydb
mysql-py> util.import_table("/tmp/productrange.csv", {"dialect": "csv-unix"})

The function returns void, or an exception in case of an error. If the import is stopped partway by the user with Ctrl+C or by an error, the utility stops sending data. When the server finishes processing the data it received, messages are returned showing the chunk that was being imported by each thread at the time, the percentage complete, and the number of records that were updated in the target table.

The parallel table import utility can also be invoked from the command line using the mysqlsh command interface. With this interface, you invoke the utility as in the following example:

mysqlsh mysql://root:@127.0.0.1:3366 --ssl-mode=DISABLED -- util import-table /r/mytable.dump --schema=mydb --table=regions --bytes-per-chunk=10M --linesTerminatedBy=$'\r\n'

When you use the mysqlsh command interface to invoke the parallel table import utility, the columns option is not supported because array values are not accepted, so the input lines in your data file must contain a matching field for every column in the target table. Also note that as shown in the above example, line feed characters must be passed using ANSI-C quoting in shells that support this function (such as bash, ksh, mksh, and zsh).

For information on this interface, see Section 5.8, “API Command Line Interface”.