WL#2637: Byte Order Mark for LOAD DATA INFILE and mysqlimport

Affects: Server-7.0   —   Status: Assigned

Sometimes Unicode files begin with a Byte Order Mark (BOM). 
We want to ignore the BOM when reading files using LOAD 
DATA INFILE or mysqlimport. 
 
We could support a new clause "IGNORE n BYTES" 
(analogous to "IGNORE n LINES") but that would 
not work well: sometimes there might be a BOM, 
sometimes not, and we only want to ignore the 
bytes if they are there. 
 
For a new LOAD DATA INFILE clause, we'll use the 
Oracle10g SQL Loader syntax "BYTEORDERMARK CHECK | NOCHECK", 
It's described in the Oracle10g Utilities Manual: 
http://dbis.informatik.uni-freiburg.de/oracle-docs/doc1001/server.101/b10825/ldr_field_list.htm#i1011032

(Search for the word "BYTEORDERMARK".) 
We won't worry about Oracle's clause order, though, 
so put "BYTEORDER" clause after "INTO TABLE tbl_name". 
 
The default is "BYTEORDERMARK NOCHECK". 
Thus, if there is some MySQL-supported character set 
where the BOM characters are meaningful, no problem. 
(This is not how Oracle handles the default.) 
 
For mysqlimport, the flag is --byteordermark=check 
or --byteordermark=nocheck (the default) 
 
SELECT ... INTO OUTFILE does not generate BOMs. 
 
"BYTEORDERMARK CHECK" doesn't really mean that we 
check. All it means is: we skip the first 2 or 3 bytes of 
the file if they are equal to any of the following: 
0xEFBBBF                              UTF8 
0xFEFF                                UCS2 bigendian 
0xFFFE                                UCS2 littleendian 
Even if BYTEORDERMARK CHECK is on, the absence of a BOM 
will not be an error. Just treat the first bytes as data. 
 
See also: WL#993 "LOAD DATA INFILE and character sets". 

Feature requests:
BUG#4960 Mysql cmdline client fails on byte order mark

See also:
http://bugs.mysql.com/bug.php?id=10573
http://bugs.mysql.com/bug.php?id=29323