WL#8699: Bit-wise operations on binary data types
Affects: Server-8.0 — Status: Complete — Priority: Medium
WHAT ==== Existing bit-wise operations: & | ^ ~ << >> BIT_COUNT BIT_AND BIT_XOR BIT_OR currently take BIGINT (64-bit integer) arguments and return BIGINT. So they are limited to 64 bits. Task is to extend them to also work on binary([VAR]BINARY/[TINY|MEDIUM|LONG]BLOB) arguments of any length for ( & | ^ ~ << >> BIT_COUNT) and limited to 511 bytes ([VAR]BINARY/TINYBLOB)for the aggregate functions BIT_AND BIT_XOR BIT_OR. Need some background? - Bit operators and functions & | ^ ~ << >> BIT_COUNT : https://dev.mysql.com/doc/refman/5.7/en/bit-functions.html#operator_bitwise- and - Aggregate bit functions BIT_AND BIT_XOR BIT_OR: http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html WHERE ===== Scope of WL#8699 is Server 8.0. Some warnings, and only that, were added to Server 5.7 (WL#9015) and will be removed by this WL. WHY === Two reasons: UUID and IPv6. A very important customer repeatedly expressed the need to manipulate UUIDs more easily. Our plan is two-fold: * "Improve usability of UUID manipulations" introduces 3 functions to validate/store/retrieve/index UUIDs easily: they convert UUID between text form (like ''aab5d5fd-70c1-11e5-a4fb-b026b977eb28' - 36 characters) and BINARY(16) (16 bytes). * The present WL: extends our existing bit-wise operations ('bitwise AND', etc), which work with BIGINT, to also work with [VAR]BINARY/[TINY|MEDIUM|LONG]BLOB. Will allow to test/extract/recombine parts of a UUID, as an alternative to the SUBSTR proposed in https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ . IPv6 manipulation is the other reason: we already feature INET6_ATON and INET6_NTOA functions which convert IPv6 adresses between text form (like 'fe80::226:b9ff:fe77:eb17') and BINARY(16). The present WL will allow to test/extract/recombine parts of an IPv6 address. Using bitwise operations for IPv*4* addresses (which are convertible to BIGINT) is already an established practice among MySQL users. These users want bit ops on more than 64 bits and would be helped by this WL: http://stackoverflow.com/questions/18217664/mysql-bitwise-and-256-bit-BINARY- values http://stackoverflow.com/questions/7021549/bitwise-and-or-with-a-varBINARY255- in-mysql http://dba.stackexchange.com/questions/53351/xor-BINARY-values-greater-than- 64bit http://stackoverflow.com/questions/11015628/mysql-xor-a-string-with-a-key ANY CATCH? ========== Yes, though we hope it's small and acceptable; please read "Backward-compatibility and incompabilities" further down.
Functional requirements ======================= F-1: When their arguments are [VAR]BINARY/[TINY|MEDIUM|LONG]BLOB, & | ^ ~ << >> BIT_COUNT BIT_AND BIT_XOR BIT_OR must produce VARBINARY output or generate an error, not BIGINT anymore. See detailed rules in the HLS. F-2: in other cases & | ^ ~ << >> BIT_COUNT BIT_AND BIT_XOR BIT_OR must work as before. F-3: In MySQL 5.7.11+, the cases of F-1 should emit a warning to inform users of the changing-in-5.8 behavior, thus giving them an opportunity to rewrite their query so it doesn't fall into F-1 anymore. Nonfunctional requirements ========================== NF-1: existing applications, existing views and routines, and 5.7->5.8 statement-based replication, must work unchanged unless they involve queries falling into F-1.
Problem: ======== Let's take bit-wise '&' operator as example. In MySQL 5.7 (and older) this operation takes BIGINT arguments and returns BIGINT (BIGINT = 64-bit integer). Arguments are cast to BIGINT, possibly losing bits; for example: select x'1000000000000000000000000000000' & x'1000000000000000000000000000001'; returns 0; the most-significant bit is lost. If one has an IPv4 address and wants to test it against a network mask, he uses 'INET_ATON(address) & INET_ATON(network)' (INET_ATON returns BIGINT). But if one has an IPv6 address and wants to test it against a network mask, using 'INET6_ATON(address) & INET6_ATON(network)' will not work as INET6_ATON returns VARBINARY(16) (128 bits) which is lossily converted to BIGINT by the '&' operation. So we would like to extend the bit operations to apply to BINARY arguments. Solution: ========= Implement binary operations for [VAR]BINARY(M)/[TINY|MEDIUM|LONG]BLOB(M). Depending on the types of input arguments of & | ^ ~ << >> BIT_AND BIT_XOR BIT_OR BIT_COUNT: - determine the return type (today it is INT_RESULT): a hybrid of INT_RESULT and STRING_RESULT (like '+' does); this doesn't apply to BIT_COUNT which result type will remain INT_RESULT. - call val_int() or val_str() on arguments - do calculations (for integer arguments, continue using the native C++ bit-wise operators; for binary, do that byte-by-byte). The operation's return type will be determined as follows. 1. arg1 [ & | ^] arg2 (bit-and, bit-or, bit-xor) ------------------------------------------------- Currently in 5.7: both arguments are converted to BIGINT and the result of the operation is returned as BIGINT. Solution: 1.1. when arg1 is [VAR]BINARY(m)/[TINY|MEDIUM|LONG]BLOB(m) and arg2 is [VAR]BINARY(n)/[TINY|MEDIUM|LONG]BLOB(n) and at least one of them is not a hex/bit/NULL literal 1.1.a. Determine the length of each argument: l1=LENGTH(arg 1), l2=LENGTH(arg 2). Note that l1 =< m, l2 =< n; if types are BINARY there is equality. 1.1.b If l1 <> l2, throw error (anything else would require padding to left or right, and we can't reliably decide between left and right) 1.1.c Execute the bitwise operation. Return type is VARBINARY(the length of the result is l1 i.e. l1=l2 per 1.1.b). 1.2. otherwise, convert both arguments to BIGINT, execute the operation and return BIGINT; this is the same behaviour as MySQL 5.7. 2. arg1 [ << >> ] arg2 (bit shifting) ; ~(arg1) (bit-not) ---------------------------------------------------------- Currently in 5.7: arg1 is converted to BIGINT and the result of the operation is returned as BIGINT. Solution: 2.1. when arg1 is [VAR]BINARY(m)/[TINY|MEDIUM|LONG]BLOB(m) and is not a hex/bit/NULL literal, execute the operation and return the result as [VAR]BINARY(m). For shift operators, bits may be lost without warning (as is usual for the BIGINT case: 1>>1 is 0). 2.2. otherwise convert to BIGINT execute operation and return BIGINT. This is the same behavior as MySQL 5.7. 3. BIT_COUNT - always returns BIGINT. ------------ 4. BIT_AND, BIT_XOR, BIT_OR --------------------------- 4.1. when arg is [VAR]BINARY(m)/TINYBLOB(m) and is not a hex/bit/NULL literal 4.1.1. if arg's values have same length execute the operation and return the result as [VAR]BINARY(m) 4.1.2. if arg's values have different length throw error 4.2. when arg is [MEDIUM|LONG]BLOB(m) or [VAR]BINARY(n) with n exceeding 511 throw ER_INVALID_BITWISE_AGGREGATE_OPERANDS_SIZE error 4.3. when the arg is other types than those in 4.1 or 4.2 work as before => returns BIGINT *NULL values do not affect the result unless all the values are NULL, in which case the result will be the neutral element having the same length as the column definition length ex. CREATE TABLE t(a VARBINARY(6), group_id INT); insert into t values (1, NULL); insert into t values (1, NULL); insert into t values (2, NULL); insert into t values (2, 0x1234); SELECT HEX(BIT_AND(a)), HEX(BIT_XOR(a)), HEX(BIT_OR(a)) FROM t GROUP BY group_id 1 FFFFFFFFFFFF 000000000000 000000000000 2 1234 1234 1234 5. The special case of hex/bit/NULL literals --------------------------------------- Those are https://dev.mysql.com/doc/refman/5.7/en/hexadecimal-literals.html https://dev.mysql.com/doc/refman/5.7/en/bit-field-literals.html (unadorned; NOT prefixed with a charset introducer or the BINARY word) http://dev.mysql.com/doc/refman/5.7/en/null-values.html Today people can use hexadecimal constants as bitmasks, for example: set @mask= (x'01' | x'F0'), or set @mask= (x'01' << 10); This hex constant x'01' is currently of type VARBINARY(1). If (x'01' << 10) and (x'01' | x'F0') produced a BINARY in 5.8 (not a BIGINT anymore), this could lead to problems in existing applications. Likewise, the NULL literal has type BINARY(0), and today set @mask=(NULL | NULL) makes @mask a BIGINT with a NULL value. That is why for backward compatibility, rules 1.1 and 2.1 have exceptions for hex/bit/NULL literals. Thus, the SET examples above will work unchanged. If the user wants to produce BINARY masks, he can force one literal to BINARY with: set @mask= (BINARY x'01' | x'F0'), set @mask= (BINARY x'01' << 10), set @mask= (BINARY NULL | NULL). BINARY <expr> is equivalent to CAST (<expr> AS BINARY) (it's a function call). If <expr> is a literal, an alternative is _BINARY <literal> (_BINARY 'a', or _BINARY x'01'). _BINARY isn't documented at http://dev.mysql.com/doc/refman/5.7/en/charset-literal.html which only documents _<CHARSET> ; as BINARY isn't listed in the list of available charsets: http://dev.mysql.com/doc/refman/5.7/en/charset-mysql.html the user has no chance to know about _BINARY. We should document it. 6. Examples ------------ Example: if a BINARY(2) column col1 contains 0x0102 and another one, col2, contains 0x0408, the result of 'col1 | col2' is a BINARY(2) containing 0x050A. Example: Suppose a network with 64 subnets; so the subnet's number is between bit 48 and bit 54 (inspired by http://www.firewall.cx/networking-topics/protocols/877-ipv6- subnetting-how-to-subnet-ipv6.html ); to test if IPv6 address 2606:b400:8f0:82:8000::237 is in the 23th subnet of this network: SELECT INET6_ATON('2606:b400:8f0:82:8000::237') & INET6_ATON('0:0:0:FC00::') = INET6_ATON('0:0:0:5C00::'); In a more realistic example, the INET6_ATON-s would be replaced by a BINARY(16) column storing IPv6 addresses or by hex constants: SELECT my_col & my_mask = my_subnet; Example: "my_BINARY_column & x'01'". This hex constant x'01' has type VARBINARY(1). So if my_BINARY_column is BINARY(4) and contains x'12345678', the operation will fail with error (different lengths 4 and 1). One will have to write x'00000001' or lpad(x'01',4,x'00') (or rpad()). Example: if one has a column of VARBINARY(4) type, containing values having different lengths, and one wants to find rows which have the least-significant bit set, one would do: select ... where col & x'00000001'; (or: select ... where col & x'80000000', depending on how he ordered bits) but it would give an error if some value of 'col' has a length shorter than 4, so one should use: select ... where lpad(col,4,x'00') & x'00000001'; Example: x'01' << 10; will yield integer 1024 (as in 5.7). BINARY x'01' << 10; will yield VARBINARY(1) containing '0x00'; no warning for lost bits will be given because: - it's how MySQL works today with << on BIGINT - people may intentionally use << to move a low part in a certain position without any concern for the lost high part (example: switching high and low bytes of a number). 'CAST ... AS BINARY(n)', or LPAD/RPAD, can make x'01' wider to avoid bit loss. Or add more zeroes in the literal. Example: given a UUID stored in binary(16), swap its portions as in http://mysqlserverteam.com/storing-uuid-values-in-mysql-tables/ : set @u=uuid(); # returns '3A059CCB-70EA-11E5-A4FB-B026B977EB28' set @v=unhex(replace(@u,'-','')); # made it binary select hex( ((@v >> 64) << 112) | (((@v >> 80) << 112) >> 16) | (((@v >> 96) << 96) >> 32) | ((@v << 64) >> 64)) = '11E570EA3A059CCBA4FBB026B977EB28'; returns 'true'. uuid_to_bin (WL#8920) will allow to do this automatically, but this example shows it is possible to move, swap, recombine pieces in any desired way. The >> is used to eliminate the right bits, << is used to eliminate left bits. Backward-compatibility and incompabilities ========================================== The solutions above 1.1 and 2.1 will change behaviour of only those 5 expressions: BINARY-neither-hex-nor-null [ &|^ ] BINARY, BINARY [ &|^ ] BINARY-neither-hex-nor-null, BINARY-neither-hex-nor-null [ << >> ] anything, ~ BINARY-neither-hex-nor-null, AGGR_BIT_FUNC(BINARY-neither-hex-nor-null), where BINARY-neither-hex-nor-null is of [VAR]BINARY(m)/[TINY|MEDIUM|LONG]BLOB(m) type but not a hex/bit/NULL literal. These 5 cases all return BIGINT today, will return BINARY in 5.8. Note that they are unlikely to be used today, because: 1) bit operators and functions are documented to work on integers only. 2) In the entire MTR suite, there is only one query which falls into these cases, and it's an absurd one from Shane's query generator. 3) Certainly some client (PHP-PDO) may quote prepared statements' parameters, making SELECT ? & 123 become SELECT '456' & 123 (if the bound parameter has value 456), but this '456' string is of VARCHAR type, not VARBINARY, unless character_set_connection is set to BINARY (which it isn't by default). Having a VARCHAR argument, the return type of the bit operation will remain BIGINT - no change. These 5 cases can theoretically exist in: - an application's SQL code - a stored routine - a view's definition ('CREATE VIEW ... SELECT bit-op') - a statement-based binary log. For extra safety, we will add a warning to 5.7, in the 5 cases: "warning: Bitwise operations on BINARY will change behaviour in a future version, check the 'Bit functions' section in the manual.' For example, SELECT bin_col1 & bin_col2; will emit this warning. This warning will also be emitted when using a view, or stored routine, if it uses a query falling into the 5 cases. This will be done by WL#9015, please read it now. Let's call 5.7.n the version where we add the warnings. If the user follows this piece of advice and corrects the expression, then in the upgrade from 5.7.n to 5.8 the result of the corrected expression will not change, his application/view/routine will continue to work, his statement-based/mixed replication from 5.7.n to 5.8 will work. If he has ignored warnings, he shouldn't have. If is true that X->5.8 statement-based/mixed replication will not work reliably if X is prior to 5.7.n. But the problematic 5 cases are expected to be rare, and we can add a note ("incompatible change") in the manual. And we can see, when the DMR with this WL is released, if it causes real numerous problems, in which case we can back it off. So we think that 5.8 need NOT offer an option to preserve 5.7 behaviours of bit operations in those 5 cases. Offering that option would then force us to propagate this option in the binlog, and to deprecate it later. Other products ============== - Their limits on size of arguments: DB2: 113 bits; SQL Server: 64 bits (largest return type is INT); C++: 64 bits; Python: integers are of arbitrary size; Oracle: 128 bits (using NUMBER type). - I found no DBMS product supporting bit operations in BINARY types. PG has it on BIT type (where lengths of operands must match); casting BIT to larger BIT does right padding; casting from INT to BIT does left-padding. - This solution limits the BIT_AND, BIT_OR and BIT_XOR operator's length to 511(so [VAR]BINARY(m)|TINYBLOB) where M<=511. However it does not limit the operators size for & | ^ << >> ~ BIT_COUNT. How to do without any of the improvements above =============================================== substr()+concat() can also extract, move and recombine parts of BINARY(16), for ipv6/uuid use cases. If the boundaries of netmasks to test are multiples of 8 bits, no bit arithmetic is needed (as one can stay at the level of the byte). But we saw in the case of the 64 subnets above, that the subnet's number is over 6 bits - not a multiple of 8. For bit arithmetic, read on. An 8-byte part extracted with substr() can be made into a bigint like this: select convert(conv(hex( BINARY(8) value ),16,10), unsigned); then it can be used with existing bigint bit-ops, and converted back to BINARY(8) with: select unhex(hex(int value)); Doable, but not very simple.
- added mark_field_in_map in Item: to be able to mark the fields for read in result fields inheriting from Item_sum_hybrid_field - removed bitwise operations on integer from item_cmpfunc.cc - added bitwise operations on binary/integer in item_func.cc - extended Item_func_bit_count::val_int to work with binary too - changed the hierarchy for Item_func_bit(base class for: '~', '|', '^', '&', '>>', '<<'): used Item_func as direct parent(previously it was Item_int_func) and made it work with int and binary. - added Item_func_bit_two_param(inherits Item_func_bit) to work with functions that have two binary arguments - renamed check_deprecated_bin_op to bit_func_returns_binary and changed it's role to check if the Item should return integer or binary result (implementation moved to item_func.cc) - Item_sum_bit inherits now directly from Item_sum(previously Item_sum_int) and is capable of working with both binary strings and integers - added Item_sum_hybrid_field which inherits from Item_result_field and can produce different results + it implements mark_field_in_map to make it possible to use field->store() and field->val_str/.. - Item_sum_num_field and Item_sum_bit_field inherits now from Item_sum_hybrid_field - added a new Field of type FIELD_BIT_ITEM: Item_sum_bit_field which extracts the additional byte stored on field used during the aggregation(used to keep track if the lenght for the group was set)
Copyright (c) 2000, 2017, Oracle Corporation and/or its affiliates. All rights reserved.