WL#8699: Bit-wise operations on binary data types

Affects: Server-8.0   —   Status: Complete

WHAT
====

Existing bit-wise operations: & | ^ ~ << >> BIT_COUNT BIT_AND BIT_XOR BIT_OR
currently take BIGINT (64-bit integer) arguments and return BIGINT. So
they are limited to 64 bits.
Task is to extend them to also work on 
binary([VAR]BINARY/[TINY|MEDIUM|LONG]BLOB) arguments of any length for ( & | ^ ~ 
<< >> BIT_COUNT) and limited to 511 bytes ([VAR]BINARY/TINYBLOB)for the 
aggregate functions BIT_AND BIT_XOR BIT_OR.

Need some background?
- Bit operators and functions & | ^ ~ << >> BIT_COUNT :
  https://dev.mysql.com/doc/refman/5.7/en/bit-functions.html#operator_bitwise-
and
- Aggregate bit functions BIT_AND BIT_XOR BIT_OR:
  http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html

WHERE
=====

Scope of WL#8699 is Server 8.0.
Some warnings, and only that, were added to Server 5.7 (WL#9015) and will be 
removed by this WL.

WHY
===

Two reasons: UUID and IPv6.

A very important customer repeatedly expressed the need to manipulate
UUIDs more easily.

Our plan is two-fold:
* "Improve usability of UUID manipulations" introduces 3 functions to
validate/store/retrieve/index UUIDs easily: they convert UUID between text form
(like ''aab5d5fd-70c1-11e5-a4fb-b026b977eb28' - 36 characters) and BINARY(16)
(16 bytes).
* The present WL: extends our existing bit-wise operations ('bitwise
AND', etc), which work with BIGINT, to also work with 
[VAR]BINARY/[TINY|MEDIUM|LONG]BLOB. Will allow to test/extract/recombine parts 
of a UUID, as an alternative to the SUBSTR proposed in 
https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/ .


IPv6 manipulation is the other reason: we already feature INET6_ATON
and INET6_NTOA functions which convert IPv6 adresses between text form
(like 'fe80::226:b9ff:fe77:eb17') and BINARY(16). The present WL will
allow to test/extract/recombine parts of an IPv6 address. Using
bitwise operations for IPv*4* addresses (which are convertible to BIGINT) is
already an established practice among MySQL users.

These users want bit ops on more than 64 bits and would be helped by
this WL:
http://stackoverflow.com/questions/18217664/mysql-bitwise-and-256-bit-BINARY-
values
http://stackoverflow.com/questions/7021549/bitwise-and-or-with-a-varBINARY255-
in-mysql
http://dba.stackexchange.com/questions/53351/xor-BINARY-values-greater-than-
64bit
http://stackoverflow.com/questions/11015628/mysql-xor-a-string-with-a-key

ANY CATCH?
==========

Yes, though we hope it's small and acceptable; please read
"Backward-compatibility and incompabilities" further down.
Functional requirements
=======================

F-1: When their arguments are [VAR]BINARY/[TINY|MEDIUM|LONG]BLOB, & | ^ ~ << >> 
BIT_COUNT BIT_AND BIT_XOR BIT_OR must produce VARBINARY output or generate an 
error, not BIGINT anymore. See detailed rules in the HLS.

F-2: in other cases & | ^ ~ << >> BIT_COUNT BIT_AND BIT_XOR BIT_OR
must work as before.

F-3: In MySQL 5.7.11+, the cases of F-1 should emit a warning to inform
users of the changing-in-5.8 behavior, thus giving them an opportunity to
rewrite their query so it doesn't fall into F-1 anymore.

Nonfunctional requirements
==========================

NF-1: existing applications, existing views and routines, and 5.7->5.8
statement-based replication, must work unchanged unless they involve queries
falling into F-1.
Problem:
========

Let's take bit-wise '&' operator as example.
In MySQL 5.7 (and older) this operation takes BIGINT arguments and returns
BIGINT (BIGINT = 64-bit integer).
Arguments are cast to BIGINT, possibly losing bits; for example:
select x'1000000000000000000000000000000' & x'1000000000000000000000000000001';
returns 0; the most-significant bit is lost.

If one has an IPv4 address and wants to test it against a
network mask, he uses
'INET_ATON(address) & INET_ATON(network)' (INET_ATON returns BIGINT).
But if one has an IPv6 address and wants to test it against a
network mask, using
'INET6_ATON(address) & INET6_ATON(network)'
will not work as INET6_ATON returns VARBINARY(16) (128 bits) which is lossily 
converted to BIGINT by the '&' operation.

So we would like to extend the bit operations to apply to BINARY arguments.

Solution:
=========

Implement binary operations for [VAR]BINARY(M)/[TINY|MEDIUM|LONG]BLOB(M). 
Depending on the types of input arguments of & | ^ ~ << >> BIT_AND BIT_XOR 
BIT_OR BIT_COUNT:
- determine the return type (today it is INT_RESULT): a hybrid of INT_RESULT and
STRING_RESULT (like '+' does); this doesn't apply to BIT_COUNT which result type
will remain INT_RESULT.
- call val_int() or val_str() on arguments
- do calculations (for integer arguments, continue using the native C++ bit-wise
operators; for binary, do that byte-by-byte).
The operation's return type will be determined as follows.

1.  arg1 [ & | ^] arg2 (bit-and, bit-or, bit-xor)
-------------------------------------------------

Currently in 5.7: both arguments are converted to BIGINT and the result of the
operation is returned as BIGINT.

Solution:

1.1. when arg1 is [VAR]BINARY(m)/[TINY|MEDIUM|LONG]BLOB(m) and arg2 is 
[VAR]BINARY(n)/[TINY|MEDIUM|LONG]BLOB(n) and at least one of them is not a 
hex/bit/NULL literal
1.1.a. Determine the length of each argument:
l1=LENGTH(arg 1), l2=LENGTH(arg 2).
Note that l1 =< m, l2 =< n; if types are BINARY there is equality.
1.1.b If l1 <> l2, throw error (anything else would require padding to left or 
right, and we can't reliably decide between left and right)
1.1.c Execute the bitwise operation. Return type is VARBINARY(the length of the 
result is l1 i.e. l1=l2 per 1.1.b).

1.2. otherwise,
convert both arguments to BIGINT, execute the operation and return
BIGINT; this is the same behaviour as MySQL 5.7.

2.  arg1 [ << >> ] arg2 (bit shifting) ; ~(arg1) (bit-not)
----------------------------------------------------------

Currently in 5.7: arg1 is converted to BIGINT and the result of the
operation is returned as BIGINT.

Solution:

2.1. when arg1 is [VAR]BINARY(m)/[TINY|MEDIUM|LONG]BLOB(m) and is not a 
hex/bit/NULL literal, execute the operation and return the result as 
[VAR]BINARY(m). For shift operators, bits may be lost without warning (as is 
usual for the BIGINT case: 1>>1 is 0).

2.2. otherwise
convert to BIGINT execute operation and return BIGINT. This
is the same behavior as MySQL 5.7.

3. BIT_COUNT - always returns BIGINT.
------------

4. BIT_AND, BIT_XOR, BIT_OR 
---------------------------

4.1. when arg is [VAR]BINARY(m)/TINYBLOB(m) and is not a hex/bit/NULL 
literal
4.1.1. if arg's values have same length execute the operation and return the 
result as [VAR]BINARY(m)
4.1.2. if arg's values have different length throw error
4.2. when arg is [MEDIUM|LONG]BLOB(m) or [VAR]BINARY(n) with n exceeding 511 
throw ER_INVALID_BITWISE_AGGREGATE_OPERANDS_SIZE error 
4.3. when the arg is other types than those in 4.1 or 4.2 work as before => 
returns BIGINT

*NULL values do not affect the result unless all the values are NULL, in which 
case the result will be the neutral element having the same length as the column 
definition length

ex.
CREATE TABLE t(a VARBINARY(6), group_id INT);
insert into t values (1, NULL);
insert into t values (1, NULL);
insert into t values (2, NULL);
insert into t values (2, 0x1234);

SELECT HEX(BIT_AND(a)), HEX(BIT_XOR(a)), HEX(BIT_OR(a)) FROM t GROUP BY group_id
1	FFFFFFFFFFFF	000000000000	000000000000
2	1234	1234	1234

5. The special case of hex/bit/NULL literals
---------------------------------------

Those are
https://dev.mysql.com/doc/refman/5.7/en/hexadecimal-literals.html
https://dev.mysql.com/doc/refman/5.7/en/bit-field-literals.html
(unadorned; NOT prefixed with a charset introducer or the BINARY
word)
http://dev.mysql.com/doc/refman/5.7/en/null-values.html

Today people can use hexadecimal constants as bitmasks, for example:
set @mask= (x'01' | x'F0'), or
set @mask= (x'01' << 10);

This hex constant x'01' is currently of type VARBINARY(1).
If (x'01' << 10) and (x'01' | x'F0') produced a BINARY in 5.8 (not a
BIGINT anymore), this could lead to problems in existing
applications.

Likewise, the NULL literal has type BINARY(0),
and today
set @mask=(NULL | NULL)
makes @mask a BIGINT with a NULL value.

That is why for backward compatibility, rules 1.1 and 2.1 have
exceptions for hex/bit/NULL literals.

Thus, the SET examples above will work unchanged. If the user wants to 
produce BINARY masks, he can force one literal to BINARY with:

set @mask= (BINARY x'01' | x'F0'),
set @mask= (BINARY x'01' << 10),
set @mask= (BINARY NULL | NULL).

BINARY 
is equivalent to CAST ( AS BINARY) (it's a function call).
If  is a literal, an alternative is
_BINARY  (_BINARY 'a', or _BINARY x'01').
_BINARY isn't documented at
http://dev.mysql.com/doc/refman/5.7/en/charset-literal.html
which only documents _ ; as BINARY isn't listed in the list of
available charsets:
http://dev.mysql.com/doc/refman/5.7/en/charset-mysql.html
the user has no chance to know about _BINARY. We should document it.

6.  Examples
------------

Example:
if a BINARY(2) column col1 contains 0x0102 and another one, col2, contains 
0x0408,
the result of 'col1 | col2' is a BINARY(2) containing 0x050A.

Example:
Suppose a network with 64 subnets; so the subnet's number is between bit 48 and 
bit 54
(inspired by http://www.firewall.cx/networking-topics/protocols/877-ipv6-
subnetting-how-to-subnet-ipv6.html );
to test if IPv6 address 2606:b400:8f0:82:8000::237 is in the 23th subnet of this 
network:
SELECT INET6_ATON('2606:b400:8f0:82:8000::237') & INET6_ATON('0:0:0:FC00::') = 
INET6_ATON('0:0:0:5C00::');
In a more realistic example, the INET6_ATON-s would be replaced by a BINARY(16) 
column storing IPv6 addresses or by hex constants:
SELECT my_col & my_mask = my_subnet;

Example: "my_BINARY_column & x'01'".
This hex constant x'01' has type VARBINARY(1). So if my_BINARY_column
is BINARY(4) and contains  x'12345678', the operation will fail with
error (different lengths 4 and 1). One will have to write x'00000001' or
lpad(x'01',4,x'00') (or rpad()).

Example:
if one has a column of VARBINARY(4) type, containing values having different 
lengths, and one wants to find rows which have the least-significant bit 
set, one would do:
select ... where col & x'00000001';
(or: select ... where col & x'80000000', depending on how he ordered bits)
but it would give an error if some value of 'col' has a length shorter than 4, 
so one should use:
select ... where lpad(col,4,x'00') & x'00000001';

Example:
 x'01' << 10;
will yield integer 1024 (as in 5.7).
 BINARY x'01' << 10;
will yield VARBINARY(1) containing '0x00'; no warning for lost bits
will be given because:
- it's how MySQL works today with << on BIGINT
- people may intentionally use << to move a low part in a certain position 
without any concern for the lost high part (example: switching high and low 
bytes of a number).
'CAST ... AS BINARY(n)', or LPAD/RPAD, can make x'01' wider to avoid bit
loss. Or add more zeroes in the literal.

Example:
given a UUID stored in binary(16), swap its portions as in
http://mysqlserverteam.com/storing-uuid-values-in-mysql-tables/ :
set @u=uuid(); # returns '3A059CCB-70EA-11E5-A4FB-B026B977EB28'
set @v=unhex(replace(@u,'-','')); # made it binary
select hex( ((@v >> 64) << 112) | (((@v >> 80) << 112) >> 16) | (((@v >> 96) <<
96) >> 32) | ((@v << 64) >> 64)) = '11E570EA3A059CCBA4FBB026B977EB28';
returns 'true'.
uuid_to_bin (WL#8920) will allow to do this automatically, but this example
shows it is possible to move, swap, recombine pieces in any desired way. The >>
is used to eliminate the right bits, << is used to eliminate left bits.

Backward-compatibility and incompabilities
==========================================

The solutions above 1.1 and 2.1 will change behaviour of only those 5
expressions:

BINARY-neither-hex-nor-null [ &|^ ] BINARY,
BINARY  [ &|^ ] BINARY-neither-hex-nor-null,
BINARY-neither-hex-nor-null [ << >> ] anything,
~ BINARY-neither-hex-nor-null,
AGGR_BIT_FUNC(BINARY-neither-hex-nor-null),

where BINARY-neither-hex-nor-null is of [VAR]BINARY(m)/[TINY|MEDIUM|LONG]BLOB(m) 
type but not a hex/bit/NULL literal.

These 5 cases all return BIGINT today, will return BINARY in 5.8. Note
that they are unlikely to be used today, because:
1) bit operators and functions are documented to work on integers
only.
2) In the entire MTR suite, there is only one query which falls into
these cases, and it's an absurd one from Shane's query generator.
3) Certainly some client (PHP-PDO) may quote prepared statements'
parameters, making 
SELECT ? & 123
become
SELECT '456' & 123
(if the bound parameter has value 456), but this '456' string is of
VARCHAR type, not VARBINARY, unless character_set_connection is set to
BINARY (which it isn't by default). Having a VARCHAR argument, the
return type of the bit operation will remain BIGINT - no change.

These 5 cases can theoretically exist in:
- an application's SQL code
- a stored routine
- a view's definition ('CREATE VIEW ... SELECT bit-op')
- a statement-based binary log.

For extra safety, we will add a warning to 5.7, in the 5 cases:
"warning: Bitwise operations on BINARY will change behaviour in a
future version, check the 'Bit functions' section in the manual.'
For example, SELECT bin_col1 & bin_col2;
will emit this warning.
This warning will also be emitted when using a view, or stored routine, if it
uses a query falling into the 5 cases. This will be done by WL#9015, please read
it now.

Let's call 5.7.n the version where we add the warnings.

If the user follows this piece of advice and corrects the expression,
then in the upgrade from 5.7.n to 5.8 the result of the corrected
expression will not change, his application/view/routine will continue to work,
his statement-based/mixed replication from 5.7.n to 5.8 will work.

If he has ignored warnings, he shouldn't have.

If is true that X->5.8 statement-based/mixed replication will not work reliably
if X is prior to 5.7.n. But the problematic 5 cases are expected to be rare,
and we can add a note ("incompatible change") in the manual. And we
can see, when the DMR with this WL is released, if it causes real
numerous problems, in which case we can back it off.

So we think that 5.8 need NOT offer an option to preserve 5.7
behaviours of bit operations in those 5 cases. Offering that option
would then force us to propagate this option in the binlog, and
to deprecate it later.


Other products
==============

- Their limits on size of arguments: DB2: 113 bits; SQL Server: 64 bits (largest 
return type is INT); C++: 64 bits; Python: integers are of arbitrary size; 
Oracle: 128 bits (using NUMBER type).
- I found no DBMS product supporting bit operations in BINARY types. PG has it 
on BIT type (where lengths of operands must match); casting BIT to larger BIT
does right padding; casting from INT to BIT does left-padding.

- This solution limits the BIT_AND, BIT_OR and BIT_XOR operator's length to 
511(so [VAR]BINARY(m)|TINYBLOB) where M<=511. However it does not limit the 
operators size for & | ^  << >>  ~ BIT_COUNT.


How to do without any of the improvements above
===============================================

substr()+concat() can also extract, move and recombine parts of BINARY(16), for 
ipv6/uuid use cases. If the boundaries of netmasks to test are multiples of 8 
bits, no bit arithmetic is needed (as one can stay at the level of the byte). 
But we saw in the case of the 64 subnets above, that the subnet's number is over 
6 bits - not a multiple of 8. For bit arithmetic, read on.

An 8-byte part extracted with substr() can be made into a bigint like this:
select convert(conv(hex( BINARY(8) value ),16,10), unsigned);
then it can be used with existing bigint bit-ops, and converted back to 
BINARY(8) with:
select unhex(hex(int value));

Doable, but not very simple.
- added mark_field_in_map in Item: to be able to mark the fields for read in 
result fields inheriting from Item_sum_hybrid_field
- removed bitwise operations on integer from item_cmpfunc.cc
- added bitwise operations on binary/integer in item_func.cc
- extended Item_func_bit_count::val_int to work with binary too
- changed the hierarchy for Item_func_bit(base class for: '~', '|', '^', '&', 
'>>', '<<'): used Item_func as direct parent(previously it was Item_int_func) and 
made it work with int and binary.
- added Item_func_bit_two_param(inherits Item_func_bit) to work with functions 
that have two binary arguments
- renamed check_deprecated_bin_op to bit_func_returns_binary and changed it's role 
to check if the Item should return integer or binary result (implementation moved 
to item_func.cc)
- Item_sum_bit inherits now directly from Item_sum(previously Item_sum_int) and is 
capable of working with both binary strings and integers
- added Item_sum_hybrid_field which inherits from Item_result_field and can 
produce different results + it implements mark_field_in_map to make it possible to 
use field->store() and field->val_str/..
-  Item_sum_num_field and Item_sum_bit_field inherits now from 
Item_sum_hybrid_field
- added a new Field of type FIELD_BIT_ITEM: Item_sum_bit_field which extracts the 
additional byte stored on field used during the aggregation(used to keep track if 
the lenght for the group was set)