WL#3717: NDB blobs: varsize parts & new distribution option

Affects: Server-6.0   —   Status: Complete

Two blob improvements
1) varsize inline and parts
2) option to distribute parts
   the same way as the table row
These are best done at same time.

Attachment summary.txt lists
the changes.
Old blobs is V1, new is V2.  Both versions can
exist in a given db.  Conversion V1->V2 is by

a) copy data to a new table, or
b) backup/restore (see section 4)

Following primary table is used as an example:

create table T (
  A int, B blob, C varchar(20), primary key (A, C) )
  partition by key (C) ;

1. formats
----------

1A. data types
--------------

NDB API data type remains same (Blob/Text).
V1/V2 is specified on NdbDictionary::Column as
"blob version" with V2 the default.  Internally
the version is also seen via "array type" of
the blob attribute in the primary table:

V1 : array type = FIXED
V2 : array type = MEDIUM_VAR (2 length bytes)

1B. blob parts table
--------------------

NDB$BLOB table V1 for B:

  PK    Unsigned[7] (4+1+20 bytes rounded up)
  DIST  Unsigned
  PART  Unsigned
  DATA  Binary(2000)
  primary key (PK, DIST, PART)
  distribution key (PK, DIST)

Striping:

The purpose of DIST is to give "wider striping"
of blob data, e.g. 8k per node group.  As PART
goes 0,1,2,3,4,5,.. DIST goes e.g. 0,0,0,0,1,1,..

The problems with this format are

a) partitioning per A or C cannot be specified
b) PK is fixed size even though C is varchar
c) DATA is fixed size so last part wastes space
d) charset keys do not always work in NDB API

Note: d) does not (by chance) affect mysql level.
d) could be fixed but blobs V2 fixes it anyway.

NDB$BLOB table V2 for B, by default:

  A         Int
  C         Varchar(20)
  NDB$DIST  Unsigned
  NDB$PART  Unsigned
  NDB$PKID  Unsigned
  NDB$DATA  Longvarbinary(2000)
  primary key (A, C, NDB$PART)
  distribution key (C)

This fixes problems a)-d).  But there is something
else to fix now, see section 2A and 2B.

Striping:

The suggestion (by mikael@) was to omit NDB$DIST
from default dk and set same dk as primary table.
Then blob data is stored in same partition as
primary table row.  This is good:

x) The new behaviour is the primary cause of this wl#
y) It plays well with e.g. "drop partition"

Partitioning will be described in section 3.

1C. blob attribute in primary table
-----------------------------------

This contains "blob head" and "inline bytes" (initial
bytes of blob value) in a single attribute.

Blob head V1 is 8 bytes:

  8 bytes of blob length - native endian (of ndb apis)

Blob head V2 is 16 bytes:

  2 bytes head+inline length - little-endian
  2 bytes reserved (zero)
  4 bytes NDB$PKID (see section 3) - little-endian
  8 bytes blob length - little-endian

Note 1: If we had external/internal mapping in NDB API
we could separate head+inline into several attributes.

Note 2: For "mixed-endian" purposes V1 must be
hacked specially but V2 is treated like a Varchar.

Note 3: The "initial bytes" exist for a) tinyblobs
b) possible indexing in future (if this is useful).
They can be set to 0 (not from mysql level).

1D. blob parts events
---------------------

Nothing special here, just another format to handle.

1A-1D are easy to get working with V2.  But as
always there are new problems, listed next.

2. problems
-----------

2A. partition key types
-----------------------

NDB$BLOB in V2 may expose a PK key type which is
not currently supported as partition key.

This is already fixed in DBTC.  The minor problem
is removing error 745 from NDB API.

2B. length bytes vs user buffer
-------------------------------

In V1, user buffer is used directly to read or write
blob data.  The blob part ops can be run in parallel
and there is no need to execute each op at once.

In V2 this is not possible since user obviously
neither wants nor provides the 2 length bytes for
each part.  This is a major performance hit.

The solutions:

1) New getValue() and setValue() variants in NDB API
which take separate arguments for length and data.

2) New getValue() and setValue() variants in NDB API
which only accept "full sized" data, since we only
need this for the full sized blob parts in the middle.

Chose 1).  For multi-part data reads we cannot
verify the sizes (execution takes place later).
So maybe add 2) as a flag.  Multi-part event read
does verify.

2C. disk data does not support Var* types
-----------------------------------------

Here we create a blob version "1.5" where the parts
are not Var*.  Disk data property is set like this:

1) the primary table blob attr is always in memory.
This allows using array type to distinguish blobs
V1 and V2 (see section 1A)

Note: should shrink inline size set from mysqld.

2) the storage type of the primary attr in NDB API
contains actually the storage type of blob parts attr
DATA (V1) or NDB$DATA (V2)

Implementing 2) requires a bit trickery in 3 places
in NDB API:

a) at create table and its blob tables
b) at retrieve table and its blob tables
c) in the ndb_restore program

This is with no changes in DictTabInfo.  An alternative
might have been a new storage type.

3. partitioning
---------------

On NDB metadata level there is concept of partition
keys.  This is the set of PK attrs to hash to get
partition id.  By default the set is all PK attrs.

Partition keys provide default "passive" partitioning.
User defined partitioning may override it.

Blob table V2 partition keys include those of primary
table (the PK attrs are exposed, see section 1B).
There are 2 main types:

1) stripe size 0
2) stripe size not 0

Type 2) adds NDB$DIST to partition keys.  This type
is not available from mysql level.

Type 1) means blob data is stored in same partition
as the primary table row.  There are 2 sub-types:

1a) default "passive" partitioning
1b) user-defined partitioning

In 1b) the user (mysql or ndb api application)
controls partitioning for each operation.  Blob data
operations belonging to the primary operation use
same partitioning.

4. backup/restore
-----------------

The only change needed here is noted in 2C c).

It is desirable to add option to convert blobs V1->V2
but this may become a separate task.

5. blob events
--------------

Using PKID field to skip event merge in NDB API
event code.  This may become another wl#.
the HLS is sufficient