WL#3717: NDB blobs: varsize parts & new distribution option
Affects: Server-6.0 — Status: Complete — Priority: Low
Two blob improvements 1) varsize inline and parts 2) option to distribute parts the same way as the table row These are best done at same time. Attachment summary.txt lists the changes.
Old blobs is V1, new is V2. Both versions can exist in a given db. Conversion V1->V2 is by a) copy data to a new table, or b) backup/restore (see section 4) Following primary table is used as an example: create table T ( A int, B blob, C varchar(20), primary key (A, C) ) partition by key (C) ; 1. formats ---------- 1A. data types -------------- NDB API data type remains same (Blob/Text). V1/V2 is specified on NdbDictionary::Column as "blob version" with V2 the default. Internally the version is also seen via "array type" of the blob attribute in the primary table: V1 : array type = FIXED V2 : array type = MEDIUM_VAR (2 length bytes) 1B. blob parts table -------------------- NDB$BLOB table V1 for B: PK Unsigned (4+1+20 bytes rounded up) DIST Unsigned PART Unsigned DATA Binary(2000) primary key (PK, DIST, PART) distribution key (PK, DIST) Striping: The purpose of DIST is to give "wider striping" of blob data, e.g. 8k per node group. As PART goes 0,1,2,3,4,5,.. DIST goes e.g. 0,0,0,0,1,1,.. The problems with this format are a) partitioning per A or C cannot be specified b) PK is fixed size even though C is varchar c) DATA is fixed size so last part wastes space d) charset keys do not always work in NDB API Note: d) does not (by chance) affect mysql level. d) could be fixed but blobs V2 fixes it anyway. NDB$BLOB table V2 for B, by default: A Int C Varchar(20) NDB$DIST Unsigned NDB$PART Unsigned NDB$PKID Unsigned NDB$DATA Longvarbinary(2000) primary key (A, C, NDB$PART) distribution key (C) This fixes problems a)-d). But there is something else to fix now, see section 2A and 2B. Striping: The suggestion (by mikael@) was to omit NDB$DIST from default dk and set same dk as primary table. Then blob data is stored in same partition as primary table row. This is good: x) The new behaviour is the primary cause of this wl# y) It plays well with e.g. "drop partition" Partitioning will be described in section 3. 1C. blob attribute in primary table ----------------------------------- This contains "blob head" and "inline bytes" (initial bytes of blob value) in a single attribute. Blob head V1 is 8 bytes: 8 bytes of blob length - native endian (of ndb apis) Blob head V2 is 16 bytes: 2 bytes head+inline length - little-endian 2 bytes reserved (zero) 4 bytes NDB$PKID (see section 3) - little-endian 8 bytes blob length - little-endian Note 1: If we had external/internal mapping in NDB API we could separate head+inline into several attributes. Note 2: For "mixed-endian" purposes V1 must be hacked specially but V2 is treated like a Varchar. Note 3: The "initial bytes" exist for a) tinyblobs b) possible indexing in future (if this is useful). They can be set to 0 (not from mysql level). 1D. blob parts events --------------------- Nothing special here, just another format to handle. 1A-1D are easy to get working with V2. But as always there are new problems, listed next. 2. problems ----------- 2A. partition key types ----------------------- NDB$BLOB in V2 may expose a PK key type which is not currently supported as partition key. This is already fixed in DBTC. The minor problem is removing error 745 from NDB API. 2B. length bytes vs user buffer ------------------------------- In V1, user buffer is used directly to read or write blob data. The blob part ops can be run in parallel and there is no need to execute each op at once. In V2 this is not possible since user obviously neither wants nor provides the 2 length bytes for each part. This is a major performance hit. The solutions: 1) New getValue() and setValue() variants in NDB API which take separate arguments for length and data. 2) New getValue() and setValue() variants in NDB API which only accept "full sized" data, since we only need this for the full sized blob parts in the middle. Chose 1). For multi-part data reads we cannot verify the sizes (execution takes place later). So maybe add 2) as a flag. Multi-part event read does verify. 2C. disk data does not support Var* types ----------------------------------------- Here we create a blob version "1.5" where the parts are not Var*. Disk data property is set like this: 1) the primary table blob attr is always in memory. This allows using array type to distinguish blobs V1 and V2 (see section 1A) Note: should shrink inline size set from mysqld. 2) the storage type of the primary attr in NDB API contains actually the storage type of blob parts attr DATA (V1) or NDB$DATA (V2) Implementing 2) requires a bit trickery in 3 places in NDB API: a) at create table and its blob tables b) at retrieve table and its blob tables c) in the ndb_restore program This is with no changes in DictTabInfo. An alternative might have been a new storage type. 3. partitioning --------------- On NDB metadata level there is concept of partition keys. This is the set of PK attrs to hash to get partition id. By default the set is all PK attrs. Partition keys provide default "passive" partitioning. User defined partitioning may override it. Blob table V2 partition keys include those of primary table (the PK attrs are exposed, see section 1B). There are 2 main types: 1) stripe size 0 2) stripe size not 0 Type 2) adds NDB$DIST to partition keys. This type is not available from mysql level. Type 1) means blob data is stored in same partition as the primary table row. There are 2 sub-types: 1a) default "passive" partitioning 1b) user-defined partitioning In 1b) the user (mysql or ndb api application) controls partitioning for each operation. Blob data operations belonging to the primary operation use same partitioning. 4. backup/restore ----------------- The only change needed here is noted in 2C c). It is desirable to add option to convert blobs V1->V2 but this may become a separate task. 5. blob events -------------- Using PKID field to skip event merge in NDB API event code. This may become another wl#.
the HLS is sufficient
Copyright (c) 2000, 2015, Oracle Corporation and/or its affiliates. All rights reserved.