WL#6942: InnoDB GIS: Store GIS POINT datatype as fixed length column rather than BLOB

Status: Complete

Currently, we store GIS data as BLOB in InnoDB. However, if such GIS data type
is "POINT", then there is no need to store such datatype in a "BLOB" page, it
can be a fixed length datatype.

So the idea is that for datatype of POINT, we should store it just like any
other FIXED length datatypee.
Functional requirements

This worklog doesn't add any new functional features to innodb GIS work.
It focuses on optimization for specific data type POINT(Please note, the
data type MULTIPOINT is not the same with POINT, it's not included in this WL).
So for the functional requirements, IMO, it should be the almost the same with
FR of WL6968, except the followings:

F-1: A SESSION variable named 'innodb_optimize_point_storage' is introduced, to
     identify the format users want for POINT. When 
     innodb_optimize_point_storage is TRUE, we store POINT as fixed length,
     otherwise as var length.
     The default value is OFF.

F-2: We support storing both fixed length POINT and var length POINT in the
     same table, the type of POINT depends on the value of
     innodb_optimize_point_storage when the column is created.
     One exception is that if we do 'DROP COLUMN p, ADD COLUMN p', the format
     of p would keep as is, since we couldn't identify this case in InnoDB.

F-3: We can confirm the data type POINT is of length 25 by checking if the LEN
     field of table information_schema.innodb_sys_columns is 25(fixed length)
     or 12(var length). Also, the mtype of POINT could be 14 for fixed length
     or 15 for var length.
     For all other non-POINT geometry data types, the LEN shall be 12 while
     the mtype shall be 16.

F-4: If the POINT is a fixed length one, there shall be no padding here.
     We couldn't catch a overflow error, since the server layer shall always
     pass a valid POINT value(must be of length 25) to innodb currently.

F-5: DB_CANT_CREATE_GEOMETRY_OBJECT would be thrown if we insert a non-POINT
     data into a POINT field or update POINT field data to non-POINT data.

F-6: If we try to alter a geometrical table by modifying a non-POINT field to
     POINT field, the result depends.
     a. If the table is empty, or all the data in the original field is POINT,
        the modifying would be fine.
     b. If the table is not empty and some data in the original field is
        non-POINT, the modifying is forbidden and DB_CANT_CREATE_GEOMETRY_OBJECT
        would be thrown.

     innodb_optimize_point_storage is ON, we will store default POINIT
     (POINT(0 0)) for this new column; otherwise, we keep it as is(storing
     a value of length 0).

The bug 17371138 mentioned in HLS is independent of this worklog, so
while testing this worklog, we can assume that we always insert corret POINT
values but not non-POINT values into POINT fields.

Non-Functional requirements

Since this is a optimization related worklog, we do expect there would be
performance improvement after this worklog. These NF requirements don't need
tests from System QA, since the improvement could be minor currently. I observed
a disk space saving for a 5 million row POINT table with one spatial index, the
saving is nearly 1%.

NF-1: As described in the HLS, for fixed length POINT, every field of data type
      POINT would save 1 byte when it's on both clust index and sec index.
      So there would be space saving here.

      But the saving would be minor. For example, for the simplest case that
      a table with only one POINT column, in the clust index, there would be
      1 byte saving for the POINT column, and the space occupied by POINT would
      be around 38(7+25+6), so the saving ratio should be 1/38. The ratio would
      go down when the columns increase or in the secondary index. So it would
      be difficult for us to test the saving by writing MTR test.

NF-2: If the POINT is fixed length, we don't need to calc the offset, etc.
      So there would be more or less CPU saving as well. If there would be a
      benchmark testing for GIS work, we can see if there is any improvement.


Currently, POINT is defined as a 0-dimensional geometry, with only X and
Y-coordinate. The WKB value that corresponds to any POINT would consist of
a sequence of 21 bytes. The POINT field, which is the combination of SRID and
WKB value, shall still be an object of Field_geom which inherits from
Field_blob, but the data of POINT could be stored as a fixed length field
such as CHAR in InnoDB layer.

SESSION variable 'innodb_optimize_point_storage'

We don't store POINT only as fixed length field, we allow users to specify
the format(fixed length or var length) they want to use.

We introduce a session variable named 'innodb_optimize_point_storage' to
identify the format in which user want to store POINT, whose default value
is OFF.

This variable is a column level variable, that means each column's type
depends on the innodb_optimize_point_storage when it's created. So a table
could has mixture of fixed length POINT and var length POINT.

Data checking

In order to start this optimization, we shall fix the following bug first:
Because current POINT field accepts those none-POINT values, such as value of
LINESTRING, POLYGON, etc. which may be as long as possible, if we don't fix
the bug while are having no way to specify the max length of geom data types
currently, we can't store POINT inline as a fixed length field.

In case the above bug couldn't be fixed in time, we shall do a data length
checking before we insert any data into a POINT field. That can be done in
row_ins_index_entry_set_vals. DB_CANT_CREATE_GEOMETRY_OBJECT would be thrown
if the data is error.

Type of POINT in InnoDB

We use DATA_GEOMETRY for all geometry datatypes in InnoDB now. To optimize
POINT, we will introduce 2 more types for POINT, DATA_POINT and DATA_VAR_POINT.
So DATA_POINT is used for POINT data which would be stored as fixed length,
while DATA_VAR_POINT is used for var length POINT. For all other geom data
types, we still use DATA_GEOMETRY.

DATA_GEOMETRY_MTYPE is still used for all other geom data types, including

In Innodb layer, we treat DATA_POINT the same as CHAR while DATA_GEOMETRY
and DATA_VAR_POINT as BLOB as is. The only difference between POINT and CHAR
is CHAR is a ASCII-based datatype, but POINT is fixed length binary data
which doesn't need padding.

Data of POINT in InnoDB

Currently, all POINT data are essentially stored as binary, InnoDB doesn't
need to care about the format of the raw binary data. So we couldn't expect
the btree on POINT support sorting according to X, Y value of POINT.

In fact, both server and myisam don't support sorting POINT according to X, Y.
(I guess we can either implement this feature when server supports it or
create the incoming funcional index(if it would support X(point) and Y(point))
to meet the requirement.

Length of POINT in InnoDB

First, we consider the fixed length type DATA_POINT.

A Macro named DATA_POINT_LEN is defined as 25, which is the sum of SRID_SIZE(4)
+ WKB_HEADER_SIZE(1+4) + POINT_DATA_SIZE(8*2) as the fixed length mentioned
above. It will be used in innodb layer to represent the column length for
DATA_POINT. Thus, the length of DATA_POINT column in physical record is 25.
The fixed size/max size/min size of DATA_POINT column is always 25 as well.

When we create spatial index on any geom type data, we just store a MBR for
each tuple. A MBR consists of 4 double values, which is always 32 bytes.
Actually, it shall be sizeof(double) * SPDIMS * 2, supposing sizeof(double)
is 8, and SPDIMS is always 2 currently.

So when we create spatial index on POINT field, we need to store 32 bytes for
POINT, but not 25. As we know, we use dict_field_t to represent a field
used in some index. For POINT, the fixed_len of dict_field_t should depend
on the type of index we are using. If the POINT is indexed by B-TREE,
the fixed_len would still the original POINT data with length of 25, while
if it's indexed by R-TREE, the fixed_len would be 32 to save the MBR we need.
dict_index_add_col should know all these details, so that we don't need to
take care of this issue here and there.

If the POINT is a fixed length column, the MBR field is also a fixed length
field, so there's no extra byte for MBR field representing a DATA_POINT
to store the offset.

Second, for DATA_VAR_POINT, the lengths are the same as other DATA_GEOMETRYs.


When we execute ALTER TABLE ... ADD COLUMN p POINT NOT NULL; We will insert
a value of length 0 into the table before this WL. Since we could store fixed
length DATA_POINT into table in this WL, and we couldn't insert the empty value
into a fixed length column, we need a default POINT value here to address the
'ADD COLUMN' issue. The default value is POINT(0 0). For other geometry types
(including DATA_VAR_POINT), we keep the behaviour as is.

Return to MySQL layer

row_mysql_store_geometry and row_mysql_read_geometry shall deal with
DATA_POINT, DATA_VAR_POINT and other DATA_GEOMETRY in the same way as is.
Because the server layer treats all geom values as blob, when we return
value of DATA_POINT/DATA_VAR_POINT, we shall store it as a blob as well.

A flag 'templ_contains_fixed_point' is introduced in the prebuilt struct to
indicate if there is any column of DATA_POINT.