WL#6942: InnoDB GIS: Store GIS POINT datatype as fixed length column rather than BLOB
Status: Complete
Currently, we store GIS data as BLOB in InnoDB. However, if such GIS data type is "POINT", then there is no need to store such datatype in a "BLOB" page, it can be a fixed length datatype. So the idea is that for datatype of POINT, we should store it just like any other FIXED length datatypee.
Functional requirements ----------------------- This worklog doesn't add any new functional features to innodb GIS work. It focuses on optimization for specific data type POINT(Please note, the data type MULTIPOINT is not the same with POINT, it's not included in this WL). So for the functional requirements, IMO, it should be the almost the same with FR of WL6968, except the followings: F-1: A SESSION variable named 'innodb_optimize_point_storage' is introduced, to identify the format users want for POINT. When innodb_optimize_point_storage is TRUE, we store POINT as fixed length, otherwise as var length. The default value is OFF. F-2: We support storing both fixed length POINT and var length POINT in the same table, the type of POINT depends on the value of innodb_optimize_point_storage when the column is created. One exception is that if we do 'DROP COLUMN p, ADD COLUMN p', the format of p would keep as is, since we couldn't identify this case in InnoDB. F-3: We can confirm the data type POINT is of length 25 by checking if the LEN field of table information_schema.innodb_sys_columns is 25(fixed length) or 12(var length). Also, the mtype of POINT could be 14 for fixed length or 15 for var length. For all other non-POINT geometry data types, the LEN shall be 12 while the mtype shall be 16. F-4: If the POINT is a fixed length one, there shall be no padding here. We couldn't catch a overflow error, since the server layer shall always pass a valid POINT value(must be of length 25) to innodb currently. F-5: DB_CANT_CREATE_GEOMETRY_OBJECT would be thrown if we insert a non-POINT data into a POINT field or update POINT field data to non-POINT data. F-6: If we try to alter a geometrical table by modifying a non-POINT field to POINT field, the result depends. a. If the table is empty, or all the data in the original field is POINT, the modifying would be fine. b. If the table is not empty and some data in the original field is non-POINT, the modifying is forbidden and DB_CANT_CREATE_GEOMETRY_OBJECT would be thrown. F-7: For 'ALTER TABLE ADD COLUMN p POINT NOT NULL', if current innodb_optimize_point_storage is ON, we will store default POINIT (POINT(0 0)) for this new column; otherwise, we keep it as is(storing a value of length 0). The bug 17371138 mentioned in HLS is independent of this worklog, so while testing this worklog, we can assume that we always insert corret POINT values but not non-POINT values into POINT fields. Non-Functional requirements --------------------------- Since this is a optimization related worklog, we do expect there would be performance improvement after this worklog. These NF requirements don't need tests from System QA, since the improvement could be minor currently. I observed a disk space saving for a 5 million row POINT table with one spatial index, the saving is nearly 1%. NF-1: As described in the HLS, for fixed length POINT, every field of data type POINT would save 1 byte when it's on both clust index and sec index. So there would be space saving here. But the saving would be minor. For example, for the simplest case that a table with only one POINT column, in the clust index, there would be 1 byte saving for the POINT column, and the space occupied by POINT would be around 38(7+25+6), so the saving ratio should be 1/38. The ratio would go down when the columns increase or in the secondary index. So it would be difficult for us to test the saving by writing MTR test. NF-2: If the POINT is fixed length, we don't need to calc the offset, etc. So there would be more or less CPU saving as well. If there would be a benchmark testing for GIS work, we can see if there is any improvement.
Overall ------- Currently, POINT is defined as a 0-dimensional geometry, with only X and Y-coordinate. The WKB value that corresponds to any POINT would consist of a sequence of 21 bytes. The POINT field, which is the combination of SRID and WKB value, shall still be an object of Field_geom which inherits from Field_blob, but the data of POINT could be stored as a fixed length field such as CHAR in InnoDB layer. SESSION variable 'innodb_optimize_point_storage' ------------------------------------------------ We don't store POINT only as fixed length field, we allow users to specify the format(fixed length or var length) they want to use. We introduce a session variable named 'innodb_optimize_point_storage' to identify the format in which user want to store POINT, whose default value is OFF. This variable is a column level variable, that means each column's type depends on the innodb_optimize_point_storage when it's created. So a table could has mixture of fixed length POINT and var length POINT. Data checking ------------- In order to start this optimization, we shall fix the following bug first: BUG 17371138 - INNODB GIS: NONE-POINT DATA CAN BE STORED AS POINT Because current POINT field accepts those none-POINT values, such as value of LINESTRING, POLYGON, etc. which may be as long as possible, if we don't fix the bug while are having no way to specify the max length of geom data types currently, we can't store POINT inline as a fixed length field. In case the above bug couldn't be fixed in time, we shall do a data length checking before we insert any data into a POINT field. That can be done in row_ins_index_entry_set_vals. DB_CANT_CREATE_GEOMETRY_OBJECT would be thrown if the data is error. Type of POINT in InnoDB ----------------------- We use DATA_GEOMETRY for all geometry datatypes in InnoDB now. To optimize POINT, we will introduce 2 more types for POINT, DATA_POINT and DATA_VAR_POINT. So DATA_POINT is used for POINT data which would be stored as fixed length, while DATA_VAR_POINT is used for var length POINT. For all other geom data types, we still use DATA_GEOMETRY. DATA_GEOMETRY_MTYPE is still used for all other geom data types, including DATA_POINT and DATA_VAR_POINT, while DATA_POINT_MTYPE is only for DATA_POINT and DATA_VAR_POINT. In Innodb layer, we treat DATA_POINT the same as CHAR while DATA_GEOMETRY and DATA_VAR_POINT as BLOB as is. The only difference between POINT and CHAR is CHAR is a ASCII-based datatype, but POINT is fixed length binary data which doesn't need padding. Data of POINT in InnoDB ----------------------- Currently, all POINT data are essentially stored as binary, InnoDB doesn't need to care about the format of the raw binary data. So we couldn't expect the btree on POINT support sorting according to X, Y value of POINT. In fact, both server and myisam don't support sorting POINT according to X, Y. (I guess we can either implement this feature when server supports it or create the incoming funcional index(if it would support X(point) and Y(point)) to meet the requirement. Length of POINT in InnoDB ------------------------- First, we consider the fixed length type DATA_POINT. A Macro named DATA_POINT_LEN is defined as 25, which is the sum of SRID_SIZE(4) + WKB_HEADER_SIZE(1+4) + POINT_DATA_SIZE(8*2) as the fixed length mentioned above. It will be used in innodb layer to represent the column length for DATA_POINT. Thus, the length of DATA_POINT column in physical record is 25. The fixed size/max size/min size of DATA_POINT column is always 25 as well. When we create spatial index on any geom type data, we just store a MBR for each tuple. A MBR consists of 4 double values, which is always 32 bytes. Actually, it shall be sizeof(double) * SPDIMS * 2, supposing sizeof(double) is 8, and SPDIMS is always 2 currently. So when we create spatial index on POINT field, we need to store 32 bytes for POINT, but not 25. As we know, we use dict_field_t to represent a field used in some index. For POINT, the fixed_len of dict_field_t should depend on the type of index we are using. If the POINT is indexed by B-TREE, the fixed_len would still the original POINT data with length of 25, while if it's indexed by R-TREE, the fixed_len would be 32 to save the MBR we need. dict_index_add_col should know all these details, so that we don't need to take care of this issue here and there. If the POINT is a fixed length column, the MBR field is also a fixed length field, so there's no extra byte for MBR field representing a DATA_POINT to store the offset. Second, for DATA_VAR_POINT, the lengths are the same as other DATA_GEOMETRYs. ADD COLUMN issue ---------------- When we execute ALTER TABLE ... ADD COLUMN p POINT NOT NULL; We will insert a value of length 0 into the table before this WL. Since we could store fixed length DATA_POINT into table in this WL, and we couldn't insert the empty value into a fixed length column, we need a default POINT value here to address the 'ADD COLUMN' issue. The default value is POINT(0 0). For other geometry types (including DATA_VAR_POINT), we keep the behaviour as is. Return to MySQL layer --------------------- row_mysql_store_geometry and row_mysql_read_geometry shall deal with DATA_POINT, DATA_VAR_POINT and other DATA_GEOMETRY in the same way as is. Because the server layer treats all geom values as blob, when we return value of DATA_POINT/DATA_VAR_POINT, we shall store it as a blob as well. A flag 'templ_contains_fixed_point' is introduced in the prebuilt struct to indicate if there is any column of DATA_POINT.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.