WL#9248: InnoDB: Allow multiple readers and one writer on InnoDB internal tmp table
Affects: Server-8.0 — Status: Complete — Priority: Medium
Necessary changes inside innodb to support CTE. Basically, this is to support multiple readers and single writer access the same intrinsic table and still maintain the context for them separately. The key is that these readers and writer are with the same thread, so the work is serialized: An example: reader a reads 1 reader b reads 1 writer c writes 2 reader a reads 2 writer c writes 3 read b reads 2 The current intrinsic table does not have mechanism to support multiple cursor positions. So this worklog is to make sure we put this support back (the normal table can support multiple cursor position). To do so, we will also need mechanism to tell readers to "reposition" if a page splits happens. Since the tree splits could move the records, including those already have cursors positioned on it. This worklog would also remove the consistent read requirement on intrinsic table. All the access are done with read dirty scheme. This worklog also puts an optimization that use statistics row count instead of a full table scan. Since there is only one thread working on intrinsic table, thus statistics count should be accurate.
CTE stores its result in an internal tmp table: this must work both with innodb intrinsic tables in-mem (for small-result CTEs) and on-disk (for big-result). Four problematic points, from (a), to (d), are identified below. For recursive CTE they are all blockers. For nonrecursive CTE only (a) applies and is a blocker. Example: the recursive CTE algorithm has, in a single thread, TABLE objects TW (for write) and TR1 and TR2 (for read), all three pointing to the same physical intrinsic table. A duplicate-eliminating secondary index may exist on the table. No updates, no deletes are to be done on the table. For the recursive algorithm, reads are scans: an initial rnd_init(), then rnd_next() calls. No index_init. Example query: with recursive qn as ( select R0 union all select <expression1> from qn # select1 union all select <expression2> from qn # select2 ) <select using qn, irrelevant here>; "qn" is the intrinsic table. An initial step has put a row R0 into the table. From then, the two "recursive" SELECTs will read the table, calculate new rows, store them, and loop (read the newly stored rows, calculate more new rows, etc), until they generate no more rows. TR1 is "qn" read by select1; TR2 is "qn" read by select2. select1 and select2 write their new rows to "qn" using TW. The Server layer keeps a count of records contained in TW. # new iteration (number 1) If count of records in TW is identical to the previous iteration's count, the previous iteration didn't produce rows, so terminate. TR1 read, sees R0 (... based on this R0, calculates a new row named R1 ; could even be multiple rows, or zero rows in which case the process terminates ...) TW write row R1 ( now feed R1 to next iteration ) # new iteration (number 2) If count of records in TW is identical to the previous iteration's count, then terminate. TR1 read, sees R1 (calculates a new row R2) TW write row R2 TR2 read, sees R1 (calculates a new row R2a) TW write row R2a ( now feed R2 and R2a to next iteration ) # new iteration (3) If count of records in TW is identical to the previous iteration's count, then terminate. TR1 read, sees R2 (calculates new rows R3 and R3a) TW write row R3 TW write row R3a TR1 read, sees R2a (calculates new rows R3b and R3c) TW write row R3b TW write row R3c TR2 read, sees R2 (calculates a new row R3d) TW write row R3d TR2 read, sees R2a (calculates no new row, for some reason) ( now feed R3, R3a/b/c/d to next iteration ) etc. So it is critical that a reader: (a) is not disturbed by the other readers' reads (b) is not disturbed by the writer's write_row() (c) has visibility of rows which the writer has previously written: reader must be able to catch up continuously, and must receive rows in the order they were inserted (otherwise it will hopelessly miss rows or process rows twice). This includes the special case where the reader hits EOF, then the writer writes one or more rows (this can happen if join buffering is used or there are more than one recursive query block): reader must see these rows. (d) is not disturbed by a conversion from too-large in-mem to on-disk (create_ondisk_from_heap()). In a nonrecursive CTE, all writes are done first (single one-time materialization), then multiple TABLEs may read, example: with qn as (select R0) select * from qn as foo, qn as bar; there will be a TABLE TW to write R0 to the intrinsic table "qn", then a TABLE TR1 will do reads for the first reference to "qn" ("foo") and TR2 for the second reference to "qn" ("bar"). So for nonrecursive CTE, points break down like this: - (a) is necessary; but the access methods are NOT necessarily table scans anymore, they can also be index scan/lookup on any index. - (b) (c) (d) don't apply as all writes happen before all reads. - however, TR1/TR2 objects are created *before* the writes, and their cursor needs visibility to the written rows. This is probably going to work, assuming that if TR1/TR2 do rnd_init() *after* the writes they will see the rows. Jimmy to confirm. Quick facts about InnoDB ======================== * For on-disk tables. If a table is created without a primary key, InnoDB adds one: it generates a rowid and indexes it (for intrinsic tables it's a per-table integer counter starting at 0 when the table is created, and incremented by 1 before every row write by dict_table_get_next_table_sess_row_id). position() then returns the rowid. rnd_init/rnd_next always return rows in PK order. So if we have the rowid as PK, we can say that rnd_next returns rows in insertion order. If the table has a UNIQUE NOT NULL index the Server may promote this to be the PK, in that case our algorithms won't work, as insertion will not be in PK order while rnd_next will be in PK order. So: we should either verify there's never a UNIQUE NOT NULL on intrinsic tables, or block promotion. * For in-mem tables. There is NO autogenerated PK. position() returns a pair (page_id, number_of_record_in_page), i.e. a rowid but not one which increments by 1. Does rnd_next return rows in insertion order (Jimmy to confirm)? Point (a) ========= Description: "a reader is not disturbed by the other readers' reads; any reader may have any access method". Does not work in 8.0.0 as intrinsic table code assumes a single reader and thus stores the "current read position" not into the TABLE but into prebuilt->index->last_sel_cur which is common to all readers. Fix has been implemented in trunk-wl883 tree: <description> Point (b) ========= Description: "a reader (table scan) is not disturbed by the writer's write_row()" With innodb on-disk: Such table stores data into a clustered index. An insert may change the index tree, causing the reader to suddenly lose position, so (b) won't work. Fix has been implemented in trunk-wl883 tree: <description> Point (c) ======== Description: "a reader (table scan) has visibility of rows which the writer has previously written: reader must be able to catch up continuously, and must receive rows in the order they were inserted (otherwise it will hopelessly miss rows or process rows twice). This includes the special case where the reader hits EOF, then the writer writes one or more rows (this can happen if join buffering is used or there are more than one recursive query block): reader must see these rows." Fix has been implemented in trunk-wl883 tree: <description> Point (d) ========= Description: "a reader (table scan) is not disturbed by a conversion from too-large in-mem to on-disk (create_ondisk_from_heap())" when an in-mem intrinsic table becomes too big, it is converted to an on-disk one (rows are sequentially read from the in-mem and inserted in same order into the on-disk table). The readers will lose position (as TABLE and "handler" objects change). It is not possible that the readers save position() before the conversion happens, then re-position themselves where they were before with rnd_pos(the saved position). Because position() in in-mem is based on location in memory, while in on-disk it's based on incremented-by-1 rowid. Fix has been implemented in trunk-wl883 tree: readers maintain info "I have read N rows so far"; after the overflow process has filled the on-disk table, it re-positions each reader on "row with PK=N" which is exactly the same as the in-mem row which the reader was on.
Copyright (c) 2000, 2017, Oracle Corporation and/or its affiliates. All rights reserved.