Necessary changes inside innodb to support CTE.
Basically, this is to support multiple readers and single writer access the same
intrinsic table and still maintain the context for them separately.
The key is that these readers and writer are with the same thread, so the work
is serialized:
An example:
reader a reads 1
reader b reads 1
writer c writes 2
reader a reads 2
writer c writes 3
read b reads 2
The current intrinsic table does not have mechanism to support multiple cursor
positions. So this worklog is to make sure we put this support back (the normal
table can support multiple cursor position).
To do so, we will also need mechanism to tell readers to "reposition" if a page
splits happens. Since the tree splits could move the records, including those
already have cursors positioned on it.
This worklog would also remove the consistent read requirement on intrinsic
table. All the access are done with read dirty scheme.
This worklog also puts an optimization that use statistics row count instead of
a full table scan. Since there is only one thread working on intrinsic table,
thus statistics count should be accurate.
CTE stores its result in an internal tmp table: this must
work both with innodb intrinsic tables in-mem (for small-result CTEs) and
on-disk (for big-result).
Four problematic points, from (a), to (d), are identified below.
For recursive CTE they are all blockers.
For nonrecursive CTE only (a) applies and is a blocker.
Example: the recursive CTE algorithm has, in a single thread, TABLE objects TW
(for write)
and TR1 and TR2 (for read), all three pointing to the same physical intrinsic
table. A duplicate-eliminating secondary index may exist on the table. No
updates, no deletes are to be done
on the table. For the recursive algorithm, reads are scans: an initial
rnd_init(), then rnd_next() calls. No index_init.
Example query:
with recursive qn as (
select R0
union all
select from qn # select1
union all
select from qn # select2
)
;
"qn" is the intrinsic table.
An initial step has put a row R0 into the table. From then, the two "recursive"
SELECTs will read the table, calculate new rows, store them, and loop (read the
newly stored rows, calculate more new rows, etc), until they generate no more
rows. TR1 is "qn" read by select1; TR2 is "qn" read by select2. select1 and
select2 write their new rows to "qn" using TW.
The Server layer keeps a count of records contained in TW.
# new iteration (number 1)
If count of records in TW is identical to the previous iteration's count,
the previous iteration didn't produce rows, so terminate.
TR1 read, sees R0
(... based on this R0, calculates a new row named R1 ; could even be multiple
rows, or zero rows in which case the process terminates ...)
TW write row R1
( now feed R1 to next iteration )
# new iteration (number 2)
If count of records in TW is identical to the previous iteration's count, then
terminate.
TR1 read, sees R1
(calculates a new row R2)
TW write row R2
TR2 read, sees R1
(calculates a new row R2a)
TW write row R2a
( now feed R2 and R2a to next iteration )
# new iteration (3)
If count of records in TW is identical to the previous iteration's count, then
terminate.
TR1 read, sees R2
(calculates new rows R3 and R3a)
TW write row R3
TW write row R3a
TR1 read, sees R2a
(calculates new rows R3b and R3c)
TW write row R3b
TW write row R3c
TR2 read, sees R2
(calculates a new row R3d)
TW write row R3d
TR2 read, sees R2a
(calculates no new row, for some reason)
( now feed R3, R3a/b/c/d to next iteration )
etc.
So it is critical that a reader:
(a) is not disturbed by the other readers' reads
(b) is not disturbed by the writer's write_row()
(c) has visibility of rows which the writer has previously written: reader must
be able to catch up continuously, and must receive rows in the order
they were inserted (otherwise it will hopelessly miss rows
or process rows twice). This includes the special case where the reader hits
EOF, then the writer writes one or more rows (this can happen if join buffering
is used or there are more than one recursive query block): reader must see these
rows.
(d) is not disturbed by a conversion from too-large in-mem to on-disk
(create_ondisk_from_heap()).
In a nonrecursive CTE, all writes are done first (single one-time
materialization), then multiple TABLEs may read, example:
with qn as (select R0) select * from qn as foo, qn as bar;
there will be a TABLE TW to write R0 to the intrinsic table "qn", then a TABLE
TR1 will do reads for the first reference to "qn" ("foo") and TR2 for the second
reference to "qn" ("bar").
So for nonrecursive CTE, points break down like this:
- (a) is necessary; but the access methods are NOT necessarily table scans
anymore, they can also be index scan/lookup on any index.
- (b) (c) (d) don't apply as all writes happen before all reads.
- however, TR1/TR2 objects are created *before* the writes, and their cursor
needs visibility to the written rows. This is probably going to work, assuming
that if TR1/TR2 do rnd_init() *after* the writes they will see the rows. Jimmy
to confirm.
Quick facts about InnoDB
========================
* For on-disk tables.
If a table is created without a primary key, InnoDB adds one: it generates a
rowid and indexes it (for intrinsic tables it's a per-table integer counter
starting at 0 when the table is created, and incremented by 1 before every row
write by dict_table_get_next_table_sess_row_id). position() then returns the rowid.
rnd_init/rnd_next always return rows in PK order.
So if we have the rowid as PK, we can say that rnd_next returns rows in
insertion order.
If the table has a UNIQUE NOT NULL index the Server may promote this to be the
PK, in that case our algorithms won't work, as insertion will not be in PK order
while rnd_next will be in PK order. So: we should either verify there's never a
UNIQUE NOT NULL on intrinsic tables, or block promotion.
* For in-mem tables.
There is NO autogenerated PK. position() returns a pair (page_id,
number_of_record_in_page), i.e. a rowid but not one which increments by 1.
Does rnd_next return rows in insertion order (Jimmy to confirm)?
Point (a)
=========
Description:
"a reader is not disturbed by the other readers' reads; any reader may have any
access method".
Does not work in 8.0.0 as intrinsic table code assumes a single reader and thus
stores the "current read position" not into the TABLE but into
prebuilt->index->last_sel_cur which is common to all readers.
Fix has been implemented in trunk-wl883 tree:
Point (b)
=========
Description:
"a reader (table scan) is not disturbed by the writer's write_row()"
With innodb on-disk:
Such table stores data into a clustered index. An insert may change the index
tree, causing the reader to suddenly lose position, so (b) won't work.
Fix has been implemented in trunk-wl883 tree:
Point (c)
========
Description:
"a reader (table scan) has visibility of rows which the writer has previously
written: reader must
be able to catch up continuously, and must receive rows in the order
they were inserted (otherwise it will hopelessly miss rows
or process rows twice). This includes the special case where the reader hits
EOF, then the writer writes one or more rows (this can happen if join buffering
is used or there are more than one recursive query block): reader must see these
rows."
Fix has been implemented in trunk-wl883 tree:
Point (d)
=========
Description:
"a reader (table scan) is not disturbed by a conversion from too-large in-mem to
on-disk (create_ondisk_from_heap())"
when an in-mem intrinsic table becomes too big, it is converted to an on-disk
one (rows are sequentially read from the in-mem and inserted in same order into
the on-disk table). The readers will lose position (as TABLE and "handler"
objects change).
It is not possible that the readers save position() before the
conversion happens, then re-position themselves where they were before with
rnd_pos(the saved position). Because position() in in-mem is based on location
in memory, while in on-disk it's based on incremented-by-1 rowid.
Fix has been implemented in trunk-wl883 tree: readers maintain info "I have read
N rows so far"; after the overflow process has filled the on-disk table, it
re-positions each reader on "row with PK=N" which is exactly the same as the
in-mem row which the reader was on.