InnoDB stores all records inside a fixed-size
unit which is commonly called a "page" (though
InnoDB sometimes calls it a "block" instead).
Currently all pages are the same size, 16KB.
A page contains records, but it also contains headers and trailers. I'll start this description with a high-altitude view of a page's parts, then I'll describe each part of a page. Finally, I'll show an example. This discussion deals only with the most common format, for the leaf page of a data file.
[+/-]
An InnoDB page has seven parts:
Fil Header
Page Header
Infimum + Supremum Records
User Records
Free Space
Page Directory
Fil Trailer
As you can see, a page has two header/trailer pairs. The inner pair, "Page Header" and "Page Directory", are mostly the concern of the \page program group, while the outer pair, "Fil Header" and "Fil Trailer", are mostly the concern of the \fil program group. The "Fil" header also goes by the name of "File Page Header".
Sandwiched between the headers and trailers, are the records and the free (unused) space. A page always begins with two unchanging records called the Infimum and the Supremum. Then come the user records. Between the user records (which grow downwards) and the page directory (which grows upwards) there is space for new records.
The Fil Header has eight parts, as follows:
Name |
Size |
Remarks |
|
4 |
4 ID of the space the page is in |
|
4 |
ordinal page number from start of space |
|
4 |
offset of previous page in key order |
|
4 |
offset of next page in key order |
|
8 |
log serial number of page's latest log record |
|
2 |
current defined types are:
|
|
8 |
"the file has been flushed to disk at least up to this lsn" (log serial number), valid only on the first page of the file |
|
4 |
the latest archived log file number at the time that
|
FIL_PAGE_SPACE is a necessary
identifier because different pages might belong to
different (table) spaces within the same file. The word
"space" is generic jargon for either "log" or
"tablespace".
FIL_PAGE_PREV and
FIL_PAGE_NEXT are the page's "backward"
and "forward" pointers. To show what they're about, I'll
draw a two-level B-tree.
--------
- root -
--------
|
----------------------
| |
| |
-------- --------
- leaf - <--> - leaf -
-------- --------
Everyone has seen a B-tree and knows that the entries in the
root page point to the leaf pages. (I indicate those pointers
with vertical '|' bars in the drawing.) But sometimes people
miss the detail that leaf pages can also point to each other
(I indicate those pointers with a horizontal two-way pointer
'<-->' in the drawing). This feature allows
InnoDB to navigate from leaf to leaf
without having to back up to the root level. This is a
sophistication which you won't find in the classic B-tree,
which is why InnoDB should perhaps be
called a B+-tree instead.
The fields FIL_PAGE_FILE_FLUSH_LSN,
FIL_PAGE_PREV, and
FIL_PAGE_NEXT all have to do with logs,
so I'll refer you to my article “How Logs Work With
MySQL And InnoDB” on
devarticles.com.
FIL_PAGE_FILE_FLUSH_LSN and
FIL_PAGE_ARCH_LOG_NO are valid only for
the first page of a data file.
The Page Header has 14 parts, as follows:
Name |
Size |
Remarks |
|
2 |
number of directory slots in the Page Directory part; initial value = 2 |
|
2 |
record pointer to first record in heap |
|
2 |
number of heap records; initial value = 2 |
|
2 |
record pointer to first free record |
|
2 |
"number of bytes in deleted records" |
|
2 |
record pointer to the last inserted record |
|
2 |
either |
|
2 |
number of consecutive inserts in the same direction, for example, "last 5 were all to the left" |
|
2 |
number of user records |
|
8 |
the highest ID of a transaction which might have changed a record on the page (only set for secondary indexes) |
|
2 |
level within the index (0 for a leaf page) |
|
8 |
identifier of the index the page belongs to |
|
10 |
"file segment header for the leaf pages in a B-tree" (this is irrelevant here) |
|
10 |
"file segment header for the non-leaf pages in a B-tree" (this is irrelevant here) |
(Note: I'll clarify what a "heap" is when I discuss the User Records part of the page.)
Some of the Page Header parts require further explanation:
PAGE_FREE :
Records which have been freed (due to deletion or migration)
are in a one-way linked list. The PAGE_FREE
pointer in the page header points to the first record in the
list. The "next" pointer in the record header (specifically,
in the record's Extra Bytes) points to the next record in the
list.
PAGE_DIRECTION and
PAGE_N_DIRECTION:
It's useful to know whether inserts are coming in a constantly
ascending sequence. That can affect
InnoDB's efficiency.
PAGE_HEAP_TOP and
PAGE_FREE and
PAGE_LAST_INSERT:
Warning: Like all record pointers, these point not to the beginning of the record but to its Origin (see the earlier discussion of Record Structure).
PAGE_BTR_SEG_LEAF and
PAGE_BTR_SEG_TOP:
These variables contain information (space ID, page number,
and byte offset) about index node file segments.
InnoDB uses the information for allocating
new pages. There are two different variables because
InnoDB allocates separately for leaf pages
and upper-level pages.
"Infimum" and "supremum" are real English words but they are
found only in arcane mathematical treatises, and in
InnoDB comments. To
InnoDB, an infimum is lower than the lowest
possible real value (negative infinity) and a supremum is
greater than the greatest possible real value (positive
infinity). InnoDB sets up an infimum record
and a supremum record automatically at page-create time, and
never deletes them. They make a useful barrier to navigation
so that "get-prev" won't pass the beginning and "get-next"
won't pass the end. Also, the infimum record can be a dummy
target for temporary record locks.
The InnoDB code comments distinguish
between "the infimum and supremum records" and the "user
records" (all other kinds).
It's sometimes unclear whether InnoDB
considers the infimum and supremum to be part of the header or
not. Their size is fixed and their position is fixed, so I
guess so.
In the User Records part of a page, you'll find all the records that the user inserted.
There are two ways to navigate through the user records, depending whether you want to think of their organization as an unordered or an ordered list.
An unordered list is often called a "heap". If you make a pile
of stones by saying "whichever one I happen to pick up next
will go on top" — rather than organizing them according
to size and colour — then you end up with a heap.
Similarly, InnoDB does not want to insert
new rows according to the B-tree's key order (that would
involve expensive shifting of large amounts of data), so it
inserts new rows right after the end of the existing rows (at
the top of the Free Space part) or wherever there's space left
by a deleted row.
But by definition the records of a B-tree must be accessible
in order by key value, so there is a record pointer in each
record (the "next" field in the Extra Bytes) which points to
the next record in key order. In other words, the records are
a one-way linked list. So InnoDB can access
rows in key order when searching.
I think it's clear what the Free Space part of a page is, from the discussion of other parts.
The Page Directory part of a page has a variable number of
record pointers. Sometimes the record pointers are called
"slots" or "directory slots". Unlike other DBMSs,
InnoDB does not have a slot for every
record in the page. Instead it keeps a sparse directory. In a
fullish page, there will be one slot for every six records.
The slots track the records' logical order (the order by key
rather than the order by placement on the heap). Therefore, if
the records are 'A''B''F''D' the slots will
be (pointer to 'A') (pointer to 'B') (pointer to 'D')
(pointer to 'F'). Because the slots are in key
order, and each slot has a fixed size, it's easy to do a
binary search of the records on the page via the slots.
(Since the Page Directory does not have a slot for every
record, binary search can only give a rough position and then
InnoDB must follow the "next" record
pointers. InnoDB's "sparse slots" policy
also accounts for the n_owned field in the Extra Bytes part of
a record: n_owned indicates how many more records must be gone
through because they don't have their own slots.)
The Fil Trailer has one part, as follows:
Name |
Size |
Remarks |
|
8 |
low 4 bytes = checksum of page, last 4 bytes = same
as |
The final part of a page, the fil trailer (or File Page
Trailer), exists because InnoDB's architect
worried about integrity. It's impossible for a page to be only
half-written, or corrupted by crashes, because the
log-recovery mechanism restores to a consistent state. But if
something goes really wrong, then it's nice to have a
checksum, and to have a value at the very end of the page
which must be the same as a value at the very beginning of the
page.
For this example, I used Borland's TDUMP again, as I did for the earlier chapter on Record Format. This is what a page looked like:
Address Values in Hexadecimal |
Values in ASCII |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
?? |
|
?? |
|
|
|
|
Let's skip past the first 38 bytes, which are Fil Header. The
bytes of the Page Header start at location 0d4026
hexadecimal:
Location |
Name |
Description |
|
|
There are 5 directory slots. |
|
|
At location |
|
|
There are 18 (hexadecimal 12) records in the page. |
|
|
There are zero free (deleted) records. |
|
|
There are zero bytes in deleted records. |
|
|
The last record was inserted at location
|
|
|
A glance at page0page.h will tell you that 2 is the
#defined value for |
|
|
The last 15 (hexadecimal 0F) inserts were all done "to the right" because I was inserting in ascending order. |
|
|
There are 16 (hexadecimal 10) user records. Notice
that |
|
|
?? |
|
|
Zero because this is a leaf page. |
|
|
This is index number 20. |
|
|
?? |
|
|
?? |
Immediately after the page header are the infimum and supremum records. Looking at the "Values In ASCII" column in the hexadecimal dump, you will see that the contents are in fact the words "infimum" and "supremum" respectively.
Skipping past the User Records and the Free Space, many bytes later, is the end of the 16KB page. The values shown there are the two trailers.
The first trailer (00 74, 02 47, 01 AA, 01 0A, 00
65) is the page directory. It has 5 entries,
because the header field PAGE_N_DIR_SLOTS
says there are 5.
The next trailer (3A E0 AA 71, 00 00 E2
64) is the fil trailer. Notice that the last four
bytes, 00 00 E2 64, appeared before in
the fil header.
