WL#6965: InnoDB: Truncate UNDO logs/tablespace

Status: Complete

Let me first clarify difference between TRUNCATE and SHRINK.

TRUNCATE:
- Truncating undo tablespace to make it equivaluent to what it was when it was
created new. The physical size of undo tablespace file will be reset to its
default initially configured size and all the undo logs within the tablespace
will be removed

SHRINKING:
- Shrinking undo tablespace to reduce the file size.
(New size of undo tablespace can be anything between least possible size and
current size. The undo records up to the shrink size are maintained and are valid.

This WL aims at truncating (and not shrinkng) of UNDO tablespace.

UNDO logs and UNDO tablespaces are used interchangebly.

-----------------

When transaction performs any DML activity it would result in generation of UNDO
logs. Once transaction is committed then INSERT generated UNDO logs are cleared
immediately but UPDATE/DELETE generated UNDO logs are kept to service MVCC.

Purge tracks the limit and once it identifies that the said UNDO log entries are
no more needed it removes it from the UNDO logs.
Removal here means pages related to these entries can be marked as free and
ready-for-reallocation. This activity doesn't result in reducing the size of
UNDO logs/tablespace.

On a heavily loaded system even though we re-use the UNDO log pages.
In order to implement MVCC (and purge) new pages areadded to the UNDO logs at a
higher rate than purge can truncate (remove). This can result in an increase in
the undo log file size and can consume a significant size on disk.

WL will try to address this problem by periodically truncating undo
logs/tablespace. Truncate will be done online without affecting server operation
While the undo tablespace is taken offline for truncate it would not be
available for attaching to new transactions and so the total transaction
handling capacity of the server will reduced till TRUNCATE is complete. The
truncate of the UNDO tablespace is a millisecond operation and therefore no
significant user impact expected. Note: Other UNDO tablespaces will be available
to service transaction so it will not stall the server except we can see some
slowness or delayed allocation.

One of the tablespace will be selected for truncate and will be made inactive
for that time period. During this time-period there should be another UNDO
tablespace to service the new upcoming transactions.

Requirement for UNDO tablespace truncate:

- There should be at-least 2 UNDO tablespaces and innobase_undo_logs should be
configured accordingly so that there are enough rsegs.
- Server should be configured to use UNDO tablespace (if used with system
tablespace it will never be truncated.)
- Functional Requirement:

  - Server should be started with at-least 2 UNDO tablespaces and
    innodb_undo_logs set to a value such that there are at-least 2 REDO rollback
    segments. (--innodb_undo_tablespace=2 --innodb_undo_logs=35)
    UNDO log count of 35 suggest following
    1st rseg resides in system tablespace that is 0
    2nd - 33rd rseg resides in system temporary tablespace (these are non-redo
    rsegs)
    34th and 35th rsegs resides in undo tablespaces (normally wih space-ids
    1, 2 .... etc)

  - Server should keep periodic watch on size of UNDO tablespace.
    If size of UNDO tablespace > threshold then Server should
    kick-start TRUNCATE. If tablespace is made inactive for TRUNCATE
    then number of transactions, server can support, will reduce.
    Based on the load the threshold should be configured taking into
    consideraion above criteria of reduced transaction handling.
    Note: threshold is configurable. Check HLS for more details.

  - On truncate, size of tablespace should be resorted to what it
    was while was created.
    (NOTE: Verification of this could be bit tricky if tablespace get re-used
     before user go to the needed directory to check the size.)

  - Till date, purge of rsegs use to happen once in 128 times of purge action
    invocation. With this WL we would introduce a variable 
    innodb_purge_rseg_truncate_frequency that can help control how frequenty 
    this action should take place.
    Permissible Vaue: 1 - 128.

  - Complete undo-truncate action is controlled using a flag name
    innodb_undo_log_truncate (on/off or 0/1). Default it is turned-off.
    Adminsitrator can trun it on based on workload. Only when it is on
    UNDO-Tablespace will be marked for truncate.

- Non-Functional Requirement:
  - TRUNCATE shouldn't affect MVCC correctness.

Understanding tablespace-rseg-transaction relationship
======================================================

Let's first understand how these 3 components are inter-related.

To start with, scope of the WL is limited only if undo logs resides in undo
tablespace (not residing in system tablespace) and so discussion will only
consider related scenario.

- User can specify undo tablespaces/logs to create using
  --innodb_undo_tablespaces=n
  --innodb_undo_logs=m

- If m > n (m >= 34) then from rsegs from (34..to.. m) are allocated tablespaces
in round-robbin fashion.

  (Note: rseg id start with 0 indices)

  0	 -> rseg resides always in system tablespace
  1..32  -> rseg resides in shared temporary tablespace (ibtmp1)

  33     -> undo-tablespace-0
  34     -> undo-tablespace-1
  ....
  33+n-1 -> undo-tablespace-n-1 (undo tablespace indexes from 0 so n-1 is last).
  33+n   -> undo-tablespace-0
  33+n+1 -> undo-tablespace-1
  ....
  m      -> undo-tablespace-x

  That means, there is many-one mapping between undo logs and tablespaces.
  (There is one-one between undo logs and rseg so both terms can be used 
   interchangeably)

- Besides this, each undo logs/rseg has 1024 slots (0 .... 1023).
  That means 1024 t   can share same rseg.

  	txn		rseg		tablespace

	1
	2		 0		0
	3

	4		 1		1
	5 

	6		 2		0
	7
	8

	9		 4		0
	10
	11

	Many	-> 	One
			Many	->	One

  This means to flag a tablespace as free for truncate all the trxs that are 
  mapped to that tablespace should be complete, besides, ensuring that undo
  logs of these trxs are not needed by purge or MVCC.

  For INSERT, UNDO logs are immediately freed (on commit)
  For UPDATE (INCLUDING DELETE), UNDO logs are not freed on trx_commit
  (as they might be needed by MVCC)

  Instead, purge looks at the limit and try to free undo log pages once
  it has consumed the undo logs and limit (enforced using trx->id)
  signals it that these undo logs are not needed for MVCC.

-------------------------------------------------------------------------------

Approach
========

  1. Detect if UNDO log needs truncate. This could be based on some size 
     threshold. (TODO: Size threshold should be dynamically configurable.)

  2. Check if UNDO logs can be truncated. There has to be some minimum 
     number of undo logs in order to continue smooth operation of server.
     For example: if there is only 1 undo tablespace then we can't schedule
     it for truncate else it would halt complete server operation.

  3. Once identified, mark it so that rsegs associated with it will not
     be allocated to new upcoming transactions.
   
  4. As part of purge workload check if any UNDO tablespace is marked for 
     truncate If found, then try to see if all the pages are free.
     (purge enforces this using the limit (trx->id) clause).
     All free pages means that UNDO log can be truncated now.
     If there is cached page then check if undo log record are needed based
     on limit as they are not purged by the purge thread.

  5. Initiate truncate. Truncating the tablespace to the original size.

  6. Once complete re-enable the rsegs for allocation.


-------------------------------------------------------------------------------

Changes List:
============

New enviornment variable introduced:
------------------------------------

    - innodb_max_undo_logs_size: Threshold after which UNDO tablespace
      will qualify for truncate.
	  default: 1024 MB / 1 GB
	  min: 10 MM
	  max: ULONGLONG_MAX

    - innodb_purge_rseg_truncate_frequency:
      Rate at which undo log purge should be invoked
      as part of purge action. Value = n means invoke undo log purge on every
      nth iteration of purge invocation.
           default: 128
           min: 1
           max: 128

   - innodb_undo_log_truncate: Enable/Disable undo-tablespace truncate.
     default: off (boolean variable. possible value on/off or 0/1).
     If ON only then undo tablespace will be marked for truncate.
     Note: If variable is truned-off while a undo-tablespace is already
     marked for truncate then truncate action for such undo-tablespace
     is not aborted. It continues but no new tablespace is selected unless
     user turn it on again.

   Both variables are dynamically configurable and have global scope.

Identify UNDO tablespace to truncate:
-------------------------------------
   
 1. Iterate over all the undo-tablespaces and check size of each tablespace.
    If size of tablespace is > threshold then mark the tablespace active for
    truncate.

    Ensure that there are other usuable undo-tablespace(s) and rsegs
    (besides marked one) so that operation of server doesn't halt.

 2. Once tablespace is selected, then mark all the rseg residing in that
    tablespace as inactive for allocation. This will avoid allocation of these
    rsegs to new read-write transactions.
    Existing transaction will continue to use these rsegs.


Purge action:
-------------

  1. Purge thread based on innodb_purge_rseg_truncate_frequency will invoke 
     undo truncate action. If UNDO tablespace is not marked for truncate then 
     purge thread will free the undo log segment as it does now.
     If UNDO tablespace is marked for truncate then purge thread will avoid
     freeing of undo log segment instead just maintain the stat that this
     undo log segment can be freed. In short it does delay free as anyway
     tablespace is going to get truncated.


|---------------|             |--------------| 
| ONLINE        |  undo tblsp | OFFLINE      |   all rsegs
| rseg          |--> size --->| rseg not     |--> in undo tblsp
| available     | > threshold |available     |    are free
| for allocation|             |for allocation|      |
|---------------|             |--------------|      |
     A                            A                 |
     |                            | (truncate fail) v
     |                         |------------------------------------|
     |  (truncate success)     |      TRUNCATE                      |
     |-------------------------| Truncate undo tablespace which in  |
                               | turn trucates rsegs residing in it |
                               | Reinitialize rsegs.                |
                               |----------------------------------- |


Truncate action:
----------------

  1. Check if all the rsegs residing in the marked UNDO tablespace are free.
     Free here means: All the undo records residing in all the rsegs are no more
     needed for MVCC.

     Once this is ensured a truncate action can be initiated.

  2. Stop all operation on UNDO tablespace. Ensure that pages from buffer pool
     are either flushed or released.

  3. log-checkpoint. This is to ensure that REDO records related to UNDO
     tablepsace are no more accessible/valid once we truncate the undo 
     tablespace.

  4. DDL log to safeguard truncate action from crash.
     DDL log file name undo_<space_id>_trunc.log is created in srv-log-dir.

  5. Truncate the file back to original size that exist when undo tablespace is
     created new. Re-initialize tablespace and rseg headers.
  
  6. Remove DDL log marking action complete.

  7. Re-enable all the rsegs making them avaiable for re-allocation.


Fixup action during server startup:
-----------------------------------

  1. During server startup all the available UNDO tablespaces are opened.
 
  2. As part of the operation, check if there is corresponding UNDO tablespace 
     DDL log file. If present, then initiate the truncate action as it 
     indicates that truncate action was not successful.

  3. Once done, normal server flow will open then new file and initialize rsegs 
     accordingly. By default, all rsegs are in active mode.

Finding Optimal value of params:
-------------------------------

- If undo-truncate is going in parallel as quoted above we would surely see a
drop in TPS. How much drop in TPS depends on lot of varying factors. Let me list
them here.

- Number of tablespaces. More the better
- Number of undo logs. More the better
- Threshold. More the better
- innodb_purge_rseg_truncate_frequency. More the better
- Machine configuration and IO sub-system. Truncate is fil-system operation.
  Depending on how fast the IO sub-system is.
- Length of parallel running trxn. If any undo-tablespace hold an active data
  then it can't be truncated even if active data occupies few bytes in 
  undo-tablespace.
- Invocation/Scheduling of Purge thread. Truncate action is sub-action of purge 
  thread and purge thread is invoked only when it has significant work to do.
- We recently did a study to understand this effect and here are results from
  aupra01/06.

======================================================================

Machine Used: SURPA-01

--------

Loaded data using:
./sysbench --test=oltp --mysql-table-engine=innodb --oltp-table-size=1000000
--mysql-socket=/export-c/kbauskar/mysql.sock prepare --mysql-user=root

workload:
./sysbench --test=oltp --init-rng=on  --oltp-dist-type=uniform
--mysql-socket=/export-c/kbauskar/mysql.sock --mysql-user=root --max-requests=0
--mysql-table-engine=innodb --max-time=60 --oltp-table-size=1000000
--num-threads=xxxxx --oltp-skip-trx=off run

Strategy:
5 runs for each thread. Ignore first 2 runs and then take average of last 3 runs.

--------
Configuration used:
Baseline: --innodb_undo_tablespaces=2 --innodb_undo_logs=35
WL:  --innodb_undo_log_truncate=1 --innodb_undo_tablespaces=2
--innodb_undo_logs=35 --innodb_max_undo_log_size=10M
--innodb_purge_rseg_truncate_frequency=1

Results:
                Baseline (mysql-trunk)        WL#6965          regression
threads

1                    297                292                   

2                    520                534

4                    1033                1049

8                    1880                1888

16                  3454                3452

32                  6177                6153       

64                  7072                6647        6%

128                6974                5924        15%


- No regression for thread 1-32 as there is no undo-truncate taking place.

- Regression starts from 64 threads as undo-log size goes beyond 10M threshold.

- For 128 threads almost at any given time one of the undo tablespaces (out of
configured 2) is being truncated.
  So an expected regression of 15%.

--------

Configuration used:
Baseline: --innodb_undo_tablespaces=2 --innodb_undo_logs=35
WL: --innodb_undo_log_truncate=1 --innodb_undo_tablespaces=2
--innodb_undo_logs=35 --innodb_max_undo_log_size=100M

Results:

threads    baseline        WL            regression
128            6974            6192        11%

- Regression went down by 4% moment we reduced the frequency of undo-truncate.
(by increasing threshold and keeping the rseg-purge-frequency to default of 128).

--------

Configuration used:
Baseline: --innodb_undo_tablespaces=8 --innodb_undo_logs=128
WL: --innodb_undo_log_truncate=1 --innodb_undo_tablespaces=8
--innodb_undo_logs=128 --innodb_max_undo_log_size=10M
--innodb_purge_rseg_truncate_frequency=1

Results:
threads    baseline        WL            regression
128            7095            6888        2.91%

- Regression down to 2-3% with 8 undo tablespace and 128 undo logs. Note:
threshold is still 10M so frequency of undo-truncate is still higher.

--------

Configuration used:
Baseline: --innodb_undo_tablespaces=8 --innodb_undo_logs=128
WL: --innodb_undo_log_truncate=1 --innodb_undo_tablespaces=8
--innodb_undo_logs=128 --innodb_max_undo_log_size=100M

Results:
threads    baseline        WL            regression
128            7095            7027        0.009%

- No regression if we increase the threshold from 10M -> 100M. Default value of
threshold if 1G given the frequency of undo-truncate that user might need.

--------

Configuration used:
Baseline: --innodb_undo_tablespaces=2 --innodb_undo_logs=128
WL: --innodb_undo_log_truncate=1 --innodb_undo_tablespaces=2
--innodb_undo_logs=128 --innodb_max_undo_log_size=100M

Results:
threads    baseline        WL            regression
128            7155            6882         3.8%

--------

Machine Used: SURPA-06

Baseline:
--innodb_undo_tablespaces=2 --innodb_undo_logs=35
WL: --innodb_undo_log_truncate=1 --innodb_undo_tablespaces=2
--innodb_undo_logs=35 --innodb_max_undo_log_size=10M
--innodb_purge_rseg_truncate_frequency=1

Baseline WL

thrds

1 314 344

2 580 589

4 1286 1213

8 2507 2379

16 4725 4607

32 8080 8046

64 7076 6331 10%

128 5625 2105 62%

---------------

Baseline: --innodb_undo_tablespaces=8 --innodb_undo_logs=128
WL: --innodb_undo_log_truncate=1 --innodb_undo_tablespaces=8
--innodb_undo_logs=128 --innodb_max_undo_log_size=10M
--innodb_purge_rseg_truncate_frequency=1

thrds Baseline WL %regression

128 6919 6831 1.28%

----------------

Effect of using innodb_purge_rseg_truncate_frequency 1 or 128

WL

innodb_purge_rseg_truncate_frequency = 1 .... TPS: 7075

innodb_purge_rseg_truncate_frequency = 128 (default and constant on Baseline)
 .... TPS: 7178

So if we reduce the frequency of freeing undo-pages then we see increase in tps
(with all other setting remaining unchanged).

With value 1 of 1 rollback segments are pages are freed on every invocation of
purge. Check the significance of this variable for more details.

------------------

Effect of undo-tablespace

Baseline: --innodb_undo_tablespaces=128 --innodb_undo_logs=128
WL: innodb_undo_log_truncate=1 --innodb_undo_tablespaces=128
--innodb_undo_logs=128 --innodb_max_undo_log_size=10M
--innodb_purge_rseg_truncate_frequency=1

Note: 128 undo-tablespace
Baseline (tps): 6918
WL (tps): 6984

======================================================================

Let's take a case to understand the issue above.

1. When we ran workload with 2 undo-tablespace and 35 undo-logs (which means 2
redo logs as 32 are reserved for non-redo) performance gap was higher.

2. When we ran workload with 8 undo-tablespace and 128 undo-logs (which emans 95
redo logs) performance gap reduced.

Why ? 

a. Amount of generated undo information continue to remain same in both the
cases mentioned above. With threshold of 10M it tend to fillout 2
undo-tablespaces quickly and that means at any given time one of the tablespace
has size > 10M causing truncate to be active.

b. If 1 undo tablespace (out of 2) goes down it is like lossing 50% of your
resources vs if 1 undo-tablespace goes down (out of 8) it is like loosing 12.5%
of your resources. 

c. We also observed that if threshold is increased from 10 -> 100M then
frequency of truncate reduces and so gap difference is reduced significantly.

d. innodb_purge_rseg_truncate_frequency plays a key role in deciding when to
clearing up pages in undo-tablespace. For baseline this value is constant to
128. If set to 1 in WL it means for every instance of purge invocation it will
try to free rollback segment. (Note: Freeing of rollback segment is different
from truncate of rollback segment. Free will make those pages re-usable.)

========================================

Based on our existing testing we observed following optimal value.
(note: optimal value defintion may change based on other factor and machine.)

- undo-tablespace >= 8 (more the better)

- undo-logs = 128

- innodb_purge_rseg_truncate_frequency = default

- innodb_max_undo_log_size >= 1G (more the better)

Let's understand the low level changes by describing the change list
depicted above.

1. Handler to track truncate of undo tablespace:
------------------------------------------------

/** Track UNDO tablespace mark for truncate. */
class undo_trunc_t {

......

public:
        undo_trunc_logger_t     undo_logger;

private:
        /** UNDO tablespace is mark for truncate. */
        ulint                   m_undo_for_trunc;

        /** rseg that resides in UNDO tablespace marked for truncate. */
        rseg_for_trunc_t        m_rseg_for_trunc;

        /** Start scanning for UNDO tablespace from this space_id.
        This is to avoid bias selection of one tablespace always. */
        ulint                   m_scan_start;
};

Reference to this class will be stored in purge handler and it would be
used for tracking the tablespace that is marked for truncate along with
related rsegs.

2. Handler to track ddl log needed to safeguard truncate action from crash:
---------------------------------------------------------------------------

/** UNDO log truncate logger. Needed to track state of truncate
during crash. A DDL_LOG kind of file will be created that will be removed
on successful truncate but if server crashes before successful truncate
this file will be used to initiate fix-up action during server start. */

class undo_trunc_logger_t {

......

public:
        /** Magic Number to indicate truncate action is complete. */
        const static ib_uint32_t        s_magic;

        /** Truncate Log file Prefix. */
        const static char*              s_log_prefix;

        /** Truncate Log file Extension. */
        const static char*              s_log_ext;

private:
        char*                           m_log_file_name;
};


3. Selecting UNOD tablespace for truncate:
------------------------------------------

/** Iterate over all the UNDO tablespaces and check if any of the UNDO
tablespace qualifies for TRUNCATE (size > threshold).
@param[in,out]  undo_trunc      undo truncate tracker */
static
void
trx_purge_mark_undo_for_truncate(
        undo_trunc_t*   undo_trunc);

Important validation checks done by the API

        /* Validation/Qualification checks
        a. At-least 2 UNDO tablespaces so even if one UNDO tablespace
           is being truncated server can continue to operate.
        b. At-least 2 UNDO redo rseg/undo logs (besides the default rseg-0)
        b. At-least 1 UNDO tablespace size > threshold. */

Besides this API will mark the rsegs residing the selected tablespace as inactive.


4. Making rseg inactive:
------------------------

If UNDO tablespace is being truncated then rsegs residing in that tablespace 
needs to be marked too so that we can avoid allocation of them to new booting
transaction. skip_allocation helps in tracking this.
pages_marked_freed is use to cache the count of undo segment that can be freed
but we delay the free action as tablespace is being truncated anyway.

struct trx_rseg_t {

.....

        /** if true then skip allocating this rseg as the tablespace
        it resides in is marked for truncate. */
        bool                            skip_allocation;

        /** track the segment that can be freed. If UNDO tablespace
        where the segment resides is marked for truncate then we delay
        free operation and instead let the truncate do the actual free. */
        ulint                           pages_marked_freed;
};


5. Actual truncate action:
--------------------------

/** Iterate over selected UNDO tablespace and check if all the rsegs
that resides in the tablespace are free.
@param[in]      limit           truncate_limit
@param[in,out]  undo_trunc      undo truncate tracker */
static
void
trx_purge_initiate_truncate(
        purge_iter_t*   limit,
        undo_trunc_t*   undo_trunc);


   - API will check if all the rsegs are free based on their size.

   - rseg->curr_size count is adjusted to consider pages_marked_freed.

                ulint   size_of_rsegs =
                        rseg->curr_size - rseg->pages_marked_freed;

   - Also, if there are cached undo segment then ensure that records within
     these segment qualifies for purge. This can done by comparing their trx_no
     with purge view limit.

                        for (undo = UT_LIST_GET_FIRST(rseg->update_undo_cached);
                             undo != NULL && all_free;
                             undo = UT_LIST_GET_NEXT(undo_list, undo)) {

                                if (limit->trx_no < undo->trx_id) {
                                        // segment might be needed for MVCC.
                                }
                        }
    Same for other cached segment.


   - Write DDL log using undo_logger.
   - Truncate the tablespace. This would include physical file truncate and 
     adjusting the space handler (fil_space_t) for the size. Note: space_id is 
     not changing.
   - Write the header and Re-initialize rseg. skip_allocation will be reset to 
     false and so post truncate immediately rseg can be allocated to new 
     booting transaction.

/** Truncate UNDO tablespace, reinitialize header and rseg. 
@param[in]      undo_trunc      UNDO tablespace handler
@return true if success else false. */

bool
trx_undo_truncate_tablespace(
        undo_trunc_t*   undo_trunc);

   - Remove the DDL log.

6. Fixup action during crash:
-----------------------------

   - During server start srv_undo_tablespaces_init() api will open all the
     available undo tablespace.

   - Before opening any undo tablespace check if any undo tablespace needs
     fix-up. This can be determined by checking for presence of ddl log file.

     if (undo_trunc.needs_fix_up(space_id)) {

		...

		srv_undo_tablespace_create
		fsp_header_init
     }

     needs_fix_up will check for presence of log file and if log file is present
     then for presence of magic-number. If magic-number if present then it 
     suggest that truncate action was complete but unlink of file was not so 
     simply remove the file and report that file didn't exist.

     If file exist and magic-number if not present it suggest that truncate 
     action was not complete. So complete the truncate action.
 
     Instead of truncating the file we can simply delete and re-create the file
     as file is yet to open.