WL#3085: Cluster should support OPTIMIZE

Status: Code-Review   —   Priority: Medium

OPTIMIZE TABLE should be used if you have deleted a large part of a table or if
you have made many changes to a table with variable-length rows (tables that
have VARCHAR, BLOB, or TEXT columns). Deleted records are maintained in a linked
list and subsequent INSERT operations reuse old record positions. You can use
OPTIMIZE TABLE to reclaim the unused space and to defragment the data file.


This may not have been needed for Memory tables and may be why it is not
supported for NDBCluster, but will be needed for Disk Data and should be looked at.
1. What's the benefit from "optimize table" for ndb engine?
To reclaim the mem hole from one table and free some pages, then other
tables can use the free pages.

2. Why?
Currently, in ndb kernel, memory is allocated per table, that means
memory is NOT shared among tables.   
Once allocated to a table it is never returned to the main memory pool.

e.g., 
If there are a large amount of rows in a table A, and after a lot of rows
deletion, or update a lot of varsize columns from large size to small size,
there will be empty slots in mem pages. Even these empty slots
will be re-used again by inserting new rows into talbe A later, but the 
problem is if there are few rows insertion then. So that means the mem 
pages are occupied by table A would not shared and used by other tables 
even there are a lot of empty holes in table A.

There are two separate problems to solve, fixed-size pages and variable-size
pages (the latter currently not for disk tables).

3. Limitation
   Now only support to optimize memory table.

4. How to reclaim "mem hole" in fixed-size pages for a table?
1) move rows into hole, then dispersed "mem hole" would re-organized to 
free pages. 
  (this is implemented by me through NDBAPI)
  a) Scan the table forward in TUP order, inspecting rowids to look for holes. A
     hole is detected when a returned rowid is different from one row past the
     rowid returned last. Note that this needs to be kept track of on a
     per-fragment basis, as we scan fragments in parallel.
  b) Simultaneously, scan the table backwards, looking for rows to move.
  c) For each hole found in a), move a row from b) into the table until there are
     no "mem hole".
  d) If the move in c) fails, just try again with the next hole. We cannot lock
     a hole, so no way to prevent another transaction from filling the hole
     before us.

2) flow to move fix part in kernel
  a) how to allocate a new tuple key ("logical page id" and page index) to store
fix part on primary fragment and other fragments in one node group ?
   Since tuple key (logical page id and page index) are same on all fragments in
one node group, so we need do this by two-phase-commit protocol.
   And also need to maintain the map between logical page id and physical page
id respectively on primary fragments and other fragments.
  b) copy old fixpart to new fixpart
  c) update new tuple key in dbacc on all fragments in one node group.
  d) delete old fixpart entry in fix page

5. How to reclaim "mem hole" in variable size table ?
1) The variable part of a row is stored independently of the fixed part, and
there is no way to scan in "varpart order", so we cannot scan for holes.
2) Instead, implement a flag for TCKEYREQ update request, which will make kernel
   move the variable part of the row to another page (normally the kernel only
does this if the update causes it to no longer fit in current page).
3) NDB API part of optimize table sends this new flag in an update of _every_
   row in the table.


6. free empty pages 
  a) for both fix pages and var pages, if there are no entries in the page, then
 we need to reclaim the page to global page pools.
  b) Implemented in kernel (and not restricted to optimize table).

7. steps of optimizing a table inside.
  When we do optimizing one table, actually, we do:
  a) optimize the fix pages and var size pages of the original table;
  b) optimize all unique index tables of the original table;
     Need to optimize fix pages, and whether need to optimize var pages is
decided by whether the unique key has var sized columns and primary key of
original table has var sized columns
  c) optimize all BLOB tables of the original tables;
     Need to optimize fix pages and var pages because BLOB table has both fix
sized and var sized columns.
1. Basic hierarchy:
1) mysqld server level: 
   to write a ndb handler function ha_ndbcluster::ndb_optimize_talbe(),
2) NDBAPI level:
   to extend some NDBAPI methods to fill up holes in mem page.
3) NDB kernel level:
   scan page by page to find mem holes;
   scan table backwards;
   move fixed part of a tuple to the ROWID of holes;
   move var part of a tuple to where it should be;
   to reclaim free pages of the table;

2. related issues should be thought:
1) how to look up mem holes in NDBD?
   a) get page no of the free fix page(free here means at least 1 free record
inside the page)
   get from free pages list for fix pages(the pages have at least 1 free record)
   b) get page index inside the fix page we found
   get free fix tuple from next free index in the fix page header
2) how to scan table backwards(in reversed tup order) ?
   Now, a normal scan tuple from 0 to maximal logical page no.;
   Then a descending tuple scan should be from maximal logical page no. to 0;
3) how to do the optimization online ?
   With scanning full table, readTuples() method support a LM_Exclusive which
use ACC_SCAN to lock the tuple;
   During update OPTIMIZE pseudo column, we can lock the tuple with updating;
   And the updating operation is transaction-based;
4) how to get the tradeoff between availability and performance ?
   Currently, Martin already improved cursor based optimize table NDBAPI.
   a) On MySQL server side, we can use a parameter to slow down. 
   b) can suspend, continue or abort the whole optimizing processing and which
is agile.

5) need a off-line optimize table tool for a big table and speed up ? 
   
6) how to reclaim free pages after holes are filled up in kernel? 
   Actually, Jonas has implemented a patch which will release the free page to
global pages pool; Of course, this worklog is based on Jonas' page release patch.
   Basically release page patch is to check whether page is empty after the free
size is equal to the page size;
   And other tricky things are lated to redo log and logical-real page map
rebuilding after NR/SR;

7) error analyze and processing(e.g when execute() or others)
   Need to check transaction temporary error and to do some retries. 

8) ndbapi test case
  Martin wrote a test case to verify the effect of optimizing table for varpart
tuple moving. 
  Basically, we create a table with var size columns and populate the table,
then do random deletions; 
  Next do the optimizing table, and compare the memory utilization before and
after the optimization;
  We finally evaluation the optimizing effect by check the percent of reclaiming
because we are not sure about the exact memory bytes are optimized.   

9) ...


3. detailed flow
Discussed with Tomas, on ndbapi side, optimize_table() would step up with 
two versions, one focus on moving varpart of row to varpart page hole, the
other focus on moving fixpart of row to the hole of a new ROWID.

4. version #1
Discussed with Martin and Kristian.
There will be a single scan using the existing "tup order" and readTuples() 
with exclusive locks, then inform kernel to move varpart on each tuple.
1) add a pseudo column "OPTIMIZE"
2) add new NDBAPI functions optimizeTable...(), e.g.:
     /**
     * Optimize table given defined table object
     * @param t Object of table to optimize
     * @param delay in milliseconds for each batch of rows 
     * @return 0 if successful otherwise -1.
     */
    int optimizeTable(const Table &t, Uint32 delay = 0);

    /**
     * Optimize table given table name
     * @param name Name of table to optimize
     * @param delay in milliseconds for each batch of rows 
     * @return 0 if successful otherwise -1.
     */
    int optimizeTable(const char * name, Uint32 delay = 0);

    /**
     * Optimize table given defined table object
     * @param t Object of global table to optimize
     * @param delay in milliseconds for each batch of rows 
     * @return 0 if successful otherwise -1.
     */
    int optimizeTableGlobal(const Table &t, Uint32 delay = 0);

    /**
     * Optimize table given table name
     * @param name Name of global table to optimize
     * @param delay in milliseconds for each batch of rows 
     * @return 0 if successful otherwise -1.
     */
    int optimizeTableGlobal(const char * name, Uint32 delay = 0);

    /**
     * Optimize index given defined index object
     * @param ind Object of index to optimize
     * @param delay in milliseconds for each batch of rows 
     * @return 0 if successful otherwise -1.
     */
    int optimizeIndex(const Index &ind, Uint32 delay = 0);

    /**
     * Optimize index given table name and index name
     * @param idx_name Internal name of index
     * @param tab_name Name of table
     * @param delay in milliseconds for each batch of rows 
     * @return 0 if successful otherwise -1.
     */
    int optimizeIndex(const char * idx_name, 
                      const char * tab_name,
                      Uint32 delay = 0);

    /**
     * Optimize index given defined global index object
     * @param ind Object of index to optimize
     * @param delay in milliseconds for each batch of rows 
     * @return 0 if successful otherwise -1.
     */
    int optimizeIndexGlobal(const Index &ind, Uint32 delay = 0);

    /**
     * Optimize index given table name and index name
     * @param idx_name Internal name of global index
     * @param tab_name Name of table
     * @param delay in milliseconds for each batch of rows 
     * @return 0 if successful otherwise -1.
     */
    int optimizeIndexGlobal(const char * idx_name,
                            const char * tab_name, 
                            Uint32 delay = 0);

3) flow of __optimizeTable():
  a)do a full table scan with readTuples(NdbOperation::LM_Exclusive)) 
  b)in scan loop, update each row's pseudo column "OPTIMIZE" with optimize options
  c)sleep a configured parameter to slow down the scan speed, then decrease the
overload of optimizing.

4) in dbtup block: 
  a)check whether "OPTIMIZE" column is updated
  b)if yes, check what optimize options are set; 
  c)if varpart option, then move the tuple to fit position
  d)how to get the fit position is that we basically move tuple from
large-free-size var page lists into small-free-size var page lists.
  e)after move the tuple, we check the page whether it's empty, if so, we
reclaim the page to global page pool, then other table can reuse them.
5) in mysql server:
  add an handler function to meet SQL "optimize table", the handler function
will invoke the new NDBAPI optimizeTable().  

5. version #2
  It's to move fixpart of tuple, is global operation because primary fragment
and backup fragments need has same ROWID of a tuple and the moving should be
transaction based operation.
  We will introduce a new moveTuple() operation alike updateTuple(), related
to NDBAPI/TC/LQH/TUP blocks.
  Then in moveTuple(), we can moving the fixpart tuple. Basically, moving
fixpart do NOT change the original data of the tuple, just change the tuple
location in the fix pages.
  Below are basic steps to implement version #2
    a) add a new operation moveTuple(), both ndbapi and kernel will be involved.
       --dbtup block (Jonas)
       --ndbapi, dbtc, dblqh  (Justin, to do)
    b) moving fixpart in DBTUP (Jonas)
    c) update ACC key with new tuple location (Jonas)
    d) update ordered index with new tuple location (Jonas)  
    e) introduce a descending tuple scan, 
       both ndbapi and dbtupScan (Justin, done)
    f) in NDBAPI to find the ROWID of memory hole in NDBD, 
       and using two-scan to move varpart and fixpart (Justin, done)

In version 2 there will be _two_ scans, one forward and one backwards, forward
scan can will look up the ROWID of mem holes, backward scan will move the tuple
to the mem hole with the ROWID we found.
1) in foward scan, to get ROWID of the memory holes one by one; meanwhile,
optimize varpart.
2) in backward scan, pass a parameter to moveTuple(Local_key
destination_rowid), then kernel can know where to move the fix part of the
tuple.