WL#4044: Online backup: Large data test

Status: Complete

BACKUP BOOT CAMP: PRIORITY = P1 DATE:

BLOCK: Was partially blocked by BUG#36571, a duplicate of BUG#36624 (now fixed)

RATIONALE
---------
Test of online backup with large database.

DESCRIPTION
-----------
Perhaps something like this (can be modified):

1. Create generator of data, perhaps 10GB of data in 30,000 tables.
2. Do backup
3. Compute some checkum of the data
4. Shutdown system
5. Clear database
6. Start system
7. Restore data
8. Verify that checksum is correct
NOTE: as much of this worklog overlaps with WL#4406, the scope of this worklog is 
      reduced to touch on 'basic' scenarios that will on one side prevent overlap 
      with WL#4406 and on the other will provide a 'faster' turnaround. As such, 
      for this  'reduced' phase of implementing this task, the flghtstat database 
      will be used with the following permutations:

 - Database loaded to the different engines (MyISAM, InnoDB, Falcn)
 - Backup/Restore of idle system
 - Backup while activity is performed on the database (concurrency might be 
   limited in the case of MyISAM as getting a read lock has a lower priority 
   (see [30 Apr 16:08] Chuck Bell in BUG#44568)
Phase 1
-------
- Tests will be performed on the 'Flight Stat' database (10 tables. total of 40M
  rows). The Flight Stat' database (http://flightstats.us/) is a collection of 
  flight information collected by the FAA over a number of years until 2005.
- The schema will be augmented with triggers/SP in order to increase concurrent 
  load on the system while backups are being taken (the focus is not to try and 
  scale n the number of objects - this will be done as part of WL#4406
- 'ontime' table (37M rows) will be partitioned (24 partitions)

- Database schema will be created using MyISAM. Innodb and Falcon databases

- Backups will be taken when the system is idle and when there is activity 
  against the server

- Activity will be generated using the mysql-stress-test framework and will 
  include up to 100 concurrent users performing random insert/update/delete 
  scenarios using direct sql and stored procedures. 
  * Insert: 
    A random row will be selected from ontime_1week and will be inserted into 
    'online_all' table using a stored procedure
  * Update:
    A random row will be selected from 'ontime_all' and its distance value 
    will be incremented by one using direct sql.
  * Delete:
    A random row will be deleted from the 'online_all' table

- Additional insert activity will be generated using triggers on the 'ontime_all' 
  table that will insert rows to a log table following each insert/update/delete 
  operation on 'ontime_all'.

- Backup integrity will be checked by comparing a restore of the the backup 
  snapshot on a separate server and comparing it to a slave replica taken from
  the master to the file/position of when the backup was taken:
  - System running with binlogging and IUDS activity
  - Backup initiated (binlog file and position recorded)
  - Slave replicating form the master up to the file and position of the backup
  - Backup restored on a new server
  - Restored database compared to Slave database

Overview of test:
1) Start a server
2) Load base FlightStat database
3) Make a copy of the database to be used for the slave (see )
4) Start client lad activity 
5) After a short wait start a backup
6) Start replication to a slave stopping at the file/position of the Backup
7) Restore Backup on new system
8) Compare content of the 'Flight Stat' database in the slave replica and the
   restored Backup (although the HLD mentioned a checksum comparison, comparing
   dumps of both databases is a valid comparison as well). .


As running these tests involves large about of disk space and are currently
limited to a specific machine (ml-370.mysql.com), a lower priority is given to
writing the tests in a portable way.  


Phase #2
--------
(This phase is written after a set of email discussions with development (Rafal)
following feedback/questions from tests done in phase 1)

 - Tests will be performed on an existing copy of the TPC-H 
   database (160G - MyIsam / 360G) Innodb
 - Some of the table structure will be altered / duplicated to allow for a 
   combination of both myisam and innodb table in a single database
 - Backup / Restore / verification of restore will be performed on an 
   'idle' database.
 - Work running backup under activity will continue under WL#4922

Overview of test:
1) One time setup
1.1) Setup a vardir directory with existing innodb TPC- database
1.2 Start the server
1.3) Create 'myisam' tables from some fo the existing tables
1.4) Backup the vardir for future use
2) Running
2.1) Run  a backup
2.2) Start a second server
2.3) Restore the backup file to the second server
2.4) Compare the content of the original database backed up with the restored
     one (although the HLD mentioned a checksum comparison, comparing
     dumps of both databases is a valid comparison as well)

Note: It is not clear how long the restore of the TPC-H database will take.
      Based on initial backup TPC-H databse (47 min) and backup restore time of 
      flight-stat database in phase #1 (4 min / 9.5 hours), the restore is 
      estimated to take around five days (~114 hours)


Implementation details
-----------------------

IMPORTANT NOTE: Only phase #2 (backup/restore on an idle database) has been
implemented. 

The script has been pushed into mysql-test-extra tree:
mysql-test-extra/mysql-test/test_tools/scripts/backup/backup_restore.pl

Usage: ./backup_restore.pl
                        --datadir=
                        --backup-file=
                        --vardir=
                        --mysql-basedir=
                        --mtr-build-thread=
                        --data-volume=[K|B|G]


        datadir:         (MANDATORY) location of the datadir for the 1st server
                                     (where the database is stored locally)
        backup-file:     (MANDATORY) target location for the backup file -- if
the file already exists, it will be deleted
        vardir:          (MANDATORY) location of the datadir for the 2nd server
(which is empty at the startup, but grows as the restore is being performed).
                                     If the folder exists, all contents will be
removed!
        mysql-basedir:   (MANDATORY) location of MySQL binaries directory
                                     (containing mysql-test, bin, etc.)
        mtr-build-thread: (OPTIONAL) MTR_BUILD_THREAD value defining port number
for servers, default 300 (port 13000)
        data-volume:      (OPTIONAL) approximate volume of the datadir; it is
used to tune innodb* parameters; if not set, some medium values will be
used, which might be not optimal
        help:             print usage and exit


How to invoke:
$> perl ./backup_restore.pl --datadir=/data1/backup_data \
--backup-file=/data0/20090923.backup --vardir=/data0/var
--mysql-basedir=/data0/mysql-6.0 \
--mtr-build-thread=10 --data-volume=500G

Please make sure that you have enough space for the backup file and server vardir.

Since creating a database of desirable size takes too long to be done
each time, some manual work is required for running the test.

The database should be pre-created and stored locally.
The script takes the database location as a parameter, starts a server
with the defined datadir, runs BACKUP, stores table checksums,
stops the server, starts a server with an empty datadir, runs RESTORE,
compares new checksums with the stored ones, and stops the server.

The test passes if BACKUP and RESTORE finished successfully, and
all corresponding checksums are equal.

Note: while starting servers, the script attempts to adjust some server
parameters to optimize restore performance. If --data-volume option
is set incorrectly or is missing, the values might be totally wrong.