WL#2354: Online backup: NDB native driver

Affects: Server-6.x   —   Status: Un-Assigned

In WL#1613 Online Backup (all storage engine) a generic backup interface
will need to use ndbcluster backup.


--

Some ideas from Stewart (copied from an email):

Okay, we should really decide what we want to do regarding NDB in:
- initial implementation
- final, complete implementation

Considering that some work may have to be done in NDB.

I don't think that anybody is really going to mind (at least initially)
if NDB is not consistent with other storage engines. Just that it's
consistent within itself (which it is).

However, we should be able to make NDB consistent with others. at least
in the future.

The lock on the binlog is so that transactions don't complete, right? as
in you expect running txns to wait on that mutex before being committed?
This isn't true for NDB as committed rows come back from the storage
nodes via the event API asynchronously.

We can, however, use the cluster GCP as a synchronisation point. As this
point is where the NDB online backup synchronises too, where we restore
to after system failure and a single, identifiable transaction in the
binlog (at least currently)[1].

[1] In the future, to support replication with higher transaction loads
for cluster, we may end up having to have several masters and several
slaves replicating different parts of the the db. how we integrate this
with backup will be fun. We may have to push down the "are we all at the
same point in binlogging" to the engine so it can decide with the other
masters.

Does mysqlbackup just start transactions and dump? or does it lock and
copy table files? The former will work for something like federated,
latter, won't. Although, should federated tables be backed up locally?
I'd debate that it should be an option... but everybody is welcome to
say i'm being stupid (diplomatically).

DDL is also a storage engine issue - e.g. with cluster (and federated
probably). We lock out DICT operations during backup, but it's possible
(would have to think about it more) that we need that lock for longer?
If so, we need some NDB code. Shouldn't be too hard though... (as long
as you're willing to loose that lock on your node failure).

Also, NDB can abort the backup due to node failure. So it's possible
that the backup needs to be restarted for NDB. I'm guessing other
engines could have this too? (abort backup under extreme load).

A interesting (and good) feature of the current way NDB does backup is
that the backups are written to local disk of the nodes. This means we
don't impose interfering network traffic and collectively have more disk
bandwidth.

We possibly want to have a way to configure which network interface (or
just address) we should get data from cluster regarding the backup. For
NDB, arguably the best way is to backup to local disk FIRST and then
pull the contents of this over the wire into the MySQL Streaming Backup.
This way we get the cluster backup done at the maximal rate (smaller
logs, shorter replay time, faster restore) and can then have control
over how we interfere with network traffic (at a well defined rate, or
over a separate link).

Either way requires some NDB code, but nothing too invasive. Although
writing backup directly to the MySQL server and not to local disk on all
nodes could mean the backup never completes on any reasonably loaded
cluster[2].

There is almost no doubt that people WILL (want to) USE this generic
backup interface just to get the cluster backup in one place on disk.

[2] Replication currently doesn't work on these either.

It could be a good idea to put global metadata next to the server-id. In
theory then, in the future, we could back up the (possibly different)
metadata for all MySQL servers connected to the cluster. By running
restore on the different server-ids, each would restore their own set of
metadata (and any local tables) but only the cluster tables once. I
imagine the implementation of this would use the backup streaming stuff
to talk to other MySQL servers in the cluster. (something for the future
though).

Personally, i think per-table checksums could be useful too. No doubt
we'll get somebody with a partially corrupted backup that just wants to
get *something* out of it. Maybe we can then look like miracle
workers :)

In Parallelism, NDB can be in parallel even with a single thread - as
all the data nodes can be doing work in the background and blocks can
just be received whenever. So it's possible to interleave this with
other engines.

In regard to log backup - treating as engine sounds good. seeing as
logs-in-csv-tables is around... could be neat.