Auto recovery in case of locally-missing remote snapshot, other improvements #13

mvitale1989 · 2017-05-07T16:31:58Z

This solves a few issues, as well as add a bit more automation to remote-local snapshot matching:

Implemented auto recovery in case of locally-missing remote snapshot
zfs hold local snapshot before send operation, and zfs release on program exit, to ensure existence and retention.
Improved consistency: if target machine is localhost, ssh is not used at all (ssh -n not necessary anymore)
Improved compatibility: use sh instead of ksh, fallback to sudo if pfexec is not found, zfs command location is now auto-discovered
Removed redundant code: now $RECENT is used programmatically, instead of branching
Documented configurable parameters in config file, and added reminder to set properties on datasets

The auto recovery feature basically only changes the way the base of the incremental snapshot is obtained: if the latest remote is missing in the local machine, instead of just going into manteinance mode, zfs-backup tries to find an older snapshot that exists locally, and uses that as the zfs send -I base instead. Up to $MAXAGE-1 snapshots older than the latest remote are checked; you can configure MAXAGE in the config file.

- Improved consistency: if target machine is "localhost", ssh is not used at all (solved "ssh -n" issue) - Added description of parameters expected in config file, and a reminder to set properties on datasets - `zfs` command location is now auto-discovered - Added sudo fallback, if your machine does not have pfexec

- Added automatic recovery option, in case the latest remote snapshot does not exist locally (added MAXAGE parameter). If enabled, $MAXAGE-1 older remote snapshots are searched in the local ones, to act as the new incremental base. This will prevent manteinance mode, but do know that non-matched snapshots newer than the matching one will be destroyed, before being replaced by the even newer ones. Cloning the non-matched remote snapshots is one (arguably unclean) solution for this, but has not been implemented. - Added 'zfs hold' to local snapshot, to ensure retention before send operation. - Removed redundant code (now $RECENT is used programmatically, instead of branching)

mvitale1989 · 2017-05-08T00:05:37Z

Sorry, noticed there was a bug as i was testing using multiple datasets (trap is called only once). I'll fix it asap and open another pull request, for now i'll leave this open in case there's anything you'd like to change before that.

- Fix regression in last commit, which executed `zfs hold` for each dataset, but `zfs release` only for the last one. Now it is called right after the send/receive, and then the trap is cleared. - Schedule `release` before even calling the `hold`, to guarantee its execution.

adaugherity · 2017-05-08T19:50:44Z

Thank you for your contribution!

I've done a quick read-through and seen enough little things here and there that I don't want to merge as-is, and I need to take time to read the recovery logic in-depth. It's a good idea but we need to make sure it is doing what we expect it to. Also, I'm not sure MAXAGE is the right name if it's based on number of snapshots vs. date?

Quick notes:

"No need to change this" isn't 100% accurate for LOCK and PID, as if you have multiple backup jobs (e.g. local and offsite) you want different LOCK & PID files for each. There's also a couple typos in various comments.
When running from cron, directories like /sbin may not be in the PATH, so we shouldn't rely on finding zfs with which. Having sensible defaults based on OS (for at least Solaris/FreeBSD/Linux) is on my to-do list.
I believe zfs send automatically does its own hold, so I don't know if yours is necessary, but if so, you probably want zfs hold -r (and zfs release -r). I would have to check but it's possible the auto-hold is only on the snapshot being sent at a particular moment (and not its children), so a recursive hold may be desirable.
Please explain removing ssh -n. Note that the if/else block you eliminated also outputs different error messages for local vs. remote, which I thought would be helpful. And I'm pretty sure ssh is already skipped for localhost?
Storing the entire snapshot listing in a variable (e.g. $all_locals, $all_remotes) could hit variable length limits, meaning some snapshots are missed (vs. piping to grep and head/awk before assigning the variable). From a quick Google this appears to be platform-dependent on ARG_MAX, e.g. 256k on FreeBSD. It's probably an edge case that would be unlikely to be encountered, but I'd rather not introduce potential bugs.

mvitale1989 · 2017-05-08T21:40:12Z

The recovery logic is very simple: if latest remote snapshot does not exist locally, then check if the first older remote one does, and retry up to $MAXAGE times (or until all remote snapshots have been checked, if MAXAGE=0). I agree that it's not the best name...maybe something like MAXREMOTERECENT, as it resembles your RECENT variable in some ways?

About those issues:

True, my bad. I just never needed to do backups to multiple remots yet, so leaving it as it is seemed obvious, but others have other needs.
True as well. A quick test also rules out /usr/bin/env which, as it does not include /sbin either. A solution could be a simple which zfs || which /sbin/zfs || which /usr/sbin/zfs
I introduced the hold as a more consistent way of error checking. The previous behaviour was to just re-check the existence of $snap2, after $snap1 has been determined, and then do the send. Theoretically the local snapshot could also be rotated out after that check, but before the send, which can't ever happen with the hold as it is atomic. Adding the -r would also be good.
Wow, i actually missed the fact that the localhost branch avoided the ssh -n, sorry. That change is then superfluous, though i think that using $REMZFS_CMD consistently instead of branching is more DRY-friendly.
An edge case, but could still happen. Still, doing information-retrieval before everything else, instead of piping command outputs, has the advantage of letting you use that information in more than one place, while optimizing the calls to zfs command. For example, while checking if the remote snapshots exist locally one-by-one, at every iteration the zfs list of the local host is grepped against the remote snapshot name. You can save time by buffering that command's output in a variable or, to avoid hitting the variable limit, to a temporary file. What do you think?

mvitale1989 added 2 commits May 7, 2017 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto recovery in case of locally-missing remote snapshot, other improvements #13

Auto recovery in case of locally-missing remote snapshot, other improvements #13

mvitale1989 commented May 7, 2017

mvitale1989 commented May 8, 2017

adaugherity commented May 8, 2017

mvitale1989 commented May 8, 2017

Auto recovery in case of locally-missing remote snapshot, other improvements #13

Are you sure you want to change the base?

Auto recovery in case of locally-missing remote snapshot, other improvements #13

Conversation

mvitale1989 commented May 7, 2017

mvitale1989 commented May 8, 2017

adaugherity commented May 8, 2017

mvitale1989 commented May 8, 2017