Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with ZFS Replication Using Syncoid #937

Open
Mikesco3 opened this issue Jul 13, 2024 · 3 comments
Open

Issue with ZFS Replication Using Syncoid #937

Mikesco3 opened this issue Jul 13, 2024 · 3 comments

Comments

@Mikesco3
Copy link

Description

I am experiencing recurring issues with ZFS replication using syncoid.

I have scheduled a script to run every two hours to synchronize datasets between an SSD and a pool of hard drives and to another server.

The script often fails during the zfs send and zfs receive processes with errors like:

  • cannot restore to rpool/_VMs/vm-10111-disk-1@autosnap_2024-07-13_02:00:18_hourly: destination already exists
  • Use of uninitialized value $existing in string eq at /usr/sbin/syncoid line 750.
  • Broken pipe and
  • critical errors .

My Script:

Here is the script that I have scheduled:

#!/usr/bin/bash

## tfh-fs00 Server to pve2
/usr/sbin/syncoid --force-delete --identifier=pve2 fast200/_VMs/vm-111-disk-0 pve2:tank100/vm-10111-disk-0 && \
/usr/sbin/syncoid --force-delete --identifier=pve2 fast200/_VMs/vm-111-disk-1 pve2:tank100/vm-10111-disk-1 && \
/usr/sbin/syncoid --force-delete --identifier=pve2 fast200/_VMs/vm-111-disk-2 pve2:tank100/vm-10111-disk-2

## tfh-fs00 Server from SSD fast200 to HDs on rpool
/usr/sbin/syncoid --force-delete --identifier=rpool fast200/_VMs/vm-111-disk-0 rpool/_VMs/vm-10111-disk-0 && \
/usr/sbin/syncoid --force-delete --identifier=rpool fast200/_VMs/vm-111-disk-1 rpool/_VMs/vm-10111-disk-1 && \
/usr/sbin/syncoid --force-delete --identifier=rpool fast200/_VMs/vm-111-disk-2 rpool/_VMs/vm-10111-disk-2 

Schedule

run every two hours using cron:

17 */2 * * * (/root/mirrorVMs_to-PVE2-Shadows.sh) > /dev/null

Error Sample

mbuffer: error: outputThread: error writing to <stdout> at offset 0x40000: Broken pipe
mbuffer: warning: error during output to <stdout>: Broken pipe
CRITICAL ERROR:  zfs send  -I 'fast200/_VMs/vm-111-disk-2'@'syncoid_rpool_tfh-pve1_2024-07-11:04:23:02-GMT-05:00' 'fast200/_VMs/vm-111-disk-2'@'syncoid_rpool_tfh-pve1_2024-07-12:20:20:47-GMT-05:00' | mbuffer  -q -s 128k -m 16M | pv -p -t -e -r -b -s 950304232 |  zfs receive  -s -F 'rpool/_VMs/vm-10111-disk-2' 2>&1 failed: 256

mbuffer: error: outputThread: error writing to <stdout> at offset 0x20000: Broken pipe
mbuffer: warning: error during output to <stdout>: Broken pipe
CRITICAL ERROR:  zfs send  -I 'fast200/_VMs/vm-111-disk-1'@'syncoid_rpool_tfh-pve1_2024-07-12:20:17:45-GMT-05:00' 'fast200/_VMs/vm-111-disk-1'@'syncoid_rpool_tfh-pve1_2024-07-12:22:17:22-GMT-05:00' | mbuffer  -q -s 128k -m 16M | pv -p -t -e -r -b -s 117088184 |  zfs receive  -s -F 'rpool/_VMs/vm-10111-disk-1' 2>&1 failed: 256

CRITICAL ERROR:  zfs send  -I 'fast200/_VMs/vm-111-disk-0'@'syncoid_rpool_tfh-pve1_2024-07-11:06:17:17-GMT-05:00' 'fast200/_VMs/vm-111-disk-0'@'syncoid_rpool_tfh-pve1_2024-07-12:18:17:27-GMT-05:00' | mbuffer  -q -s 128k -m 16M | pv -p -t -e -r -b -s 34944 |  zfs receive  -s -F 'rpool/_VMs/vm-10111-disk-0' 2>&1 failed: 256

Reprouction

I try to run the syncoid line manually, and some go through fine, and then one will throw this error:

/usr/sbin/syncoid --force-delete --identifier=rpool fast200/_VMs/vm-111-disk-1 rpool/_VMs/vm-10111-disk-1
INFO: Sending incremental fast200/_VMs/vm-111-disk-1@syncoid_rpool_tfh-pve1_2024-07-12:20:17:45-GMT-05:00 ... syncoid_rpool_tfh-pve1_2024-07-12:23:17:37-GMT-05:00 to rpool/_VMs/vm-10111-disk-1 (~ 1.8 GB):
cannot restore to rpool/_VMs/vm-10111-disk-1@autosnap_2024-07-13_02:00:18_hourly: destination already exists
64.0KiB 0:00:00 [ 423KiB/s] [>                                                                            ]  0%            
mbuffer: error: outputThread: error writing to <stdout> at offset 0x30000: Broken pipe
mbuffer: warning: error during output to <stdout>: Broken pipe
CRITICAL ERROR:  zfs send  -I 'fast200/_VMs/vm-111-disk-1'@'syncoid_rpool_tfh-pve1_2024-07-12:20:17:45-GMT-05:00' 'fast200/_VMs/vm-111-disk-1'@'syncoid_rpool_tfh-pve1_2024-07-12:23:17:37-GMT-05:00' | mbuffer  -q -s 128k -m 16M | pv -p -t -e -r -b -s 1968025760 |  zfs receive  -s -F 'rpool/_VMs/vm-10111-disk-1' 2>&1 failed: 256
Use of uninitialized value $existing in string eq at /usr/sbin/syncoid line 750.

Troubleshooting I've attempted:

  • Making sure both zpools have been upgraded
  • Deleting the destination datasets and then attempting again:
    (It goes fine for a bit and then again I get the errors all over again.)
  • I've downloaded the latest syncoid from github.

Additionally:

Here is the portion of my sanoid.conf that is relevant to this:


[fast200/_VMs]
	use_template = production
	recursive = yes

[rpool]
	use_template = production
	recursive = yes

[rpool/_VMs]
	use_template = production
	recursive = yes

[fast200/_VMs/vm-112-disk-0]
	use_template = ignore

[fast200/_VMs/vm-112-disk-1]
	use_template = ignore

[rpool/_VMs/vm-112-disk-0]
	use_template = ignore

## This is for the replica VM of tfh-fs00
[rpool/_Shadows]
	use_template = shadows

[rpool/_VMs/vm-10111-disk-0]
	use_template = shadows

[rpool/_VMs/vm-10111-disk-1]
	use_template = shadows

[rpool/_VMs/vm-10111-disk-2]
	use_template = shadows


#############################
# templates below this line #
#############################

[template_production]
	frequently = 0
	hourly = 36
	daily = 8
	monthly = 1
	yearly = 0
	autosnap = yes
	autoprune = yes

[template_backup]
	autoprune = yes
	frequently = 0
	hourly = 0
	daily = 31
	monthly = 6
	yearly = 0

	### don't take new snapshots - snapshots on backup
	### datasets are replicated in from source, not
	### generated locally
	autosnap = no

	### monitor hourlies and dailies, but don't warn or
	### crit until they're over 48h old, since replication
	### is typically daily only
	hourly_warn = 2880
	hourly_crit = 3600
	daily_warn = 48
	daily_crit = 60

[template_shadows]
	autoprune = yes
	frequently = 0
#	hourly = 0
	daily = 31
	monthly = 6
	yearly = 0

[template_hotspare]
	autoprune = yes
	frequently = 0
	hourly = 30
	daily = 9
	monthly = 0
	yearly = 0

	### don't take new snapshots - snapshots on backup
	### datasets are replicated in from source, not
	### generated locally
	autosnap = no

	### monitor hourlies and dailies, but don't warn or
	### crit until they're over 4h old, since replication
	### is typically hourly only
	hourly_warn = 4h
	hourly_crit = 6h
	daily_warn = 2d
	daily_crit = 4d

[template_scripts]
	### information about the snapshot will be supplied as environment variables,
	### see the README.md file for details about what is passed when.
	### run script before snapshot
	pre_snapshot_script = /path/to/script.sh
	### run script after snapshot
	post_snapshot_script = /path/to/script.sh
	### run script after pruning snapshot
	pruning_script = /path/to/script.sh
	### don't take an inconsistent snapshot (skip if pre script fails)
	#no_inconsistent_snapshot = yes
	### run post_snapshot_script when pre_snapshot_script is failing
	#force_post_snapshot_script = yes
	### limit allowed execution time of scripts before continuing (<= 0: infinite)
	script_timeout = 5

[template_ignore]
	autoprune = no
	autosnap = no
	monitor = no

I'm practically pulling my hair out, I don't have this issue on any of my other proxmox servers...

@Mikesco3
Copy link
Author

Update:

I copied over the executables from version 2.1.0 and my problem disappeared....

@Mikesco3
Copy link
Author

Mikesco3 commented Aug 1, 2024

I think I may have found the issue...
I must have forgot to turn on sanoid.timer ?
I ran systemctl enable --now sanoid.timer
sanoid wasn't pruning the old snapshots... I see it pruning now a bunch of old snapshots so I'm cautiously optimistic....

@ameyp
Copy link

ameyp commented Jan 13, 2025

Even if sanoid prunes the older snapshots, I don't see that as a fix for syncoid. I'm running into what might be the same issue as @Mikesco3, except syncoid fails every time for me with a

CRITICAL ERROR: zfs send -R -w -c -v -I 'source'@'snap_1' 'source'@'snap_2' | pv -p -t -e -r -b -s 487323760 | lzop | mbuffer -q -s 128k -m 16M | ssh -i /path/to/key -S /tmp/whatever [email protected] ' mbuffer -q -s 128k -m 16M | lzop -dfc | zfs receive -v -s -F 'destination/syncoid' 2>&1' failed: 256

The only prints preceding the CRITICAL ERROR are a bunch of "receiving incremental stream of source@snap-x into destination/syncoid@snap-x" and "received 1.31K stream in 0.61 seconds (2.14K/sec)". I can provide the full logs but there really isn't anything useful there.

I'm using the nix-wrapped version of syncoid version 2.2.0, my hunch is that it isn't nix-related but I could be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants