btrfs send/receive - Failure to destroy quota group can prevent promoting oldest snapshot during replication #2901

Hooverdan96 · 2024-09-20T16:38:41Z

As observed in the scenario described on the Rockstor community forum (users stevek, Hooverdan, phillxnet), when quotas are enabled on the receiving system, it can happen that a quota group cannot be destroyed during the Receive Process while trying to promote a snapshot:

[16/Sep/2024 20:20:02] INFO [storageadmin.views.snapshot:61] Supplanting share (6f32cb58-f849-4c93-bc65-6ebda422c66d_Replication) with snapshot (.snapshots/6f32cb58-f849-4c93-bc65-6ebda422c66d_Replication/Replication_5_replication_1).
[16/Sep/2024 20:20:03] ERROR [system.osi:287] non-zero code(1) returned by command: ['/usr/sbin/btrfs', 'qgroup', 'destroy', '0/257', '/mnt2/fresse_storage']. output: [''] error: ['ERROR: unable to destroy quota group: Device or resource busy', '']

https://forum.rockstor.com/t/disk-structure-under-mnt2-and-replication-question/9720/21

The text was updated successfully, but these errors were encountered:

Hooverdan96 · 2024-09-20T16:39:18Z

Could this be due to changes on the qgroup maintenance within Rockstor or related to newer btrfs developments that added additional constraints.

On btrfs documentation
Whenlooking at thedestroy` description:

destroy <qgroupid> <path>

Destroy a qgroup.

If a qgroup is not isolated, meaning it is a parent or child qgroup, then it can only be destroyed after the relationship is removed.

while Rockstor uses the assign there does not seem to be any function that deals with removing the relationship, using remove.

remove <src> <dst> <path>

Remove the relationship between child qgroup src and parent qgroup dst in the btrfs filesystem identified by path.

Options

--rescan
(default since: 4.19) Automatically schedule quota rescan if the removed qgroup relation would lead to quota inconsistency. See QUOTA RESCAN for more information.

--no-rescan
Explicitly ask not to do a rescan, even if the removal will make the quotas inconsistent. This may be useful for repeated calls where the rescan would add unnecessary overhead.

phillxnet · 2024-10-03T11:01:55Z

@Hooverdan96 Post #2911 I've had another go at reproducing this issue: without success. This time with a slightly less trivial data payload on the replicated share of 2 GB. This is in the context of the following #2902 (comment)

It may be that in some contexts the error report detailed in this issue:

'ERROR: unable to destroy quota group: Device or resource busy'

is another emergence, with a mutual cause now addressed in #2911. But I'n not convinced of this.

And your notes here regarding our having no `btrfs qgroup destroy ...' counterpart to our custom 2015/* parent/child qgroup creation/assign are informative. However without an exact reproducer it's tricky to know if we are doing what is necessary. And as below there is now (in master) successful repclone function; albeit with only a slightly non-trivial data payload.

Receiver:

[03/Oct/2024 11:30:04] INFO [storageadmin.views.snapshot:61] Supplanting share (67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2) with snapshot (.snapshots/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2/repshare2_3_replication_1).
[03/Oct/2024 11:30:04] INFO [storageadmin.views.snapshot:103] Moving snapshot (/mnt2/rock-pool/.snapshots/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2/repshare2_3_replication_1) to prior share's pool location (/mnt2/rock-pool/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2)
...
[03/Oct/2024 11:35:05] INFO [storageadmin.views.snapshot:61] Supplanting share (67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2) with snapshot (repshare2_3_replication_11).
[03/Oct/2024 11:35:06] INFO [storageadmin.views.snapshot:103] Moving snapshot (/mnt2/rock-pool/.snapshots/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2/repshare2_3_replication_11) to prior share's pool location (/mnt2/rock-pool/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2)
[03/Oct/2024 11:40:03] INFO [storageadmin.views.snapshot:61] Supplanting share (67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2) with snapshot (repshare2_3_replication_12).
[03/Oct/2024 11:40:03] INFO [storageadmin.views.snapshot:103] Moving snapshot (/mnt2/rock-pool/.snapshots/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2/repshare2_3_replication_12) to prior share's pool location (/mnt2/rock-pool/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2)
[03/Oct/2024 11:45:03] INFO [storageadmin.views.snapshot:61] Supplanting share (67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2) with snapshot (repshare2_3_replication_13).
[03/Oct/2024 11:45:03] INFO [storageadmin.views.snapshot:103] Moving snapshot (/mnt2/rock-pool/.snapshots/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2/repshare2_3_replication_13) to prior share's pool location (/mnt2/rock-pool/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2)
[03/Oct/2024 11:50:03] INFO [storageadmin.views.snapshot:61] Supplanting share (67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2) with snapshot (repshare2_3_replication_14).
[03/Oct/2024 11:50:03] INFO [storageadmin.views.snapshot:103] Moving snapshot (/mnt2/rock-pool/.snapshots/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2/repshare2_3_replication_14) to prior share's pool location (/mnt2/rock-pool/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2)

I know there have been some improvements more recently in btrfs regarding qgroup housekeeping. I'm just not sure it's something we have yet to respond to. In the above log there appears no blocker.

I'll have a little more of a look at what we are doing in this stage and make notes accordingly in this issue if anything stands out. Mostly focused on show-stopper currently however, but this did qualified; however without a reproducer it's tricky.

phillxnet · 2024-10-03T13:37:08Z

In the interests of attempting to trigger this issue, I added an additional many-files 2GB payload to the source share and in-between replication events:

On Receiving system

Removed all Replication created share's receiving snapshots.
Removed the Replication created receiving share: 67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2
Left the existing receive task in-place.

On Sender:

[03/Oct/2024 13:55:04] INFO [smart_manager.replication.sender:335] Id: 67bdf5bd-2c16-41d7-8224-ca864f2c0a68-3. Sending incremental replica between /mnt2/raid-test/.snapshots/repshare2/repshare2_3_replication_41 -- /mnt2/raid-test/.snapshots/repshare2/repshare2_3_replication_42
... Time of above noted intervention
[03/Oct/2024 14:00:03] INFO [smart_manager.replication.sender:341] Id: 67bdf5bd-2c16-41d7-8224-ca864f2c0a68-3. Sending full replica: /mnt2/raid-test/.snapshots/repshare2/repshare2_3_replication_43
[03/Oct/2024 14:05:03] INFO [smart_manager.replication.sender:335] Id: 67bdf5bd-2c16-41d7-8224-ca864f2c0a68-3. Sending incremental replica between /mnt2/raid-test/.snapshots/repshare2/repshare2_3_replication_43 -- /mnt2/raid-test/.snapshots/repshare2/repshare2_3_replication_44

On Recievier:

The above artificially created anomaly (receiver end) is successfully detected & logged. The receiver then re-establishes the future receiving share and lists the first .snapshot... anomalous snap as share from a new full send established by the Rockstor replication protocol as required to re-establish the existing replication: post all receiving shares & snapshots having been removed (by-hand intervention):

[03/Oct/2024 13:50:05] INFO [storageadmin.views.snapshot:61] Supplanting share (67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2) with snapshot (repshare2_3_replication_38).
[03/Oct/2024 13:50:05] INFO [storageadmin.views.snapshot:103] Moving snapshot (/mnt2/rock-pool/.snapshots/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2/repshare2_3_replication_38) to prior share's pool location (/mnt2/rock-pool/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2)
[03/Oct/2024 13:55:04] INFO [storageadmin.views.snapshot:61] Supplanting share (67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2) with snapshot (repshare2_3_replication_39).
[03/Oct/2024 13:55:04] INFO [storageadmin.views.snapshot:103] Moving snapshot (/mnt2/rock-pool/.snapshots/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2/repshare2_3_replication_39) to prior share's pool location (/mnt2/rock-pool/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2)
[03/Oct/2024 13:57:34] ERROR [storageadmin.util:44] Exception: Snapshot id (37) does not exist.
Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/storageadmin/views/snapshot.py", line 254, in _delete_snapshot
    snapshot = Snapshot.objects.get(id=id)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/django/db/models/manager.py", line 87, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/django/db/models/query.py", line 637, in get
    raise self.model.DoesNotExist(
storageadmin.models.snapshot.Snapshot.DoesNotExist: Snapshot matching query does not exist.
[03/Oct/2024 14:00:02] ERROR [smart_manager.replication.receiver:167] Id: b'67bdf5bd-2c16-41d7-8224-ca864f2c0a68-3'. There are no replication snapshots on the system for Share(67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2).
[03/Oct/2024 14:15:04] INFO [storageadmin.views.snapshot:61] Supplanting share (67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2) with snapshot (.snapshots/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2/repshare2_3_replication_43).
[03/Oct/2024 14:15:04] INFO [storageadmin.views.snapshot:103] Moving snapshot (/mnt2/rock-pool/.snapshots/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2/repshare2_3_replication_43) to prior share's pool location (/mnt2/rock-pool/67bdf5bd-2c16-41d7-8224-ca864f2c0a68_repshare2)

As is evident we get a restore to stable state on Receiver, re a replication share re-created with 3 snapshots: cascading oldest first to supplant their replication created Share. But over the subsequent 4-5 events.

More specifically pertinent to this issue, the snap to share promotion (this time with around 4 G of small files payload), again did not give us the reproducer that would be favoured here. These tests were performed on low-end VM instances where the single full replication transfer lasted 3 minutes. A 5 minute replication interval was used.
N.B. In the above receiver storage sabotage the resulting re-established replication share (stable state) showed zero space (quote anomaly). A btrfs quota rescan /mnt2/rock-pool resolved this issue. We likely have a few places where such a re-scan is required. However in this attempted (failed) reproducer the data was re-established on the receive system as intended.

Hooverdan96 mentioned this issue Sep 20, 2024

replication received_uuid blocker re snap to share promotion #2902

Closed

phillxnet added this to the 5.1.X-X Stable release milestone Sep 28, 2024

phillxnet self-assigned this Oct 1, 2024

phillxnet mentioned this issue Oct 2, 2024

replication received_uuid blocker re snap to share promotion #2902 #2911

Merged

phillxnet removed their assignment Oct 3, 2024

phillxnet removed this from the 5.1.X-X Stable release milestone Oct 3, 2024

phillxnet mentioned this issue Nov 27, 2024

Tumbleweed Pool delete quota error #2932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

btrfs send/receive - Failure to destroy quota group can prevent promoting oldest snapshot during replication #2901

btrfs send/receive - Failure to destroy quota group can prevent promoting oldest snapshot during replication #2901

Hooverdan96 commented Sep 20, 2024 •

edited by phillxnet

Loading

Hooverdan96 commented Sep 20, 2024

phillxnet commented Oct 3, 2024

phillxnet commented Oct 3, 2024 •

edited

Loading

btrfs send/receive - Failure to destroy quota group can prevent promoting oldest snapshot during replication #2901

btrfs send/receive - Failure to destroy quota group can prevent promoting oldest snapshot during replication #2901

Comments

Hooverdan96 commented Sep 20, 2024 • edited by phillxnet Loading

Hooverdan96 commented Sep 20, 2024

phillxnet commented Oct 3, 2024

Receiver:

phillxnet commented Oct 3, 2024 • edited Loading

On Receiving system

On Sender:

On Recievier:

Hooverdan96 commented Sep 20, 2024 •

edited by phillxnet

Loading

phillxnet commented Oct 3, 2024 •

edited

Loading