Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RST delete ci run #1435

Closed
wants to merge 80 commits into from
Closed

RST delete ci run #1435

wants to merge 80 commits into from

Commits on Oct 17, 2024

  1. btrfs: don't take dev_replace rwsem on task already holding it

    Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of
    the RAID stripe-tree, we get the following splat from lockdep:
    
     BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started
    
     ============================================
     WARNING: possible recursive locking detected
     6.11.0-rc3-btrfs-for-next torvalds#599 Not tainted
     --------------------------------------------
     btrfs/2326 is trying to acquire lock:
     ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
    
     but task is already holding lock:
     ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
    
     other info that might help us debug this:
      Possible unsafe locking scenario:
    
            CPU0
            ----
       lock(&fs_info->dev_replace.rwsem);
       lock(&fs_info->dev_replace.rwsem);
    
      *** DEADLOCK ***
    
      May be due to missing lock nesting notation
    
     1 lock held by btrfs/2326:
      #0: ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
    
     stack backtrace:
     CPU: 1 UID: 0 PID: 2326 Comm: btrfs Not tainted 6.11.0-rc3-btrfs-for-next torvalds#599
     Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
     Call Trace:
      <TASK>
      dump_stack_lvl+0x5b/0x80
      __lock_acquire+0x2798/0x69d0
      ? __pfx___lock_acquire+0x10/0x10
      ? __pfx___lock_acquire+0x10/0x10
      lock_acquire+0x19d/0x4a0
      ? btrfs_map_block+0x39f/0x2250
      ? __pfx_lock_acquire+0x10/0x10
      ? find_held_lock+0x2d/0x110
      ? lock_is_held_type+0x8f/0x100
      down_read+0x8e/0x440
      ? btrfs_map_block+0x39f/0x2250
      ? __pfx_down_read+0x10/0x10
      ? do_raw_read_unlock+0x44/0x70
      ? _raw_read_unlock+0x23/0x40
      btrfs_map_block+0x39f/0x2250
      ? btrfs_dev_replace_by_ioctl+0xd69/0x1d00
      ? btrfs_bio_counter_inc_blocked+0xd9/0x2e0
      ? __kasan_slab_alloc+0x6e/0x70
      ? __pfx_btrfs_map_block+0x10/0x10
      ? __pfx_btrfs_bio_counter_inc_blocked+0x10/0x10
      ? kmem_cache_alloc_noprof+0x1f2/0x300
      ? mempool_alloc_noprof+0xed/0x2b0
      btrfs_submit_chunk+0x28d/0x17e0
      ? __pfx_btrfs_submit_chunk+0x10/0x10
      ? bvec_alloc+0xd7/0x1b0
      ? bio_add_folio+0x171/0x270
      ? __pfx_bio_add_folio+0x10/0x10
      ? __kasan_check_read+0x20/0x20
      btrfs_submit_bio+0x37/0x80
      read_extent_buffer_pages+0x3df/0x6c0
      btrfs_read_extent_buffer+0x13e/0x5f0
      read_tree_block+0x81/0xe0
      read_block_for_search+0x4bd/0x7a0
      ? __pfx_read_block_for_search+0x10/0x10
      btrfs_search_slot+0x78d/0x2720
      ? __pfx_btrfs_search_slot+0x10/0x10
      ? lock_is_held_type+0x8f/0x100
      ? kasan_save_track+0x14/0x30
      ? __kasan_slab_alloc+0x6e/0x70
      ? kmem_cache_alloc_noprof+0x1f2/0x300
      btrfs_get_raid_extent_offset+0x181/0x820
      ? __pfx_lock_acquire+0x10/0x10
      ? __pfx_btrfs_get_raid_extent_offset+0x10/0x10
      ? down_read+0x194/0x440
      ? __pfx_down_read+0x10/0x10
      ? do_raw_read_unlock+0x44/0x70
      ? _raw_read_unlock+0x23/0x40
      btrfs_map_block+0x5b5/0x2250
      ? __pfx_btrfs_map_block+0x10/0x10
      scrub_submit_initial_read+0x8fe/0x11b0
      ? __pfx_scrub_submit_initial_read+0x10/0x10
      submit_initial_group_read+0x161/0x3a0
      ? lock_release+0x20e/0x710
      ? __pfx_submit_initial_group_read+0x10/0x10
      ? __pfx_lock_release+0x10/0x10
      scrub_simple_mirror.isra.0+0x3eb/0x580
      scrub_stripe+0xe4d/0x1440
      ? lock_release+0x20e/0x710
      ? __pfx_scrub_stripe+0x10/0x10
      ? __pfx_lock_release+0x10/0x10
      ? do_raw_read_unlock+0x44/0x70
      ? _raw_read_unlock+0x23/0x40
      scrub_chunk+0x257/0x4a0
      scrub_enumerate_chunks+0x64c/0xf70
      ? __mutex_unlock_slowpath+0x147/0x5f0
      ? __pfx_scrub_enumerate_chunks+0x10/0x10
      ? bit_wait_timeout+0xb0/0x170
      ? __up_read+0x189/0x700
      ? scrub_workers_get+0x231/0x300
      ? up_write+0x490/0x4f0
      btrfs_scrub_dev+0x52e/0xcd0
      ? create_pending_snapshots+0x230/0x250
      ? __pfx_btrfs_scrub_dev+0x10/0x10
      btrfs_dev_replace_by_ioctl+0xd69/0x1d00
      ? lock_acquire+0x19d/0x4a0
      ? __pfx_btrfs_dev_replace_by_ioctl+0x10/0x10
      ? lock_release+0x20e/0x710
      ? btrfs_ioctl+0xa09/0x74f0
      ? __pfx_lock_release+0x10/0x10
      ? do_raw_spin_lock+0x11e/0x240
      ? __pfx_do_raw_spin_lock+0x10/0x10
      btrfs_ioctl+0xa14/0x74f0
      ? lock_acquire+0x19d/0x4a0
      ? find_held_lock+0x2d/0x110
      ? __pfx_btrfs_ioctl+0x10/0x10
      ? lock_release+0x20e/0x710
      ? do_sigaction+0x3f0/0x860
      ? __pfx_do_vfs_ioctl+0x10/0x10
      ? do_raw_spin_lock+0x11e/0x240
      ? lockdep_hardirqs_on_prepare+0x270/0x3e0
      ? _raw_spin_unlock_irq+0x28/0x50
      ? do_sigaction+0x3f0/0x860
      ? __pfx_do_sigaction+0x10/0x10
      ? __x64_sys_rt_sigaction+0x18e/0x1e0
      ? __pfx___x64_sys_rt_sigaction+0x10/0x10
      ? __x64_sys_close+0x7c/0xd0
      __x64_sys_ioctl+0x137/0x190
      do_syscall_64+0x71/0x140
      entry_SYSCALL_64_after_hwframe+0x76/0x7e
     RIP: 0033:0x7f0bd1114f9b
     Code: Unable to access opcode bytes at 0x7f0bd1114f71.
     RSP: 002b:00007ffc8a8c3130 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
     RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f0bd1114f9b
     RDX: 00007ffc8a8c35e0 RSI: 00000000ca289435 RDI: 0000000000000003
     RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
     R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffc8a8c6c85
     R13: 00000000398e72a0 R14: 0000000000004361 R15: 0000000000000004
      </TASK>
    
    This happens because on RAID stripe-tree filesystems we recurse back into
    btrfs_map_block() on scrub to perform the logical to device physical
    mapping.
    
    But as the device replace task is already holding the dev_replace::rwsem
    we deadlock.
    
    So don't take the dev_replace::rwsem in case our task is the task performing
    the device replace.
    
    Suggested-by: Filipe Manana <[email protected]>
    Signed-off-by: Johannes Thumshirn <[email protected]>
    Reviewed-by: Filipe Manana <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    morbidrsa authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    fe46c64 View commit details
    Browse the repository at this point in the history
  2. btrfs: make assert_rbio() to only check CONFIG_BTRFS_ASSERT

    According to the description, CONFIG_BTRFS_DEBUG is only for extra
    debug info, meanwhile sanity checks should be managed by
    CONFIG_BTRFS_ASSERT.
    
    There is no need to check both to enable assert_rbio().
    
    Just remove the check for CONFIG_BTRFS_DEBUG.
    
    Reviewed-by: Johannes Thumshirn <[email protected]>
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    5b3b62a View commit details
    Browse the repository at this point in the history
  3. btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from CONFIG_BTRFS_DEBUG

    Currently CONFIG_BTRFS_EXPERIMENTAL is not only for the extra debugging
    output, but also for experimental features.
    
    This is not ideal to distinguish planned but not yet stable features
    from those purely designed for debugging.
    
    This patch splits the following features into CONFIG_BTRFS_EXPERIMENTAL:
    
    - Extent map shrinker
      This seems to be the first one to exit experimental.
    
    - Extent tree v2
      This seems to be the last one to graduate from experimental.
    
    - Raid stripe tree
    - Csum offload mode
    - Send protocol v3
    
    Reviewed-by: Johannes Thumshirn <[email protected]>
    Signed-off-by: Qu Wenruo <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    8f83607 View commit details
    Browse the repository at this point in the history
  4. btrfs: zlib: make the compression path to handle sector size < page size

    Inside zlib_compress_folios(), each time we switch the input page cache,
    the @start is increased by PAGE_SIZE.
    
    But for the incoming compression support for sector size < page size
    (previously we support compression only when the range is fully page
    aligned), this is not going to handle the following case:
    
        0          32K         64K          96K
        |          |///////////||///////////|
    
    @start has the initial value 32K, indicating the start filepos of the
    to-be-compressed range.
    
    And when grabbing the first page as input, we always call "start +=
    PAGE_SIZE;".
    
    But since @start is starting at 32K, it will be increased by 64K,
    resulting it to be 96K for the next range, causing incorrect input range
    and corruption for the future subpage compression.
    
    Fix it by only increase @start by the input size.
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    18c2be4 View commit details
    Browse the repository at this point in the history
  5. btrfs: zstd: make the compression path to handle sector size < page size

    Inside zstd_compress_folios(), after exhausted one input page, we need
    to switch to the next page as input.
    
    However when counting the total input bytes (@tot_in), we always increase
    it by PAGE_SIZE.
    
    For the following case, it can cause incorrect value:
    
            0          32K         64K          96K
            |          |///////////||///////////|
    
    After compressing range [32K, 64K), we switch to the next page, and
    increasing @tot_in by 64K, while we only read 32K.
    
    This will cause the @total_in to return a value larger than the input
    length.
    
    Fix it by only increase @tot_in by the input size.
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    17a51a0 View commit details
    Browse the repository at this point in the history
  6. btrfs: compression: add an ASSERT() to ensure the read-in length is sane

    There are already two bugs (one in zlib, one in zstd) that involved
    compression path is not handling sector size < page size cases well.
    
    So it makes more sense to make sure that btrfs_compress_folios() returns
    
    Since we already have two bugs (one in zlib, one in zstd) in the
    compression path resulting the @total_in be to larger than the
    to-be-compressed range length, there is enough reason to add an ASSERT()
    to make sure the total read-in length doesn't exceed the input length.
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    6eb293f View commit details
    Browse the repository at this point in the history
  7. btrfs: wait for writeback if sector size is smaller than page size

    [PROBLEM]
    If sector perfect compression is enabled for sector size < page size
    case, the following case can lead dirty ranges not being written back:
    
         0     32K     64K     96K     128K
         |     |///////||//////|     |/|
                                     124K
    
    In above example, the page size is 64K, and we need to write back above
    two pages.
    
    - Submit for page 0 (main thread)
      We found delalloc range [32K, 96K), which can be compressed.
      So we queue an async range for [32K, 96K).
      This means, the page unlock/clearing dirty/setting writeback will
      all happen in a workqueue context.
    
    - The compression is done, and compressed range is submitted (workqueue)
      Since the compression is done in asynchronously, the compression can
      be done before the main thread to submit for page 64K.
    
      Now the whole range [32K, 96K), involving two pages, will be marked
      writeback.
    
    - Submit for page 64K (main thread)
      extent_write_cache_pages() got its wbc->sync_mode is WB_SYNC_NONE,
      so it skips the writeback wait.
    
      And unlock the page and exit. This means the dirty range [124K, 128K)
      will never be submitted, until next writeback happens for page 64K.
    
    This will never happen for previous kernels because:
    
    - For sector size == page size case
      Since one page is one sector, if a page is marked writeback it will
      not have dirty flags.
      So this corner case will never hit.
    
    - For sector size < page size case
      We never do subpage compression, a range can only be submitted for
      compression if the range is fully page aligned.
      This change makes the subpage behavior mostly the same as non-subpage
      cases.
    
    [ENHANCEMENT]
    Instead of relying WB_SYNC_NONE check only, if it's a subpage case, then
    always wait for writeback flags.
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    50e2162 View commit details
    Browse the repository at this point in the history
  8. btrfs: make extent_range_clear_dirty_for_io() to handle sector size <…

    … page size cases
    
    For btrfs with sector size < page size (e.g. 4K sector size, 64K page
    size), and enable the sector perfect compression support, then the
    following dirty range can lead to problems:
    
       0     32K     64K     96K    128K
       |     |///////||//////|    |/|
                                  124K
    
    In above case, if we start writeback for that inode, the last dirty
    range [124K, 128K) will not be submitted and cause reserved space
    leakage:
    
    - Start writeback for page 0
      We find the range [32K, 96K) is suitable for compression, and queue it
      into a workqueue to do the delayed compression and submission.
    
    - Compression happens for range [32K, 96K)
      Function extent_range_clear_dirty_for_io() is called, however it is
      only doing full page handling, not considering any the extra bitmaps
      for subpage cases.
    
      That function will clear page dirty for both page 0 and page 64K.
    
    - Writeback for the inode is done
      Because page 64K has its dirty flag cleared, it will not be considered
      as a writeback target.
    
    This means the range [124K, 128K) will not be submitted, and reserved
    space for it will be leaked.
    
    Fix this problem by using the subpage helper to clear the dirty flag.
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    2923eaf View commit details
    Browse the repository at this point in the history
  9. btrfs: do not assume the full page range is not dirty in extent_write…

    …page_io()
    
    The function extent_writepage_io() will submit the dirty sectors inside
    the page for the write.
    
    But recently to co-operate with the incoming subpage compression
    enhancement, a new bitmap is introduced to
    btrfs_bio_ctrl::submit_bitmap, to only avoid a subset of the dirty
    range.
    
    This is because we can have the following cases with 64K page size:
    
        0      16K       32K       48K       64K
        |      |/////////|         |/|
                                     52K
    
    For range [16K, 32K), we queue the dirty range for compression, which is
    ran in a delayed workqueue.
    Then for range [48K, 52K), we go through the regular submission path.
    
    In that case, our btrfs_bio_ctrl::submit_bitmap will exclude the range
    [16K, 32K).
    
    The dirty flags for the range [16K, 32K) is only cleared when the
    compression is done, by the extent_clear_unlock_delalloc() call inside
    submit_one_async_extent().
    
    This patch fix the false alert by removing the
    btrfs_folio_assert_not_dirty() check, since it's no longer correct for
    subpage compression cases.
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    5ce7471 View commit details
    Browse the repository at this point in the history
  10. btrfs: move the delalloc range bitmap search into extent_io.c

    Currently for subpage (sector size < page size) cases, we reuse subpage
    locked bitmap to find out all delalloc ranges we have locked, and run
    all those found ranges.
    
    However such reuse is not perfect, e.g.:
    
        0       32K      64K      96K       128K
        |       |////////||///////|    |////|
                                       120K
    
    For above range, writepage_delalloc() for page 0 will handle the range
    [32K, 96k), note delalloc range can be beyond the page boundary.
    
    But writepage_delalloc() for page 64K will only handle range [120K,
    128K), as the previous run on page 0 has already handled range [64K,
    96K).
    Meanwhile for the writeback we should expect range [64K, 96K) to also be
    locked, this leads to the mismatch from locked bitmap and delalloc
    range.
    
    This is not causing problems yet, but it's still an inconsistent
    behavior.
    
    So instead of relying on the subpage locked bitmap, move the delalloc
    range search using local @delalloc_bitmap, so that we can remove the
    existing btrfs_folio_find_writer_locked().
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    938449b View commit details
    Browse the repository at this point in the history
  11. btrfs: mark all dirty sectors as locked inside writepage_delalloc()

    Currently we only mark sectors as locked if there is a *NEW* delalloc
    range for it.
    
    But NEW delalloc range is not the same as dirty sectors we want to
    submit, e.g:
    
            0       32K      64K      96K       128K
            |       |////////||///////|    |////|
                                           120K
    
    For above 64K page size case, writepage_delalloc() for page 0 will find
    and lock the delalloc range [32K, 96K), which is beyond the page
    boundary.
    
    Then when writepage_delalloc() is called for the page 64K, since [64K,
    96K) is already locked, only [120K, 128K) will be locked.
    
    This means, although range [64K, 96K) is dirty and will be submitted
    later by extent_writepage_io(), it will not be marked as locked.
    
    This is fine for now, as we call btrfs_folio_end_writer_lock_bitmap() to
    free every non-compressed sector, and compression is only allowed for
    full page range.
    
    But this is not safe for future sector perfect compression support, as
    this can lead to double folio unlock:
    
                  Thread A                 |           Thread B
    ---------------------------------------+--------------------------------
                                           | submit_one_async_extent()
    				       | |- extent_clear_unlock_delalloc()
    extent_writepage()                     |    |- btrfs_folio_end_writer_lock()
    |- btrfs_folio_end_writer_lock_bitmap()|       |- btrfs_subpage_end_and_test_writer()
       |                                   |       |  |- atomic_sub_and_test()
       |                                   |       |     /* Now the atomic value is 0 */
       |- if (atomic_read() == 0)          |       |
       |- folio_unlock()                   |       |- folio_unlock()
    
    The root cause is the above range [64K, 96K) is dirtied and should also
    be locked but it isn't.
    
    So to make everything more consistent and prepare for the incoming
    sector perfect compression, mark all dirty sectors as locked.
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    81b8cc5 View commit details
    Browse the repository at this point in the history
  12. btrfs: allow compression even if the range is not page aligned

    Previously for btrfs with sector size smaller than page size (subpage),
    we only allow compression if the range is fully page aligned.
    
    This is to work around the asynchronous submission of compressed range,
    which delayed the page unlock and writeback into a workqueue,
    furthermore asynchronous submission can lock multiple sector range
    across page boundary.
    
    Such asynchronous submission makes it very hard to co-operate with other
    regular writes.
    
    With the recent changes to the subpage folio unlock path, now
    asynchronous submission of compressed pages can co-operate with regular
    submission, so enable sector perfect compression if it's an experimental
    build.
    
    The ETA for moving this feature out of experimental is 6.15, and I hope
    all remaining corner cases can be exposed before that.
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    6326713 View commit details
    Browse the repository at this point in the history
  13. btrfs: avoid unnecessary device path update for the same device

    [PROBLEM]
    It is very common for udev to trigger device scan, and every time a
    mounted btrfs device got re-scan from different soft links, we will get
    some of unnecessary device path updates, this is especially common
    for LVM based storage:
    
     # lvs
      scratch1 test -wi-ao---- 10.00g
      scratch2 test -wi-a----- 10.00g
      scratch3 test -wi-a----- 10.00g
      scratch4 test -wi-a----- 10.00g
      scratch5 test -wi-a----- 10.00g
      test     test -wi-a----- 10.00g
    
     # mkfs.btrfs -f /dev/test/scratch1
     # mount /dev/test/scratch1 /mnt/btrfs
     # dmesg -c
     [  205.705234] BTRFS: device fsid 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9 devid 1 transid 6 /dev/mapper/test-scratch1 (253:4) scanned by mount (1154)
     [  205.710864] BTRFS info (device dm-4): first mount of filesystem 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9
     [  205.711923] BTRFS info (device dm-4): using crc32c (crc32c-intel) checksum algorithm
     [  205.713856] BTRFS info (device dm-4): using free-space-tree
     [  205.722324] BTRFS info (device dm-4): checking UUID tree
    
    So far so good, but even if we just touched any soft link of
    "dm-4", we will get quite some unnecessary device path updates.
    
     # touch /dev/mapper/test-scratch1
     # dmesg -c
     [  469.295796] BTRFS info: devid 1 device path /dev/mapper/test-scratch1 changed to /dev/dm-4 scanned by (udev-worker) (1221)
     [  469.300494] BTRFS info: devid 1 device path /dev/dm-4 changed to /dev/mapper/test-scratch1 scanned by (udev-worker) (1221)
    
    Such device path rename is unnecessary and can lead to random path
    change due to the udev race.
    
    [CAUSE]
    Inside device_list_add(), we are using a very primitive way checking if
    the device has changed, strcmp().
    
    Which can never handle links well, no matter if it's hard or soft links.
    
    So every different link of the same device will be treated as a different
    device, causing the unnecessary device path update.
    
    [FIX]
    Introduce a helper, is_same_device(), and use path_equal() to properly
    detect the same block device.
    So that the different soft links won't trigger the rename race.
    
    Reviewed-by: Filipe Manana <[email protected]>
    Link: https://bugzilla.suse.com/show_bug.cgi?id=1230641
    Reported-by: Fabian Vogt <[email protected]>
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    f7f6d8e View commit details
    Browse the repository at this point in the history
  14. btrfs: canonicalize the device path before adding it

    [PROBLEM]
    Currently btrfs accepts any file path for its device, resulting some
    weird situation:
    
     # ./mount_by_fd /dev/test/scratch1  /mnt/btrfs/
    
    The program has the following source code:
    
     #include <fcntl.h>
     #include <stdio.h>
     #include <sys/mount.h>
    
     int main(int argc, char *argv[]) {
    	int fd = open(argv[1], O_RDWR);
    	char path[256];
    	snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
    	return mount(path, argv[2], "btrfs", 0, NULL);
     }
    
    Then we can have the following weird device path:
    
     BTRFS: device fsid 2378be81-fe12-46d2-a9e8-68cf08dd98d5 devid 1 transid 7 /proc/self/fd/3 (253:2) scanned by mount_by_fd (18440)
    
    Normally it's not a big deal, and later udev can trigger a device path
    rename. But if udev didn't trigger, the device path "/proc/self/fd/3"
    will show up in mtab.
    
    [CAUSE]
    For filename "/proc/self/fd/3", it means the opened file descriptor 3.
    In above case, it's exactly the device we want to open, aka points to
    "/dev/test/scratch1" which is another symlink pointing to "/dev/dm-2".
    
    Inside kernel we solve the mount source using LOOKUP_FOLLOW, which
    follows the symbolic link and grab the proper block device.
    
    But inside btrfs we also save the filename into btrfs_device::name, and
    utilize that member to report our mount source, which leads to the above
    situation.
    
    [FIX]
    Instead of unconditionally trust the path, check if the original file
    (not following the symbolic link) is inside "/dev/", if not, then
    manually lookup the path to its final destination, and use that as our
    device path.
    
    This allows us to still use symbolic links, like
    "/dev/mapper/test-scratch" from LVM2, which is required for fstests runs
    with LVM2 setup.
    
    And for really weird names, like the above case, we solve it to
    "/dev/dm-2" instead.
    
    Reviewed-by: Filipe Manana <[email protected]>
    Link: https://bugzilla.suse.com/show_bug.cgi?id=1230641
    Reported-by: Fabian Vogt <[email protected]>
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    4f0ed68 View commit details
    Browse the repository at this point in the history
  15. btrfs: remove code duplication in ordered extent finishing

    Remove the duplicated transaction joining, block reserve setting and raid
    extent inserting in btrfs_finish_ordered_extent().
    
    While at it, also abort the transaction in case inserting a RAID
    stripe-tree entry fails.
    
    Suggested-by: Naohiro Aota <[email protected]>
    Reviewed-by: Filipe Manana <[email protected]>
    Signed-off-by: Johannes Thumshirn <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    morbidrsa authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    9078516 View commit details
    Browse the repository at this point in the history
  16. btrfs: qgroups: remove bytenr field from struct btrfs_qgroup_extent_r…

    …ecord
    
    Now that we track qgroup extent records in a xarray we don't need to have
    a "bytenr" field in  struct btrfs_qgroup_extent_record, since we can get
    it from the index of the record in the xarray.
    
    So remove the field and grab the bytenr from either the index key or any
    other place where it's available (delayed refs). This reduces the size of
    struct btrfs_qgroup_extent_record from 40 bytes down to 32 bytes, meaning
    that we now can store 128 instances of this structure instead of 102 per
    4K page.
    
    Reviewed-by: Qu Wenruo <[email protected]>
    Signed-off-by: Filipe Manana <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    d217a8f View commit details
    Browse the repository at this point in the history
  17. btrfs: store fs_info in a local variable at btrfs_qgroup_trace_extent…

    …_post()
    
    Instead of extracting fs_info from the transaction multiples times, store
    it in a local variable and use it.
    
    Reviewed-by: Qu Wenruo <[email protected]>
    Signed-off-by: Filipe Manana <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    bc10046 View commit details
    Browse the repository at this point in the history
  18. btrfs: remove unnecessary delayed refs locking at btrfs_qgroup_trace_…

    …extent()
    
    There's no need to hold the delayed refs spinlock when calling
    btrfs_qgroup_trace_extent_nolock() from btrfs_qgroup_trace_extent(), since
    it doesn't change anything in delayed refs and it only changes the xarray
    used to track qgroup extent records, which is protected by the xarray's
    lock.
    
    Holding the lock is only adding unnecessary lock contention with other
    tasks that actually need to take the lock to add/remove/change delayed
    references. So remove the locking.
    
    Reviewed-by: Qu Wenruo <[email protected]>
    Signed-off-by: Filipe Manana <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    7d2835e View commit details
    Browse the repository at this point in the history
  19. btrfs: always use delayed_refs local variable at btrfs_qgroup_trace_e…

    …xtent()
    
    Instead of dereferencing the delayed refs from the transaction multiple
    times, store it early in the local variable and then always use the
    variable.
    
    Reviewed-by: Qu Wenruo <[email protected]>
    Signed-off-by: Filipe Manana <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    ad0bb2c View commit details
    Browse the repository at this point in the history
  20. btrfs: remove pointless initialization at btrfs_qgroup_trace_extent()

    The qgroup record was allocated with kzalloc(), so it's pointless to set
    its old_roots member to NULL. Remove the assignment.
    
    Reviewed-by: Qu Wenruo <[email protected]>
    Signed-off-by: Filipe Manana <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    c8421bc View commit details
    Browse the repository at this point in the history
  21. btrfs: remove redundant stop_loop variable in scrub_stripe()

    The variable stop_loop was originally introduced in commit 625f1c8
    ("Btrfs: improve the loop of scrub_stripe"). It was initialized to 0 in
    commit 3b080b2 ("Btrfs: scrub raid56 stripes in the right way").
    However, in a later commit 18d30ab ("btrfs: scrub: use
    scrub_simple_mirror() to handle RAID56 data stripe scrub"), the code
    that modified stop_loop was removed, making the variable redundant.
    
    Currently, stop_loop is only initialized with 0 and is never used or
    modified within the scrub_stripe() function. As a result, this patch
    removes the stop_loop variable to clean up the code and eliminate
    unnecessary redundancy.
    
    This change has no impact on functionality, as stop_loop was never
    utilized in any meaningful way in the final version of the code.
    
    Reviewed-by: Filipe Manana <[email protected]>
    Signed-off-by: Riyan Dhiman <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    Ryand1234 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    e7f6492 View commit details
    Browse the repository at this point in the history
  22. btrfs: remove unused page_to_inode and page_to_fs_info macros

    This macro is no longer used after the "btrfs: Cleaned up folio->page
    conversion" series patch [1] was applied, so remove it.
    
    [1]: https://patchwork.kernel.org/project/linux-btrfs/cover/[email protected]/
    
    Reviewed-by: Neal Gompa <[email protected]>
    Signed-off-by: Youling Tang <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    Youling Tang authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    21ac0bf View commit details
    Browse the repository at this point in the history
  23. btrfs: correct typos in multiple comments across various files

    Fix some confusing spelling errors that were currently identified,
    the details are as follows:
    
    	block-group.c: 2800: 	uncompressible 	==> incompressible
    	extent-tree.c: 3131:	EXTEMT		==> EXTENT
    	extent_io.c: 3124: 	utlizing 	==> utilizing
    	extent_map.c: 1323: 	ealier		==> earlier
    	extent_map.c: 1325:	possiblity	==> possibility
    	fiemap.c: 189:		emmitted	==> emitted
    	fiemap.c: 197:		emmitted	==> emitted
    	fiemap.c: 203:		emmitted	==> emitted
    	transaction.h: 36:	trasaction	==> transaction
    	volumes.c: 5312:	filesysmte	==> filesystem
    	zoned.c: 1977:		trasnsaction	==> transaction
    
    Signed-off-by: Shen Lichuan <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    Shen Lichuan authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    0647aa3 View commit details
    Browse the repository at this point in the history
  24. btrfs: tests: add selftests for raid-stripe-tree

    Add first stash of very basic self tests for the RAID stripe-tree.
    
    More test cases will follow exercising the tree.
    
    Reviewed-by: Filipe Manana <[email protected]>
    Signed-off-by: Johannes Thumshirn <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    morbidrsa authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    9c74f2c View commit details
    Browse the repository at this point in the history
  25. btrfs: remove unused btrfs_free_squota_rsv()

    btrfs_free_squota_rsv() was added in commit
    e85a0ad ("btrfs: ensure releasing squota reserve on head refs")
    but has remained unused since then.
    Remove it as we don't seem to need it and was probably a leftover.
    
    Reviewed-by: Qu Wenruo <[email protected]>
    Signed-off-by: Dr. David Alan Gilbert <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    Dr. David Alan Gilbert authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    1bbafcc View commit details
    Browse the repository at this point in the history
  26. btrfs: remove unused btrfs_is_parity_mirror()

    btrfs_is_parity_mirror() has been unused since commit 4886ff7
    ("btrfs: introduce a new helper to submit write bio for repair").
    Remove it as the code was refactored and we don't need the helper
    anymore.
    
    Reviewed-by: Qu Wenruo <[email protected]>
    Signed-off-by: Dr. David Alan Gilbert <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    Dr. David Alan Gilbert authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    3d8ac55 View commit details
    Browse the repository at this point in the history
  27. btrfs: remove unused btrfs_try_tree_write_lock()

    btrfs_try_tree_write_lock() has been unused since commit
    50b21d7 ("btrfs: submit a writeback bio per extent_buffer").
    Remove it as we don't need it anymore.
    
    Reviewed-by: Christoph Hellwig <[email protected]>
    Reviewed-by: Qu Wenruo <[email protected]>
    Signed-off-by: Dr. David Alan Gilbert <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    Dr. David Alan Gilbert authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    7f92863 View commit details
    Browse the repository at this point in the history
  28. btrfs: remove the dirty_page local variable

    Inside btrfs_buffered_write(), we have a local variable @dirty_pages,
    recording the number of pages we dirtied in the current iteration.
    
    However we do not really need that variable, since it can be calculated
    from @pos and @copied.
    
    In fact there is already a problem inside the short copy path, where we
    use @dirty_pages to calculate the range we need to release.
    But that usage assumes sectorsize == PAGE_SIZE, which is no longer true.
    
    Instead of keeping @dirty_pages and cause incorrect usage, just
    calculate the number of dirtied pages inside btrfs_dirty_pages().
    
    Reviewed-by: Josef Bacik <[email protected]>
    Reviewed-by: Johannes Thumshirn <[email protected]>
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    3b7324c View commit details
    Browse the repository at this point in the history
  29. btrfs: simplify the page uptodate preparation for prepare_pages()

    Currently inside prepare_pages(), we handle the leading and tailing page
    differently, and skip the middle pages (if any).  This is to avoid
    reading pages which are fully covered by the dirty range.
    
    Refactor the code by moving all checks (alignment check, range check,
    force read check) into prepare_uptodate_page().
    
    So that prepare_pages() only needs to iterate all the pages
    unconditionally.
    
    And since we're here, also update prepare_uptodate_page() to use
    folio API other than the old page API.
    
    Reviewed-by: Johannes Thumshirn <[email protected]>
    Signed-off-by: Qu Wenruo <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    a85e63f View commit details
    Browse the repository at this point in the history
  30. btrfs: handle empty list of NOCOW ordered extents with checksum list

    Currently we BUG_ON() in btrfs_finish_one_ordered() if we are finishing
    an ordered extent that is flagged as NOCOW, but it's checksum list is
    not empty.
    
    This is clearly a logic error which we can recover from by aborting the
    transaction.
    
    For developer builds which enable CONFIG_BTRFS_ASSERT, also ASSERT()
    that the list is empty.
    
    Suggested-by: Filipe Manana <[email protected]>
    Reviewed-by: Qu Wenruo <[email protected]>
    Reviewed-by: Filipe Manana <[email protected]>
    Signed-off-by: Johannes Thumshirn <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    morbidrsa authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    df5af25 View commit details
    Browse the repository at this point in the history
  31. btrfs: return ENODATA in case RST lookup fails

    In case a lookup in the RAID stripe-tree fails, return ENODATA instead of
    ENOENT to better distinguish stripe-tree lookups from other code paths
    where we return ENOENT.
    
    Suggested-by: Josef Bacik <[email protected]>
    Reviewed-by: Josef Bacik <[email protected]>
    Signed-off-by: Johannes Thumshirn <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    morbidrsa authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    659f41e View commit details
    Browse the repository at this point in the history
  32. btrfs: scrub: skip initial RST lookup errors

    Performing the initial extent sector read on a RAID stripe-tree backed
    filesystem with pre-allocated extents will cause the RAID stripe-tree
    lookup code to return ENODATA, as pre-allocated extents do not have any
    on-disk bytes and thus no RAID stripe-tree entries.
    
    But the current scrub read code marks these extents as errors, because
    the lookup fails.
    
    If btrfs_map_block() returns -ENODATA, it means that the call to
    btrfs_get_raid_extent_offset() returned -ENODATA, because there is no
    entry for the corresponding range in the RAID stripe-tree. But as this
    range is in the extent tree it means we've hit a pre-allocated extent. In
    this case, don't mark the sector in the stripe's error bitmaps as faulty
    and carry on to the next.
    
    Reviewed-by: Josef Bacik <[email protected]>
    Signed-off-by: Johannes Thumshirn <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    morbidrsa authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    e081590 View commit details
    Browse the repository at this point in the history
  33. btrfs: qgroup: run delayed iputs after ordered extent completion

    When trying to flush qgroups in order to release space we run delayed
    iputs in order to release space from recently deleted files (their link
    counted reached zero), and then we start delalloc and wait for any
    existing ordered extents to complete.
    
    However there's a time window here where we end up not doing the final
    iput on a deleted file which could release necessary space:
    
    1) An unlink operation starts;
    
    2) During the unlink, or right before it completes, delalloc is flushed
       and an ordered extent is created;
    
    3) When the ordered extent is created, the inode's ref count is
       incremented (with igrab() at alloc_ordered_extent());
    
    4) When the unlink finishes it doesn't drop the last reference on the
       inode and so it doesn't trigger inode eviction to delete all of
       the inode's items in its root and drop all references on its data
       extents;
    
    5) Another task enters try_flush_qgroup() to try to release space,
       it runs all delayed iputs, but there's no delayed iput yet for that
       deleted file because the ordered extent hasn't completed yet;
    
    6) Then at try_flush_qgroup() we wait for the ordered extent to complete
       and that results in adding a delayed iput at btrfs_put_ordered_extent()
       when called from btrfs_finish_one_ordered();
    
    7) Adding the delayed iput results in waking the cleaner kthread if it's
       not running already. However it may take some time for it to be
       scheduled, or it may be running but busy running auto defrag, dropping
       deleted snapshots or doing other work, so by the time we return from
       try_flush_qgroup() the space for deleted file isn't released.
    
    Improve on this by running delayed iputs only after flushing delalloc
    and waiting for ordered extent completion.
    
    Reviewed-by: Qu Wenruo <[email protected]>
    Signed-off-by: Filipe Manana <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    8162aaa View commit details
    Browse the repository at this point in the history
  34. btrfs: remove btrfs_set_range_writeback()

    The function btrfs_set_range_writeback() was originally a callback for
    metadata and data, to mark a range with writeback flag.
    
    Then it was converted into a common function call for both metadata and
    data.
    
    From the very beginning, the function had been only called on a full page,
    later converted to handle range inside a page.
    
    But it never needed to handle multiple pages, and since commit
    8189197 ("btrfs: refactor __extent_writepage_io() to do
    sector-by-sector submission") the function was only called on a
    sector-by-sector basis.
    
    This makes the function unnecessary, and can be converted to a simple
    btrfs_folio_set_writeback() call instead.
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    0b2308e View commit details
    Browse the repository at this point in the history
  35. btrfs: zstd: assert the timer pointer in callback

    Make sure we got the right timer struct for the zstd workspace reclaim
    work.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    bde9f20 View commit details
    Browse the repository at this point in the history
  36. btrfs: drop unused parameter path from btrfs_tree_mod_log_rewind()

    The path parameter was used for our own locking, that got converted to
    rwsem eventually. Last usage in ac5887c ("btrfs: locking: remove
    all the blocking helpers").
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    1d19c12 View commit details
    Browse the repository at this point in the history
  37. btrfs: drop unused parameter ctx from batch_delete_dir_index_items()

    The ctx parameter is not used, we can drop it.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    3994109 View commit details
    Browse the repository at this point in the history
  38. btrfs: drop unused parameter fs_info from wait_reserve_ticket()

    The parameter is not used, we can also reach it from the space info if
    needed in the future.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    4995054 View commit details
    Browse the repository at this point in the history
  39. btrfs: drop unused parameter fs_info from do_reclaim_sweep()

    The parameter is unused and we can get it from space info if needed.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    51f6c3a View commit details
    Browse the repository at this point in the history
  40. btrfs: send: drop unused parameter num from iterate_inode_ref_t callb…

    …acks
    
    None of the ref iteration callbacks needs the num parameter (this is for
    the directory item iteration), so we can drop it.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    8721e68 View commit details
    Browse the repository at this point in the history
  41. btrfs: send: drop unused parameter index from iterate_inode_ref_t cal…

    …lbacks
    
    None of the ref iteration callbacks needs the index parameter (this is
    for the directory item iteration), so we can drop it.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    3555fea View commit details
    Browse the repository at this point in the history
  42. btrfs: scrub: drop unused parameter sctx from scrub_submit_extent_sec…

    …tor_read()
    
    The parameter is unused and we can reach sctx from scrub stripe if
    needed.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    2aa366d View commit details
    Browse the repository at this point in the history
  43. btrfs: drop unused parameter map from scrub_simple_mirror()

    The parameter map used to be passed to scrub_extent() until
    e02ee89 ("btrfs: scrub: switch scrub_simple_mirror() to
    scrub_stripe infrastructure"), where the scrub implementation was
    completely reworked.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    e6c00f4 View commit details
    Browse the repository at this point in the history
  44. btrfs: qgroup: drop unused parameter fs_info from __del_qgroup_rb()

    We don't need fs_info here, everything is reachable from qgroup.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    f9bd555 View commit details
    Browse the repository at this point in the history
  45. btrfs: drop unused transaction parameter from btrfs_qgroup_add_swappe…

    …d_blocks()
    
    The caller replace_path() runs under transaction but we don't need it in
    btrfs_qgroup_add_swapped_blocks().
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    e3c79ce View commit details
    Browse the repository at this point in the history
  46. btrfs: lzo: drop unused paramter level from lzo_alloc_workspace()

    The LZO compression has only one level, we don't need to pass the
    parameter.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    3beb0db View commit details
    Browse the repository at this point in the history
  47. btrfs: drop unused parameter argp from btrfs_ioctl_quota_rescan_wait()

    We don't need the user passed parameter, rescan is a filesystem
    operation so fs_info is sufficient.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    7e22750 View commit details
    Browse the repository at this point in the history
  48. btrfs: drop unused parameter inode from read_inline_extent()

    We don't need the inode pointer to read inline extent, it's all
    accessible from the path pointer.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    24f6fd6 View commit details
    Browse the repository at this point in the history
  49. btrfs: drop unused parameter offset from __cow_file_range_inline()

    We don't need offset for inline extents, they always start from 0.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    05d2682 View commit details
    Browse the repository at this point in the history
  50. btrfs: drop unused parameter file_offset from btrfs_encoded_read_regu…

    …lar_fill_pages()
    
    The file_offset parameter used to be passed to encoded read struct but
    was removed in commit b665aff ("btrfs: remove unused members from
    struct btrfs_encoded_read_private").
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    2c007d2 View commit details
    Browse the repository at this point in the history
  51. btrfs: drop unused parameter iov_iter from btrfs_write_check()

    The parameter 'from' has never been used since commit b8d8e1f
    ("btrfs: introduce btrfs_write_check()"), this is for buffered write.
    Direct io write needs it so it was probably an interface thing, but we
    can drop it.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    3c09d3b View commit details
    Browse the repository at this point in the history
  52. btrfs: drop unused parameter refs from visit_node_for_delete()

    The parameter duplicates what can be effectively obtained from
    wc->refs[level - 1] and this is what's actually used inside. Added in
    commit 2b73c7e ("btrfs: unify logic to decide if we need to walk
    down into a node during snapshot delete").
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    afe6a70 View commit details
    Browse the repository at this point in the history
  53. btrfs: drop unused parameter mask from try_release_extent_state()

    The mask parameter used for allocations got unified to GFP_NOFS and
    removed from relevant functions in 1d12680 ("btrfs: drop gfp from
    parameter extent state helpers").
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    ce8e39d View commit details
    Browse the repository at this point in the history
  54. btrfs: drop unused parameter fs_info from folio_range_has_eb()

    The parameter was added in 8ff8466 ("btrfs: support subpage for
    extent buffer page release") for page but hasn't been used since, so we
    can drop it.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    2f3f009 View commit details
    Browse the repository at this point in the history
  55. btrfs: drop unused parameter options from open_ctree()

    Since the new mount option parser in commit ad21f15 ("btrfs:
    switch to the new mount API") we don't pass the options like that
    anymore.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    4836659 View commit details
    Browse the repository at this point in the history
  56. btrfs: drop unused parameter data from btrfs_fill_super()

    The only caller passes NULL, we can drop the parameter. This is since
    the new mount option parser done in 3bb17a2 ("btrfs: add get_tree
    callback for new mount API").
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    3fcafae View commit details
    Browse the repository at this point in the history
  57. btrfs: drop unused parameter transaction from alloc_log_tree()

    The function got split in commit 6ab6ebb ("btrfs: split
    alloc_log_tree()") and since then transaction parameter has been unused.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    bea54c5 View commit details
    Browse the repository at this point in the history
  58. btrfs: drop unused parameter fs_info from btrfs_match_dir_item_name()

    Cascaded removal of fs_info that is not needed in several functions.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    9c0eded View commit details
    Browse the repository at this point in the history
  59. btrfs: drop unused parameter level from alloc_heuristic_ws()

    The compression heuristic pass does not need a level, so we can drop the
    parameter.
    
    Reviewed-by: Anand Jain <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    caebb14 View commit details
    Browse the repository at this point in the history
  60. btrfs: zoned: fix zone unusable accounting for freed reserved extent

    When btrfs reserves an extent and does not use it (e.g, by an error), it
    calls btrfs_free_reserved_extent() to free the reserved extent. In the
    process, it calls btrfs_add_free_space() and then it accounts the region
    bytes as block_group->zone_unusable.
    
    However, it leaves the space_info->bytes_zone_unusable side not updated. As
    a result, ENOSPC can happen while a space_info reservation succeeded. The
    reservation is fine because the freed region is not added in
    space_info->bytes_zone_unusable, leaving that space as "free". OTOH,
    corresponding block group counts it as zone_unusable and its allocation
    pointer is not rewound, we cannot allocate an extent from that block group.
    That will also negate space_info's async/sync reclaim process, and cause an
    ENOSPC error from the extent allocation process.
    
    Fix that by returning the space to space_info->bytes_zone_unusable.
    Ideally, since a bio is not submitted for this reserved region, we should
    return the space to free space and rewind the allocation pointer. But, it
    needs rework on extent allocation handling, so let it work in this way for
    now.
    
    Fixes: 169e0da ("btrfs: zoned: track unusable bytes for zones")
    CC: [email protected] # 5.15+
    Reviewed-by: Johannes Thumshirn <[email protected]>
    Signed-off-by: Naohiro Aota <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    naota authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    9bbc899 View commit details
    Browse the repository at this point in the history
  61. btrfs: fix error propagation of split bios

    The purpose of btrfs_bbio_propagate_error() shall be propagating an error
    of split bio to its original btrfs_bio, and tell the error to the upper
    layer. However, it's not working well on some cases.
    
    * Case 1. Immediate (or quick) end_bio with an error
    
    When btrfs sends btrfs_bio to mirrored devices, btrfs calls
    btrfs_bio_end_io() when all the mirroring bios are completed. If that
    btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function
    is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error()
    accesses the orig_bbio's bio context to increase the error count.
    
    That works well in most cases. However, if the end_io is called enough
    fast, orig_bbio's (remaining part after split) bio context may not be
    properly set at that time. Since the bio context is set when the orig_bbio
    (the last btrfs_bio) is sent to devices, that might be too late for earlier
    split btrfs_bio's completion.  That will result in NULL pointer
    dereference.
    
    That bug is easily reproducible by running btrfs/146 on zoned devices [1]
    and it shows the following trace.
    
    [1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS.
    
      BUG: kernel NULL pointer dereference, address: 0000000000000020
      #PF: supervisor read access in kernel mode
      #PF: error_code(0x0000) - not-present page
      PGD 0 P4D 0
      Oops: Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ torvalds#474
      Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      Workqueue: writeback wb_workfn (flush-btrfs-5)
      RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs]
      BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
      RSP: 0018:ffffc9000006f248 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc
      RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080
      RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001
      R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58
      R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158
      FS:  0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       ? __die_body.cold+0x19/0x26
       ? page_fault_oops+0x13e/0x2b0
       ? _printk+0x58/0x73
       ? do_user_addr_fault+0x5f/0x750
       ? exc_page_fault+0x76/0x240
       ? asm_exc_page_fault+0x22/0x30
       ? btrfs_bio_end_io+0xae/0xc0 [btrfs]
       ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs]
       btrfs_orig_write_end_io+0x51/0x90 [btrfs]
       dm_submit_bio+0x5c2/0xa50 [dm_mod]
       ? find_held_lock+0x2b/0x80
       ? blk_try_enter_queue+0x90/0x1e0
       __submit_bio+0xe0/0x130
       ? ktime_get+0x10a/0x160
       ? lockdep_hardirqs_on+0x74/0x100
       submit_bio_noacct_nocheck+0x199/0x410
       btrfs_submit_bio+0x7d/0x150 [btrfs]
       btrfs_submit_chunk+0x1a1/0x6d0 [btrfs]
       ? lockdep_hardirqs_on+0x74/0x100
       ? __folio_start_writeback+0x10/0x2c0
       btrfs_submit_bbio+0x1c/0x40 [btrfs]
       submit_one_bio+0x44/0x60 [btrfs]
       submit_extent_folio+0x13f/0x330 [btrfs]
       ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs]
       extent_writepage_io+0x18b/0x360 [btrfs]
       extent_write_locked_range+0x17c/0x340 [btrfs]
       ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs]
       run_delalloc_cow+0x71/0xd0 [btrfs]
       btrfs_run_delalloc_range+0x176/0x500 [btrfs]
       ? find_lock_delalloc_range+0x119/0x260 [btrfs]
       writepage_delalloc+0x2ab/0x480 [btrfs]
       extent_write_cache_pages+0x236/0x7d0 [btrfs]
       btrfs_writepages+0x72/0x130 [btrfs]
       do_writepages+0xd4/0x240
       ? find_held_lock+0x2b/0x80
       ? wbc_attach_and_unlock_inode+0x12c/0x290
       ? wbc_attach_and_unlock_inode+0x12c/0x290
       __writeback_single_inode+0x5c/0x4c0
       ? do_raw_spin_unlock+0x49/0xb0
       writeback_sb_inodes+0x22c/0x560
       __writeback_inodes_wb+0x4c/0xe0
       wb_writeback+0x1d6/0x3f0
       wb_workfn+0x334/0x520
       process_one_work+0x1ee/0x570
       ? lock_is_held_type+0xc6/0x130
       worker_thread+0x1d1/0x3b0
       ? __pfx_worker_thread+0x10/0x10
       kthread+0xee/0x120
       ? __pfx_kthread+0x10/0x10
       ret_from_fork+0x30/0x50
       ? __pfx_kthread+0x10/0x10
       ret_from_fork_asm+0x1a/0x30
       </TASK>
      Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl
      CR2: 0000000000000020
    
    * Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios
    
    btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is
    called last among split bios. In that case, btrfs_orig_write_end_io() sets
    the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2].
    Otherwise, the increased orig_bio's bioc->error is not checked by anyone
    and return BLK_STS_OK to the upper layer.
    
    [2] Actually, this is not true. Because we only increases orig_bioc->errors
    by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors"
    is still not met if only one split btrfs_bio fails.
    
    * Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios
    
    In contrast to the above case, btrfs_bbio_propagate_error() is not working
    well if un-mirrored orig_bbio is completed last. It sets
    orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily
    over-written by orig_bbio's completion status. If the status is BLK_STS_OK,
    the upper layer would not know the failure.
    
    * Solution
    
    Considering the above cases, we can only save the error status in the
    orig_bbio (remaining part after split) itself as it is always
    available. Also, the saved error status should be propagated when all the
    split btrfs_bios are finished (i.e, bbio->pending_ios == 0).
    
    This commit introduces "status" to btrfs_bbio and saves the first error of
    split bios to original btrfs_bio's "status" variable. When all the split
    bios are finished, the saved status is loaded into original btrfs_bio's
    status.
    
    With this commit, btrfs/146 on zoned devices does not hit the NULL pointer
    dereference anymore.
    
    Fixes: 852eee6 ("btrfs: allow btrfs_submit_bio to split bios")
    CC: [email protected] # 6.6+
    Reviewed-by: Qu Wenruo <[email protected]>
    Reviewed-by: Christoph Hellwig <[email protected]>
    Reviewed-by: Johannes Thumshirn <[email protected]>
    Signed-off-by: Naohiro Aota <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    naota authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    467b190 View commit details
    Browse the repository at this point in the history
  62. btrfs: clear force-compress on remount when compress mount option is …

    …given
    
    After the migration to use fs context for processing mount options we had
    a slight change in the semantics for remounting a filesystem that was
    mounted with compress-force. Before we could clear compress-force by
    passing only "-o compress[=algo]" during a remount, but after that change
    that does not work anymore, force-compress is still present and one needs
    to pass "-o compress-force=no,compress[=algo]" to the mount command.
    
    Example, when running on a kernel 6.8+:
    
      $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
      $ mount | grep sdi
      /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)
    
      $ mount -o remount,compress=zlib:5 /mnt/sdi
      $ mount | grep sdi
      /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)
    
    On a 6.7 kernel (or older):
    
      $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
      $ mount | grep sdi
      /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)
    
      $ mount -o remount,compress=zlib:5 /mnt/sdi
      $ mount | grep sdi
      /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)
    
    So update btrfs_parse_param() to clear "compress-force" when "compress" is
    given, providing the same semantics as kernel 6.7 and older.
    
    Reported-by: Roman Mamedov <[email protected]>
    Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/
    CC: [email protected] # 6.8+
    Signed-off-by: Filipe Manana <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    b54ce3d View commit details
    Browse the repository at this point in the history
  63. btrfs: reduce lock contention when eb cache miss for btree search

    When crawling btree, if an eb cache miss occurs, we change to use the eb
    read lock and release all previous locks (including the parent lock) to
    reduce lock contention.
    
    If an eb cache miss occurs in a leaf and needs to execute IO, before this
    change we released locks only from level 2 and up and we read a leaf's
    content from disk while holding a lock on its parent (level 1), causing
    the unnecessary lock contention on the parent, after this change we
    release locks from level 1 and up, but we lock level 0, and read leaf's
    content from disk.
    
    Because we have prepared the check parameters and the read lock of eb we
    hold, we can ensure that no race will occur during the check and cause
    unexpected errors.
    
    Reviewed-by: Filipe Manana <[email protected]>
    Signed-off-by: Robbie Ko <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    Robbie Ko authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    b18732b View commit details
    Browse the repository at this point in the history
  64. btrfs: add and use helper to remove extent map from its inode's tree

    Move the common code to remove an extent map from its inode's tree into a
    helper function and use it, reducing duplicated code.
    
    Reviewed-by: Josef Bacik <[email protected]>
    Signed-off-by: Filipe Manana <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    7e1135c View commit details
    Browse the repository at this point in the history
  65. btrfs: make the extent map shrinker run asynchronously as a work queu…

    …e job
    
    Currently the extent map shrinker is run synchronously for kswapd tasks
    that end up calling the fs shrinker (fs/super.c:super_cache_scan()).
    This has some disadvantages and for some heavy workloads with memory
    pressure it can cause some delays and stalls that make a machine
    unresponsive for some periods. This happens because:
    
    1) We can have several kswapd tasks on machines with multiple NUMA zones,
       and running the extent map shrinker concurrently can cause high
       contention on some spin locks, namely the spin locks that protect
       the radix tree that tracks roots, the per root xarray that tracks
       open inodes and the list of delayed iputs. This not only delays the
       shrinker but also causes high CPU consumption and makes the task
       running the shrinker monopolize a core, resulting in the symptoms
       of an unresponsive system. This was noted in previous commits such as
       commit ae1e766 ("btrfs: only run the extent map shrinker from
       kswapd tasks");
    
    2) The extent map shrinker's iteration over inodes can often be slow, even
       after changing the data structure that tracks open inodes for a root
       from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+).
       The transition to the xarray while it made things a bit faster, it's
       still somewhat slow - for example in a test scenario with 10000 inodes
       that have no extent maps loaded, the extent map shrinker took between
       5ms to 8ms, using a release, non-debug kernel. Iterating over the
       extent maps of an inode can also be slow if have an inode with many
       thousands of extent maps, since we use a red black tree to track and
       search extent maps. So having the extent map shrinker run synchronously
       adds extra delay for other things a kswapd task does.
    
    So make the extent map shrinker run asynchronously as a job for the
    system unbounded workqueue, just like what we do for data and metadata
    space reclaim jobs.
    
    Reviewed-by: Josef Bacik <[email protected]>
    Signed-off-by: Filipe Manana <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    d4eefbc View commit details
    Browse the repository at this point in the history
  66. btrfs: simplify tracking progress for the extent map shrinker

    Now that the extent map shrinker can only be run by a single task (as a
    work queue item) there is no need to keep the progress of the shrinker
    protected by a spinlock and passing the progress to trace events as
    parameters. So remove the lock and simplify the arguments for the trace
    events.
    
    Reviewed-by: Josef Bacik <[email protected]>
    Signed-off-by: Filipe Manana <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    ad6f27e View commit details
    Browse the repository at this point in the history
  67. btrfs: rename extent map shrinker members from struct btrfs_fs_info

    The names for the members of struct btrfs_fs_info related to the extent
    map shrinker are a bit too long, so rename them to be shorter by replacing
    the "extent_map_" prefix with the "em_" prefix.
    
    Reviewed-by: Josef Bacik <[email protected]>
    Signed-off-by: Filipe Manana <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    94a09da View commit details
    Browse the repository at this point in the history
  68. btrfs: re-enable the extent map shrinker

    Now that the extent map shrinker can only be run by a single task and runs
    asynchronously as a work queue job, enable it as it can no longer cause
    stalls on tasks allocating memory and entering the extent map shrinker
    through the fs shrinker (implemented by btrfs_free_cached_objects()).
    
    This is crucial to prevent exhaustion of memory due to unbounded extent
    map creation, primarily with direct IO but also for buffered IO on files
    with holes. This problem, for the direct IO case, was first reported in
    the Link tag below. That report was added to a Link tag of the first patch
    that introduced the extent map shrinker, commit 956a17d ("btrfs: add
    a shrinker for extent maps"), however the Link tag disappeared somehow
    from the committed patch (but was included in the submitted patch to the
    mailing list), so adding it below for future reference.
    
    Link: https://lore.kernel.org/linux-btrfs/[email protected]/
    Reviewed-by: Josef Bacik <[email protected]>
    Signed-off-by: Filipe Manana <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    5a6cada View commit details
    Browse the repository at this point in the history
  69. btrfs: remove redundant level argument from read_block_for_search()

    The level parameter passed to read_block_for_search() always matches the
    level of the extent buffer passed in the "eb_ret" parameter, which we are
    also extracting into the "parent_level" local variable.
    
    So remove the level parameter and instead use the "parent_level" variable
    which in fact has a better name (it's the level of the parent node from
    which we are reading a child node/leaf).
    
    Signed-off-by: Filipe Manana <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    0f9677a View commit details
    Browse the repository at this point in the history
  70. btrfs: simplify arguments for btrfs_verify_level_key()

    The only caller of btrfs_verify_level_key() is read_block_for_search() and
    it's passing 3 arguments to it that can be extracted from its on stack
    variable of type struct btrfs_tree_parent_check.
    
    So change btrfs_verify_level_key() to accept an argument of type
    struct btrfs_tree_parent_check instead of level, first key and parent
    transid arguments.
    
    Signed-off-by: Filipe Manana <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    7f066ce View commit details
    Browse the repository at this point in the history
  71. btrfs: remove redundant initializations for struct btrfs_tree_parent_…

    …check
    
    It's pointless to initialize the has_first_key field of the stack local
    btrfs_tree_parent_check structure at btrfs_tree_parent_check() and at
    btrfs_qgroup_trace_subtree() since all fields not explicitly initialized
    are zeroed out. In the case of the first function it's a bit odd because
    we are assigning 0 and the field is of type bool, however not incorrect
    since a 0 is converted to false.
    
    Just remove the explicit initializations due to their redundancy.
    
    Signed-off-by: Filipe Manana <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    28cea0d View commit details
    Browse the repository at this point in the history
  72. btrfs: remove local generation variable from read_block_for_search()

    It's redundant to have the 'gen' variable since we already have the same
    value in the local btrfs_tree_parent_check structure. So remove it and
    instead use the structure's field.
    
    Signed-off-by: Filipe Manana <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    fdmanana authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    7e3a5ca View commit details
    Browse the repository at this point in the history
  73. btrfs: do not clear read-only when adding sprout device

    If you follow the seed/sprout wiki, it suggests the following workflow:
    
    btrfstune -S 1 seed_dev
    mount seed_dev mnt
    btrfs device add sprout_dev
    mount -o remount,rw mnt
    
    The first mount mounts the FS readonly, which results in not setting
    BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
    somewhat surprisingly clears the readonly bit on the sb (though the
    mount is still practically readonly, from the users perspective...).
    Finally, the remount checks the readonly bit on the sb against the flag
    and sees no change, so it does not run the code intended to run on
    ro->rw transitions, leaving BTRFS_FS_OPEN unset.
    
    As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
    does no work. This results in leaking deleted snapshots until we run out
    of space.
    
    I propose fixing it at the first departure from what feels reasonable:
    when we clear the readonly bit on the sb during device add.
    
    A new fstest I have written reproduces the bug and confirms the fix.
    
    Reviewed-by: Qu Wenruo <[email protected]>
    Signed-off-by: Boris Burkov <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    boryas authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    78a423d View commit details
    Browse the repository at this point in the history
  74. btrfs: qgroup: set a more sane default value for subtree drop threshold

    Since commit 011b46c ("btrfs: skip subtree scan if it's too high to
    avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can
    automatically skip large subtree scan at the cost of marking qgroup
    inconsistent.
    
    It's designed to address the final performance problem of snapshot drop
    with qgroup enabled, but to be safe the default value is
    BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value
    to make it work.
    
    I'd say it's not a good idea to rely on user space tool to set this
    default value, especially when some operations (snapshot dropping) can
    be triggered immediately after mount, leaving a very small window to
    that that sysfs interface.
    
    So instead of disabling this new feature by default, enable it with a
    low threshold (3), so that large subvolume tree drop at mount time won't
    cause huge qgroup workload.
    
    CC: [email protected] # 6.1
    Signed-off-by: Qu Wenruo <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    8a40d12 View commit details
    Browse the repository at this point in the history
  75. btrfs: fix the delalloc range locking if sector size < page size

    Inside lock_delalloc_folios(), there are several problems related to
    sector size < page size handling:
    
    - Set the writer locks without checking if the folio is still valid
      We call btrfs_folio_start_writer_lock() just like it's folio_lock().
      But since the folio may not even be the folio of the current mapping,
      we can easily screw up the folio->private.
    
    - The range is not clamped inside the page
      This means we can over write other bitmaps if the start/len is not
      properly handled, and trigger the btrfs_subpage_assert().
    
    - @processed_end is always rounded up to page end
      If the delalloc range is not page aligned, and we need to retry
      (returning -EAGAIN), then we will unlock to the page end.
    
      Thankfully this is not a huge problem, as now
      btrfs_folio_end_writer_lock() can handle range larger than the locked
      range, and only unlock what is already locked.
    
    Fix all these problems by:
    
    - Lock and check the folio first, then call
      btrfs_folio_set_writer_lock()
      So that if we got a folio not belonging to the inode, we won't
      touch folio->private.
    
    - Properly truncate the range inside the page
    
    - Update @processed_end to the locked range end
    
    Fixes: 1e1de38 ("btrfs: make process_one_page() to handle subpage locking")
    Signed-off-by: Qu Wenruo <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    cd9721f View commit details
    Browse the repository at this point in the history
  76. btrfs: remove unused btrfs_folio_start_writer_lock()

    This function is not really suitable to lock a folio, as it lacks the
    proper mapping checks, thus the locked folio may not even belong to
    btrfs.
    
    And due to the above reason, the last user inside lock_delalloc_folios()
    is already removed, and we can remove this function.
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    3efe27c View commit details
    Browse the repository at this point in the history
  77. btrfs: unify to use writer locks for subpage locking

    Since commit d7172f5 ("btrfs: use per-buffer locking for
    extent_buffer reading"), metadata read no longer relies on the subpage
    reader locking.
    
    This means we do not need to maintain a different metadata/data split
    for locking, so we can convert the existing reader lock users by:
    
    - add_ra_bio_pages()
      Convert to btrfs_folio_set_writer_lock()
    
    - end_folio_read()
      Convert to btrfs_folio_end_writer_lock()
    
    - begin_folio_read()
      Convert to btrfs_folio_set_writer_lock()
    
    - folio_range_has_eb()
      Remove the subpage->readers checks, since it is always 0.
    
    - Remove btrfs_subpage_start_reader() and btrfs_subpage_end_reader()
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    9d64856 View commit details
    Browse the repository at this point in the history
  78. btrfs: rename btrfs_folio_(set|start|end)_writer_lock()

    Since there is no user of reader locks, rename the writer locks into a
    more generic name, by removing the "_writer" part from the name.
    
    And also rename btrfs_subpage::writer into btrfs_subpage::locked.
    
    Signed-off-by: Qu Wenruo <[email protected]>
    Reviewed-by: David Sterba <[email protected]>
    Signed-off-by: David Sterba <[email protected]>
    adam900710 authored and kdave committed Oct 17, 2024
    Configuration menu
    Copy the full SHA
    2be4c90 View commit details
    Browse the repository at this point in the history

Commits on Oct 21, 2024

  1. btrfs: implement partial deletion of RAID stripe extents

    In our CI system, the RAID stripe tree configuration sometimes fails with
    the following ASSERT():
    
     assertion failed: found_start >= start && found_end <= end, in fs/btrfs/raid-stripe-tree.c:64
    
    This ASSERT()ion triggers, because for the initial design of RAID
    stripe-tree, I had the "one ordered-extent equals one bio" rule of zoned
    btrfs in mind.
    
    But for a RAID stripe-tree based system, that is not hosted on a zoned
    storage device, but on a regular device this rule doesn't apply.
    
    So in case the range we want to delete starts in the middle of the
    previous item, grab the item and "truncate" it's length. That is, clone
    the item, subtract the deleted portion from the key's offset, delete the
    old item and insert the new one.
    
    In case the range to delete ends in the middle of an item, we have to
    adjust both the item's key as well as the stripe extents and then
    re-insert the modified clone into the tree after deleting the old stripe
    extent.
    
    Signed-off-by: Johannes Thumshirn <[email protected]>
    morbidrsa committed Oct 21, 2024
    Configuration menu
    Copy the full SHA
    6eba246 View commit details
    Browse the repository at this point in the history
  2. btrfs: implement self-tests for partial RAID srtipe-tree delete

    Implement self-tests for partial deletion of RAID stripe-tree entries.
    
    These two new tests cover both the deletion of the front of a RAID
    stripe-tree stripe extent as well as truncation of an item to make it
    smaller.
    
    Signed-off-by: Johannes Thumshirn <[email protected]>
    morbidrsa committed Oct 21, 2024
    Configuration menu
    Copy the full SHA
    7269e1c View commit details
    Browse the repository at this point in the history