Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular for-next build test #1157

Open
wants to merge 10,000 commits into
base: build
Choose a base branch
from
Open

Regular for-next build test #1157

wants to merge 10,000 commits into from

Conversation

kdave
Copy link
Member

@kdave kdave commented Feb 21, 2024

Keep this open, the build tests are hosted on github CI.

@kdave kdave changed the title Post rc5 build test Regular for-next build test Feb 22, 2024
@kdave kdave force-pushed the for-next branch 6 times, most recently from 2d4aefb to c9e380a Compare February 28, 2024 14:37
@kdave kdave force-pushed the for-next branch 6 times, most recently from c56343b to 1cab137 Compare March 5, 2024 17:23
@kdave kdave force-pushed the for-next branch 2 times, most recently from 6613f3c to b30a0ce Compare March 15, 2024 01:05
@kdave kdave force-pushed the for-next branch 6 times, most recently from d205ebd to c0bd9d9 Compare March 25, 2024 17:48
@kdave kdave force-pushed the for-next branch 4 times, most recently from 15022b1 to c22750c Compare March 28, 2024 02:04
@kdave kdave force-pushed the for-next branch 3 times, most recently from 28d9855 to e18d8ce Compare April 4, 2024 19:30
fdmanana and others added 27 commits April 7, 2025 15:40
Instead of using __clear_extent_bit() we can use clear_extent_bits() since
we pass a NULL value for the cached and changeset arguments.

Reviewed-by: Boris Burkov <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
…ssible

Several places are using clear_extent_bit() and passing a NULL value for
the 'cached' argument, which is pointless as they can use instead
clear_extent_bits().

Reviewed-by: Boris Burkov <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Instead of keeping track of the minimum start offset of the next record
and detecting overflow every time we update that offset to be the sum of
current record's end offset plus one, we can simply exit when the current
record ends at or beyond our end offset and forget about updating the
start offset on every iteration and testing for it at the top of the loop.
This makes both the source code and assembly code simpler, more efficient
and shorter (reducing the object text size).

Also remove the pointless initialization to NULL of the state variable, as
we don't use it before the first assignment to it. This may help avoid
some warnings with clang tools such as the one reported/fixed by commit
966de47 ("btrfs: remove redundant initialization of variables in
log_new_ancestors").

Reviewed-by: Boris Burkov <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
There are several things wrong with the documentation:

1) At the top it's only mentioned that we search for an entry containing
   the given offset, but when such entry does not exists we search for
   the first entry that starts and ends after that offset;

2) It mentions that @node_ret and @parent_ret aren't changed if the
   returned entry contains the given offset - that is true only if the
   returned entry starts exactly at @offset, otherwise those arguments
   are changed;

3) It mentions that if no entry containing offset is found then we return
   the first entry ending before the offset - that is not true, we return
   the first entry that starts and ends after that offset;

4) It also mentions that NULL is never returned. This is false as in case
   there's no entry containing offset or any entry that starts and ends
   after offset, NULL is returned.

So fix the documentation.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
The tree_search() function always returns an entry that either contains
the search offset or the first entry in the tree that starts after the
offset. So checking at find_first_extent_bit_state() if the returned
entry ends at or after the search offset is pointless. Remove the check.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
The overflow detection for the start offset of the next record is not
really necessary, we can just stop iterating if the current record ends at
or after out end offset. This removes the need to test if the current
record end offset is (u64)-1 and to check if adding 1 to the current
end offset results in 0.

By testing only if the current record ends at or after the end offset, we
also don't need anymore to test the new start offset at the head of the
while loop.

This makes both the source code and assembly code simpler, more efficient
and shorter (reducing the object text size).

Also remove the pointless initialization to NULL of the state variable, as
we don't use it before the first assignment to it. This may help avoid
some warnings with clang tools such as the one reported/fixed by commit
966de47 ("btrfs: remove redundant initialization of variables in
log_new_ancestors").

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
It's pointless to check if the current record's start offset is greater
than the end offset, as before we just tested if it was greater than the
start offset - and if it's not it means it's less than or equal to the
start offset, so it can not be greater than the end offset, as our start
offset is always smaller than the end offset.

So remove that check and also add an assertion to verify the start offset
is smaller then the end offset.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
The most trivial pattern for the auto freeing when the variable is
declared with the macro and the final btrfs_free_path() is removed.
There are almost none goto -> return conversions and there's no other
function cleanup.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
There was a bug report about a NULL pointer dereference in
__btrfs_add_free_space_zoned() that ultimately happens because a
conversion from the default metadata profile DUP to a RAID1 profile on two
disks.

The stack trace has the following signature:

   BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile
   BUG: kernel NULL pointer dereference, address: 0000000000000058
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x0000) - not-present page
   PGD 0 P4D 0
   Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
   RIP: 0010:__btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
   RSP: 0018:ffffa236b6f3f6d0 EFLAGS: 00010246
   RAX: 0000000000000000 RBX: ffff96c8132f3400 RCX: 0000000000000001
   RDX: 0000000010000000 RSI: 0000000000000000 RDI: ffff96c8132f3410
   RBP: 0000000010000000 R08: 0000000000000003 R09: 0000000000000000
   R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000000
   R13: ffff96c758f65a40 R14: 0000000000000001 R15: 000011aac0000000
   FS: 00007fdab1cb2900(0000) GS:ffff96e60ca00000(0000) knlGS:0000000000000000
   CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000000000000058 CR3: 00000001a05ae000 CR4: 0000000000350ef0
   Call Trace:
   <TASK>
   ? __die_body.cold+0x19/0x27
   ? page_fault_oops+0x15c/0x2f0
   ? exc_page_fault+0x7e/0x180
   ? asm_exc_page_fault+0x26/0x30
   ? __btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
   btrfs_add_free_space_async_trimmed+0x34/0x40
   btrfs_add_new_free_space+0x107/0x120
   btrfs_make_block_group+0x104/0x2b0
   btrfs_create_chunk+0x977/0xf20
   btrfs_chunk_alloc+0x174/0x510
   ? srso_return_thunk+0x5/0x5f
   btrfs_inc_block_group_ro+0x1b1/0x230
   btrfs_relocate_block_group+0x9e/0x410
   btrfs_relocate_chunk+0x3f/0x130
   btrfs_balance+0x8ac/0x12b0
   ? srso_return_thunk+0x5/0x5f
   ? srso_return_thunk+0x5/0x5f
   ? __kmalloc_cache_noprof+0x14c/0x3e0
   btrfs_ioctl+0x2686/0x2a80
   ? srso_return_thunk+0x5/0x5f
   ? ioctl_has_perm.constprop.0.isra.0+0xd2/0x120
   __x64_sys_ioctl+0x97/0xc0
   do_syscall_64+0x82/0x160
   ? srso_return_thunk+0x5/0x5f
   ? __memcg_slab_free_hook+0x11a/0x170
   ? srso_return_thunk+0x5/0x5f
   ? kmem_cache_free+0x3f0/0x450
   ? srso_return_thunk+0x5/0x5f
   ? srso_return_thunk+0x5/0x5f
   ? syscall_exit_to_user_mode+0x10/0x210
   ? srso_return_thunk+0x5/0x5f
   ? do_syscall_64+0x8e/0x160
   ? sysfs_emit+0xaf/0xc0
   ? srso_return_thunk+0x5/0x5f
   ? srso_return_thunk+0x5/0x5f
   ? seq_read_iter+0x207/0x460
   ? srso_return_thunk+0x5/0x5f
   ? vfs_read+0x29c/0x370
   ? srso_return_thunk+0x5/0x5f
   ? srso_return_thunk+0x5/0x5f
   ? syscall_exit_to_user_mode+0x10/0x210
   ? srso_return_thunk+0x5/0x5f
   ? do_syscall_64+0x8e/0x160
   ? srso_return_thunk+0x5/0x5f
   ? exc_page_fault+0x7e/0x180
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
   RIP: 0033:0x7fdab1e0ca6d
   RSP: 002b:00007ffeb2b60c80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fdab1e0ca6d
   RDX: 00007ffeb2b60d80 RSI: 00000000c4009420 RDI: 0000000000000003
   RBP: 00007ffeb2b60cd0 R08: 0000000000000000 R09: 0000000000000013
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 00007ffeb2b6343b R14: 00007ffeb2b60d80 R15: 0000000000000001
   </TASK>
   CR2: 0000000000000058
   ---[ end trace 0000000000000000 ]---

The 1st line is the most interesting here:

 BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile

When a RAID1 block-group is created and a write pointer mismatch between
the disks in the RAID set is detected, btrfs sets the alloc_offset to the
length of the block group marking it as full. Afterwards the code expects
that a balance operation will evacuate the data in this block-group and
repair the problems.

But before this is possible, the new space of this block-group will be
accounted in the free space cache. But in __btrfs_add_free_space_zoned()
it is being checked if it is a initial creation of a block group and if
not a reclaim decision will be made. But the decision if a block-group's
free space accounting is done for an initial creation depends on if the
size of the added free space is the whole length of the block-group and
the allocation offset is 0.

But as btrfs_load_block_group_zone_info() sets the allocation offset to
the zone capacity (i.e. marking the block-group as full) this initial
decision is not met, and the space_info pointer in the 'struct
btrfs_block_group' has not yet been assigned.

Fail creation of the block group and rely on manual user intervention to
re-balance the filesystem.

Afterwards the filesystem can be unmounted, mounted in degraded mode and
the missing device can be removed after a full balance of the filesystem.

Reported-by: 西木野羰基 <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/CAB_b4sBhDe3tscz=duVyhc9hNE+gu=B8CrgLO152uMyanR8BEA@mail.gmail.com/
Fixes: b1934cd ("btrfs: zoned: handle broken write pointer on zones")
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Signed-off-by: David Sterba <[email protected]>
The again label is here to retry to get the folio for the current index.
When triggering that label, there is no increasement on the iterator.

So it can be replaced by a simple "continue" and remove the again label.

Signed-off-by: Qu Wenruo <[email protected]>
Currently the function put_file_data() can only accept page sized folio.

However the function itself is not that complex, it's just copying data
from filemap folio into send buffer.

So make it support larger data folios by:

- Change the loop to use file offset instead of page index

- Calculate @pg_offset and @cur_len after getting the folio

- Remove the "WARN_ON(folio_order(folio));" line

Signed-off-by: Qu Wenruo <[email protected]>
The function btrfs_page_mkwrite() has an explicit ASSERT() checking the
folio order.

To make it support larger data folios, we need to:

- Remove the ASSERT(folio_order(folio) == 0)

- Use folio_contains() to check if the folio covers the last page

Otherwise the code is already supporting larger folios well.

Signed-off-by: Qu Wenruo <[email protected]>
The only blockage is the ASSERT() rejecting larger folios, just remove
it.

Signed-off-by: Qu Wenruo <[email protected]>
The function is doing an ASSERT() checking the folio order,  but all
later functions are handling larger folios properly, thus we can safely
remove that ASSERT().

Signed-off-by: Qu Wenruo <[email protected]>
The subpage handling code has two locations not supporting larger
folios:

- btrfs_attach_subpage()
  Which is doing a metadata specific ASSERT() check.

  But for the future larger data folios support, that check is too
  generic.
  Since it's metadata specific, only check the ASSERT() for metadata.

- btrfs_subpage_assert()
  Just remove the "ASSERT(folio_order(folio) == 0)" check.

Signed-off-by: Qu Wenruo <[email protected]>
The function itself is already taking larger folios into consideration,
just remove the ASSERT(!folio_test_large()) line.

Signed-off-by: Qu Wenruo <[email protected]>
Instead of open coding btrfs_root_id() to get the ID of a root, use the
helper in the trace points, which also makes the code less verbose.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
If we have a failure at create_reloc_inode(), under the 'out' label we
assign an error pointer to the 'inode' variable and then return a weird
pointer because we return the expression "&inode->vfs_inode":

   static noinline_for_stack struct inode *create_reloc_inode(
                                    const struct btrfs_block_group *group)
   {
       (...)
   out:
       (...)
       if (ret) {
            if (inode)
                  iput(&inode->vfs_inode);
            inode = ERR_PTR(ret);
       }
       return &inode->vfs_inode;
   }

This can make us return a pointer that is not an error pointer and make
the caller proceed as if an error didn't happen and later result in an
invalid memory access when dereferencing the inode pointer.
Syzbot reported reported such a case with the following stack trace:

   R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
   R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790
    </TASK>
   BTRFS info (device loop0): relocating block group 6881280 flags data|metadata
   Oops: general protection fault, probably for non-canonical address 0xdffffc0000000045: 0000 [#1] SMP KASAN NOPTI
   KASAN: null-ptr-deref in range [0x0000000000000228-0x000000000000022f]
   CPU: 0 UID: 0 PID: 5332 Comm: syz-executor215 Not tainted 6.14.0-syzkaller-13423-ga8662bcd2ff1 #0 PREEMPT(full)
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
   RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971
   Code: 00 74 08 (...)
   RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203
   RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440
   RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000
   RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
   R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000
   FS:  000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   Call Trace:
    <TASK>
    relocate_block_group+0xa1e/0xd50 fs/btrfs/relocation.c:3657
    btrfs_relocate_block_group+0x777/0xd80 fs/btrfs/relocation.c:4011
    btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3511
    __btrfs_balance+0x1a93/0x25e0 fs/btrfs/volumes.c:4292
    btrfs_balance+0xbde/0x10c0 fs/btrfs/volumes.c:4669
    btrfs_ioctl_balance+0x3f5/0x660 fs/btrfs/ioctl.c:3586
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:906 [inline]
    __se_sys_ioctl+0xf1/0x160 fs/ioctl.c:892
    do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7fb4ef537dd9
   Code: 28 00 00 (...)
   RSP: 002b:00007ffc55de5728 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007ffc55de5750 RCX: 00007fb4ef537dd9
   RDX: 0000200000000440 RSI: 00000000c4009420 RDI: 0000000000000003
   RBP: 0000000000000002 R08: 00007ffc55de54c6 R09: 00007ffc55de5770
   R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
   R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790
    </TASK>
   Modules linked in:
   ---[ end trace 0000000000000000 ]---
   RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971
   Code: 00 74 08 (...)
   RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203
   RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440
   RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000
   RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
   R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000
   FS:  000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   ----------------
   Code disassembly (best guess):
      0:	00 74 08 48          	add    %dh,0x48(%rax,%rcx,1)
      4:	89 df                	mov    %ebx,%edi
      6:	e8 f8 36 24 fe       	call   0xfe243703
      b:	48 89 9c 24 30 01 00 	mov    %rbx,0x130(%rsp)
     12:	00
     13:	4c 89 74 24 28       	mov    %r14,0x28(%rsp)
     18:	4d 8b 76 10          	mov    0x10(%r14),%r14
     1c:	49 8d 9e 98 fe ff ff 	lea    -0x168(%r14),%rbx
     23:	48 89 d8             	mov    %rbx,%rax
     26:	48 c1 e8 03          	shr    $0x3,%rax
   * 2a:	42 80 3c 20 00       	cmpb   $0x0,(%rax,%r12,1) <-- trapping instruction
     2f:	74 08                	je     0x39
     31:	48 89 df             	mov    %rbx,%rdi
     34:	e8 ca 36 24 fe       	call   0xfe243703
     39:	4c 8b 3b             	mov    (%rbx),%r15
     3c:	48                   	rex.W
     3d:	8b                   	.byte 0x8b
     3e:	44                   	rex.R
     3f:	24                   	.byte 0x24

So fix this by returning the error immediately.

Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: 00aad50 ("btrfs: make btrfs_iget() return a btrfs inode instead")
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Inside functions unlock_delalloc_folio() and lock_delalloc_folios(), we
have the following early exits:

	if (index == locked_folio->index && end_index == index)
		return;

This allows us to exit early if the range is inside the same locked
folio.

However the current check relies on page sized folios, if we have a large
folio that contains @index but not at @index, then the early exit will
no longer trigger.

Furthermore without the above early check, the existing code can handle it
well, as both __process_folios_contig() and lock_delalloc_folios() will
skip any folio page lock/unlock if it's on the locked folio.

Here we remove the early exits and let the existing code to handle the
same index case, to make the code a little simpler.

Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Currently we use the following pattern to detect if the folio contains
the end of a file:

	if (folio->index == end_index)
		folio_zero_range();

But that only works if the folio is page sized.

For the following case, it will not work and leave the range beyond EOF
uninitialized:

  The page size is 4K, and the fs block size is also 4K.

	16K        20K       24K
        |          |     |   |
	                 |
                         EOF at 22K

And we have a large folio sized 8K at file offset 16K.

In that case, the old "folio->index == end_index" will not work, thus
the range [22K, 24K) will not be zeroed out.

Fix the following call sites which use the above pattern:

- add_ra_bio_pages()

- extent_writepage()

Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
With large folios, filemap_get_folios_contig() can return duplicated
folios, for example:

	704K                     768K	                          1M
        |<-- 64K sized folio --->|<------- 256K sized folio ----->|
	                      |          |
			      764K       800K

If we call lock_delalloc_folios() with range [762K, 800K) on above
layout with locked folio at 704K, we will hit the following sequence:

1. filemap_get_folios_contig() returned 1 for range [764K, 768K)
   As this is a folio boundary.

   The returned folio will be folio at 704K.

   Since the folio is already locked, we will not lock the folio.

2. filemap_get_folios_contig() returned 8 for range [768K, 800K)
   All the entries are the same folio at 768K.

3. We lock folio 768K for the slot 0 of the fbatch

4. We lock folio 768K for the slot 1 of the fbatch
   We deadlock, as we have already locked the same folio at step 3.

The filemap_get_folios_contig() behavior is a bug, which has been
reported to block layer:

 https://lore.kernel.org/linux-btrfs/[email protected]/

And the author is working actively on a fix to address the behavior:

 https://lore.kernel.org/linux-btrfs/Z-8s1-kiIDkzgRbc@fedora/

On the other hand, we do not really need the filemap_get_folios_contig()
call either:

- There is no error handling if there is a hole anyway
  filemap_get_folios_contig() will return immediately at a hole in the
  index range.

  In that case, we're in big trouble as our io tree is no longer in sync
  with the filemap.
  So no matter if we use filemap_get_folios_contig() or
  filemap_get_folios(), the result will be the same.

- filemap_get_folios() results already have ascending indices
  Thus @processed_end is still updated in correct order.

- No difference in error handling
  We will lock and unlock using both filemap_get_folios(), meaning
  even if there is a hole and we need to unlock and retry, we will still
  get the same batch of folios to unlock.

So get rid of the buggy filemap_get_folios_contig() and use regular
filemap_get_folios() instead, this will fix the above deadlock for large
folios.

Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
kdave added 2 commits April 7, 2025 20:34
Add more unlikely annotations to branches that lead to EUCLEAN, overall
in the tree checker this helps to reorder instructions for the no-error
case.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
The whole tree checker returns EUCLEAN, except the one check in
btrfs_verify_level_key(). This was inherited from the function that was
moved from disk-io.c in 2cac5af ("btrfs: move
btrfs_verify_level_key into tree-checker.c") but this should be unified
with the rest.

Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.