Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test for-next (regular) #1268

Open
wants to merge 10,000 commits into
base: ci
Choose a base branch
from
Open

Test for-next (regular) #1268

wants to merge 10,000 commits into from

Conversation

kdave
Copy link
Member

@kdave kdave commented May 16, 2024

No description provided.

@kdave kdave force-pushed the test-for-next branch 3 times, most recently from 1e76edd to 6d347e7 Compare May 24, 2024 21:18
@kdave kdave changed the title Test for-next rc0 Test for-next rc1 May 28, 2024
@kdave kdave changed the title Test for-next rc1 Test for-next (6.10-rc2) Jun 3, 2024
@kdave kdave changed the title Test for-next (6.10-rc2) Test for-next (6.10-rc3) Jun 13, 2024
@kdave kdave force-pushed the test-for-next branch 2 times, most recently from 68f1b4b to 4a4d10c Compare June 24, 2024 15:33
@kdave kdave changed the title Test for-next (6.10-rc3) Test for-next (6.10-rc5) Jun 24, 2024
@kdave kdave force-pushed the test-for-next branch 4 times, most recently from a817d1a to dde24c7 Compare July 2, 2024 14:58
@kdave kdave changed the title Test for-next (6.10-rc5) Test for-next (regular) Jul 18, 2024
@kdave kdave force-pushed the test-for-next branch 3 times, most recently from d6b8894 to fc58d4b Compare August 5, 2024 16:49
@kdave kdave force-pushed the test-for-next branch 3 times, most recently from 3c3dc5e to 7b9aa31 Compare August 15, 2024 18:34
@kdave kdave force-pushed the test-for-next branch 2 times, most recently from 7f78dd5 to 00782aa Compare August 29, 2024 16:46
@kdave kdave force-pushed the test-for-next branch 2 times, most recently from 5278b7a to b7b74b3 Compare September 4, 2024 21:31
fdmanana and others added 29 commits April 2, 2025 17:30
Drop reference to pages from the comment since the function is fully folio
aware and works regardless of how many pages are in the folio. Also while
at it, capitalize the first word and make it more explicit that
release_folio is a callback from struct address_space_operations.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
When the release_folio callback (from struct address_space_operations) is
invoked we don't allow the folio to be released if its range is currently
locked in the inode's io_tree, as it may indicate the folio may be needed
by the task that locked the range.

However if the range is locked because an ordered extent is finishing,
then we can safely allow the folio to be released because ordered extent
completion doesn't need to use the folio at all.

When we are under memory pressure, the kernel starts writeback of dirty
pages (folios) with the goal of releasing the pages from the page cache
after writeback completes, however this often is not possible on btrfs
because:

  * Once the writeback completes we queue the ordered extent completion;

  * Once the ordered extent completion starts, we lock the range in the
    inode's io_tree (at btrfs_finish_one_ordered());

  * If the release_folio callback is called while the folio's range is
    locked in the inode's io_tree, we don't allow the folio to be
    released, so the kernel has to try to release memory elsewhere,
    which may result in triggering more writeback or releasing other
    pages from the page cache which may be more useful to have around
    for applications.

In contrast, when the release_folio callback is invoked after writeback
finishes and before ordered extent completion starts or locks the range,
we allow the folio to be released, as well as when the release_folio
callback is invoked after ordered extent completion unlocks the range.

Improve on this by detecting if the range is locked for ordered extent
completion and if it is, allow the folio to be released. This detection
is achieved by adding a new extent flag in the io_tree that is set when
the range is locked during ordered extent completion.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Allow get_range_bits() to take an extent state pointer to pointer argument
so that we can cache the first extent state record in the target range, so
that a caller can use it for subsequent operations without doing a full
tree search. Currently the only user is try_release_extent_state(), which
then does a call to __clear_extent_bit() which can use such a cached state
record.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Simplify conditionally reading an rb_entry(), there's the
rb_entry_safe() helper that checks the node pointer for NULL so we don't
have to write it explicitly.

Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: David Sterba <[email protected]>
After enabling large data folios for tests, I hit the ASSERT() inside
GET_SUBPAGE_BITMAP() where blocks_per_folio matches BITS_PER_LONG.

The ASSERT() itself is only based on the original subpage fs block size,
where we have at most 16 blocks per page, thus
"ASSERT(blocks_per_folio < BITS_PER_LONG)".

However the experimental large data folio support will set the max folio
order according to the BITS_PER_LONG, so we can have a case where a large
folio contains exactly BITS_PER_LONG blocks.

So the ASSERT() is too strict, change it to
"ASSERT(blocks_per_folio <= BITS_PER_LONG)" to avoid the false alert.

Reviewed-by: Filipe Manana <[email protected]>
Reviewed-by: Sweet Tea Dorminy <[email protected]>
Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
…ge()

[BUG WITH EXPERIMENTAL LARGE FOLIOS]
When testing the experimental large data folio support with compression,
there are several ASSERT()s triggered from btrfs_decompress_buf2page()
when running fsstress with compress=zstd mount option:

- ASSERT(copy_len) from btrfs_decompress_buf2page()
- VM_BUG_ON(offset + len > PAGE_SIZE) from memcpy_to_page()

[CAUSE]
Inside btrfs_decompress_buf2page(), we need to grab the file offset from
the current bvec.bv_page, to check if we even need to copy data into the
bio.

And since we're using single page bvec, and no large folio, every page
inside the folio should have its index properly setup.

But when large folios are involved, only the first page (aka, the head
page) of a large folio has its index properly initialized.

The other pages inside the large folio will not have their indexes
properly initialized.

Thus the page_offset() call inside btrfs_decompress_buf2page() will
result garbage, and completely screw up the @copy_len calculation.

[FIX]
Instead of using page->index directly, go with page_pgoff(), which can
handle non-head pages correctly.

So introduce a helper, file_offset_from_bvec(), to get the file offset
from a single page bio_vec, so the copy_len calculation can be done
correctly.

Reviewed-by: Filipe Manana <[email protected]>
Reviewed-by: Sweet Tea Dorminy <[email protected]>
Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Instead of using __clear_extent_bit() we can use clear_extent_bit() since
we pass a NULL value for the changeset argument.

Reviewed-by: Boris Burkov <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Instead of using __clear_extent_bit() we can use clear_extent_bits() since
we pass a NULL value for the cached and changeset arguments.

Reviewed-by: Boris Burkov <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
…ssible

Several places are using clear_extent_bit() and passing a NULL value for
the 'cached' argument, which is pointless as they can use instead
clear_extent_bits().

Reviewed-by: Boris Burkov <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Instead of keeping track of the minimum start offset of the next record
and detecting overflow every time we update that offset to be the sum of
current record's end offset plus one, we can simply exit when the current
record ends at or beyond our end offset and forget about updating the
start offset on every iteration and testing for it at the top of the loop.
This makes both the source code and assembly code simpler, more efficient
and shorter (reducing the object text size).

Also remove the pointless initialization to NULL of the state variable, as
we don't use it before the first assignment to it. This may help avoid
some warnings with clang tools such as the one reported/fixed by commit
966de47 ("btrfs: remove redundant initialization of variables in
log_new_ancestors").

Reviewed-by: Boris Burkov <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
There are several things wrong with the documentation:

1) At the top it's only mentioned that we search for an entry containing
   the given offset, but when such entry does not exists we search for
   the first entry that starts and ends after that offset;

2) It mentions that @node_ret and @parent_ret aren't changed if the
   returned entry contains the given offset - that is true only if the
   returned entry starts exactly at @offset, otherwise those arguments
   are changed;

3) It mentions that if no entry containing offset is found then we return
   the first entry ending before the offset - that is not true, we return
   the first entry that starts and ends after that offset;

4) It also mentions that NULL is never returned. This is false as in case
   there's no entry containing offset or any entry that starts and ends
   after offset, NULL is returned.

So fix the documentation.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
The tree_search() function always returns an entry that either contains
the search offset or the first entry in the tree that starts after the
offset. So checking at find_first_extent_bit_state() if the returned
entry ends at or after the search offset is pointless. Remove the check.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
The overflow detection for the start offset of the next record is not
really necessary, we can just stop iterating if the current record ends at
or after out end offset. This removes the need to test if the current
record end offset is (u64)-1 and to check if adding 1 to the current
end offset results in 0.

By testing only if the current record ends at or after the end offset, we
also don't need anymore to test the new start offset at the head of the
while loop.

This makes both the source code and assembly code simpler, more efficient
and shorter (reducing the object text size).

Also remove the pointless initialization to NULL of the state variable, as
we don't use it before the first assignment to it. This may help avoid
some warnings with clang tools such as the one reported/fixed by commit
966de47 ("btrfs: remove redundant initialization of variables in
log_new_ancestors").

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
It's pointless to check if the current record's start offset is greater
than the end offset, as before we just tested if it was greater than the
start offset - and if it's not it means it's less than or equal to the
start offset, so it can not be greater than the end offset, as our start
offset is always smaller than the end offset.

So remove that check and also add an assertion to verify the start offset
is smaller then the end offset.

Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
The most trivial pattern for the auto freeing when the variable is
declared with the macro and the final btrfs_free_path() is removed.
There are almost none goto -> return conversions and there's no other
function cleanup.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.

Reviewed-by: Daniel Vacek <[email protected]>
Signed-off-by: David Sterba <[email protected]>
There was a bug report about a NULL pointer dereference in
__btrfs_add_free_space_zoned() that ultimately happens because a
conversion from the default metadata profile DUP to a RAID1 profile on two
disks.

The stack trace has the following signature:

   BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile
   BUG: kernel NULL pointer dereference, address: 0000000000000058
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x0000) - not-present page
   PGD 0 P4D 0
   Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
   RIP: 0010:__btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
   RSP: 0018:ffffa236b6f3f6d0 EFLAGS: 00010246
   RAX: 0000000000000000 RBX: ffff96c8132f3400 RCX: 0000000000000001
   RDX: 0000000010000000 RSI: 0000000000000000 RDI: ffff96c8132f3410
   RBP: 0000000010000000 R08: 0000000000000003 R09: 0000000000000000
   R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000000
   R13: ffff96c758f65a40 R14: 0000000000000001 R15: 000011aac0000000
   FS: 00007fdab1cb2900(0000) GS:ffff96e60ca00000(0000) knlGS:0000000000000000
   CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000000000000058 CR3: 00000001a05ae000 CR4: 0000000000350ef0
   Call Trace:
   <TASK>
   ? __die_body.cold+0x19/0x27
   ? page_fault_oops+0x15c/0x2f0
   ? exc_page_fault+0x7e/0x180
   ? asm_exc_page_fault+0x26/0x30
   ? __btrfs_add_free_space_zoned.isra.0+0x61/0x1a0
   btrfs_add_free_space_async_trimmed+0x34/0x40
   btrfs_add_new_free_space+0x107/0x120
   btrfs_make_block_group+0x104/0x2b0
   btrfs_create_chunk+0x977/0xf20
   btrfs_chunk_alloc+0x174/0x510
   ? srso_return_thunk+0x5/0x5f
   btrfs_inc_block_group_ro+0x1b1/0x230
   btrfs_relocate_block_group+0x9e/0x410
   btrfs_relocate_chunk+0x3f/0x130
   btrfs_balance+0x8ac/0x12b0
   ? srso_return_thunk+0x5/0x5f
   ? srso_return_thunk+0x5/0x5f
   ? __kmalloc_cache_noprof+0x14c/0x3e0
   btrfs_ioctl+0x2686/0x2a80
   ? srso_return_thunk+0x5/0x5f
   ? ioctl_has_perm.constprop.0.isra.0+0xd2/0x120
   __x64_sys_ioctl+0x97/0xc0
   do_syscall_64+0x82/0x160
   ? srso_return_thunk+0x5/0x5f
   ? __memcg_slab_free_hook+0x11a/0x170
   ? srso_return_thunk+0x5/0x5f
   ? kmem_cache_free+0x3f0/0x450
   ? srso_return_thunk+0x5/0x5f
   ? srso_return_thunk+0x5/0x5f
   ? syscall_exit_to_user_mode+0x10/0x210
   ? srso_return_thunk+0x5/0x5f
   ? do_syscall_64+0x8e/0x160
   ? sysfs_emit+0xaf/0xc0
   ? srso_return_thunk+0x5/0x5f
   ? srso_return_thunk+0x5/0x5f
   ? seq_read_iter+0x207/0x460
   ? srso_return_thunk+0x5/0x5f
   ? vfs_read+0x29c/0x370
   ? srso_return_thunk+0x5/0x5f
   ? srso_return_thunk+0x5/0x5f
   ? syscall_exit_to_user_mode+0x10/0x210
   ? srso_return_thunk+0x5/0x5f
   ? do_syscall_64+0x8e/0x160
   ? srso_return_thunk+0x5/0x5f
   ? exc_page_fault+0x7e/0x180
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
   RIP: 0033:0x7fdab1e0ca6d
   RSP: 002b:00007ffeb2b60c80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fdab1e0ca6d
   RDX: 00007ffeb2b60d80 RSI: 00000000c4009420 RDI: 0000000000000003
   RBP: 00007ffeb2b60cd0 R08: 0000000000000000 R09: 0000000000000013
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 00007ffeb2b6343b R14: 00007ffeb2b60d80 R15: 0000000000000001
   </TASK>
   CR2: 0000000000000058
   ---[ end trace 0000000000000000 ]---

The 1st line is the most interesting here:

 BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile

When a RAID1 block-group is created and a write pointer mismatch between
the disks in the RAID set is detected, btrfs sets the alloc_offset to the
length of the block group marking it as full. Afterwards the code expects
that a balance operation will evacuate the data in this block-group and
repair the problems.

But before this is possible, the new space of this block-group will be
accounted in the free space cache. But in __btrfs_add_free_space_zoned()
it is being checked if it is a initial creation of a block group and if
not a reclaim decision will be made. But the decision if a block-group's
free space accounting is done for an initial creation depends on if the
size of the added free space is the whole length of the block-group and
the allocation offset is 0.

But as btrfs_load_block_group_zone_info() sets the allocation offset to
the zone capacity (i.e. marking the block-group as full) this initial
decision is not met, and the space_info pointer in the 'struct
btrfs_block_group' has not yet been assigned.

Fail creation of the block group and rely on manual user intervention to
re-balance the filesystem.

Afterwards the filesystem can be unmounted, mounted in degraded mode and
the missing device can be removed after a full balance of the filesystem.

Reported-by: 西木野羰基 <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/CAB_b4sBhDe3tscz=duVyhc9hNE+gu=B8CrgLO152uMyanR8BEA@mail.gmail.com/
Fixes: b1934cd ("btrfs: zoned: handle broken write pointer on zones")
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Signed-off-by: David Sterba <[email protected]>
The again label is here to retry to get the folio for the current index.
When triggering that label, there is no increasement on the iterator.

So it can be replaced by a simple "continue" and remove the again label.

Signed-off-by: Qu Wenruo <[email protected]>
Currently the function put_file_data() can only accept page sized folio.

However the function itself is not that complex, it's just copying data
from filemap folio into send buffer.

So make it support larger data folios by:

- Change the loop to use file offset instead of page index

- Calculate @pg_offset and @cur_len after getting the folio

- Remove the "WARN_ON(folio_order(folio));" line

Signed-off-by: Qu Wenruo <[email protected]>
The function btrfs_page_mkwrite() has an explicit ASSERT() checking the
folio order.

To make it support larger data folios, we need to:

- Remove the ASSERT(folio_order(folio) == 0)

- Use folio_contains() to check if the folio covers the last page

Otherwise the code is already supporting larger folios well.

Signed-off-by: Qu Wenruo <[email protected]>
The only blockage is the ASSERT() rejecting larger folios, just remove
it.

Signed-off-by: Qu Wenruo <[email protected]>
The function is doing an ASSERT() checking the folio order,  but all
later functions are handling larger folios properly, thus we can safely
remove that ASSERT().

Signed-off-by: Qu Wenruo <[email protected]>
The subpage handling code has two locations not supporting larger
folios:

- btrfs_attach_subpage()
  Which is doing a metadata specific ASSERT() check.

  But for the future larger data folios support, that check is too
  generic.
  Since it's metadata specific, only check the ASSERT() for metadata.

- btrfs_subpage_assert()
  Just remove the "ASSERT(folio_order(folio) == 0)" check.

Signed-off-by: Qu Wenruo <[email protected]>
The function itself is already taking larger folios into consideration,
just remove the ASSERT(!folio_test_large()) line.

Signed-off-by: Qu Wenruo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.