forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rst delete ci #1447
Closed
Closed
Rst delete ci #1447
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fix a typo in comments. Signed-off-by: Andrew Kreimer <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Carlos Maiolino <[email protected]>
The VLAN table is a shared memory between the two ports/slices in a ICSSG cluster and this may lead to race condition when the common code paths for both ports are executed in different CPUs. Fix the race condition access by locking the shared memory access Fixes: 487f732 ("net: ti: icssg-prueth: Add helper functions to configure FDB") Signed-off-by: MD Danish Anwar <[email protected]> Reviewed-by: Roger Quadros <[email protected]> Signed-off-by: David S. Miller <[email protected]>
…one info At btrfs_load_zone_info() we have an error path that is dereferencing the name of a device which is a RCU string but we are not holding a RCU read lock, which is incorrect. Fix this by using btrfs_err_in_rcu() instead of btrfs_err(). The problem is there since commit 08e11a3 ("btrfs: zoned: load zone's allocation offset"), back then at btrfs_load_block_group_zone_info() but then later on that code was factored out into the helper btrfs_load_zone_info() by commit 09a4672 ("btrfs: zoned: factor out per-zone logic from btrfs_load_block_group_zone_info"). Fixes: 08e11a3 ("btrfs: zoned: load zone's allocation offset") Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Naohiro Aota <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
This commit is a replay of commit 6252690 ("btrfs: fix invalid mapping of extent xarray state"). We need to call btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that xarray DIRTY tag is cleared. With a refactoring commit 8189197 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission"), it screwed up and the order is reversed and causing the same hang. Fix the ordering now in submit_one_sector(). Fixes: 8189197 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission") Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
All MMD reads return 0 for the RTL8126A-integrated PHY. Therefore phylib assumes it doesn't support EEE, what results in higher power consumption, and a significantly higher chip temperature in my case. To fix this split out the PHY driver for the RTL8126A-integrated PHY and set the read_mmd/write_mmd callbacks to read from vendor-specific registers. Fixes: 5befa37 ("net: phy: realtek: add support for RTL8126A-integrated 5Gbps PHY") Cc: [email protected] Signed-off-by: Heiner Kallweit <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Currently this driver prints this line with what looks like a rogue format specifier when the device is probed: [ 2.840000] eth%d: MVME147 at 0xfffe1800, irq 12, Hardware Address xx:xx:xx:xx:xx:xx Change the printk() for netdev_info() and move it after the registration has completed so it prints out the name of the interface properly. Signed-off-by: Daniel Palmer <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
…n_start If hashing fails in sctp_listen_start(), the socket remains in the LISTENING state, even though it was not added to the hash table. This can lead to a scenario where a socket appears to be listening without actually being accessible. This patch ensures that if the hashing operation fails, the sk_state is set back to CLOSED before returning an error. Note that there is no need to undo the autobind operation if hashing fails, as the bind port can still be used for next listen() call on the same socket. Fixes: 76c6d98 ("sctp: add sock_reuseport for the sock in __sctp_hash_endpoint") Reported-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: Xin Long <[email protected]> Acked-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Yisen Zhuang has left the company in September. Jian Shen will be responsible for maintaining the hns3/hns driver's code in the future, so add Jian Shen to the hns3/hns driver's matainer list. Signed-off-by: Jijie Shao <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
The boot mapped ring buffer has its buffer mapped at a fixed location found at boot up. It is not dynamic. It cannot grow or be expanded when new CPUs come online. Do not hook fixed memory mapped ring buffers to the CPU hotplug callback, otherwise it can cause a crash when it tries to add the buffer to the memory that is already fully occupied. Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Link: https://lore.kernel.org/[email protected] Fixes: be68d63 ("ring-buffer: Add ring_buffer_alloc_range()") Signed-off-by: Steven Rostedt (Google) <[email protected]>
…ation A user reported that commit aa3998d ("ata: libata-scsi: Disable scsi device manage_system_start_stop") introduced a spin down + immediate spin up of the disk both when entering and when resuming from hibernation. This behavior was not there before, and causes an increased latency both when entering and when resuming from hibernation. Hibernation is done by three consecutive PM events, in the following order: 1) PM_EVENT_FREEZE 2) PM_EVENT_THAW 3) PM_EVENT_HIBERNATE Commit aa3998d ("ata: libata-scsi: Disable scsi device manage_system_start_stop") modified ata_eh_handle_port_suspend() to call ata_dev_power_set_standby() (which spins down the disk), for both event PM_EVENT_FREEZE and event PM_EVENT_HIBERNATE. Documentation/driver-api/pm/devices.rst, section "Entering Hibernation", explicitly mentions that PM_EVENT_FREEZE does not have to be put the device in a low-power state, and actually recommends not doing so. Thus, let's not spin down the disk on PM_EVENT_FREEZE. (The disk will instead be spun down during the subsequent PM_EVENT_HIBERNATE event.) This way, PM_EVENT_FREEZE will behave as it did before commit aa3998d ("ata: libata-scsi: Disable scsi device manage_system_start_stop"), while PM_EVENT_HIBERNATE will continue to spin down the disk. This will avoid the superfluous spin down + spin up when entering and resuming from hibernation, while still making sure that the disk is spun down before actually entering hibernation. Cc: [email protected] # v6.6+ Fixes: aa3998d ("ata: libata-scsi: Disable scsi device manage_system_start_stop") Reviewed-by: Damien Le Moal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Niklas Cassel <[email protected]>
Using the device-managed version allows to simplify clean-up in probe() error path. Additionally, this device-managed ensures proper cleanup, which helps to resolve memory errors, page faults, btrfs going read-only, and btrfs disk corruption. Fixes: 4b2c53d ("SFH:Transport Driver to add support of AMD Sensor Fusion Hub (SFH)") Tested-by: Chris Hixon <[email protected]> Tested-by: Richard <[email protected]> Tested-by: Skyler <[email protected]> Reported-by: Chris Hixon <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=219331 Signed-off-by: Basavaraj Natikar <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
The sched_ext selftests is missing proper cross-compilation support, a proper target entry, and out-of-tree build support. When building the kselftest suite, e.g.: make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- \ TARGETS=sched_ext SKIP_TARGETS="" O=/output/foo \ -C tools/testing/selftests install or: make ARCH=arm64 LLVM=1 TARGETS=sched_ext SKIP_TARGETS="" \ O=/output/foo -C tools/testing/selftests install The expectation is that the sched_ext is included, cross-built, the correct toolchain is picked up, and placed into /output/foo. In contrast to the BPF selftests, the sched_ext suite does not use bpftool at test run-time, so it is sufficient to build bpftool for the build host only. Add ARCH, CROSS_COMPILE, OUTPUT, and TARGETS support to the sched_ext selftest. Also, remove some variables that were unused by the Makefile. Signed-off-by: Björn Töpel <[email protected]> Reviewed-by: Shuah Khan <[email protected]> Acked-by: David Vernet <[email protected]> Tested-by: Mark Brown <[email protected]> Reviewed-by: Mark Brown <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
We don't need to handle them separately. Instead, just let them decompose/casefold to themselves. Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
…nux/kernel/git/krisman/unicode Pull unicode fix from Gabriel Krisman Bertazi: - Handle code-points with the Ignorable property as regular character instead of treating them as an empty string (me) * tag 'unicode-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode: unicode: Don't special case ignorable code points
After the delegation is returned to the NFS server remove it from the server's delegations list to reduce the time it takes to scan this list. Network trace captured while running the below script shows the time taken to service the CB_RECALL increases gradually due to the overhead of traversing the delegation list in nfs_delegation_find_inode_server. The NFS server in this test is a Solaris server which issues CB_RECALL when receiving the all-zero stateid in the SETATTR. mount=/mnt/data for i in $(seq 1 20) do echo $i mkdir $mount/testtarfile$i time tar -C $mount/testtarfile$i -xf 5000_files.tar done Signed-off-by: Dai Ngo <[email protected]> Reviewed-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
Disabling preemption in the GRU driver is unnecessary, and clashes with sleeping locks in several code paths. Remove preempt_disable and preempt_enable from the GRU driver. Signed-off-by: Dimitri Sivanich <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Patch series "remove PF_MEMALLOC_NORECLAIM" v3. This patch (of 2): bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new inode to achieve GFP_NOWAIT semantic while holding locks. If this allocation fails it will drop locks and use GFP_NOFS allocation context. We would like to drop PF_MEMALLOC_NORECLAIM because it is really dangerous to use if the caller doesn't control the full call chain with this flag set. E.g. if any of the function down the chain needed GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and cause unexpected failure. While this is not the case in this particular case using the scoped gfp semantic is not really needed bacause we can easily pus the allocation context down the chain without too much clutter. [[email protected]: fix kerneldoc warnings] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Jan Kara <[email protected]> # For vfs changes Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: James Morris <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Paul Moore <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Yafang Shao <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
This reverts commit eab0af9. There is no existing user of those flags. PF_MEMALLOC_NOWARN is dangerous because a nested allocation context can use GFP_NOFAIL which could cause unexpected failure. Such a code would be hard to maintain because it could be deeper in the call chain. PF_MEMALLOC_NORECLAIM has been added even when it was pointed out [1] that such a allocation contex is inherently unsafe if the context doesn't fully control all allocations called from this context. While PF_MEMALLOC_NOWARN is not dangerous the way PF_MEMALLOC_NORECLAIM is it doesn't have any user and as Matthew has pointed out we are running out of those flags so better reclaim it without any real users. [1] https://lore.kernel.org/all/ZcM0xtlKbAOFjv5n@tiehlicka/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: James Morris <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Paul Moore <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Yafang Shao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Calling into kthread unparking unconditionally is mostly harmless when the kthread is already unparked. The wake up is then simply ignored because the target is not in TASK_PARKED state. However if the kthread is per CPU, the wake up is preceded by a call to kthread_bind() which expects the task to be inactive and in TASK_PARKED state, which obviously isn't the case if it is unparked. As a result, calling kthread_stop() on an unparked per-cpu kthread triggers such a warning: WARNING: CPU: 0 PID: 11 at kernel/kthread.c:525 __kthread_bind_mask kernel/kthread.c:525 <TASK> kthread_stop+0x17a/0x630 kernel/kthread.c:707 destroy_workqueue+0x136/0xc40 kernel/workqueue.c:5810 wg_destruct+0x1e2/0x2e0 drivers/net/wireguard/device.c:257 netdev_run_todo+0xe1a/0x1000 net/core/dev.c:10693 default_device_exit_batch+0xa14/0xa90 net/core/dev.c:11769 ops_exit_list net/core/net_namespace.c:178 [inline] cleanup_net+0x89d/0xcc0 net/core/net_namespace.c:640 process_one_work kernel/workqueue.c:3231 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312 worker_thread+0x86d/0xd70 kernel/workqueue.c:3393 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Fix this with skipping unecessary unparking while stopping a kthread. Link: https://lkml.kernel.org/r/[email protected] Fixes: 5c25b5f ("workqueue: Tag bound workers with KTHREAD_IS_PER_CPU") Signed-off-by: Frederic Weisbecker <[email protected]> Reported-by: [email protected] Tested-by: [email protected] Suggested-by: Thomas Gleixner <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Tejun Heo <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
pgoff should be aligned using ALIGN_DOWN() instead of ALIGN(). Otherwise, vmf->address not aligned to fault_size will be aligned to the next alignment, that can result in memory failure getting the wrong address. It's a subtle situation that only can be observed in page_mapped_in_vma() after the page is page fault handled by dev_dax_huge_fault. Generally, there is little chance to perform page_mapped_in_vma in dev-dax's page unless in specific error injection to the dax device to trigger an MCE - memory-failure. In that case, page_mapped_in_vma() will be triggered to determine which task is accessing the failure address and kill that task in the end. We used self-developed dax device (which is 2M aligned mapping) , to perform error injection to random address. It turned out that error injected to non-2M-aligned address was causing endless MCE until panic. Because page_mapped_in_vma() kept resulting wrong address and the task accessing the failure address was never killed properly: [ 3783.719419] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3784.049006] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3784.049190] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3784.448042] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3784.448186] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3784.792026] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3784.792179] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3785.162502] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3785.162633] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3785.461116] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3785.461247] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3785.764730] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3785.764859] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3786.042128] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3786.042259] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3786.464293] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3786.464423] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3786.818090] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3786.818217] Memory failure: 0x200c9742: recovery action for dax page: Recovered [ 3787.085297] mce: Uncorrected hardware memory error in user-access at 200c9742380 [ 3787.085424] Memory failure: 0x200c9742: recovery action for dax page: Recovered It took us several weeks to pinpoint this problem, but we eventually used bpftrace to trace the page fault and mce address and successfully identified the issue. Joao added: ; Likely we never reproduce in production because we always pin : device-dax regions in the region align they provide (Qemu does : similarly with prealloc in hugetlb/file backed memory). I think this : bug requires that we touch *unpinned* device-dax regions unaligned to : the device-dax selected alignment (page size i.e. 4K/2M/1G) Link: https://lkml.kernel.org/r/23c02a03e8d666fef11bbe13e85c69c8b4ca0624.1727421694.git.llfl@linux.alibaba.com Fixes: b9b5777 ("device-dax: use ALIGN() for determining pgoff") Signed-off-by: Kun(llfl) <[email protected]> Tested-by: JianXiong Zhao <[email protected]> Reviewed-by: Joao Martins <[email protected]> Cc: Dan Williams <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
The hmm2 double_map test was failing due to an incorrect buffer->mirror size. The buffer->mirror size was 6, while buffer->ptr size was 6 * PAGE_SIZE. The test failed because the kernel's copy_to_user function was attempting to copy a 6 * PAGE_SIZE buffer to buffer->mirror. Since the size of buffer->mirror was incorrect, copy_to_user failed. This patch corrects the buffer->mirror size to 6 * PAGE_SIZE. Test Result without this patch ============================== # RUN hmm2.hmm2_device_private.double_map ... # hmm-tests.c:1680:double_map:Expected ret (-14) == 0 (0) # double_map: Test terminated by assertion # FAIL hmm2.hmm2_device_private.double_map not ok 53 hmm2.hmm2_device_private.double_map Test Result with this patch =========================== # RUN hmm2.hmm2_device_private.double_map ... # OK hmm2.hmm2_device_private.double_map ok 53 hmm2.hmm2_device_private.double_map Link: https://lkml.kernel.org/r/[email protected] Fixes: fee9f6d ("mm/hmm/test: add selftests for HMM") Signed-off-by: Donet Tom <[email protected]> Reviewed-by: Muhammad Usama Anjum <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Kees Cook <[email protected]> Cc: Mark Brown <[email protected]> Cc: Przemek Kitszel <[email protected]> Cc: Ritesh Harjani (IBM) <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
When /proc/kcore is read an attempt to read the first two pages results in HW-specific page swap on s390 and another (so called prefix) pages are accessed instead. That leads to a wrong read. Allow architecture-specific translation of memory addresses using kc_xlate_dev_mem_ptr() and kc_unxlate_dev_mem_ptr() callbacks similarily to /dev/mem xlate_dev_mem_ptr() and unxlate_dev_mem_ptr() callbacks. That way an architecture can deal with specific physical memory ranges. Re-use the existing /dev/mem callback implementation on s390, which handles the described prefix pages swapping correctly. For other architectures the default callback is basically NOP. It is expected the condition (vaddr == __va(__pa(vaddr))) always holds true for KCORE_RAM memory type. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Gordeev <[email protected]> Suggested-by: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
…ts() In resource_test_insert_resource(), the pointer is used in error message after kfree(). This is user-after-free. To fix this, we need to call kunit_add_action_or_reset() to schedule memory freeing after usage. But kunit_add_action_or_reset() itself may fail and free the memory. So, its return value should be checked and abort the test for failure. Then, we found that other usage of kunit_add_action_or_reset() in resource_test_region_intersects() needs to be fixed too. We fix all these user-after-free bugs in this patch. Link: https://lkml.kernel.org/r/[email protected] Fixes: 99185c1 ("resource, kunit: add test case for region_intersects()") Signed-off-by: "Huang, Ying" <[email protected]> Reported-by: Kees Bakker <[email protected]> Closes: https://lore.kernel.org/lkml/[email protected]/ Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Bjorn Helgaas <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
We should only check for pmd_special() after we made sure that we have a present PMD. For example, if we have a migration PMD, pmd_special() might indicate that we have a special PMD although we really don't. This fixes confusing migration entries as PFN mappings, and not doing what we are supposed to do in the "is_swap_pmd()" case further down in the function -- including messing up COW, page table handling and accounting. Link: https://lkml.kernel.org/r/[email protected] Fixes: bc02afb ("mm/fork: accept huge pfnmap entries") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: [email protected] Closes: https://lore.kernel.org/lkml/[email protected]/ Reviewed-by: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
I'm leaving Google. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fangrui Song <[email protected]> Acked-by: Nathan Chancellor <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Return -ENOSYS from memfd_secret() syscall if !can_set_direct_map(). This is the case for example on some arm64 configurations, where marking 4k PTEs in the direct map not present can only be done if the direct map is set up at 4k granularity in the first place (as ARM's break-before-make semantics do not easily allow breaking apart large/gigantic pages). More precisely, on arm64 systems with !can_set_direct_map(), set_direct_map_invalid_noflush() is a no-op, however it returns success (0) instead of an error. This means that memfd_secret will seemingly "work" (e.g. syscall succeeds, you can mmap the fd and fault in pages), but it does not actually achieve its goal of removing its memory from the direct map. Note that with this patch, memfd_secret() will start erroring on systems where can_set_direct_map() returns false (arm64 with CONFIG_RODATA_FULL_DEFAULT_ENABLED=n, CONFIG_DEBUG_PAGEALLOC=n and CONFIG_KFENCE=n), but that still seems better than the current silent failure. Since CONFIG_RODATA_FULL_DEFAULT_ENABLED defaults to 'y', most arm64 systems actually have a working memfd_secret() and aren't be affected. From going through the iterations of the original memfd_secret patch series, it seems that disabling the syscall in these scenarios was the intended behavior [1] (preferred over having set_direct_map_invalid_noflush return an error as that would result in SIGBUSes at page-fault time), however the check for it got dropped between v16 [2] and v17 [3], when secretmem moved away from CMA allocations. [1]: https://lore.kernel.org/lkml/[email protected]/ [2]: https://lore.kernel.org/lkml/[email protected]/#t [3]: https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 1507f51 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Patrick Roy <[email protected]> Reviewed-by: Mike Rapoport (Microsoft) <[email protected]> Cc: Alexander Graf <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Gowans <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Re-sort few misplaced entries in the CREDITS file. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Krzysztof Kozlowski <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Made a minor edit in the comments for 'struct zswap_entry' to delete the description of the 'value' member that was deleted in commit 20a5532 ("mm: remove code to handle same filled pages"). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kanchana P Sridhar <[email protected]> Fixes: 20a5532 ("mm: remove code to handle same filled pages") Reviewed-by: Nhat Pham <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Reviewed-by: Usama Arif <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Huang Ying <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kanchana P Sridhar <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Wajdi Feghali <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
There's an inherent race in taking a snapshot while an unlinked file is open, and then reattaching it in the child snapshot. In the interior snapshot node the file will appear unlinked, as though it should be deleted - it's not referenced by anything in that snapshot - but we can't delete it, because the file data is referenced by the child snapshot. This was being handled incorrectly with propagate_key_to_snapshot_leaves() - but that doesn't resolve the fundamental inconsistency of "this file looks like it should be deleted according to normal rules, but - ". To fix this, we need to fix the rule for when an inode is deleted. The previous rule, ignoring snapshots (there was no well-defined rule for with snapshots) was: Unlinked, non open files are deleted, either at recovery time or during online fsck The new rule is: Unlinked, non open files, that do not exist in child snapshots, are deleted. To make this work transactionally, we add a new inode flag, BCH_INODE_has_child_snapshot; it overrides BCH_INODE_unlinked when considering whether to delete an inode, or put it on the deleted list. For transactional consistency, clearing it handled by the inode trigger: when deleting an inode we check if there are parent inodes which can now have the BCH_INODE_has_child_snapshot flag cleared. Signed-off-by: Kent Overstreet <[email protected]>
Dead code now. Signed-off-by: Kent Overstreet <[email protected]>
The mask parameter used for allocations got unified to GFP_NOFS and removed from relevant functions in 1d12680 ("btrfs: drop gfp from parameter extent state helpers"). Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
The parameter was added in 8ff8466 ("btrfs: support subpage for extent buffer page release") for page but hasn't been used since, so we can drop it. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
Since the new mount option parser in commit ad21f15 ("btrfs: switch to the new mount API") we don't pass the options like that anymore. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
The only caller passes NULL, we can drop the parameter. This is since the new mount option parser done in 3bb17a2 ("btrfs: add get_tree callback for new mount API"). Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
The function got split in commit 6ab6ebb ("btrfs: split alloc_log_tree()") and since then transaction parameter has been unused. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
Cascaded removal of fs_info that is not needed in several functions. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
The compression heuristic pass does not need a level, so we can drop the parameter. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
When btrfs reserves an extent and does not use it (e.g, by an error), it calls btrfs_free_reserved_extent() to free the reserved extent. In the process, it calls btrfs_add_free_space() and then it accounts the region bytes as block_group->zone_unusable. However, it leaves the space_info->bytes_zone_unusable side not updated. As a result, ENOSPC can happen while a space_info reservation succeeded. The reservation is fine because the freed region is not added in space_info->bytes_zone_unusable, leaving that space as "free". OTOH, corresponding block group counts it as zone_unusable and its allocation pointer is not rewound, we cannot allocate an extent from that block group. That will also negate space_info's async/sync reclaim process, and cause an ENOSPC error from the extent allocation process. Fix that by returning the space to space_info->bytes_zone_unusable. Ideally, since a bio is not submitted for this reserved region, we should return the space to free space and rewind the allocation pointer. But, it needs rework on extent allocation handling, so let it work in this way for now. Fixes: 169e0da ("btrfs: zoned: track unusable bytes for zones") CC: [email protected] # 5.15+ Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
The purpose of btrfs_bbio_propagate_error() shall be propagating an error of split bio to its original btrfs_bio, and tell the error to the upper layer. However, it's not working well on some cases. * Case 1. Immediate (or quick) end_bio with an error When btrfs sends btrfs_bio to mirrored devices, btrfs calls btrfs_bio_end_io() when all the mirroring bios are completed. If that btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error() accesses the orig_bbio's bio context to increase the error count. That works well in most cases. However, if the end_io is called enough fast, orig_bbio's (remaining part after split) bio context may not be properly set at that time. Since the bio context is set when the orig_bbio (the last btrfs_bio) is sent to devices, that might be too late for earlier split btrfs_bio's completion. That will result in NULL pointer dereference. That bug is easily reproducible by running btrfs/146 on zoned devices [1] and it shows the following trace. [1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ torvalds#474 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-5) RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 RSP: 0018:ffffc9000006f248 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080 RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001 R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58 R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158 FS: 0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body.cold+0x19/0x26 ? page_fault_oops+0x13e/0x2b0 ? _printk+0x58/0x73 ? do_user_addr_fault+0x5f/0x750 ? exc_page_fault+0x76/0x240 ? asm_exc_page_fault+0x22/0x30 ? btrfs_bio_end_io+0xae/0xc0 [btrfs] ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs] btrfs_orig_write_end_io+0x51/0x90 [btrfs] dm_submit_bio+0x5c2/0xa50 [dm_mod] ? find_held_lock+0x2b/0x80 ? blk_try_enter_queue+0x90/0x1e0 __submit_bio+0xe0/0x130 ? ktime_get+0x10a/0x160 ? lockdep_hardirqs_on+0x74/0x100 submit_bio_noacct_nocheck+0x199/0x410 btrfs_submit_bio+0x7d/0x150 [btrfs] btrfs_submit_chunk+0x1a1/0x6d0 [btrfs] ? lockdep_hardirqs_on+0x74/0x100 ? __folio_start_writeback+0x10/0x2c0 btrfs_submit_bbio+0x1c/0x40 [btrfs] submit_one_bio+0x44/0x60 [btrfs] submit_extent_folio+0x13f/0x330 [btrfs] ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs] extent_writepage_io+0x18b/0x360 [btrfs] extent_write_locked_range+0x17c/0x340 [btrfs] ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs] run_delalloc_cow+0x71/0xd0 [btrfs] btrfs_run_delalloc_range+0x176/0x500 [btrfs] ? find_lock_delalloc_range+0x119/0x260 [btrfs] writepage_delalloc+0x2ab/0x480 [btrfs] extent_write_cache_pages+0x236/0x7d0 [btrfs] btrfs_writepages+0x72/0x130 [btrfs] do_writepages+0xd4/0x240 ? find_held_lock+0x2b/0x80 ? wbc_attach_and_unlock_inode+0x12c/0x290 ? wbc_attach_and_unlock_inode+0x12c/0x290 __writeback_single_inode+0x5c/0x4c0 ? do_raw_spin_unlock+0x49/0xb0 writeback_sb_inodes+0x22c/0x560 __writeback_inodes_wb+0x4c/0xe0 wb_writeback+0x1d6/0x3f0 wb_workfn+0x334/0x520 process_one_work+0x1ee/0x570 ? lock_is_held_type+0xc6/0x130 worker_thread+0x1d1/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0xee/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl CR2: 0000000000000020 * Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is called last among split bios. In that case, btrfs_orig_write_end_io() sets the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2]. Otherwise, the increased orig_bio's bioc->error is not checked by anyone and return BLK_STS_OK to the upper layer. [2] Actually, this is not true. Because we only increases orig_bioc->errors by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors" is still not met if only one split btrfs_bio fails. * Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios In contrast to the above case, btrfs_bbio_propagate_error() is not working well if un-mirrored orig_bbio is completed last. It sets orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily over-written by orig_bbio's completion status. If the status is BLK_STS_OK, the upper layer would not know the failure. * Solution Considering the above cases, we can only save the error status in the orig_bbio (remaining part after split) itself as it is always available. Also, the saved error status should be propagated when all the split btrfs_bios are finished (i.e, bbio->pending_ios == 0). This commit introduces "status" to btrfs_bbio and saves the first error of split bios to original btrfs_bio's "status" variable. When all the split bios are finished, the saved status is loaded into original btrfs_bio's status. With this commit, btrfs/146 on zoned devices does not hit the NULL pointer dereference anymore. Fixes: 852eee6 ("btrfs: allow btrfs_submit_bio to split bios") CC: [email protected] # 6.6+ Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
…given After the migration to use fs context for processing mount options we had a slight change in the semantics for remounting a filesystem that was mounted with compress-force. Before we could clear compress-force by passing only "-o compress[=algo]" during a remount, but after that change that does not work anymore, force-compress is still present and one needs to pass "-o compress-force=no,compress[=algo]" to the mount command. Example, when running on a kernel 6.8+: $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) On a 6.7 kernel (or older): $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) So update btrfs_parse_param() to clear "compress-force" when "compress" is given, providing the same semantics as kernel 6.7 and older. Reported-by: Roman Mamedov <[email protected]> Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/ CC: [email protected] # 6.8+ Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
When crawling btree, if an eb cache miss occurs, we change to use the eb read lock and release all previous locks (including the parent lock) to reduce lock contention. If an eb cache miss occurs in a leaf and needs to execute IO, before this change we released locks only from level 2 and up and we read a leaf's content from disk while holding a lock on its parent (level 1), causing the unnecessary lock contention on the parent, after this change we release locks from level 1 and up, but we lock level 0, and read leaf's content from disk. Because we have prepared the check parameters and the read lock of eb we hold, we can ensure that no race will occur during the check and cause unexpected errors. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Robbie Ko <[email protected]> Signed-off-by: David Sterba <[email protected]>
Move the common code to remove an extent map from its inode's tree into a helper function and use it, reducing duplicated code. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
…e job Currently the extent map shrinker is run synchronously for kswapd tasks that end up calling the fs shrinker (fs/super.c:super_cache_scan()). This has some disadvantages and for some heavy workloads with memory pressure it can cause some delays and stalls that make a machine unresponsive for some periods. This happens because: 1) We can have several kswapd tasks on machines with multiple NUMA zones, and running the extent map shrinker concurrently can cause high contention on some spin locks, namely the spin locks that protect the radix tree that tracks roots, the per root xarray that tracks open inodes and the list of delayed iputs. This not only delays the shrinker but also causes high CPU consumption and makes the task running the shrinker monopolize a core, resulting in the symptoms of an unresponsive system. This was noted in previous commits such as commit ae1e766 ("btrfs: only run the extent map shrinker from kswapd tasks"); 2) The extent map shrinker's iteration over inodes can often be slow, even after changing the data structure that tracks open inodes for a root from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+). The transition to the xarray while it made things a bit faster, it's still somewhat slow - for example in a test scenario with 10000 inodes that have no extent maps loaded, the extent map shrinker took between 5ms to 8ms, using a release, non-debug kernel. Iterating over the extent maps of an inode can also be slow if have an inode with many thousands of extent maps, since we use a red black tree to track and search extent maps. So having the extent map shrinker run synchronously adds extra delay for other things a kswapd task does. So make the extent map shrinker run asynchronously as a job for the system unbounded workqueue, just like what we do for data and metadata space reclaim jobs. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
Now that the extent map shrinker can only be run by a single task (as a work queue item) there is no need to keep the progress of the shrinker protected by a spinlock and passing the progress to trace events as parameters. So remove the lock and simplify the arguments for the trace events. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
The names for the members of struct btrfs_fs_info related to the extent map shrinker are a bit too long, so rename them to be shorter by replacing the "extent_map_" prefix with the "em_" prefix. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
Now that the extent map shrinker can only be run by a single task and runs asynchronously as a work queue job, enable it as it can no longer cause stalls on tasks allocating memory and entering the extent map shrinker through the fs shrinker (implemented by btrfs_free_cached_objects()). This is crucial to prevent exhaustion of memory due to unbounded extent map creation, primarily with direct IO but also for buffered IO on files with holes. This problem, for the direct IO case, was first reported in the Link tag below. That report was added to a Link tag of the first patch that introduced the extent map shrinker, commit 956a17d ("btrfs: add a shrinker for extent maps"), however the Link tag disappeared somehow from the committed patch (but was included in the submitted patch to the mailing list), so adding it below for future reference. Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
The level parameter passed to read_block_for_search() always matches the level of the extent buffer passed in the "eb_ret" parameter, which we are also extracting into the "parent_level" local variable. So remove the level parameter and instead use the "parent_level" variable which in fact has a better name (it's the level of the parent node from which we are reading a child node/leaf). Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
The only caller of btrfs_verify_level_key() is read_block_for_search() and it's passing 3 arguments to it that can be extracted from its on stack variable of type struct btrfs_tree_parent_check. So change btrfs_verify_level_key() to accept an argument of type struct btrfs_tree_parent_check instead of level, first key and parent transid arguments. Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
…check It's pointless to initialize the has_first_key field of the stack local btrfs_tree_parent_check structure at btrfs_tree_parent_check() and at btrfs_qgroup_trace_subtree() since all fields not explicitly initialized are zeroed out. In the case of the first function it's a bit odd because we are assigning 0 and the field is of type bool, however not incorrect since a 0 is converted to false. Just remove the explicit initializations due to their redundancy. Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
It's redundant to have the 'gen' variable since we already have the same value in the local btrfs_tree_parent_check structure. So remove it and instead use the structure's field. Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
If you follow the seed/sprout wiki, it suggests the following workflow: btrfstune -S 1 seed_dev mount seed_dev mnt btrfs device add sprout_dev mount -o remount,rw mnt The first mount mounts the FS readonly, which results in not setting BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add somewhat surprisingly clears the readonly bit on the sb (though the mount is still practically readonly, from the users perspective...). Finally, the remount checks the readonly bit on the sb against the flag and sees no change, so it does not run the code intended to run on ro->rw transitions, leaving BTRFS_FS_OPEN unset. As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and does no work. This results in leaking deleted snapshots until we run out of space. I propose fixing it at the first departure from what feels reasonable: when we clear the readonly bit on the sb during device add. A new fstest I have written reproduces the bug and confirms the fix. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Boris Burkov <[email protected]> Signed-off-by: David Sterba <[email protected]>
Since commit 011b46c ("btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can automatically skip large subtree scan at the cost of marking qgroup inconsistent. It's designed to address the final performance problem of snapshot drop with qgroup enabled, but to be safe the default value is BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value to make it work. I'd say it's not a good idea to rely on user space tool to set this default value, especially when some operations (snapshot dropping) can be triggered immediately after mount, leaving a very small window to that that sysfs interface. So instead of disabling this new feature by default, enable it with a low threshold (3), so that large subvolume tree drop at mount time won't cause huge qgroup workload. CC: [email protected] # 6.1 Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
Inside lock_delalloc_folios(), there are several problems related to sector size < page size handling: - Set the writer locks without checking if the folio is still valid We call btrfs_folio_start_writer_lock() just like it's folio_lock(). But since the folio may not even be the folio of the current mapping, we can easily screw up the folio->private. - The range is not clamped inside the page This means we can over write other bitmaps if the start/len is not properly handled, and trigger the btrfs_subpage_assert(). - @processed_end is always rounded up to page end If the delalloc range is not page aligned, and we need to retry (returning -EAGAIN), then we will unlock to the page end. Thankfully this is not a huge problem, as now btrfs_folio_end_writer_lock() can handle range larger than the locked range, and only unlock what is already locked. Fix all these problems by: - Lock and check the folio first, then call btrfs_folio_set_writer_lock() So that if we got a folio not belonging to the inode, we won't touch folio->private. - Properly truncate the range inside the page - Update @processed_end to the locked range end Fixes: 1e1de38 ("btrfs: make process_one_page() to handle subpage locking") Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
This function is not really suitable to lock a folio, as it lacks the proper mapping checks, thus the locked folio may not even belong to btrfs. And due to the above reason, the last user inside lock_delalloc_folios() is already removed, and we can remove this function. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
Since commit d7172f5 ("btrfs: use per-buffer locking for extent_buffer reading"), metadata read no longer relies on the subpage reader locking. This means we do not need to maintain a different metadata/data split for locking, so we can convert the existing reader lock users by: - add_ra_bio_pages() Convert to btrfs_folio_set_writer_lock() - end_folio_read() Convert to btrfs_folio_end_writer_lock() - begin_folio_read() Convert to btrfs_folio_set_writer_lock() - folio_range_has_eb() Remove the subpage->readers checks, since it is always 0. - Remove btrfs_subpage_start_reader() and btrfs_subpage_end_reader() Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
Since there is no user of reader locks, rename the writer locks into a more generic name, by removing the "_writer" part from the name. And also rename btrfs_subpage::writer into btrfs_subpage::locked. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
In our CI system, the RAID stripe tree configuration sometimes fails with the following ASSERT(): assertion failed: found_start >= start && found_end <= end, in fs/btrfs/raid-stripe-tree.c:64 This ASSERT()ion triggers, because for the initial design of RAID stripe-tree, I had the "one ordered-extent equals one bio" rule of zoned btrfs in mind. But for a RAID stripe-tree based system, that is not hosted on a zoned storage device, but on a regular device this rule doesn't apply. So in case the range we want to delete starts in the middle of the previous item, grab the item and "truncate" it's length. That is, clone the item, subtract the deleted portion from the key's offset, delete the old item and insert the new one. In case the range to delete ends in the middle of an item, we have to adjust both the item's key as well as the stripe extents and then re-insert the modified clone into the tree after deleting the old stripe extent. Signed-off-by: Johannes Thumshirn <[email protected]>
Implement self-tests for partial deletion of RAID stripe-tree entries. These two new tests cover both the deletion of the front of a RAID stripe-tree stripe extent as well as truncation of an item to make it smaller. Signed-off-by: Johannes Thumshirn <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.