Qgroup fix #1379

josefbacik · 2024-08-21T19:57:13Z

No description provided.

Update mes12 api definition. Signed-off-by: Jack Xiao <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 2ab5dc5)

Add multiple mes ring instances in mes structure to support multiple mes pipes. Signed-off-by: Jack Xiao <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit c7d4355)

Enable unified mes firmware to load on pipe0 and pipe1. Signed-off-by: Jack Xiao <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit e69c2dd)

Add mes pipe switch to let caller choose pipe to submit packet. Signed-off-by: Jack Xiao <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit b2dee08)

Adjust mes12 sw/hw initiailization for both pipe0 and pipe1 enablement. The two pipes are almost identical pipe. Pipe0 behaves like schq and pipe1 like kiq, pipe0 was mapped by pipe1. Signed-off-by: Jack Xiao <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit aa539da)

Configure two pipes with different hardware resources. Signed-off-by: Jack Xiao <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit ea5d6db)

Free memory for two pipes and unmap pipe0 via pipe1. Signed-off-by: Jack Xiao <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 98cae69)

Use mes pipe to unmap kcq and kgq. Signed-off-by: Jack Xiao <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit f7fb9d6)

Add JPEG IB command parser to ensure registers in the command are within the JPEG IP block. Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: David (Ming Qiang) Wu <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit a7f670d) Cc: [email protected]

kmd_fw_shared changed in VCN5 Signed-off-by: Yinjie Yao <[email protected]> Reviewed-by: Ruijing Dong <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit aa02486)

add HDP_SD support on gc 12.0.0/1 Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 61cffac)

The regressing commit is new in 6.10. It assumed that anytime event->prog is set bpf_overflow_handler() should be invoked to execute the attached bpf program. This assumption is false for tracing events, and as a result the regressing commit broke bpftrace by invoking the bpf handler with garbage inputs on overflow. Prior to the regression the overflow handlers formed a chain (of length 0, 1, or 2) and perf_event_set_bpf_handler() (the !tracing case) added bpf_overflow_handler() to that chain, while perf_event_attach_bpf_prog() (the tracing case) did not. Both set event->prog. The chain of overflow handlers was replaced by a single overflow handler slot and a fixed call to bpf_overflow_handler() when appropriate. This modifies the condition there to check event->prog->type == BPF_PROG_TYPE_PERF_EVENT, restoring the previous behavior and fixing bpftrace. Signed-off-by: Kyle Huey <[email protected]> Suggested-by: Andrii Nakryiko <[email protected]> Reported-by: Joe Damato <[email protected]> Closes: https://lore.kernel.org/lkml/ZpFfocvyF3KHaSzF@LQ3V64L9R2/ Fixes: f11f10b ("perf/bpf: Call BPF handler directly, not through overflow machinery") Cc: [email protected] Tested-by: Joe Damato <[email protected]> # bpftrace Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

kmalloc is unreliable when allocating more than 8 pages of memory. It may fail when there is plenty of free memory but the memory is fragmented. Zdenek Kabelac observed such failure in his tests. This commit changes kmalloc to kvmalloc - kvmalloc will fall back to vmalloc if the large allocation fails. Signed-off-by: Mikulas Patocka <[email protected]> Reported-by: Zdenek Kabelac <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Cc: [email protected]

When opening a file for exec via do_filp_open(), permission checking is done against the file's metadata at that moment, and on success, a file pointer is passed back. Much later in the execve() code path, the file metadata (specifically mode, uid, and gid) is used to determine if/how to set the uid and gid. However, those values may have changed since the permissions check, meaning the execution may gain unintended privileges. For example, if a file could change permissions from executable and not set-id: ---------x 1 root root 16048 Aug 7 13:16 target to set-id and non-executable: ---S------ 1 root root 16048 Aug 7 13:16 target it is possible to gain root privileges when execution should have been disallowed. While this race condition is rare in real-world scenarios, it has been observed (and proven exploitable) when package managers are updating the setuid bits of installed programs. Such files start with being world-executable but then are adjusted to be group-exec with a set-uid bit. For example, "chmod o-x,u+s target" makes "target" executable only by uid "root" and gid "cdrom", while also becoming setuid-root: -rwxr-xr-x 1 root cdrom 16048 Aug 7 13:16 target becomes: -rwsr-xr-- 1 root cdrom 16048 Aug 7 13:16 target But racing the chmod means users without group "cdrom" membership can get the permission to execute "target" just before the chmod, and when the chmod finishes, the exec reaches brpm_fill_uid(), and performs the setuid to root, violating the expressed authorization of "only cdrom group members can setuid to root". Re-check that we still have execute permissions in case the metadata has changed. It would be better to keep a copy from the perm-check time, but until we can do that refactoring, the least-bad option is to do a full inode_permission() call (under inode lock). It is understood that this is safe against dead-locks, but hardly optimal. Reported-by: Marco Vanotti <[email protected]> Tested-by: Marco Vanotti <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Cc: [email protected] Cc: Eric Biederman <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Signed-off-by: Kees Cook <[email protected]>

When of_irq_parse_raw() is invoked with a device address smaller than the interrupt parent node (from #address-cells property), KASAN detects the following out-of-bounds read when populating the initial match table (dyndbg="func of_irq_parse_* +p"): OF: of_irq_parse_one: dev=/soc@0/picasso/watchdog, index=0 OF: parent=/soc@0/pci@878000000000/gpio0@17,0, intsize=2 OF: intspec=4 OF: of_irq_parse_raw: ipar=/soc@0/pci@878000000000/gpio0@17,0, size=2 OF: -> addrsize=3 ================================================================== BUG: KASAN: slab-out-of-bounds in of_irq_parse_raw+0x2b8/0x8d0 Read of size 4 at addr ffffff81beca5608 by task bash/764 CPU: 1 PID: 764 Comm: bash Tainted: G O 6.1.67-484c613561-nokia_sm_arm64 #1 Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2023.01-12.24.03-dirty 01/01/2023 Call trace: dump_backtrace+0xdc/0x130 show_stack+0x1c/0x30 dump_stack_lvl+0x6c/0x84 print_report+0x150/0x448 kasan_report+0x98/0x140 __asan_load4+0x78/0xa0 of_irq_parse_raw+0x2b8/0x8d0 of_irq_parse_one+0x24c/0x270 parse_interrupts+0xc0/0x120 of_fwnode_add_links+0x100/0x2d0 fw_devlink_parse_fwtree+0x64/0xc0 device_add+0xb38/0xc30 of_device_add+0x64/0x90 of_platform_device_create_pdata+0xd0/0x170 of_platform_bus_create+0x244/0x600 of_platform_notify+0x1b0/0x254 blocking_notifier_call_chain+0x9c/0xd0 __of_changeset_entry_notify+0x1b8/0x230 __of_changeset_apply_notify+0x54/0xe4 of_overlay_fdt_apply+0xc04/0xd94 ... The buggy address belongs to the object at ffffff81beca5600 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 8 bytes inside of 128-byte region [ffffff81beca5600, ffffff81beca5680) The buggy address belongs to the physical page: page:00000000230d3d03 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1beca4 head:00000000230d3d03 order:1 compound_mapcount:0 compound_pincount:0 flags: 0x8000000000010200(slab|head|zone=2) raw: 8000000000010200 0000000000000000 dead000000000122 ffffff810000c300 raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffffff81beca5500: 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffff81beca5580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffffff81beca5600: 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffffff81beca5680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffff81beca5700: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc ================================================================== OF: -> got it ! Prevent the out-of-bounds read by copying the device address into a buffer of sufficient size. Signed-off-by: Stefan Wiehler <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Rob Herring (Arm) <[email protected]>

Add the missing geni_icc_disable() call before returning in the geni_i2c_runtime_resume() function. Commit 9ba48db ("i2c: qcom-geni: Add missing geni_icc_disable in geni_i2c_runtime_resume") by Gaosheng missed disabling the interconnect in one case. Fixes: bf225ed ("i2c: i2c-qcom-geni: Add interconnect support") Cc: Gaosheng Cui <[email protected]> Cc: [email protected] # v5.9+ Signed-off-by: Andi Shyti <[email protected]>

…rnel/git/kees/linux Pull execve fixes from Kees Cook: - binfmt_flat: Fix corruption when not offsetting data start - exec: Fix ToCToU between perm check and set-uid/gid usage * tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Fix ToCToU between perm check and set-uid/gid usage binfmt_flat: Fix corruption when not offsetting data start

ssn_offset field is u32 and is placed into the netlink response with nla_put_u32(), but only 2 bytes are reserved for the attribute payload in subflow_get_info_size() (even though it makes no difference in the end, as it is aligned up to 4 bytes). Supply the correct argument to the relevant nla_total_size() call to make it less confusing. Fixes: 5147dfb ("mptcp: allow dumping subflow context to userspace") Signed-off-by: Eugene Syromiatnikov <[email protected]> Reviewed-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

bch2_btree_path_traverse_cached() was previously checking if it could just relock the path, which is a common idiom in path traversal. However, it was using btree_node_relock(), not btree_path_relock(); btree_path_relock() only succeeds if the path was in state BTREE_ITER_NEED_RELOCK. If the path was in state BTREE_ITER_NEED_TRAVERSE a full traversal is needed; this led to a null ptr deref in bch2_btree_path_traverse_cached(). And the short circuit check here isn't needed, since it was already done in the main bch2_btree_path_traverse_one(). Signed-off-by: Kent Overstreet <[email protected]>

bch2_trigger_alloc was assuming that the new key would always be newly created and thus always an alloc_v4 key, but - not when called from btree_gc. Signed-off-by: Kent Overstreet <[email protected]>

Signed-off-by: Kent Overstreet <[email protected]>

forward compat fix Signed-off-by: Kent Overstreet <[email protected]>

Signed-off-by: Kent Overstreet <[email protected]>

for_each_btree_node() now works similarly to for_each_btree_key(), where the loop body is passed as an argument to be passed to lockrestart_do(). This now calls trans_begin() on every loop iteration - which fixes an SRCU warning in backpointers fsck. Signed-off-by: Kent Overstreet <[email protected]>

If we need to increase the tree depth, allocate a new node, and then race with another thread that increased the tree depth before us, we'll still have a preallocated node that might be used later. If we then use that node for a new non-root node, it'll still have a pointer to the old root instead of being zeroed - fix this by zeroing it in the cmpxchg failure path. Signed-off-by: Kent Overstreet <[email protected]>

This helps ensure key cache reclaim isn't contending with threads waiting for the key cache to be helped, and fixes a severe performance bug. Signed-off-by: Kent Overstreet <[email protected]>

include information about the state of the btree key cache Signed-off-by: Kent Overstreet <[email protected]>

Signed-off-by: Kent Overstreet <[email protected]>

bkey_fsck_err() was added as an interface that looks like fsck_err(), but previously all it did was ensure that the appropriate error counter was incremented in the superblock. This is a cleanup and bugfix patch that converts it to a wrapper around fsck_err(). This is needed to fix an issue with the upgrade path to disk_accounting_v3, where the "silent fix" error list now includes bkey_fsck errors; fsck_err() handles this in a unified way, and since we need to change printing of bkey fsck errors from the caller to the inner bkey_fsck_err() calls, this ends up being a pretty big change. Als,, rename .invalid() methods to .validate(), for clarity, while we're changing the function signature anyways (to drop the printbuf argument). Signed-off-by: Kent Overstreet <[email protected]>

Update uncompress_inline to take a folio and update it's usage accordingly. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Instead of using a page, use a folio instead, take a folio as an argument, and update the callers appropriately. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

We only pass this into read_inline_extent, change it to take a folio and update the callers. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Now that btrfs_get_extent takes a folio, update __get_extent_map to take a folio as well. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

We already use a folio some in this function, replace all page usage with the folio and update the function to take the folio as an argument. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Currently this already uses a folio for most things, update it to take a folio and update all the page usage with the corresponding folio usage. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

We already use a lot of functions here that use folios, update the function to use __filemap_get_folio instead of find_get_page and then use the folio directly. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

We only use a page to copy in the data for the inline extent. Use a folio for this instead. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Instead of getting a page and using that to clear dirty for io, use the folio helper and use the appropriate folio functions. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

…lace At the end of a device replace we must go over all the chunk maps and update their stripes to point to the target device instead of the source device. We iterate over the chunk maps while holding a write lock and we never reschedule, which can result in monopolizing a CPU for too long and blocking readers for too long (it's a rw lock, non-blocking). So improve on this by rescheduling if necessary. This is safe because at this point we are holding the chunk mutex, which means no new chunks can be allocated and therefore we don't risk missing a new chunk map that covers a range behind the last one we processed before rescheduling. Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

When iterating the chunk maps when a device replace finishes we are doing a full rbtree search for each chunk map, which is not the most efficient thing to do, wasting CPU time. As we are holding a write lock on the tree during the whole iteration, we can simply start from the first node in the tree and then move to the next chunk map by doing a rb_next() call - the only exception is when we need to reschedule, in which case we have to do a full rbtree search since we dropped the write lock and the tree may have changed (chunk maps may have been removed and the tree got rebalanced). So just do that. Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Add a comment to document the complicated locked_page unlock logic in cow_file_range_inline. The specifically tricky part is that a caller just up the stack converts ret == 0 to ret == 1 and then another caller far up the callstack handles ret == 1 as a success, AND returns without cleanup in that case, both of which "feel" unnatural and led to the original bug. Try to document that somewhat specific callstack logic here to explain the weird un-setting of locked_folio on success. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Boris Burkov <[email protected]> Signed-off-by: David Sterba <[email protected]>

This just creates unnecessary noise and doesn't provide any insights into debugging RAID stripe-tree related issues. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Rename 'btrfs_io_stripe::is_scrub' to 'rst_search_commit_root'. While 'is_scrub' describes the state of the io_stripe (it is a stripe submitted by scrub) it does not describe the purpose, namely looking at the commit root when searching RAID stripe-tree entries. Renaming the stripe to rst_search_commit_root describes this purpose. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Set rst_search_commit_root in the btrfs_io_stripe we're passing to btrfs_map_block() in case we're doing data relocation. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Signed-off-by: David Sterba <[email protected]>

On relocation we're doing readahead on the relocation inode, but if the filesystem is backed by a RAID stripe tree we can get ENOENT (e.g. due to preallocated extents not being mapped in the RST) from the lookup. But readahead doesn't handle the error and submits invalid reads to the device, causing an assertion in the scatter-gather list code: BTRFS info (device nvme1n1): balance: start -d -m -s BTRFS info (device nvme1n1): relocating block group 6480920576 flags data|raid0 BTRFS error (device nvme1n1): cannot find raid-stripe for logical [6481928192, 6481969152] devid 2, profile raid0 ------------[ cut here ]------------ kernel BUG at include/linux/scatterlist.h:115! Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 1012 Comm: btrfs Not tainted 6.10.0-rc7+ torvalds#567 RIP: 0010:__blk_rq_map_sg+0x339/0x4a0 RSP: 0018:ffffc90001a43820 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802 RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000 RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8 R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000 FS: 00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000002cd11000 CR3: 00000001109ea001 CR4: 0000000000370eb0 Call Trace: <TASK> ? __die_body.cold+0x14/0x25 ? die+0x2e/0x50 ? do_trap+0xca/0x110 ? do_error_trap+0x65/0x80 ? __blk_rq_map_sg+0x339/0x4a0 ? exc_invalid_op+0x50/0x70 ? __blk_rq_map_sg+0x339/0x4a0 ? asm_exc_invalid_op+0x1a/0x20 ? __blk_rq_map_sg+0x339/0x4a0 nvme_prep_rq.part.0+0x9d/0x770 nvme_queue_rq+0x7d/0x1e0 __blk_mq_issue_directly+0x2a/0x90 ? blk_mq_get_budget_and_tag+0x61/0x90 blk_mq_try_issue_list_directly+0x56/0xf0 blk_mq_flush_plug_list.part.0+0x52b/0x5d0 __blk_flush_plug+0xc6/0x110 blk_finish_plug+0x28/0x40 read_pages+0x160/0x1c0 page_cache_ra_unbounded+0x109/0x180 relocate_file_extent_cluster+0x611/0x6a0 ? btrfs_search_slot+0xba4/0xd20 ? balance_dirty_pages_ratelimited_flags+0x26/0xb00 relocate_data_extent.constprop.0+0x134/0x160 relocate_block_group+0x3f2/0x500 btrfs_relocate_block_group+0x250/0x430 btrfs_relocate_chunk+0x3f/0x130 btrfs_balance+0x71b/0xef0 ? kmalloc_trace_noprof+0x13b/0x280 btrfs_ioctl+0x2c2e/0x3030 ? kvfree_call_rcu+0x1e6/0x340 ? list_lru_add_obj+0x66/0x80 ? mntput_no_expire+0x3a/0x220 __x64_sys_ioctl+0x96/0xc0 do_syscall_64+0x54/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fcc04514f9b Code: Unable to access opcode bytes at 0x7fcc04514f71. RSP: 002b:00007ffeba923370 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fcc04514f9b RDX: 00007ffeba923460 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000013 R09: 0000000000000001 R10: 00007fcc043fbba8 R11: 0000000000000246 R12: 00007ffeba924fc5 R13: 00007ffeba923460 R14: 0000000000000002 R15: 00000000004d4bb0 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:__blk_rq_map_sg+0x339/0x4a0 RSP: 0018:ffffc90001a43820 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802 RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000 RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8 R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000 FS: 00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcc04514f71 CR3: 00000001109ea001 CR4: 0000000000370eb0 Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception ]--- So in case of a relocation on a RAID stripe-tree based file system, skip the readahead. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Now that RAID stripe-tree lookup failures are not treated as a fatal issue any more, change the RAID stripe-tree lookup error message to debug level. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Btrfs only supports sectorsize 4K, 8K, 16K, 32K, 64K for now, thus for systems with 4K page size, there is no way the fs is subpage (sectorsize < PAGE_SIZE). So here we define btrfs_is_subpage() different according to the PAGE_SIZE: - PAGE_SIZE > 4K We may hit real subpage cases, define btrfs_is_subpage() as a regular function and do the usual checks. - PAGE_SIZE == 4K (no smaller PAGE_SIZE support AFAIK) There is no way the fs is subpage, so just define btrfs_is_subpage() as an inline function which always return false. This saves about 7K bytes for x86_64 debug builds: text data bss dec hex filename Before: 1484452 168693 25776 1678921 199e49 fs/btrfs/btrfs.ko After: 1476605 168445 25776 1670826 197eaa fs/btrfs/btrfs.ko Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

The BTRFS_IOC_SYNC ioctl wants to wake up the cleaner kthread so that it does any pending work (subvolume deletion, delayed iputs, etc), however it is waking up the transaction kthread, which in turn wakes up the cleaner. Since we don't have any transaction to commit, as any ongoing transaction was already committed when it called btrfs_sync_fs() and the goal is just to wake up the cleaner thread, directly wake up the cleaner instead of the transaction kthread. Reviewed-by: Boris Burkov <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Currently we're calling btrfs_num_copies() before btrfs_get_chunk_map() in btrfs_map_block(). But btrfs_num_copies() itself does a chunk map lookup to be able to calculate the number of copies. So split out the code getting the number of copies from btrfs_num_copies() into a helper called btrfs_chunk_map_num_copies() and directly call it from btrfs_map_block() and btrfs_num_copies(). This saves us one rbtree lookup per btrfs_map_block() invocation. Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

…sion Unlike the bitmap usage inside raid56, for __extent_writepage_io() we handle the subpage submission not sector-by-sector, but for each dirty range we found. This is not a big deal normally, as the subpage complex code is already mostly optimized out by the compiler for x86_64. However for the sake of consistency and for the future of subpage sector-perfect compression support, this patch does: - Extract the sector submission code into submit_one_sector() - Add the needed code to extract the dirty bitmap for subpage case There is a small pitfall for non-subpage case, as we cleared page dirty before starting writeback, so we have to manually set the default dirty_bitmap to 1 for such case. - Use bitmap_and() to calculate the target sectors we need to submit This is done for both subpage and non-subpage cases, and will later be expanded to skip inline/compression ranges. For x86_64, the dirty bitmap will be fixed to 1, with the length of 1, so we're still doing the same workload per sector. For larger page sizes, the overhead will be a little larger, as previous we only need to do one extent_map lookup per-dirty-range, but now it will be one extent_map lookup per-sector. But that is the same frequency as x86_64, so we're just aligning the behavior to x86_64. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Clean up resources using goto to get rid of repeated code. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Junchao Sun <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Use xarray to track dirty extents to reduce the size of the struct btrfs_qgroup_extent_record from 64 bytes to 40 bytes. The xarray is more cache line friendly, it also reduces the complexity of insertion and search code compared to rb tree. Another change introduced is about error handling. Before this patch, the result of btrfs_qgroup_trace_extent_nolock() is always a success. In this patch, because of this function calls the function xa_store() which has the possibility to fail, so mark qgroup as inconsistent if error happened and then free preallocated memory. Also we preallocate memory before spin_lock(), if memory preallcation failed, error handling is the same the existing code. Suggested-by: Qu Wenruo <[email protected]> Signed-off-by: Junchao Sun <[email protected]> Signed-off-by: David Sterba <[email protected]>

Fix a few obvious grammar mistakes: a -> an, then -> than. Signed-off-by: Thorsten Blum <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

The parameter @nr_ret is used to tell the caller how many sectors have been submitted for IO. Then callers check @nr_ret value to determine if we need to manually clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all sectors are beyond i_size) there is no folio_start_writeback() called thus PAGECACHE_TAG_DIRTY tag will not be cleared. Remove this parameter by: - Moving the btrfs_folio_clear_writeback() call into __extent_writepage_io() So that if we didn't submit any IO, then manually call btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when the page is no longer dirty. - Use a bool to record if we have submitted any sector Instead of an int. - Use subpage compatible helpers to end folio writeback. This brings no change to the behavior, just for the sake of consistency. As for the call site inside __extent_writepage(), we're always called for the whole page, so the existing full page helper folio_(start|end)_writeback() is totally fine. For the call site inside extent_write_locked_range(), although we can have subpage range, folio_start_writeback() will only clear PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio will still be dirty if there is any subpage dirty range. Only when the last dirty subpage sector is cleared, the folio_start_writeback() will clear PAGECACHE_TAG_DIRTY. So no matter if we call the full page or subpage helper, the result is still the same, then just use the subpage helpers for consistency. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

We have transient failures with btrfs/301, specifically in the part where we do for i in $(seq 0 10); do write 50m to file rm -f file done Sometimes this will result in a transient quota error, and it's because sometimes we start writeback on the file which results in a delayed iput, and thus the rm doesn't actually clean the file up. When we're flushing the quota space we need to run the delayed iputs to make sure all the unlinks that we think have completed have actually completed. This removes the small window where we could fail to find enough space in our quota. Signed-off-by: Josef Bacik <[email protected]>

Jack Xiao and others added 30 commits August 13, 2024 13:04

drm/amd/amdgpu: add HDP_SD support on gc 12.0.0/1

23acd1f

add HDP_SD support on gc 12.0.0/1 Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 61cffac)

bcachefs: Fix bch2_trigger_alloc when upgrading from old versions

bd864bc

bch2_trigger_alloc was assuming that the new key would always be newly created and thus always an alloc_v4 key, but - not when called from btree_gc. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: bch2_accounting_invalid() fixup

d9e6157

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: disk accounting: ignore unknown types

486d920

forward compat fix Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Add missing downgrade table entry

48d6cc1

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Add hysteresis to waiting on btree key cache flush

7254555

This helps ensure key cache reclaim isn't contending with threads waiting for the key cache to be helped, and fixes a severe performance bug. Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Improve trans_blocked_journal_reclaim tracepoint

790666c

include information about the state of the btree key cache Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Add a time_stat for blocked on key cache flush

06a8693

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Fix warning in __bch2_fsck_err() for trans not passed in

c994710

Signed-off-by: Kent Overstreet <[email protected]>

josefbacik and others added 26 commits August 19, 2024 00:23

btrfs: convert uncompress_inline() to take a folio

e9d937e

Update uncompress_inline to take a folio and update it's usage accordingly. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

btrfs: convert btrfs_get_extent() to take a folio

a801b19

We only pass this into read_inline_extent, change it to take a folio and update the callers. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

btrfs: convert __get_extent_map() to take a folio

ee630a4

Now that btrfs_get_extent takes a folio, update __get_extent_map to take a folio as well. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

btrfs: convert insert_inline_extent() to use a folio

abd18bd

We only use a page to copy in the data for the inline extent. Use a folio for this instead. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

btrfs: send: fix grammar in comments

6cdd9c6

Fix a few obvious grammar mistakes: a -> an, then -> than. Signed-off-by: Thorsten Blum <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

josefbacik force-pushed the qgroup-fix branch from 2cf02bc to c718e1e Compare August 21, 2024 20:04

kdave closed this Nov 7, 2024

kdave reopened this Nov 7, 2024

kdave closed this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qgroup fix #1379

Qgroup fix #1379

josefbacik commented Aug 21, 2024

Qgroup fix #1379

Qgroup fix #1379

Conversation

josefbacik commented Aug 21, 2024