Skip to content

Linux 6.15 compat #17229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Linux 6.15 compat #17229

wants to merge 3 commits into from

Conversation

robn
Copy link
Member

@robn robn commented Apr 8, 2025

Motivation and Context

"I wish it need not have happened in my time," said Frodo. "So do I," said Gandalf, "and so do all who live to see such times. But that is not for them to decide. All we have to decide is what to do with the time that is given us."

6.15-rc1 released, so off we go again.

Description

First two changes are benign: minor signature change for iops->mkdir, and a rename for del_timer_sync. Nothing to see here.

The last one is more complicated, and I would like a subject-matter expert or two to have a look, because I'm not that and this is educated guesswork.

For background, since I had to go figure it out myself: secpolicy_zfs_proc() is used to ensure that a given process (or credential from it) is allowed to perform some operation. It's used by various async admin tasks (eg taking a snapshot), where the calling process (eg zfs snapshot in userspace) is not the one that actually performs the work (some kernel thread), and of particular interest, I think, is the use in channel programs, where we need to consider the permissions of the zfs program process as different Lua functions are called.

On Linux, most of the capability-checking functions check permissions for the current task, not some arbitrary task. We have used has_capability(), which can take a credential from a target process; exactly what we want. However, this function has been removed in 6.15, and there's no easy equivalent we can just drop in:

  • has_capabaility_noaudit() would drop in, but as it says, doesn't post an audit event, which isn't something I just want to get rid of unless we have no choice
  • has_ns_capability(), security_capable() and other low-level functions aren't exported to modules
  • file_ns_capable() is exported, but it's a quite awful to cook a struct file with only a struct cred inside

So, I've taken a slightly different approach. There's a set of APIs for swapping a struct cred into the current task, and reverting it. I'm now using this to get the cred from the target task, swap it in, do the check, swap it all back out and return. This appears to at least not break anything in the test suite, but I don't know to what extent it's actually exercised, so I'd like someone who knows this subsystem or has the ability to test properly to do so and report back.

I don't love this, in part because of the warning in the kernel headers for get_task_cred() that "the caller must make sure task doesn't go away". However, it seems that if this is a problem, then it's already a problem, in that the stashed task we pass to secpolicy_zfs_proc() could have disappeared since the original call from userspace. I'm assuming that's not already happening for now.

Incidentally, if this does pan out, it might be the way to change the API to only need the cred_t, like on FreeBSD, by calling prepare_kernel_cred() wherever we would call CRED() now. This creates a struct cred explicitly designed to allow a kernel thread to do something on behalf of a userspace program. It does change thing a little, as we need a release function (the returned cred is heap-allocated), but it might allow us to not have to pass the task around with the cred. I'm not going to do this (for now) as I really don't know enough about it all, but there's a small project if someone wants it.

Tiny spanner: get_task_cred() exists but was not exported in upstream 4.19 (though I'd be surprised if it were not backported to an EL8 kernel). Regardless, it is trivial to implement a wrapper, or to just fall back to has_capability() there. The reason I haven't kept has_capability() for <6.15 is because I would rather not have two entirely different methods on "active" kernel releases unless there's a real need. Too many places for bugs and quirks to hide otherwise.

How Has This Been Tested?

Compiled checked on all LTS kernel series >= 4.19, and all 6.x kernel series.

Sanity run (create pool, fio, export, import, scrub) run on >= 5.10, completed successfully. (4.19 skipped for now, as above, and something weird booting 5.5 on my test VMs, unrelated)

ZTS now running on 6.15-rc1. I'll report back tomorrow!

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

robn added 3 commits April 8, 2025 20:20
The intent is that the filesystem may have a reference to an "old"
version of the new directory, eg if it was keeping it alive because a
remote NFS client still had it open.

We don't need anything like that, so this really just changes things so
we return error codes encoded in pointers.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Renamed in 6.2, and the compat wrapper removed in 6.15. No signature or
functional change apart from that, so a very minimal update for us.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
Linux 6.15 removes has_capability(), as nothing uses it internally.
Unfortunately, there are no readily available alternatives available to
us, forcing use of a different approach.

Instead, we borrow the credential from the target process, swap it into
the current process, then do a standard capability check for the current
process. We revert the swap, release the cred, and return the chec
result.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <[email protected]>
@amotin amotin added the Status: Code Review Needed Ready for review and testing label Apr 8, 2025
return (0);
const struct cred *c = get_task_cred(proc);
const struct cred *old = override_creds(c);
bool hascap = capable(CAP_SYS_ADMIN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since has_capability() was just:

bool has_capability(struct task_struct *t, int cap)
{
	return has_ns_capability(t, &init_user_ns, cap);
}
EXPORT_SYMBOL(has_capability);

before, can't we just use has_ns_capability() in the same way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init_user_ns is GPL'd. Could we use cr->user_ns instead?

int                                                                              
secpolicy_zfs_proc(const cred_t *cr, proc_t *proc)                               
{                                                                                
    if (!has_ns_capability(proc, cr->user_ns, CAP_SYS_ADMIN))                    
        return (EACCES);                                                         
    return (0);                                                                  
} 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_ns_capability is not exported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(incidentally, if you did ever need init_user_ns, I'm pretty sure you can always get it through init_task.init_cred.user_ns).

@robn
Copy link
Member Author

robn commented Apr 9, 2025

Died overnight during the ZTS run with a hard panic:

[ 5253.763882] Kernel panic - not syncing: buffer modified while frozen!
[ 5253.764100] CPU: 3 UID: 0 PID: 891629 Comm: rm Tainted: P           OE       6.15.0-rc1 #1 PREEMPT(voluntary)
[ 5253.764424] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 5253.764664] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 14.0 10/17/2021
[ 5253.764872] Call Trace:
[ 5253.764961]  <TASK>
[ 5253.765039]  panic+0x33f/0x3a0
[ 5253.765155]  arc_cksum_verify+0x20e/0x210 [zfs]
[ 5253.765554]  arc_buf_destroy_impl+0x38/0x330 [zfs]
[ 5253.765840]  arc_buf_destroy+0xf7/0x370 [zfs]
[ 5253.766113]  dbuf_destroy+0x5d/0x790 [zfs]
[ 5253.766384]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5253.766433]  ? aggsum_add+0x1eb/0x310 [zfs]
[ 5253.766433]  dbuf_free_range+0x286/0x710 [zfs]
[ 5253.766433]  dnode_free_range+0x1f5/0x8c0 [zfs]
[ 5253.766433]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5253.766433]  ? dmu_tx_try_assign+0x3c5/0x470 [zfs]
[ 5253.766433]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5253.766433]  ? txg_rele_to_quiesce+0x40/0x90 [zfs]
[ 5253.766433]  dmu_free_long_range_impl+0x2b0/0x4e0 [zfs]
[ 5253.766433]  dmu_free_long_range+0x76/0xc0 [zfs]
[ 5253.766433]  zfs_rmnode+0x3a8/0x5b0 [zfs]
[ 5253.766433]  zfs_inactive+0x1e9/0x300 [zfs]
[ 5253.766433]  ? down_read+0xe/0xa0
[ 5253.766433]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5253.766433]  zpl_evict_inode+0x36/0x50 [zfs]
[ 5253.766433]  evict+0x113/0x280
[ 5253.766433]  do_unlinkat+0x29a/0x2f0
[ 5253.766433]  __x64_sys_unlinkat+0x31/0x60
[ 5253.766433]  do_syscall_64+0x7e/0x190
[ 5253.766433]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5253.766433]  ? __do_sys_newfstatat+0x4d/0x80
[ 5253.766433]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5253.766433]  ? syscall_exit_to_user_mode+0x49/0x210
[ 5253.766433]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5253.766433]  ? do_syscall_64+0x8a/0x190
[ 5253.766433]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5253.766433]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 5253.766433]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 5253.766433] RIP: 0033:0x7f978d243ab7
[ 5253.766433] Code: 73 01 c3 48 8b 0d 49 83 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 07 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 19 83 0d 00 f7 d8 64 89 01 48
[ 5253.766433] RSP: 002b:00007ffca3d4ad88 EFLAGS: 00000206 ORIG_RAX: 0000000000000107
[ 5253.766433] RAX: ffffffffffffffda RBX: 0000561e9ecb75a0 RCX: 00007f978d243ab7
[ 5253.766433] RDX: 0000000000000000 RSI: 0000561e9ecb6380 RDI: 00000000ffffff9c
[ 5253.766433] RBP: 0000561e9ecb62f0 R08: 0000000000000003 R09: 0000000000000000
[ 5253.766433] R10: 00007f978d162f80 R11: 0000000000000206 R12: 0000000000000000
[ 5253.766433] R13: 00007ffca3d4af70 R14: 0000000000000003 R15: 0000561e9ecb75a0
[ 5253.766433]  </TASK>
[ 5253.766433] Kernel Offset: 0x14800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 5253.766433] ---[ end Kernel panic - not syncing: buffer modified while frozen! ]---

Last log says it was in cleanup for cli_root/zpool_resilver, but nothing interesting had happened up until that point, and no failures. My gut feeling is that it's nothing to do with any particular test, but rather, something in inode cleanup. I'll set it running again to see if it blows up in the same way and/or the same spot, and later today I'll see if there were any changes around inode cleanup that didn't change the API but did change behaviour.

@robn
Copy link
Member Author

robn commented Apr 9, 2025

Second run ran to completion, all successful. I'll do a few, see if I can run into it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants