You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Container-Optimized OS
Kernel Release
6.6.72
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
I am running on a stable kernel release.
Hardware: GPU
B200
Describe the bug
When attempting to load the NVIDIA driver on a BTI-enabled ARM64 kernel, a kernel panic occurs with the following call trace:
The crash occurs within the _portMemAllocatorAllocNonPagedWrapper function. Disassembly of this function shows the lack of BTI instructions where expected -
We are able to reproduce this crash with setting CONFIG_ARM64_BTI and CONFIG_ARM64_BTI_KERNEL in our v6.6.72 based arm64 kernel, which is cross compiled on a x86_64 host for a arm64 target. This crash is consistent and happens in all available versions of the open-gpu-kernel-modules at this time.
Bug Incidence
Always
nvidia-bug-report.log.gz
The kernel crash prevents us to capture the nvidia-bug-report.log.gz.
More Info
We believe the driver build system is not generating the bti instructions in all the right places. The "src" directory is not compiled with -mbranch-protection=bti, but the "kernel-open" directory is. This may be because "kernel-open" has the Kbuild file in it and uses the kernel's configuration, while "src" doesn't seem to do that. Hardcoding these in the build seems to be generating the bti instructions in all the right places, demonstrating the problem.
@arnav-kansal I believe this issue is along the same lines as an issue that was opened for CFI violations. @aritger has a good write-up on our build system design and what users can do to work with it.
For whatever it is worth, there are a few motivations for the split:
(1) Historically, the non-kbuild part (the part that produces nv-kernel.o) was built internally to NVIDIA and is what was distributed as binary-only. Code not built for a specific target kernel cannot use kbuild.
(2) With the advent of open-gpu-kernel-modules, we chose to retain that split so that users installing the driver wouldn't be required to build all of the kernel module when installing the driver. I.e., installing the driver from the NVIDIA .run file contains a pre-built open-gpu-kernel-modules nv-kernel.o. We can only do that because nv-kernel.o is not kernel-specific. Currently, open-gpu-kernel-modules takes about 10 minutes to build if single threaded. Much of that can be covered with a parallel build, but we didn't want to add that install time for every user installing from .run file if we didn't need to.
The big disadvantage of the split is of course that you need to match these sorts of compiler flags across the split if doing instrumentation like RAP.
Maybe the benefits of (2) are outweighed by the downsides and we should revisit that decision.
That is at least the context. So, I don't know if we can immediately move to an all kbuild-native build.
The nv_encode_caching() bug is a good catch. Thanks for that. Does nv-mmap.c not include nv-proto.h? If not, that is a bug, too. Even with the current split, I would expect the compiler to complain if the prototype and implementation mismatch.
For the near-term, would it be acceptable to pass these additional CFLAGS on the make commandline? Maybe the makefiles need more variable plumbing to facilitate that. But, I think it will be easiest to get traction with something like that, than require kbuild-ifying the entirety of the open-gpu-kernel-modules build. The code changes for that wouldn't be difficult, but the hard part would be the packaging/installation implications of that choice.
NVIDIA Open GPU Kernel Modules Version
all
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Container-Optimized OS
Kernel Release
6.6.72
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
B200
Describe the bug
When attempting to load the NVIDIA driver on a BTI-enabled ARM64 kernel, a kernel panic occurs with the following call trace:
The crash occurs within the
_portMemAllocatorAllocNonPagedWrapper
function. Disassembly of this function shows the lack of BTI instructions where expected -To Reproduce
We are able to reproduce this crash with setting
CONFIG_ARM64_BTI
andCONFIG_ARM64_BTI_KERNEL
in our v6.6.72 based arm64 kernel, which is cross compiled on a x86_64 host for a arm64 target. This crash is consistent and happens in all available versions of the open-gpu-kernel-modules at this time.Bug Incidence
Always
nvidia-bug-report.log.gz
The kernel crash prevents us to capture the nvidia-bug-report.log.gz.
More Info
We believe the driver build system is not generating the bti instructions in all the right places. The "src" directory is not compiled with -mbranch-protection=bti, but the "kernel-open" directory is. This may be because "kernel-open" has the Kbuild file in it and uses the kernel's configuration, while "src" doesn't seem to do that. Hardcoding these in the build seems to be generating the bti instructions in all the right places, demonstrating the problem.
The text was updated successfully, but these errors were encountered: