Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDVLK out of bounds write causes memory corruption/crash when an attached display monitor has more than 64 video modes. #179

Open
kleinerm opened this issue Oct 1, 2024 · 4 comments

Comments

@kleinerm
Copy link
Contributor

kleinerm commented Oct 1, 2024

Short summary:

If a Vulkan client on Linux tries to call vkGetDisplayModePropertiesKHR() on a video output which exposes more than 64 video modes then amdvlk experiences memory and stack corruption due to out-of-bounds writes, leading to client crash/termination, e.g., "Stack smashing detected" abort().

Explanation of how/why this happens:

If a Vulkan client on Linux tries to call vkGetDisplayModePropertiesKHR() on a VKDisplayKHR (running on a virtual terminal, or created from a RandR leased output or similar), then the following call-chain is executed in XGL's vk_instance.cpp:

vk_physical_device.cpp: vkGetDisplayModePropertiesKHR() -> vk_physical_device.cpp: PhysicalDevice::GetDisplayModeProperties()-> vk_instance.cpp: Instance::GetScreenModeList() -> Multiple calls to pScreen->GetScreenModeList() implemented in PAL, which will then call into the Linux OS specific amdgpuDevice.cpp : Device::QueryScreenModesForConnector()

Device::QueryScreenModesForConnector() returns the number of video modes supported on the Linux DRM connector for the video output associated with the VKDisplayKHR handle, or a list of all the video modes. It returns the full count, or the full list of modes.

This is a problem, because the current implementation of XGL uses multiple arrays for storing or caching these returned video modes, which have a fixed maximum size of Pal::MaxModePerScreen = 64 elements, as defined in palPlatform.h.

For example pModeList in struct vk_instance.h ScreenObject

Pal::ScreenMode* pModeList[Pal::MaxModePerScreen];
and in
Pal::ScreenMode* pScreenMode[Pal::MaxModePerScreen];
are arrays which are involved in the above call sequence and have a fixed maximum capacity of 64 video modes.

Therefore calling vkGetDisplayModePropertiesKHR() on a display with > 64 modes will cause out-of-bounds writes into these arrays, corrupting data structures inside the VKInstance and also stack corruption, leasing to malfunction or crash of the client app.

Possible solutions?

Ideally there wouldn't be fixed size arrays like those mentioned above. Or at least the limit of Pal::MaxModePerScreen inside
https://github.com/GPUOpen-Drivers/pal/blob/31f6a70d0e9ee22894daab5305dd91b3ebb599a5/inc/core/palPlatform.h#L88
should be raised to a much larger number.

Then the code inside Instance::GetScreenModeList() should do checking on the modecount values returned by Pal and clamp the returned mode counts and internally cached modes to a maximum of Pal::MaxModePerScreen to prevent future out-of-bounds writes due to array overflows.

VkResult Instance::GetScreenModeList(

Thanks.

@jinjianrong
Copy link
Member

The fix will be in next release

@kleinerm
Copy link
Contributor Author

kleinerm commented Nov 4, 2024

Great! Did you bump the Pal::MaxModePerScreen to a higher value and implement suitable bounds checking/clamping to prevent a future repetition of the bug? Or even switch to dynamic allocation?

For reference, a user of my software which first reported this crash, experienced it with a HDR projector that exposed 107 video modes. So if you just bumped the static limit, but didn't switch to dynamic allocation, I would propose choosing at least a limit of 256 modes at a minimum, or maybe even a more generous limit of 512 modes to be on the relative safe side.

Thanks!

@jinjianrong
Copy link
Member

Here is the fix : #180

kleinerm added a commit to kleinerm/Psychtoolbox-3 that referenced this issue Nov 5, 2024
…ure drivers.

AMDVLK v-2024.Q4.2 will have a suitable bug fix, so the workaround
ain't needed anymore on Navi+ / RDNA gpu's supported by that driver.

The fix is untested and not testable by myself as I don't have a
RDNA+ gpu, but I performed a full code review and conclude that
the bug should be fixed by that fix, which now implements dynamic
allocation of suitable data structures, so no hard mode count limit
anymore and future proof.

Bug report: GPUOpen-Drivers/xgl#179
Bugfix merged 5.11.2024: GPUOpen-Drivers/xgl#180
@kleinerm
Copy link
Contributor Author

kleinerm commented Nov 6, 2024

I reviewed the code of your fix and it looks good to me. Dynamic allocation, great!
I can't test this myself easily anymore, as the current driver no longer supports the Polaris and Vega gpu's I have, sadly only RDNA. But I will let my user who has a RDNA gpu know about the fix. Maybe he can test it already and then I can close this issue.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants