Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module unload failure following IPC timeout #4749

Closed
andyross opened this issue Dec 16, 2023 · 7 comments
Closed

Module unload failure following IPC timeout #4749

andyross opened this issue Dec 16, 2023 · 7 comments
Labels
bug Something isn't working MTL Applies to Meteor Lake platform.

Comments

@andyross
Copy link

After seeing an IPC timeout from the firmware, the snd_sof_pci_intel_mtl module hangs trying to unload, leaving the kernel in an unrecoverable state that requires a reboot to resume audio. (See #8638 for an easy recipe for causing a timeout).

This is the script I cooked up to get the full module stack reloaded with correct dependency ordering (at least in the kernel I'm using). It runs fine when not in an error state, but after the IPC failure it only gets as far as the MTL module then hangs.

#!/bin/sh
rmmod snd_soc_sof_rt5682
rmmod snd_soc_rt5645
rmmod snd_soc_hdac_hdmi
rmmod snd_soc_intel_hda_dsp_common
rmmod snd_soc_intel_sof_maxim_common
rmmod snd_soc_intel_sof_realtek_common
rmmod snd_soc_intel_sof_ssp_common
rmmod snd_soc_rt5682
rmmod snd_sof_probes
rmmod snd_soc_rl6231
rmmod snd_hda_codec_hdmi
rmmod snd_soc_dmic
rmmod snd_sof_pci_intel_mtl
rmmod snd_sof_intel_hda_common
rmmod snd_sof_intel_hda
rmmod soundwire_intel
rmmod soundwire_generic_allocation
rmmod snd_sof_intel_hda_mlink
rmmod soundwire_cadence
rmmod snd_sof_pci
rmmod snd_sof_xtensa_dsp
rmmod snd_soc_hdac_hda
rmmod snd_soc_acpi_intel_match
rmmod snd_soc_acpi
rmmod snd_hda_ext_core
rmmod snd_sof
rmmod snd_sof_utils
rmmod soundwire_bus
rmmod snd_intel_dspcfg
rmmod snd_intel_sdw_acpi
rmmod snd_hda_codec
rmmod snd_hwdep
rmmod snd_hda_core
rmmod snd_soc_rt5682s
rmmod snd_soc_max98357a

modprobe snd_soc_rt5682s
modprobe snd_soc_hdac_hdmi
modprobe snd_soc_max98357a
modprobe snd_sof_pci_intel_mtl

The immediate impact to me is just debugging speed (it's really annoying to wait for a reboot). But in general this kind of "module with no dependencies won't unload" issue is accompanied by more serious things like dangling pointers or memory leaks. Needs attention at reasonably high priority.

@andyross
Copy link
Author

Sorry, issue is in different project. See the SOF bug 8638 for a reproduction recipe: thesofproject/sof#8638

@kv2019i kv2019i added bug Something isn't working MTL Applies to Meteor Lake platform. labels Dec 22, 2023
@kv2019i
Copy link
Collaborator

kv2019i commented Dec 22, 2023

@andyross Is this still an issue or was this specific to thesofproject/sof#8638 (given the fix for that was in the end on LInux kernel side) ? I agree this is a mechanism that should work and basicly we depend on this in our CI. We use the scripts at https://github.com/thesofproject/sof-test/tree/main/tools/kmod to unload/reload with all dependencies sorted out.

Of course, mileage may vary depending on how badly the DSP fails, but in typical case, the module reload will work.

If this still occurs, can you share kernel logs on a case it fails? And can you doublecheck with "lsof" that no user-space entity is holding on to driver resources? Having mtrace-reader.py running will also block kernel module unload. But probably these you have already checked.

@andyross
Copy link
Author

I literally just validated, and indeed the fix referenced (which should be noted was merged before the bug report, my image was about two weeks stale) fixes the DSP PM management and unblocks this. I can reload successfullly now.

But this is a separate issue. The proximate cause is the DSP hang due to a kernel bug, but it could have been anything. You can imagine the DSP deliberately doing this (checking for a IDC handling a comp_free for a DP component) and then just arch_irq_lock();while(1);. To the kernel this would look identical, and it would be stuck and unable to recover audio without a reboot. The kernel needs to be able to bounce the DSP and recover state in all circumstances.

To be clear: this isn't currently a ChromeOS recovery method, but it might be. Fixing it isn't high priority, but we should at least get to some kind of affirmative analysis that says this is benign (i.e. that the only symptom is a hung rmmod and that there isn't a crash bug in there somewhere due to the removed dependencies).

@kv2019i
Copy link
Collaborator

kv2019i commented Dec 22, 2023

Ack, that scenario should work. In in fact, I can confirm we have regularly such cases where the DSP panics in a CI run, and the Linux kernel does recover. We have some debug options that interfere with this a bit (e.g. with CONFIG_SND_SOC_SOF_DEBUG_RETAIN_DSP_CONTEXT, DSP is left powered on after a crash. you can remove the SOF kernel modules, but a runtime pm ref is leaked on purpose so runtime suspend iwll no longer work even if you reload).

So in short, if this still happens, this is a valid bug and affects the SOF CI as well.

@plbossart
Copy link
Member

@andyross @kv2019i FYI we've listed recovery as a needed capability back in 2018 #452 and again in 2020 #1675

It's been done before on legacy Intel drivers but we don't have a signaling mechanism (heartbeat or something) at the firmware level nor a detection/recovery on the host side. And we'd also need to signal a reset to userspace.

I guess once we have the IMR context save we'll probably need something anyways, for now we reset the state when going back to D0 but it'll not longer be true with MTL+.

@plbossart
Copy link
Member

@andyross should we close this issue?

@plbossart
Copy link
Member

no information, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working MTL Applies to Meteor Lake platform.
Projects
None yet
Development

No branches or pull requests

3 participants