-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module unload failure following IPC timeout #4749
Comments
Sorry, issue is in different project. See the SOF bug 8638 for a reproduction recipe: thesofproject/sof#8638 |
@andyross Is this still an issue or was this specific to thesofproject/sof#8638 (given the fix for that was in the end on LInux kernel side) ? I agree this is a mechanism that should work and basicly we depend on this in our CI. We use the scripts at https://github.com/thesofproject/sof-test/tree/main/tools/kmod to unload/reload with all dependencies sorted out. Of course, mileage may vary depending on how badly the DSP fails, but in typical case, the module reload will work. If this still occurs, can you share kernel logs on a case it fails? And can you doublecheck with "lsof" that no user-space entity is holding on to driver resources? Having mtrace-reader.py running will also block kernel module unload. But probably these you have already checked. |
I literally just validated, and indeed the fix referenced (which should be noted was merged before the bug report, my image was about two weeks stale) fixes the DSP PM management and unblocks this. I can reload successfullly now. But this is a separate issue. The proximate cause is the DSP hang due to a kernel bug, but it could have been anything. You can imagine the DSP deliberately doing this (checking for a IDC handling a comp_free for a DP component) and then just To be clear: this isn't currently a ChromeOS recovery method, but it might be. Fixing it isn't high priority, but we should at least get to some kind of affirmative analysis that says this is benign (i.e. that the only symptom is a hung rmmod and that there isn't a crash bug in there somewhere due to the removed dependencies). |
Ack, that scenario should work. In in fact, I can confirm we have regularly such cases where the DSP panics in a CI run, and the Linux kernel does recover. We have some debug options that interfere with this a bit (e.g. with CONFIG_SND_SOC_SOF_DEBUG_RETAIN_DSP_CONTEXT, DSP is left powered on after a crash. you can remove the SOF kernel modules, but a runtime pm ref is leaked on purpose so runtime suspend iwll no longer work even if you reload). So in short, if this still happens, this is a valid bug and affects the SOF CI as well. |
@andyross @kv2019i FYI we've listed recovery as a needed capability back in 2018 #452 and again in 2020 #1675 It's been done before on legacy Intel drivers but we don't have a signaling mechanism (heartbeat or something) at the firmware level nor a detection/recovery on the host side. And we'd also need to signal a reset to userspace. I guess once we have the IMR context save we'll probably need something anyways, for now we reset the state when going back to D0 but it'll not longer be true with MTL+. |
@andyross should we close this issue? |
no information, closing |
After seeing an IPC timeout from the firmware, the
snd_sof_pci_intel_mtl
module hangs trying to unload, leaving the kernel in an unrecoverable state that requires a reboot to resume audio. (See #8638 for an easy recipe for causing a timeout).This is the script I cooked up to get the full module stack reloaded with correct dependency ordering (at least in the kernel I'm using). It runs fine when not in an error state, but after the IPC failure it only gets as far as the MTL module then hangs.
#!/bin/sh rmmod snd_soc_sof_rt5682 rmmod snd_soc_rt5645 rmmod snd_soc_hdac_hdmi rmmod snd_soc_intel_hda_dsp_common rmmod snd_soc_intel_sof_maxim_common rmmod snd_soc_intel_sof_realtek_common rmmod snd_soc_intel_sof_ssp_common rmmod snd_soc_rt5682 rmmod snd_sof_probes rmmod snd_soc_rl6231 rmmod snd_hda_codec_hdmi rmmod snd_soc_dmic rmmod snd_sof_pci_intel_mtl rmmod snd_sof_intel_hda_common rmmod snd_sof_intel_hda rmmod soundwire_intel rmmod soundwire_generic_allocation rmmod snd_sof_intel_hda_mlink rmmod soundwire_cadence rmmod snd_sof_pci rmmod snd_sof_xtensa_dsp rmmod snd_soc_hdac_hda rmmod snd_soc_acpi_intel_match rmmod snd_soc_acpi rmmod snd_hda_ext_core rmmod snd_sof rmmod snd_sof_utils rmmod soundwire_bus rmmod snd_intel_dspcfg rmmod snd_intel_sdw_acpi rmmod snd_hda_codec rmmod snd_hwdep rmmod snd_hda_core rmmod snd_soc_rt5682s rmmod snd_soc_max98357a modprobe snd_soc_rt5682s modprobe snd_soc_hdac_hdmi modprobe snd_soc_max98357a modprobe snd_sof_pci_intel_mtl
The immediate impact to me is just debugging speed (it's really annoying to wait for a reboot). But in general this kind of "module with no dependencies won't unload" issue is accompanied by more serious things like dangling pointers or memory leaks. Needs attention at reasonably high priority.
The text was updated successfully, but these errors were encountered: