-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Firmware recovery #1675
Comments
ACK. This can be first tied into the IPC timeout handler code. We already have a developer Kconfig option that prevents D3 upon FW crash (via IPC timeout) so would likely need the default action to be recover if this Kconfig option is not enabled. |
Right, this can work as a kernel-side detection only. But I was thinking here of an explicit watchdog or something that shows a sign of activity on the firmware side. |
@plbossart This is a duplicate of #452 |
@ujfalusi the Skylake driver implemented a recovery mechanism that was used on ApolloLake, but I am not sure if that part was ever upstreamed. @mwasko would you happen to recall? |
@plbossart, what have worked for me in the past is to call snd_pcm_stop_xrun(struct snd_pcm_substream *substream) to escalate the event up to the application. I need to take a look at the options, but instant detection of firmware crash can only be done if the crashed firmware would give us an interrupt/event that it crashed, which I don't think it is going to do. |
In Skylake driver we had implemented the FW recovery for specific releases for ApolloLake but I don't think it was upstream to Linux kernel. However the recovery itself was not done based on watchdog-like mechanism but on critical errors like IPC timeout or FW errors like memory issues during FW boot. |
@mwasko, thanks for the details. IPC timeout is a good indication of firmware error with the additional watchdog (when it applies) we can also catch crashes which happen without much IPC traffic, like audio operation when we just receive the periodic elapsed events. I'll keep these in mind. |
@ujfalusi And to further complicate, SOF supports SNDRV_PCM_HW_PARAMS_NO_PERIOD_WAKEUP, so you might not get any regular IPCs nor period-elapsed IRQs. If a DSP stops in this configuration, you can detect this at XRUN, but it's not very straighforward -- some hw_ptr liveliness check could work, but needs to differentiate from application side errors. I believe starting with distinct FW failure events (a DSP panic or error to IPC), is a good starting point. |
@kv2019i, IPC timout can be used to check NO_PERIOD_WAKEUP from a periodic timer (read position, status, hearthbeat, something small and trivial) but it would just defeat the whole purpose of NO_PERIOD_WAKEUP. I would leave this for now. |
This suspend/resume failure below did not just TIMEOUT, for a change the system stayed alive enough to send logs. What I found interesting and relevant is:
So I suspect the kernel successfully rebooted the DSP but didn't properly clean-up something in its internal state. Kernel logs
EDIT: same again in https://sof-ci.01.org/sofpr/PR8743/build1905/devicetest/index.html?model=TGLU_RVP_NOCODEC-ipc4&testcase=check-suspend-resume-with-playback / thesofproject/sof#8743 |
We really need to have a watchdog-like mechanism where the host driver can detect the DSP is no longer responding and recover by reinitializing the hardware and notify apps.
The text was updated successfully, but these errors were encountered: