-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Driver support for firmware recovery #452
Comments
extended work of thesofproject/sof#710? |
On 12/17/18 5:29 PM, Michael Fu wrote:
extended work of thesofproject/sof#710
<thesofproject/sof#710>?
No, different problem. recovery on firmware stuck means reset and
redownload. recovery on underflow means closing and reopening the PCM
device.
|
@plbossart looking into the hda_dsp_core_reset_power_down() method in hds-dsp.c, we seem to be placing the core in reset before powering down and when the firmware is unresponsive, this also seems to fail. This is required when we want to shutdown gracefully. In the case of panic, would it be acceptable to just power down the cores instead of placing them in reset? |
On 12/21/18 1:14 PM, Ranjani Sridharan wrote:
@plbossart <https://github.com/plbossart> looking into the
hda_dsp_core_reset_power_down() method in hds-dsp.c. We seem to be
placing the core in reset before powering down and when the firmware
is unresponsive, this also seems to fail. This is required when we
want to shutdown gracefully.
In the case of panic, would it be acceptable to just power down the
cores instead of placing them in reset?
Sorry, no idea. The current behavior is I believe tracking what was
implemented for the Skylake driver, not sure though if it was based on
experiments or guidance from hardware.
… |
@plbossart @ranj063 This now brings some difficulty in debug as all the logs and dumped exception info is gone with that suspend/resume. As for the recovery mode, I think we may need to add flag or debugfs controller to disable the FW recovery for DEBUG usage. |
On 2/10/19 9:01 AM, xiulipan wrote:
@plbossart <https://github.com/plbossart> @ranj063
<https://github.com/ranj063>
Actually now when a panic happened. If runtime_pm is not disabled, it
will automatically recovery the DSP by runtime_suspend and runtime_resume.
This now brings some difficulty in debug as all the logs and dumped
exception info is gone with that suspend/resume.
As for the recovery mode, I think we may need to add flag or debugfs
controller to disable the FW recovery for DEBUG usage.
Panic is one aspect, but the more critical part is timeouts when the DSP
gets stuck or irresponsive. You are correct that when recovery happens
we should keep a trace of what happened before the recovery, otherwise
it's just awfully hard to test.
|
@kv2019i , would you be able to take this one? thanks! |
@wenqingfu wrote:
Ack. I'm currently looking at tracing issues during suspend and runtime supsned (old issue #248 |
Hello, I now pushed initial patchset to address this issue (PR841). I ended up taking a slightly different path than what you've discussed earlier, so please check the code and see whether this is on the right track or not. |
rcu_read_lock is needed to protect access to psock inside sock_map_unref when tearing down the map. However, we can't afford to sleep in lock_sock while in RCU read-side critical section. Grab the RCU lock only after we have locked the socket. This fixes RCU warnings triggerable on a VM with 1 vCPU when free'ing a sockmap/sockhash that contains at least one socket: | ============================= | WARNING: suspicious RCU usage | 5.5.0-04005-g8fc91b972b73 #450 Not tainted | ----------------------------- | include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section! | | other info that might help us debug this: | | | rcu_scheduler_active = 2, debug_locks = 1 | 4 locks held by kworker/0:1/62: | #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0 | #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0 | #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_map_free+0x5/0x170 | #3: ffff8881368c5df8 (&stab->lock){+...}, at: sock_map_free+0x64/0x170 | | stack backtrace: | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73 #450 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 | Workqueue: events bpf_map_free_deferred | Call Trace: | dump_stack+0x71/0xa0 | ___might_sleep+0x105/0x190 | lock_sock_nested+0x28/0x90 | sock_map_free+0x95/0x170 | bpf_map_free_deferred+0x58/0x80 | process_one_work+0x260/0x5e0 | worker_thread+0x4d/0x3e0 | kthread+0x108/0x140 | ? process_one_work+0x5e0/0x5e0 | ? kthread_park+0x90/0x90 | ret_from_fork+0x3a/0x50 | ============================= | WARNING: suspicious RCU usage | 5.5.0-04005-g8fc91b972b73-dirty #452 Not tainted | ----------------------------- | include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section! | | other info that might help us debug this: | | | rcu_scheduler_active = 2, debug_locks = 1 | 4 locks held by kworker/0:1/62: | #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0 | #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0 | #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_hash_free+0x5/0x1d0 | #3: ffff888139966e00 (&htab->buckets[i].lock){+...}, at: sock_hash_free+0x92/0x1d0 | | stack backtrace: | CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73-dirty #452 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 | Workqueue: events bpf_map_free_deferred | Call Trace: | dump_stack+0x71/0xa0 | ___might_sleep+0x105/0x190 | lock_sock_nested+0x28/0x90 | sock_hash_free+0xec/0x1d0 | bpf_map_free_deferred+0x58/0x80 | process_one_work+0x260/0x5e0 | worker_thread+0x4d/0x3e0 | kthread+0x108/0x140 | ? process_one_work+0x5e0/0x5e0 | ? kthread_park+0x90/0x90 | ret_from_fork+0x3a/0x50 Fixes: 7e81a35 ("bpf: Sockmap, ensure sock lock held during tear down") Signed-off-by: Jakub Sitnicki <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
Fixed following WARNINGs: for thesofproject#355, removed this unnecessary log print. for thesofproject#364, added a variable to "struct es8316_priv", and tried to let function es8316_enable_jack_detect() align to mainline code. for thesofproject#452, removed it since we did not verified the patch on other platform. WARNING: Unnecessary ftrace-like logging - prefer using ftrace thesofproject#355: FILE: sound/soc/codecs/es8316.c:751: + dev_dbg(comp->dev, "Exit %s\n", __func__); ERROR: do not initialise statics to 0 thesofproject#364: FILE: sound/soc/codecs/es8316.c:760: + static int initial = 0; WARNING: DT compatible string "everest,es8336" appears un-documented -- check ./Documentation/devicetree/bindings/ thesofproject#452: FILE: sound/soc/codecs/es8316.c:932: + { .compatible = "everest,es8336", }, Signed-off-by: Huajun Li <[email protected]>
duplicate of #1675 |
When the firmware gets stuck or unresponsive (IPC timeouts), the driver just bails out and the only way to recover is to reboot. That's not acceptable for a number of products.
The SOF driver needs to implement a recovery mechanism where it detects the firmware has gone off tracks (e.g. with a 5s watchdog), and restores the configuration (hardware reset, firmware download, pipelines and controls restored). We already have this sequence for resume so it shouldn't be too hard.
Note that this request is not asking for seamless operation. Applications will likely time out and likely stop, so collaboration with userspace is likely required to reenable playback/capture. The ask is really to avoid a reboot and restart cleanly.
The text was updated successfully, but these errors were encountered: