-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] FW reported error: 104 - Other failure of module instance initialization request #4832
Comments
Found in ADLP_SKU0B00_SDCA also from today's daily test. Intel internal daily test link: |
Not reproduced for 3 days |
This issue happened again: |
happen again
planresultdetail/31496?model=MTLP_RVP_SDW-ace1_0-ipc4&testcase=check-signal-stop-start-playback-50 |
We haven't observed this issue in v2.7 stress test. |
Tried a few days test but can't reproduce it |
@fredoh9 any recent sightings at your end ? |
I haven't seen the same issue for a week+. |
ok, pls close if not seen again by end of week. |
Found today again with MTLP_SDW_AIOC Intel Internal result: planresultdetail/32587?model=MTLP_SDW_AIOC-ace1_0-ipc4&testcase=check-suspend-resume-with-playback-5 |
Also reproduced today Oct 3rd in https://sof-ci.01.org/sofpr/PR8273/build13627/devicetest/index.html?model=MTLP_RVP_SDW&testcase=check-signal-stop-start-playback-10 |
Reproduced today. Intel internal daily test result: |
@fredoh9 there's nothing in the mtrace that can be traced to the 104 error code so it's really hard to provide feedback. |
No reproductions in CI. Closing this bug. |
Observed this issue again on MTLP_SDW_AIOC platform. Test ID:34965. dmesg
|
P2 so not gating SOF2.8, pushing out to v2.9. |
all the error happened with dai copier |
It happened again today. Intel internal test link: |
Reproduced on ww51.3 on MTLP_SDW_AIOC, report ID 36158. |
@RanderWang For LNL, we don't set dai_index. Is it required for FW? @plbossart Any thought? |
I just checked with @RanderWang. The issue is seen on TGL, not LNL. So, need to look deeper into the issue. @plbossart |
@bardliao @RanderWang I have no idea what this index of 7 refers to, I also don't know why this would be specific to TGL, if there was really an issue it would be seen for MTL as well? |
It was also reproduced on MTL on above comments like #4832 |
The reason is a little tricky. Headphone and Deepbuffer share the same BE (dai-copier), when Headphone stream is being closed and Deepbuffer is being set up, the error will happen. The node_id is set up by [ 595.659311] kernel: sof_ipc4_dai_config+0x4b/0x170 [snd_sof]
[ 595.659348] kernel: hda_dai_config+0x62/0xa0 [snd_sof_intel_hda_common]
[ 595.659381] kernel: sdw_params_stream+0x51/0x70 [snd_sof_intel_hda_generic]
[ 595.659394] kernel: intel_hw_params+0x128/0x250 [soundwire_intel]
[ 595.659414] kernel: snd_soc_dai_hw_params+0x39/0xa0 [snd_soc_core]
[ 595.659461] kernel: __soc_pcm_hw_params+0x553/0x720 [snd_soc_core]
[ 595.659549] kernel: dpcm_be_dai_hw_params+0x270/0x3f0 [snd_soc_core]
if ((be->dpcm[stream].state != SND_SOC_DPCM_STATE_OPEN) &&
(be->dpcm[stream].state != SND_SOC_DPCM_STATE_HW_PARAMS) &&
(be->dpcm[stream].state != SND_SOC_DPCM_STATE_HW_FREE))
continue;
[ 595.659638] kernel: dpcm_fe_dai_hw_params+0xd0/0x220 [snd_soc_core]
[ 595.659675] kernel: snd_pcm_hw_params+0x3ba/0x6a0 [snd_pcm] (1) HPD (headphone stream) is being closed and Deepbuffer is being setup |
Why we can't rely on use_count to share the sof widget between HPD & Deepbuffer stream ? Solution: set up widget list in pcm_open function and add use_count here and initialize widget in hw_param function |
No this method is not feasible since FW will free shared widgets for the use count in fw is also set up by hw_param function |
why clear node id for alh ? According to @bardliao the node_id is fixed for a special BE, so we don't need to clear since it will be set up with the same id again. static void sof_ipc4_unprepare_copier_module(struct snd_sof_widget *swidget) {
if (ipc4_copier->dai_type == SOF_DAI_INTEL_ALH) {
......
/* clear the node ID */
copier_data->gtw_cfg.node_id &= ~SOF_IPC4_NODE_INDEX_MASK;
}
} |
To be clear, each BE has its own unique intel_alh_id aka node_index based on the link number and pdi number. So that node_id will not change for a specific BE. For example, the intel_alh_id will be always 0x2 for sdw0 pin2. |
Still seeing this almost every day |
@RanderWang Can you provide more traces and specifically log transitions of the variable "be->dpcm[stream].users" and which stages this variable is updated (startup, hw_params, prepare, hw_free, shutdown). In theory this refcount is used precisely to prevent transitions on shared BEs, e.g. /* do not free hw if this BE is used by other FE */
if (be->dpcm[stream].users > 1)
continue; My take is that there is probably something not quite right in soc-pcm.c, if there is a race condition we should fix the root cause. Changing how the node_id is cleared seems a bit like a hack to me. |
Sure, it is also a hack to me. What I think is to set it by topology or rework our SOF but I have other high priority tasks to do. |
@plbossart the problem is not be free. The node_id is set by BE component( SDW dai). When the first stream A is free, the FE dai widget is free by SOF, so the node_id is cleared, at this time the BE state is SND_SOC_DPCM_STATE_STOP. And the be_hw_params for second stream B want to call sdw_hw_params to set node_id, but it found the state is SND_SOC_DPCM_STATE_STOP, it will skip this BE hw_params, so the node_id will not be set. int dpcm_be_dai_hw_params(struct snd_soc_pcm_runtime *fe, int stream)
{
......
for_each_dpcm_be(fe, stream, dpcm) {
.....
if ((be->dpcm[stream].state != SND_SOC_DPCM_STATE_OPEN) &&
(be->dpcm[stream].state != SND_SOC_DPCM_STATE_HW_PARAMS) &&
(be->dpcm[stream].state != SND_SOC_DPCM_STATE_HW_FREE))
continue;
}
} The bug is very random and I don't reproduce it today. |
My plan was to keep FE dai widget alive in SOF since it was used by two steams, but we need to do much work to do it |
I must admit I don't recall why we only added locking for the trigger cases. I don't know how we can protect the state of a BE which can be modified by separate FEs. Using the user counts and states is by construction racy, I can't think of a simple fix, unless I am gravely mistaken it looks like all the soc-pcm.c code needs to be revisited. The path of least resistance for a quick fix would be to avoid clearing stuff and keeping the node_id constant for shared BEs. |
@tiwai shares my analysis that the Linux BE locking isn't really up-to-snuff, we would need to come-up with a better mechanism to make sure FE transitions don't end-up corrupting the state of a shared BE. This sort of change is tricky and will likely require multiple weeks and lots of testing before we get it right. |
It looks like this a kernel issue, not firmware. Should be transferred to https://github.com/thesofproject/linux ? |
@marc-hb yes this is a kernel issue and I have a good feeling that the linux PR #4829 would also fix this issue |
Transferring this issue since Github does nice, automagic redirections that don't break URLs. |
There was a permission cleanup recently and I lost write access to https://github.com/thesofproject/linux/issues which is probably why I can't do this transfer. EDIT: fixed. |
I haven't seen this in recent times. I guess some unrelated change just made the race described by Rander above less likely. |
I haven't seen this problem lately. |
closing, let's reopen if that occurs again |
Describe the bug
During check-suspend-resume-with-playback-5.sh test, FW reported error: 104 is occurred.
Intel internal daily test link:
To Reproduce
TPLG=/lib/firmware/intel/sof-ace-tplg/sof-mtl-rt711-4ch.tplg MODEL=MTLP_RVP_SDW SOF_TEST_INTERVAL=5 ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m playback
Reproduction Rate
Starting from today. It happened in MTLP_RVP_SDW and MTLP_SDW_AIOC.
Environment
Screenshots or console output
The text was updated successfully, but these errors were encountered: