Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FW reported error: 104 - Other failure of module instance initialization request #4832

Closed
fredoh9 opened this issue Aug 17, 2023 · 48 comments
Assignees
Labels
ADL Applies to Alder Lake platform bug Something isn't working MTL Applies to Meteor Lake platform. P2 Critical bugs or normal features

Comments

@fredoh9
Copy link
Collaborator

fredoh9 commented Aug 17, 2023

Describe the bug
During check-suspend-resume-with-playback-5.sh test, FW reported error: 104 is occurred.

[ 4410.777465] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x60000068|0x10036: MOD_INIT_INSTANCE
[ 4410.777468] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: FW reported error: 104 - Other failure of module instance initialization request
[ 4410.777519] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc error for msg 0x40010004|0x10036
[ 4410.777548] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: failed to create module alh-copier.SDW0-Playback.0

Intel internal daily test link:

  • planresultdetail/30296?model=MTLP_RVP_SDW&testcase=check-suspend-resume-with-playback-5
  • planresultdetail/30296?model=MTLP_SDW_AIOC&testcase=check-suspend-resume-with-playback-5

To Reproduce
TPLG=/lib/firmware/intel/sof-ace-tplg/sof-mtl-rt711-4ch.tplg MODEL=MTLP_RVP_SDW SOF_TEST_INTERVAL=5 ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m playback

Reproduction Rate
Starting from today. It happened in MTLP_RVP_SDW and MTLP_SDW_AIOC.

Environment

Screenshots or console output

@fredoh9 fredoh9 added bug Something isn't working MTL Applies to Meteor Lake platform. labels Aug 17, 2023
@abonislawski abonislawski added the P2 Critical bugs or normal features label Aug 22, 2023
@fredoh9
Copy link
Collaborator Author

fredoh9 commented Aug 29, 2023

Found in ADLP_SKU0B00_SDCA also from today's daily test.

Intel internal daily test link:
planresultdetail/30853?model=ADLP_SKU0B00_SDCA-cavs2_5-ipc4&testcase=check-suspend-resume-with-playback-5

@fredoh9 fredoh9 added the ADL Applies to Alder Lake platform label Aug 29, 2023
@fredoh9
Copy link
Collaborator Author

fredoh9 commented Sep 1, 2023

Not reproduced for 3 days

@keqiaozhang
Copy link
Collaborator

keqiaozhang commented Sep 5, 2023

This issue happened again:
planresultdetail/31237?model=MTLP_SDW_AIOC-ace1_0-ipc4&testcase=check-suspend-resume-with-playback-5
planresultdetail/31270?model=MTLP_RVP_SDW-ace1_0-ipc4&testcase=check-signal-stop-start-playback-50

@fredoh9
Copy link
Collaborator Author

fredoh9 commented Sep 8, 2023

happen again

[ 1445.894941] kernel: snd_sof:sof_ipc4_widget_setup: sof-audio-pci-intel-mtl 0000:00:1f.3: Create widget alh-copier.SDW0-Playback.0 instance 1 - pipe 1 - core 0
[ 1445.894944] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x40010004|0x10036: MOD_INIT_INSTANCE [data size: 216]
[ 1445.896264] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x60000068|0x10036: MOD_INIT_INSTANCE
[ 1445.896267] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: FW reported error: 104 - Other failure of module instance initialization request
[ 1445.896280] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc error for msg 0x40010004|0x10036
[ 1445.896287] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: failed to create module alh-copier.SDW0-Playback.0
[ 1445.896293] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget gain.1.1 freed
[ 1445.896297] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x12010000|0x0: GLB_DELETE_PIPELINE
[ 1445.896673] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
[ 1445.896683] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x12010000|0x0: GLB_DELETE_PIPELINE
[ 1445.896686] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget pipeline.1 freed
[ 1445.896688] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget mixout.1.1 freed
[ 1445.896691] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget mixin.15.1 freed
[ 1445.896693] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget gain.15.1 freed
[ 1445.896696] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x12000000|0x0: GLB_DELETE_PIPELINE
[ 1445.897189] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
[ 1445.897198] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x12000000|0x0: GLB_DELETE_PIPELINE
[ 1445.897201] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget pipeline.15 freed
[ 1445.897203] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget host-copier.31.playback freed
[ 1445.897205] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Failed to set up connected widgets
[ 1445.897210] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: failed widget list set up for pcm 31 dir 0
[ 1445.897214] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_hw_params on 0000:00:1f.3: -22

planresultdetail/31496?model=MTLP_RVP_SDW-ace1_0-ipc4&testcase=check-signal-stop-start-playback-50

@mengdonglin
Copy link
Collaborator

We haven't observed this issue in v2.7 stress test.

@RanderWang
Copy link

Tried a few days test but can't reproduce it

@lgirdwood
Copy link
Member

@fredoh9 any recent sightings at your end ?

@fredoh9
Copy link
Collaborator Author

fredoh9 commented Sep 29, 2023

I haven't seen the same issue for a week+.

@lgirdwood
Copy link
Member

I haven't seen the same issue for a week+.

ok, pls close if not seen again by end of week.

@fredoh9
Copy link
Collaborator Author

fredoh9 commented Oct 2, 2023

Found today again with MTLP_SDW_AIOC

Intel Internal result: planresultdetail/32587?model=MTLP_SDW_AIOC-ace1_0-ipc4&testcase=check-suspend-resume-with-playback-5

@marc-hb
Copy link
Collaborator

marc-hb commented Oct 3, 2023

@fredoh9
Copy link
Collaborator Author

fredoh9 commented Nov 1, 2023

Reproduced today.

Intel internal daily test result:
planresultdetail/34066?model=MTLP_SDW_AIOC&testcase=check-suspend-resume-with-playback-5

@plbossart
Copy link
Member

@fredoh9 there's nothing in the mtrace that can be traced to the 104 error code so it's really hard to provide feedback.

@keqiaozhang
Copy link
Collaborator

No reproductions in CI. Closing this bug.

@keqiaozhang
Copy link
Collaborator

Observed this issue again on MTLP_SDW_AIOC platform. Test ID:34965.

dmesg

[ 6026.369056] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x40010004|0x10036: MOD_INIT_INSTANCE [data size: 216]
[ 6026.371202] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x60000068|0x10036: MOD_INIT_INSTANCE
[ 6026.371206] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: FW reported error: 104 - Other failure of module instance initialization request
[ 6026.371223] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc error for msg 0x40010004|0x10036
[ 6026.371226] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: failed to create module alh-copier.SDW0-Playback.0
[ 6026.371231] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget gain.1.1 freed
[ 6026.371235] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x12010000|0x0: GLB_DELETE_PIPELINE
[ 6026.371619] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
[ 6026.371640] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x12010000|0x0: GLB_DELETE_PIPELINE
[ 6026.371646] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget pipeline.1 freed
[ 6026.371650] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget mixout.1.1 freed
[ 6026.371655] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget mixin.15.1 freed
[ 6026.371658] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget gain.15.1 freed
[ 6026.371661] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x12000000|0x0: GLB_DELETE_PIPELINE
[ 6026.372035] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
[ 6026.372049] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x12000000|0x0: GLB_DELETE_PIPELINE
[ 6026.372051] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget pipeline.15 freed
[ 6026.372053] kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: widget host-copier.31.playback freed
[ 6026.372055] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Failed to set up connected widgets
[ 6026.372059] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: failed widget list set up for pcm 31 dir 0
[ 6026.372062] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_hw_params on 0000:00:1f.3: -22

@kv2019i
Copy link
Collaborator

kv2019i commented Dec 7, 2023

P2 so not gating SOF2.8, pushing out to v2.9.

@RanderWang
Copy link

all the error happened with dai copier

@fredoh9
Copy link
Collaborator Author

fredoh9 commented Dec 15, 2023

It happened again today.

Intel internal test link:
planresultdetail/35987?model=MTLP_SDW_AIOC&testcase=check-suspend-resume-with-playback-5

@mengdonglin
Copy link
Collaborator

Reproduced on ww51.3 on MTLP_SDW_AIOC, report ID 36158.
Failed case: check-signal-stop-start-playback-50

@bardliao
Copy link
Collaborator

@RanderWang For LNL, we don't set dai_index. Is it required for FW? @plbossart Any thought?

@bardliao
Copy link
Collaborator

@RanderWang For LNL, we don't set dai_index. Is it required for FW? @plbossart Any thought?

I just checked with @RanderWang. The issue is seen on TGL, not LNL. So, need to look deeper into the issue. @plbossart

@plbossart
Copy link
Member

@bardliao @RanderWang I have no idea what this index of 7 refers to,

I also don't know why this would be specific to TGL, if there was really an issue it would be seen for MTL as well?

@RanderWang
Copy link

@bardliao @RanderWang I have no idea what this index of 7 refers to,

I also don't know why this would be specific to TGL, if there was really an issue it would be seen for MTL as well?

It was also reproduced on MTL on above comments like #4832

@RanderWang
Copy link

RanderWang commented Jan 19, 2024

The reason is a little tricky. Headphone and Deepbuffer share the same BE (dai-copier), when Headphone stream is being closed and Deepbuffer is being set up, the error will happen.

The node_id is set up by

[  595.659311] kernel:  sof_ipc4_dai_config+0x4b/0x170 [snd_sof]
[  595.659348] kernel:  hda_dai_config+0x62/0xa0 [snd_sof_intel_hda_common]
[  595.659381] kernel:  sdw_params_stream+0x51/0x70 [snd_sof_intel_hda_generic]
[  595.659394] kernel:  intel_hw_params+0x128/0x250 [soundwire_intel]
[  595.659414] kernel:  snd_soc_dai_hw_params+0x39/0xa0 [snd_soc_core]
[  595.659461] kernel:  __soc_pcm_hw_params+0x553/0x720 [snd_soc_core]
[  595.659549] kernel:  dpcm_be_dai_hw_params+0x270/0x3f0 [snd_soc_core]

		if ((be->dpcm[stream].state != SND_SOC_DPCM_STATE_OPEN) &&
		    (be->dpcm[stream].state != SND_SOC_DPCM_STATE_HW_PARAMS) &&
		    (be->dpcm[stream].state != SND_SOC_DPCM_STATE_HW_FREE))
			continue;

[  595.659638] kernel:  dpcm_fe_dai_hw_params+0xd0/0x220 [snd_soc_core]
[  595.659675] kernel:  snd_pcm_hw_params+0x3ba/0x6a0 [snd_pcm]

(1) HPD (headphone stream) is being closed and Deepbuffer is being setup
(2) HPD pipeline is free by SOF and also its dai copier (node_id is clear to 0)
(3) when checking BE for Deepbuffer in dpcm_be_dai_hw_params, find the BE is also used by HPD and the state is SND_SOC_DPCM_STATE_STOP, so skip above intel_hw_params-> sdw_params_stream ... > sof_ipc4_dai_config
(4) Deepbuffer stream sends dai-copier to fw (node_id is not set for (3))
(5) HPD BE is set to SND_SOC_DPCM_STATE_HW_FREE

@RanderWang
Copy link

RanderWang commented Jan 19, 2024

Why we can't rely on use_count to share the sof widget between HPD & Deepbuffer stream ?
The problem is that: we set up widget list in hw_params function but Deepbuffer stream is just open and before hw_params, so the use_count can't be increase at this time

Solution: set up widget list in pcm_open function and add use_count here and initialize widget in hw_param function

@RanderWang
Copy link

Why we can't rely on use_count to share the sof widget between HPD & Deepbuffer stream ? The problem is that: we set up widget list in hw_params function but Deepbuffer stream is just open and before hw_params, so the use_count can't be increase at this time

Solution: set up widget list in pcm_open function and add use_count here and initialize widget in hw_param function

No this method is not feasible since FW will free shared widgets for the use count in fw is also set up by hw_param function

@RanderWang
Copy link

why clear node id for alh ? According to @bardliao the node_id is fixed for a special BE, so we don't need to clear since it will be set up with the same id again.

static void sof_ipc4_unprepare_copier_module(struct snd_sof_widget *swidget) {

           if (ipc4_copier->dai_type == SOF_DAI_INTEL_ALH) {
                       ......
			/* clear the node ID */
			copier_data->gtw_cfg.node_id &= ~SOF_IPC4_NODE_INDEX_MASK;
            }
}

@ranj063 do you have any idea ? 5e9a856

@bardliao
Copy link
Collaborator

why clear node id for alh ? According to @bardliao the node_id is fixed for a special BE, so we don't need to clear since it will be set up with the same id again.

To be clear, each BE has its own unique intel_alh_id aka node_index based on the link number and pdi number. So that node_id will not change for a specific BE. For example, the intel_alh_id will be always 0x2 for sdw0 pin2.

@plbossart
Copy link
Member

plbossart commented Jan 29, 2024

@RanderWang Can you provide more traces and specifically log transitions of the variable "be->dpcm[stream].users" and which stages this variable is updated (startup, hw_params, prepare, hw_free, shutdown).

In theory this refcount is used precisely to prevent transitions on shared BEs, e.g.

/* do not free hw if this BE is used by other FE */
		if (be->dpcm[stream].users > 1)
			continue;

My take is that there is probably something not quite right in soc-pcm.c, if there is a race condition we should fix the root cause. Changing how the node_id is cleared seems a bit like a hack to me.

@RanderWang
Copy link

Sure, it is also a hack to me. What I think is to set it by topology or rework our SOF but I have other high priority tasks to do.

@RanderWang
Copy link

@RanderWang Can you provide more traces and specifically log transitions of the variable "be->dpcm[stream].users" and which stages this variable is updated (startup, hw_params, prepare, hw_free, shutdown).

In theory this refcount is used precisely to prevent transitions on shared BEs, e.g.

/* do not free hw if this BE is used by other FE */
		if (be->dpcm[stream].users > 1)
			continue;

My take is that there is probably something not quite right in soc-pcm.c, if there is a race condition we should fix the root cause. Changing how the node_id is cleared seems a bit like a hack to me.

@plbossart the problem is not be free. The node_id is set by BE component( SDW dai). When the first stream A is free, the FE dai widget is free by SOF, so the node_id is cleared, at this time the BE state is SND_SOC_DPCM_STATE_STOP. And the be_hw_params for second stream B want to call sdw_hw_params to set node_id, but it found the state is SND_SOC_DPCM_STATE_STOP, it will skip this BE hw_params, so the node_id will not be set.

int dpcm_be_dai_hw_params(struct snd_soc_pcm_runtime *fe, int stream)
{
               ......
	for_each_dpcm_be(fe, stream, dpcm) {
                .....

		if ((be->dpcm[stream].state != SND_SOC_DPCM_STATE_OPEN) &&
		    (be->dpcm[stream].state != SND_SOC_DPCM_STATE_HW_PARAMS) &&
		    (be->dpcm[stream].state != SND_SOC_DPCM_STATE_HW_FREE))
			continue;
        }
}

The bug is very random and I don't reproduce it today.

@RanderWang
Copy link

My plan was to keep FE dai widget alive in SOF since it was used by two steams, but we need to do much work to do it

@plbossart
Copy link
Member

I must admit I don't recall why we only added locking for the trigger cases. I don't know how we can protect the state of a BE which can be modified by separate FEs. Using the user counts and states is by construction racy, I can't think of a simple fix, unless I am gravely mistaken it looks like all the soc-pcm.c code needs to be revisited.

The path of least resistance for a quick fix would be to avoid clearing stuff and keeping the node_id constant for shared BEs.

@plbossart
Copy link
Member

@tiwai shares my analysis that the Linux BE locking isn't really up-to-snuff, we would need to come-up with a better mechanism to make sure FE transitions don't end-up corrupting the state of a shared BE.

This sort of change is tricky and will likely require multiple weeks and lots of testing before we get it right.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 31, 2024

It looks like this a kernel issue, not firmware. Should be transferred to https://github.com/thesofproject/linux ?

@ranj063
Copy link
Collaborator

ranj063 commented Feb 21, 2024

It looks like this a kernel issue, not firmware. Should be transferred to https://github.com/thesofproject/linux ?

@marc-hb yes this is a kernel issue and I have a good feeling that the linux PR #4829 would also fix this issue

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 21, 2024

Transferring this issue since Github does nice, automagic redirections that don't break URLs.

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 21, 2024

Transferring this issue since Github does nice, automagic redirections that don't break URLs.

There was a permission cleanup recently and I lost write access to https://github.com/thesofproject/linux/issues which is probably why I can't do this transfer.

EDIT: fixed.

@plbossart
Copy link
Member

I've really lost the plot on this issue @marc-hb @fredoh9, do we still see this in any CI/daily tests now?

@marc-hb
Copy link
Collaborator

marc-hb commented Apr 24, 2024

I haven't seen this in recent times. I guess some unrelated change just made the race described by Rander above less likely.

@fredoh9
Copy link
Collaborator Author

fredoh9 commented Apr 25, 2024

I haven't seen this problem lately.
I lost my test result database to confirm.

@plbossart
Copy link
Member

closing, let's reopen if that occurs again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADL Applies to Alder Lake platform bug Something isn't working MTL Applies to Meteor Lake platform. P2 Critical bugs or normal features
Projects
None yet
Development

Successfully merging a pull request may close this issue.