Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][MTL] Firmware boot failure due to timeout during suspend/resume stress test (ROM status 0x50000005, ROM error 0x0) #8148

Closed
keqiaozhang opened this issue Sep 5, 2023 · 8 comments
Labels
Boot Firmware boot or code signing related. bug Something isn't working as expected MTL Applies to Meteor Lake platform P2 Critical bugs or normal features suspend-resume Issues observed when doing system suspend and resume

Comments

@keqiaozhang
Copy link
Collaborator

keqiaozhang commented Sep 5, 2023

Describe the bug
This issue happened when testing suspend/resume with audio. It only happened once with stress test.

dmesg

[51369.325401] kernel: snd_sof_intel_hda_common:mtl_dsp_cl_init: sof-audio-pci-intel-mtl 0000:00:1f.3: Primary core power up successful
[51369.325408] kernel: snd_sof_intel_hda_common:mtl_dsp_cl_init: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x73214]=0x80000000 successful
[51369.325420] kernel: snd_sof_intel_hda_common:mtl_enable_interrupts: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1800]=0x41 successful
[51369.325429] kernel: snd_sof_intel_hda_common:mtl_enable_interrupts: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1140]=0x1 successful
[51369.337199] kernel: nvme nvme0: 8/0/0 default/read/poll queues
[51371.343143] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[51371.343150] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware boot failure due to timeout
[51371.343153] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (3)
[51371.343197] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0x0, ROM error: 0x0
[51371.343200] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0x50000005, ROM debug error: 0x0
[51371.343205] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit enabled
[51371.343207] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[51371.343209] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: failed to boot DSP firmware after resume -5
[51371.343213] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state change: 3 -> 4
[51371.343216] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -5
[51371.343230] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: PM: failed to resume async: error -5
[51371.348200] kernel: OOM killer enabled.
[51371.348203] kernel: Restarting tasks ... done.

To Reproduce
~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 100 -m playback

Reproduction Rate
TBD, need more tests to confirm it.

Environment

  1. Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
  2. Name of the topology file
    • Topology: {development/sof-mtl-nocodec.tplg}
  3. Name of the platform(s) on which the bug is observed.

dmesg.txt

mtrace.txt

@keqiaozhang keqiaozhang added bug Something isn't working as expected suspend-resume Issues observed when doing system suspend and resume MTL Applies to Meteor Lake platform labels Sep 5, 2023
@mengdonglin mengdonglin added the Boot Firmware boot or code signing related. label Sep 5, 2023
@keqiaozhang
Copy link
Collaborator Author

keqiaozhang commented Sep 5, 2023

A similar boot failure like #7866 (comment)

@mengdonglin
Copy link
Collaborator

@kv2019i This issue cannot be reproduced with v2.7-rc1 on both MTL and cavs2.5 platforms, confirmed by @keqiaozhang by running more stress test on MTLP_RVP_NOCODEC and ADLP_RVP_NOCODEC.

v2.7-rc1 disabled IMR context save by default.

@keqiaozhang If we disable IMR context save on main branch, can we reproduce this issue?

@keqiaozhang
Copy link
Collaborator Author

If we disable IMR context save on main branch, can we reproduce this issue?

Confirmed that this issue cannot be reproduced on main branch when IMR is disabled.

@keqiaozhang keqiaozhang added the P2 Critical bugs or normal features label Sep 18, 2023
@fredoh9
Copy link
Contributor

fredoh9 commented Oct 5, 2023

Today daily test has same issue with MTLP_RVP_NOCODEC multicore.

[ 1048.207985] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[ 1048.207994] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware boot failure due to timeout
[ 1048.207997] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (3)
[ 1048.208039] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0x0, ROM error: 0x0
[ 1048.208042] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0x50000005, ROM debug error: 0x0
[ 1048.208048] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit enabled
[ 1048.208091] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[ 1048.208094] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: failed to boot DSP firmware after resume -5
[ 1048.208102] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -5
[ 1048.208119] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: PM: failed to resume async: error -5
[ 1048.322203] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc4_tx_msg_unlocked: ipc message send for 0x11000005|0x0 failed: -19
[ 1048.322233] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: failed to create module pipeline.1
[ 1048.322252] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Failed to set up connected widgets
[ 1048.322271] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: failed widget list set up for pcm 0 dir 0
[ 1048.322291] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: set pcm hw_params after resume
[ 1048.322308] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_prepare on 0000:00:1f.3: -19
[ 1048.322330] kernel:  Port0: ASoC: error at __soc_pcm_prepare on Port0: -19
[ 1048.322346] kernel:  Port0: ASoC: error at dpcm_fe_dai_prepare on Port0: -19

Internal Intel Daily test link:
planresultdetail/32700?model=MTLP_RVP_NOCODEC-ace1_0-multicore-3cores&testcase=check-suspend-resume-with-playback-5

@mengdonglin mengdonglin changed the title [BUG][MTL] Firmware boot failure due to timeout during suspend/resume stress test [BUG][MTL] Firmware boot failure due to timeout during suspend/resume stress test (ROM status 0x50000005, ROM error 0x0) Oct 22, 2023
@lrudyX
Copy link

lrudyX commented Nov 24, 2023

@tmleman Please check

@tmleman
Copy link
Contributor

tmleman commented Nov 24, 2023

@keqiaozhang it looks very similar to the #7866 and probably have the same root cause. Why we tract it in separate issue?

@keqiaozhang
Copy link
Collaborator Author

This issue cannot be reproduced on main branch. Closing this bug.

@tmleman
Copy link
Contributor

tmleman commented Jan 19, 2024

@keqiaozhang Is there any update? The latest comment mentions that the problem is not reproducing, but the issue is still open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Boot Firmware boot or code signing related. bug Something isn't working as expected MTL Applies to Meteor Lake platform P2 Critical bugs or normal features suspend-resume Issues observed when doing system suspend and resume
Projects
None yet
Development

No branches or pull requests

5 participants