Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][MTL] S0ix long run failure, Firmware boot failure due to timeout (ROM status: 0x50000005, ROM error: 0x0) #7866

Closed
RDharageswari opened this issue Jun 27, 2023 · 76 comments
Assignees
Labels
Boot Firmware boot or code signing related. bug Something isn't working as expected IMR context save IMR Related to IMR (Isolated Memory Region) on Intel platforms IPC timeout IPC timeout observed MTL Applies to Meteor Lake platform mtl-005 mtl-006 P2 Critical bugs or normal features suspend-resume Issues observed when doing system suspend and resume

Comments

@RDharageswari
Copy link

RDharageswari commented Jun 27, 2023

EDIT: the kernel part of this issue has been transferred to newer:


We see this error reported in multiple scenario:
One was long run S0ix testing.

[  352.971033] sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x47000000|0x0
[  352.979537] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[  352.988991] sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0xc7000000|0x0|0x0, target: 0x67000000|0x0|0x0, ctl: 0x3
[  353.001735] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
[  353.010979] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[  353.020423] sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
[  353.026868] sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[  353.035552] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0xffffffff, ROM error: 0xffffffff
[  353.045287] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0x50000005, ROM debug error: 0x0
[  353.055497] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit not enabled
[  353.063477] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[  353.072721] sof-audio-pci-intel-mtl 0000:00:1f.3: ctx_save IPC error: -110, proceeding with suspend
[  689.764439] sof-audio-pci-intel-mtl 0000:00:1f.3: GAIN (UUID: 61BCA9A8-18D0-4A18-8E7B-2639219804B7): No CPC match in the firmware file's manifest (ibs/obs: 768/768)
[  692.922594] sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x47000000|0x0
[  692.931113] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[  692.940571] sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0xc7000000|0x0|0x0, target: 0x67000000|0x0|0x0, ctl: 0x3
[  692.953325] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
[  692.962576] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]---------

cc:

@RDharageswari RDharageswari added bug Something isn't working as expected MTL Applies to Meteor Lake platform mtl-005 labels Jun 27, 2023
@marc-hb
Copy link
Collaborator

marc-hb commented Jun 27, 2023

@RDharageswari can you please not ignore the entire bug template? It has a number of specific and important fields.

@RDharageswari
Copy link
Author

RDharageswari commented Jun 29, 2023

Hi Marc,
Sure.I am trying to get the exact scenario, hence did not update it right.Will update the entire configuration once available

@lgirdwood lgirdwood added this to the v2.7 milestone Jul 5, 2023
@mengdonglin
Copy link
Collaborator

@RDharageswari Can you still reproduce this issue with mtl-005.0.3 (Hot Fix 3) release?

@mengdonglin mengdonglin added the IPC timeout IPC timeout observed label Jul 31, 2023
@RDharageswari
Copy link
Author

@mengdonglin: No reports of this in the recent days..But @yongzhi1 was able to repro this issue in the PnP set up with mtl-005-hotfix3 release

@macchian
Copy link
Contributor

macchian commented Aug 16, 2023

@mengdonglin , I also found the ipc time out during suspend resume test on customer MTL board.
The issue can be reproduced on fw branch mtl-005-drop-stable and mtl-005-hotfix3 as well.

@yongzhi1 , I saw that your linux PR about page fault
ASoC: SOF: Intel: MTL: catch invalid IRQ IP ptr #4344
Is it relative to this?

...
<3>[  124.809909] sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x47000000|0x0
<3>[  124.818406] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
<3>[  124.830259] sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0xc7000000|0x0|0x0, target: 0xe7000000|0x0|0x0, ctl: 0x3
<3>[  124.856211] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
<3>[  124.874460] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
<3>[  124.893443] sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
<3>[  124.906717] sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
<3>[  124.924700] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0xffffffff, ROM error: 0xffffffff
<3>[  124.944330] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0x50000005, ROM debug error: 0x0
<3>[  124.966038] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit not enabled
<3>[  124.982946] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
<4>[  125.001601] sof-audio-pci-intel-mtl 0000:00:1f.3: ctx_save IPC error: -110, proceeding with suspend
<1>[  125.199974] BUG: unable to handle page fault for address: ffffbffbcb40001f
<1>[  125.207668] #PF: supervisor read access in kernel mode

@macchian
Copy link
Contributor

attach the crash logs.
kernel.20230815.094120.64594.txt

@mengdonglin
Copy link
Collaborator

mengdonglin commented Aug 16, 2023

@macchian @RDharageswari @yongzhi1 This issue seems similar to #7990 and #8028 that happens with mtl-005.0.3 when CPC-based clock selection and IMR context save are both enabled.

Now for main branch, IMR context save is already enabled and PR of CPC-based clock selection is under review #8019

We found on main branch, with both IMR context save and CPC-based clock selection (PR) enabled, we cannot reproduce #7990 and #8028. Please find details and recipe in #7990 (comment)

Maybe you can try this FW recipe of main branch as well?

Or you may try this kernel test PR #8007 with MTL-005.0.3 FW? This kernel PR will make FW always run at highest clock by always sending ZERO CPC values to FW?

@mengdonglin mengdonglin added the P1 Blocker bugs or important features label Aug 16, 2023
@mengdonglin mengdonglin reopened this Aug 16, 2023
@macchian
Copy link
Contributor

We found on main branch, with both IMR context save and CPC-based clock selection (PR) enabled, we cannot reproduce #7990 and #8028. Please find details and recipe in #7990 (comment)

Maybe you can try this FW recipe of main branch as well?
@mengdonglin, thanks for your suggestions.
Yes, I provided a test mtl fw with PRs #8019 on mtl-005-hotfix3 branch.
So far the result is positive! The suspend test has run 600+ times until now.
Afterwards ODM will run more DUTs when other devices are available.

Or you may try this kernel test PR #8007 with MTL-005.0.3 FW? This kernel PR will make FW always run at highest clock by always sending ZERO CPC values to FW?
Just a thought, Is #8007 a desired fix by always sending ZERO CPC values to FW ?

@lrudyX
Copy link

lrudyX commented Aug 18, 2023

@macchian Could you provide some more information how to reproduce the issue?

@macchian
Copy link
Contributor

@macchian Could you provide some more information how to reproduce the issue?

@lrudyX , the suspend_stress_test tool is specific ChromeOS test tool. I am not sure if the CI relevant check-suspend-resume-with-* can isolate the issue or not.

#suspend_stress_test -c 1500 --record_dmesg_dir=/usr/local/agingLogs --suspend_min=15 --suspend_max=20"

@macchian
Copy link
Contributor

macchian commented Aug 22, 2023

We found on main branch, with both IMR context save and CPC-based clock selection (PR) enabled, we cannot reproduce #7990 and #8028. Please find details and recipe in #7990 (comment)
Maybe you can try this FW recipe of main branch as well?
@mengdonglin, thanks for your suggestions.
Yes, I provided a test mtl fw with PRs #8019 on mtl-005-hotfix3 branch.
So far the result is positive! The suspend test has run 600+ times until now.
Afterwards ODM will run more DUTs when other devices are available.

@mengdonglin, from latest report, one DUT reproduce it highly reproducible rate. Attach the logs.

kernel.20230815.205835.87076.0.txt

@mengdonglin
Copy link
Collaborator

@macchian Thank you for testing the main branch! @abonislawski confirmed that main branch will always run at highest clock as it lack rimage update for CPC values as mtl-005.0.3. Can we say that except for the specific DUT, the reproduce rate of this issue is lower with main branch FW than mtl-005.0.3? And what's the approximate reproduction rate for this specific DUT?

@macchian
Copy link
Contributor

@macchian Thank you for testing the main branch! @abonislawski confirmed that main branch will always run at highest clock as it lack rimage update for CPC values as mtl-005.0.3. Can we say that except for the specific DUT, the reproduce rate of this issue is lower with main branch FW than mtl-005.0.3? And what's the approximate reproduction rate for this specific DUT?

@mengdonglin , I'm afraid that 2 DUTs were randomly failure within 20 cycles from suspend_resume. The 3rd DUT will almost certainly appear. This problem can be easily repeated from customer reports.

Do you recommend the kernel test PR #8007 worth to tr yor other PRs recommendation?

@mengdonglin
Copy link
Collaborator

@mengdonglin , I'm afraid that 2 DUTs were randomly failure within 20 cycles from suspend_resume. The 3rd DUT will almost certainly appear. This problem can be easily repeated from customer reports.

Do you recommend the kernel test PR #8007 worth to tr yor other PRs recommendation?
@macchian You needn't try PR #8007 with mtl-005.0.3. Because you've already tried with main branch FW that DSP runs at the highest clock, and the issue can still be reproduced. PR #8007 just make DSP runs at highest clock with mtl-005.0.3

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 30, 2023

@macchian @mengdonglin It is worth testing with #7994 (disables IMR context save)

With the context-save, e.g. a leak in DSP resources will be compounding and lead to errors. So if 7994 has impact to repro rate, this can reveal a lot more information about the bug and help in debug.

@macchian
Copy link
Contributor

@

@macchian @mengdonglin It is worth testing with #7994 (disables IMR context save)

With the context-save, e.g. a leak in DSP resources will be compounding and lead to errors. So if 7994 has impact to repro rate, this can reveal a lot more information about the bug and help in debug.

@kv2019i , I setup one device and it's very high reproduce rate no matter enabled or disabled IMR. Almost every time during 20 cycles, the DSP dump reproduce. I could share the remote access to you if you need something to check.

[ 131.013601] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[ 131.023065] sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware boot failure due to timeout
[ 131.031919] sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (3)
[ 131.040875] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0xffffffff, ROM error: 0xffffffff
[ 131.050595] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0xd000001c, ROM debug error: 0x2328
[ 131.061093] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit not enabled
[ 131.066218] pcieport 0000:00:06.1: pciehp: Slot(8): No link
[ 131.069065] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------

@macchian
Copy link
Contributor

dsp_dump.log

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 31, 2023

Thanks @macchian this now looks a bit different. The DSP fails to boot (versus a IPC timeout before) and there's a ROM debug code:

[  131.023065] sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware boot failure due to timeout
[  131.031919] sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (3)
[  131.040875] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0xffffffff, ROM error: 0xffffffff
[  131.050595] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0xd000001c, ROM debug error: 0x2328
```

@mengdonglin
Copy link
Collaborator

@tmleman Can you help check this failure?

@tmleman
Copy link
Contributor

tmleman commented Nov 24, 2023

While debugging I found one problem that may be causing this issue.

If FW receive SET_DX message (HOST->DSP) before receiving ACK for previously sent message/notification (DSP->HOST), IPC device is in busy state. Zephyr PM during system suspend skip suspension of busy devices. In result, the IPC device is neither put to suspend nor resume during power-flow.

From the host perspective, it manifests itself in the fact that the ROM reports FW is loaded, but we do not receive FW Ready. When you connect via gdb to core 0, FW is in idle state.

@lgirdwood
Copy link
Member

While debugging I found one problem that may be causing this issue.

If FW receive SET_DX message (HOST->DSP) before receiving ACK for previously sent message/notification (DSP->HOST), IPC device is in busy state. Zephyr PM during system suspend skip suspension of busy devices. In result, the IPC device is neither put to suspend nor resume during power-flow.

From the host perspective, it manifests itself in the fact that the ROM reports FW is loaded, but we do not receive FW Ready. When you connect via gdb to core 0, FW is in idle state.

Good work @tmleman ! Do you have a proposal for a fix ?
@kv2019i @ujfalusi fyi

@kv2019i kv2019i modified the milestones: v2.8, v2.9 Nov 24, 2023
@kv2019i
Copy link
Collaborator

kv2019i commented Nov 24, 2023

Context save not enabled in stable-v2.8, moving this to v2.9 (this is gating enabling context-save in mainline).

@lgirdwood
Copy link
Member

@tmleman any update on debug - my feeling is same root cause could also cause thesofproject/linux#4832 @kv2019i fyi.

@tmleman
Copy link
Contributor

tmleman commented Nov 30, 2023

@lgirdwood I haven't been able to reproduce this particular issue in my local environment. I need someone to confirm that the bug found is also the cause of this issue.

I also need to determine how FW should behave in such a situation. The easiest solution is to ignore this last ACK and force the IPC device to sleep while going into the D3 state. Another solution would be to return an error in the SET_DX response, and fix would need to be done in linux driver.

@lgirdwood
Copy link
Member

@plbossart @ujfalusi @RanderWang @ranj063 any comments here ? Is driver able to do this ?

@ranj063
Copy link
Collaborator

ranj063 commented Dec 4, 2023

If FW receive SET_DX message (HOST->DSP) before receiving ACK for previously sent message/notification (DSP->HOST), IPC device is in busy state. Zephyr PM during system suspend skip suspension of busy devices. In result, the IPC device is neither put to suspend nor resume during power-flow.

@tmleman I'm a bit skeptical about the possibility of this happening in the linux driver. Basically, if the DSP has send a message/notification to the host, we hold a spinlock until the ACK has been sent back to the DSP before initializing a new IPC.

@tmleman
Copy link
Contributor

tmleman commented Dec 5, 2023

@lgirdwood I spoke with @mmaka1 about this and we agreed that this pending ACK should not be an obstacle in D3 transition. This is because the device is not actually suspended but reset and from a hardware perspective there is no trace of this ACK. I have pushed fix for this issue to review #8573.

@tmleman
Copy link
Contributor

tmleman commented Dec 5, 2023

I'm a bit skeptical about the possibility of this happening in the linux driver. Basically, if the DSP has send a message/notification to the host, we hold a spinlock until the ACK has been sent back to the DSP before initializing a new IPC.

@ranj063 maybe this is a situation when FW sends notifications at the same time when the HOST sends the SET_DX message? I can prepare a build with my changes and if the problem described in this issue does not reproduce, we will be able to consider it as confirmation of the root-cause. I've already asked @keqiaozhang for this.

@ujfalusi
Copy link
Contributor

ujfalusi commented Dec 5, 2023

@tmleman I'm a bit skeptical about the possibility of this happening in the linux driver. Basically, if the DSP has send a message/notification to the host, we hold a spinlock until the ACK has been sent back to the DSP before initializing a new IPC.

@ranj063, I don't think we hold spinlock for notification, that would not work.
We can receive and process notifications while waiting for a reply, we only protect the sending.
@tmleman, do you know what is the notification that did not received an ACK? LOG_BUFFER_STATUS?

The host can only ack the notification/reply after it took the message out from mailbox.

I think the fw should wait for the ACK (that it is possible to send a message to host) in any case. Linux does that for all message.
See:
fw sends a notification
host receives notification (ACK is not cleared)
host starts to process it (taking it out from mailbox, etc) (ACK is not cleared)
host sends SET_DX (ACK is not cleared)
fw receives message (ACK is not cleared)
fw ignores that host is not yet ready to receive message and sends reply to SET_DX
host finishes with the notification and clears the ACK (and the notification data might be corrupted by the reply from fw)
The reply to SET_DX is lost

Linux always checks the DSP BUSY (ack from fw side) before sending message, if it is not clear then the message is moved to deferred 'list' and it is going to be sent when the ACK is received from DSP side that the fw is ready to receive new message.

I think similar 'deferred' sending should be done on the fw side as well?

@tmleman
Copy link
Contributor

tmleman commented Dec 5, 2023

@tmleman, do you know what is the notification that did not received an ACK? LOG_BUFFER_STATUS?

@ujfalusi I used our internal tests for reproduction (FW behavior is similar in these cases). I suspect that problem in this scenario is caused by the LOG_BUFFER_STATUS notification.

@lgirdwood
Copy link
Member

@tmleman @ujfalusi any consensus here ?
@RDharageswari is this still an issue, many fixes have upstreamed since initial report.

@tmleman
Copy link
Contributor

tmleman commented Dec 22, 2023

@lgirdwood @ujfalusi @RDharageswari I proposed a fix (pull request to Zephyr zephyrproject-rtos/zephyr#66135), it's one of many possible solutions. I encourage you to discuss it.

I don't have confirmation that this fix resolves this issue. I would need assistance in verifying this.

@tmleman
Copy link
Contributor

tmleman commented Jan 11, 2024

@RDharageswari can you check if the issue still reproduces on the main branch?

@lgirdwood
Copy link
Member

@RDharageswari can you check if the issue still reproduces on the main branch?

@tmleman I guess we need a west update to pick up the Zephyr commit now its merged ?

@tmleman
Copy link
Contributor

tmleman commented Jan 19, 2024

@lgirdwood The patch I'm interested in has already been integrated. The kernel version in SOF has also been updated earlier.

@abonislawski
Copy link
Member

@macchian could you help in fix verification on main branch?

@wszypelt
Copy link

wszypelt commented Feb 9, 2024

Due to lack of response, reduces to P2
@RDharageswari Can you confirm that the issue no longer occurs?

@wszypelt wszypelt added P2 Critical bugs or normal features and removed P1 Blocker bugs or important features urgent labels Feb 9, 2024
@wszypelt wszypelt assigned RDharageswari and unassigned tmleman Feb 9, 2024
@kv2019i kv2019i removed this from the v2.9 milestone Mar 13, 2024
@wszypelt
Copy link

@RDharageswari Can you confirm that the issue no longer occurs?

@wszypelt
Copy link

Due to the lack of response from the reporting person, I am closing the task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Boot Firmware boot or code signing related. bug Something isn't working as expected IMR context save IMR Related to IMR (Isolated Memory Region) on Intel platforms IPC timeout IPC timeout observed MTL Applies to Meteor Lake platform mtl-005 mtl-006 P2 Critical bugs or normal features suspend-resume Issues observed when doing system suspend and resume
Projects
None yet
Development

No branches or pull requests