[BUG][MTL] S0ix long run failure, Firmware boot failure due to timeout (ROM status: 0x50000005, ROM error: 0x0) #7866

RDharageswari · 2023-06-27T14:55:22Z

EDIT: the kernel part of this issue has been transferred to newer:

[BUG][MTL] S0ix long run failure - timed out for 0x47000000 then kernel oops on page fault linux#4608

We see this error reported in multiple scenario:
One was long run S0ix testing.

[  352.971033] sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x47000000|0x0
[  352.979537] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[  352.988991] sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0xc7000000|0x0|0x0, target: 0x67000000|0x0|0x0, ctl: 0x3
[  353.001735] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
[  353.010979] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[  353.020423] sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
[  353.026868] sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[  353.035552] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0xffffffff, ROM error: 0xffffffff
[  353.045287] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0x50000005, ROM debug error: 0x0
[  353.055497] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit not enabled
[  353.063477] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[  353.072721] sof-audio-pci-intel-mtl 0000:00:1f.3: ctx_save IPC error: -110, proceeding with suspend
[  689.764439] sof-audio-pci-intel-mtl 0000:00:1f.3: GAIN (UUID: 61BCA9A8-18D0-4A18-8E7B-2639219804B7): No CPC match in the firmware file's manifest (ibs/obs: 768/768)
[  692.922594] sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x47000000|0x0
[  692.931113] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[  692.940571] sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0xc7000000|0x0|0x0, target: 0x67000000|0x0|0x0, ctl: 0x3
[  692.953325] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
[  692.962576] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]---------

cc:

The text was updated successfully, but these errors were encountered:

marc-hb · 2023-06-27T17:09:21Z

@RDharageswari can you please not ignore the entire bug template? It has a number of specific and important fields.

RDharageswari · 2023-06-29T16:37:30Z

Hi Marc,
Sure.I am trying to get the exact scenario, hence did not update it right.Will update the entire configuration once available

mengdonglin · 2023-07-27T00:57:35Z

@RDharageswari Can you still reproduce this issue with mtl-005.0.3 (Hot Fix 3) release?

RDharageswari · 2023-08-14T20:57:14Z

@mengdonglin: No reports of this in the recent days..But @yongzhi1 was able to repro this issue in the PnP set up with mtl-005-hotfix3 release

macchian · 2023-08-16T07:54:47Z

@mengdonglin , I also found the ipc time out during suspend resume test on customer MTL board.
The issue can be reproduced on fw branch mtl-005-drop-stable and mtl-005-hotfix3 as well.

@yongzhi1 , I saw that your linux PR about page fault
ASoC: SOF: Intel: MTL: catch invalid IRQ IP ptr #4344
Is it relative to this?

...
<3>[  124.809909] sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x47000000|0x0
<3>[  124.818406] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
<3>[  124.830259] sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0xc7000000|0x0|0x0, target: 0xe7000000|0x0|0x0, ctl: 0x3
<3>[  124.856211] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
<3>[  124.874460] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
<3>[  124.893443] sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
<3>[  124.906717] sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
<3>[  124.924700] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0xffffffff, ROM error: 0xffffffff
<3>[  124.944330] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0x50000005, ROM debug error: 0x0
<3>[  124.966038] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit not enabled
<3>[  124.982946] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
<4>[  125.001601] sof-audio-pci-intel-mtl 0000:00:1f.3: ctx_save IPC error: -110, proceeding with suspend
<1>[  125.199974] BUG: unable to handle page fault for address: ffffbffbcb40001f
<1>[  125.207668] #PF: supervisor read access in kernel mode

macchian · 2023-08-16T07:57:20Z

attach the crash logs.
kernel.20230815.094120.64594.txt

mengdonglin · 2023-08-16T10:05:54Z

@macchian @RDharageswari @yongzhi1 This issue seems similar to #7990 and #8028 that happens with mtl-005.0.3 when CPC-based clock selection and IMR context save are both enabled.

Now for main branch, IMR context save is already enabled and PR of CPC-based clock selection is under review #8019

We found on main branch, with both IMR context save and CPC-based clock selection (PR) enabled, we cannot reproduce #7990 and #8028. Please find details and recipe in #7990 (comment)

Maybe you can try this FW recipe of main branch as well?

Or you may try this kernel test PR #8007 with MTL-005.0.3 FW? This kernel PR will make FW always run at highest clock by always sending ZERO CPC values to FW?

macchian · 2023-08-18T03:52:10Z

We found on main branch, with both IMR context save and CPC-based clock selection (PR) enabled, we cannot reproduce #7990 and #8028. Please find details and recipe in #7990 (comment)

Maybe you can try this FW recipe of main branch as well?
@mengdonglin, thanks for your suggestions.
Yes, I provided a test mtl fw with PRs #8019 on mtl-005-hotfix3 branch.
So far the result is positive! The suspend test has run 600+ times until now.
Afterwards ODM will run more DUTs when other devices are available.

Or you may try this kernel test PR #8007 with MTL-005.0.3 FW? This kernel PR will make FW always run at highest clock by always sending ZERO CPC values to FW?
Just a thought, Is #8007 a desired fix by always sending ZERO CPC values to FW ?

lrudyX · 2023-08-18T08:50:29Z

@macchian Could you provide some more information how to reproduce the issue?

macchian · 2023-08-22T04:06:07Z

@macchian Could you provide some more information how to reproduce the issue?

@lrudyX , the suspend_stress_test tool is specific ChromeOS test tool. I am not sure if the CI relevant check-suspend-resume-with-* can isolate the issue or not.

#suspend_stress_test -c 1500 --record_dmesg_dir=/usr/local/agingLogs --suspend_min=15 --suspend_max=20"

macchian · 2023-08-22T04:09:44Z

We found on main branch, with both IMR context save and CPC-based clock selection (PR) enabled, we cannot reproduce #7990 and #8028. Please find details and recipe in #7990 (comment)
Maybe you can try this FW recipe of main branch as well?
@mengdonglin, thanks for your suggestions.
Yes, I provided a test mtl fw with PRs #8019 on mtl-005-hotfix3 branch.
So far the result is positive! The suspend test has run 600+ times until now.
Afterwards ODM will run more DUTs when other devices are available.

@mengdonglin, from latest report, one DUT reproduce it highly reproducible rate. Attach the logs.

kernel.20230815.205835.87076.0.txt

mengdonglin · 2023-08-22T08:57:25Z

@macchian Thank you for testing the main branch! @abonislawski confirmed that main branch will always run at highest clock as it lack rimage update for CPC values as mtl-005.0.3. Can we say that except for the specific DUT, the reproduce rate of this issue is lower with main branch FW than mtl-005.0.3? And what's the approximate reproduction rate for this specific DUT?

macchian · 2023-08-22T09:10:24Z

@macchian Thank you for testing the main branch! @abonislawski confirmed that main branch will always run at highest clock as it lack rimage update for CPC values as mtl-005.0.3. Can we say that except for the specific DUT, the reproduce rate of this issue is lower with main branch FW than mtl-005.0.3? And what's the approximate reproduction rate for this specific DUT?

@mengdonglin , I'm afraid that 2 DUTs were randomly failure within 20 cycles from suspend_resume. The 3rd DUT will almost certainly appear. This problem can be easily repeated from customer reports.

Do you recommend the kernel test PR #8007 worth to tr yor other PRs recommendation?

mengdonglin · 2023-08-29T08:01:40Z

@mengdonglin , I'm afraid that 2 DUTs were randomly failure within 20 cycles from suspend_resume. The 3rd DUT will almost certainly appear. This problem can be easily repeated from customer reports.

Do you recommend the kernel test PR #8007 worth to tr yor other PRs recommendation?
@macchian You needn't try PR #8007 with mtl-005.0.3. Because you've already tried with main branch FW that DSP runs at the highest clock, and the issue can still be reproduced. PR #8007 just make DSP runs at highest clock with mtl-005.0.3

kv2019i · 2023-08-30T11:25:08Z

@macchian @mengdonglin It is worth testing with #7994 (disables IMR context save)

With the context-save, e.g. a leak in DSP resources will be compounding and lead to errors. So if 7994 has impact to repro rate, this can reveal a lot more information about the bug and help in debug.

macchian · 2023-08-31T11:41:47Z

@

@macchian @mengdonglin It is worth testing with #7994 (disables IMR context save)

With the context-save, e.g. a leak in DSP resources will be compounding and lead to errors. So if 7994 has impact to repro rate, this can reveal a lot more information about the bug and help in debug.

@kv2019i , I setup one device and it's very high reproduce rate no matter enabled or disabled IMR. Almost every time during 20 cycles, the DSP dump reproduce. I could share the remote access to you if you need something to check.

[ 131.013601] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[ 131.023065] sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware boot failure due to timeout
[ 131.031919] sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (3)
[ 131.040875] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0xffffffff, ROM error: 0xffffffff
[ 131.050595] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0xd000001c, ROM debug error: 0x2328
[ 131.061093] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit not enabled
[ 131.066218] pcieport 0000:00:06.1: pciehp: Slot(8): No link
[ 131.069065] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------

macchian · 2023-08-31T11:42:03Z

dsp_dump.log

kv2019i · 2023-08-31T11:51:51Z

Thanks @macchian this now looks a bit different. The DSP fails to boot (versus a IPC timeout before) and there's a ROM debug code:

[  131.023065] sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware boot failure due to timeout
[  131.031919] sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (3)
[  131.040875] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0xffffffff, ROM error: 0xffffffff
[  131.050595] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0xd000001c, ROM debug error: 0x2328
```

mengdonglin · 2023-09-05T02:20:33Z

@tmleman Can you help check this failure?

tmleman · 2023-11-24T09:22:20Z

While debugging I found one problem that may be causing this issue.

If FW receive SET_DX message (HOST->DSP) before receiving ACK for previously sent message/notification (DSP->HOST), IPC device is in busy state. Zephyr PM during system suspend skip suspension of busy devices. In result, the IPC device is neither put to suspend nor resume during power-flow.

From the host perspective, it manifests itself in the fact that the ROM reports FW is loaded, but we do not receive FW Ready. When you connect via gdb to core 0, FW is in idle state.

lgirdwood · 2023-11-24T16:50:07Z

While debugging I found one problem that may be causing this issue.

If FW receive SET_DX message (HOST->DSP) before receiving ACK for previously sent message/notification (DSP->HOST), IPC device is in busy state. Zephyr PM during system suspend skip suspension of busy devices. In result, the IPC device is neither put to suspend nor resume during power-flow.

From the host perspective, it manifests itself in the fact that the ROM reports FW is loaded, but we do not receive FW Ready. When you connect via gdb to core 0, FW is in idle state.

Good work @tmleman ! Do you have a proposal for a fix ?
@kv2019i @ujfalusi fyi

kv2019i · 2023-11-24T17:31:47Z

Context save not enabled in stable-v2.8, moving this to v2.9 (this is gating enabling context-save in mainline).

lgirdwood · 2023-11-29T16:55:35Z

@tmleman any update on debug - my feeling is same root cause could also cause thesofproject/linux#4832 @kv2019i fyi.

tmleman · 2023-11-30T09:32:34Z

@lgirdwood I haven't been able to reproduce this particular issue in my local environment. I need someone to confirm that the bug found is also the cause of this issue.

I also need to determine how FW should behave in such a situation. The easiest solution is to ignore this last ACK and force the IPC device to sleep while going into the D3 state. Another solution would be to return an error in the SET_DX response, and fix would need to be done in linux driver.

lgirdwood · 2023-12-04T15:53:14Z

@plbossart @ujfalusi @RanderWang @ranj063 any comments here ? Is driver able to do this ?

ranj063 · 2023-12-04T17:11:21Z

If FW receive SET_DX message (HOST->DSP) before receiving ACK for previously sent message/notification (DSP->HOST), IPC device is in busy state. Zephyr PM during system suspend skip suspension of busy devices. In result, the IPC device is neither put to suspend nor resume during power-flow.

@tmleman I'm a bit skeptical about the possibility of this happening in the linux driver. Basically, if the DSP has send a message/notification to the host, we hold a spinlock until the ACK has been sent back to the DSP before initializing a new IPC.

tmleman · 2023-12-05T11:36:44Z

@lgirdwood I spoke with @mmaka1 about this and we agreed that this pending ACK should not be an obstacle in D3 transition. This is because the device is not actually suspended but reset and from a hardware perspective there is no trace of this ACK. I have pushed fix for this issue to review #8573.

tmleman · 2023-12-05T11:43:40Z

I'm a bit skeptical about the possibility of this happening in the linux driver. Basically, if the DSP has send a message/notification to the host, we hold a spinlock until the ACK has been sent back to the DSP before initializing a new IPC.

@ranj063 maybe this is a situation when FW sends notifications at the same time when the HOST sends the SET_DX message? I can prepare a build with my changes and if the problem described in this issue does not reproduce, we will be able to consider it as confirmation of the root-cause. I've already asked @keqiaozhang for this.

ujfalusi · 2023-12-05T12:00:17Z

@tmleman I'm a bit skeptical about the possibility of this happening in the linux driver. Basically, if the DSP has send a message/notification to the host, we hold a spinlock until the ACK has been sent back to the DSP before initializing a new IPC.

@ranj063, I don't think we hold spinlock for notification, that would not work.
We can receive and process notifications while waiting for a reply, we only protect the sending.
@tmleman, do you know what is the notification that did not received an ACK? LOG_BUFFER_STATUS?

The host can only ack the notification/reply after it took the message out from mailbox.

I think the fw should wait for the ACK (that it is possible to send a message to host) in any case. Linux does that for all message.
See:
fw sends a notification
host receives notification (ACK is not cleared)
host starts to process it (taking it out from mailbox, etc) (ACK is not cleared)
host sends SET_DX (ACK is not cleared)
fw receives message (ACK is not cleared)
fw ignores that host is not yet ready to receive message and sends reply to SET_DX
host finishes with the notification and clears the ACK (and the notification data might be corrupted by the reply from fw)
The reply to SET_DX is lost

Linux always checks the DSP BUSY (ack from fw side) before sending message, if it is not clear then the message is moved to deferred 'list' and it is going to be sent when the ACK is received from DSP side that the fw is ready to receive new message.

I think similar 'deferred' sending should be done on the fw side as well?

tmleman · 2023-12-05T12:55:01Z

@tmleman, do you know what is the notification that did not received an ACK? LOG_BUFFER_STATUS?

@ujfalusi I used our internal tests for reproduction (FW behavior is similar in these cases). I suspect that problem in this scenario is caused by the LOG_BUFFER_STATUS notification.

lgirdwood · 2023-12-18T13:16:25Z

@tmleman @ujfalusi any consensus here ?
@RDharageswari is this still an issue, many fixes have upstreamed since initial report.

tmleman · 2023-12-22T20:44:52Z

@lgirdwood @ujfalusi @RDharageswari I proposed a fix (pull request to Zephyr zephyrproject-rtos/zephyr#66135), it's one of many possible solutions. I encourage you to discuss it.

I don't have confirmation that this fix resolves this issue. I would need assistance in verifying this.

tmleman · 2024-01-11T13:38:03Z

@RDharageswari can you check if the issue still reproduces on the main branch?

lgirdwood · 2024-01-11T13:45:30Z

@RDharageswari can you check if the issue still reproduces on the main branch?

@tmleman I guess we need a west update to pick up the Zephyr commit now its merged ?

tmleman · 2024-01-19T09:46:45Z

@lgirdwood The patch I'm interested in has already been integrated. The kernel version in SOF has also been updated earlier.

abonislawski · 2024-01-23T09:09:52Z

@macchian could you help in fix verification on main branch?

wszypelt · 2024-02-09T09:22:49Z

Due to lack of response, reduces to P2
@RDharageswari Can you confirm that the issue no longer occurs?

wszypelt · 2024-04-12T08:41:33Z

@RDharageswari Can you confirm that the issue no longer occurs?

wszypelt · 2024-07-19T07:54:53Z

Due to the lack of response from the reporting person, I am closing the task

RDharageswari added bug Something isn't working as expected MTL Applies to Meteor Lake platform mtl-005 labels Jun 27, 2023

RDharageswari mentioned this issue Jun 27, 2023

[BUG] ipc timed out MOD_SET_DX in suspend-resume in TGL/MTL #7482

Closed

lgirdwood added this to the v2.7 milestone Jul 5, 2023

mengdonglin added the IPC timeout IPC timeout observed label Jul 31, 2023

mengdonglin added the P1 Blocker bugs or important features label Aug 16, 2023

mengdonglin closed this as completed Aug 16, 2023

mengdonglin reopened this Aug 16, 2023

mengdonglin assigned tmleman Sep 5, 2023

keqiaozhang mentioned this issue Sep 5, 2023

[BUG][MTL] Firmware boot failure due to timeout during suspend/resume stress test (ROM status 0x50000005, ROM error 0x0) #8148

Closed

mengdonglin added the IMR context save label Nov 16, 2023

kv2019i modified the milestones: v2.8, v2.9 Nov 24, 2023

wszypelt added P2 Critical bugs or normal features and removed P1 Blocker bugs or important features urgent labels Feb 9, 2024

wszypelt assigned RDharageswari and unassigned tmleman Feb 9, 2024

kv2019i removed this from the v2.9 milestone Mar 13, 2024

wszypelt closed this as completed Jul 19, 2024

[BUG][MTL] S0ix long run failure, Firmware boot failure due to timeout (ROM status: 0x50000005, ROM error: 0x0) #7866

[BUG][MTL] S0ix long run failure, Firmware boot failure due to timeout (ROM status: 0x50000005, ROM error: 0x0) #7866

Comments

RDharageswari commented Jun 27, 2023 • edited by marc-hb Loading

marc-hb commented Jun 27, 2023 • edited Loading

RDharageswari commented Jun 29, 2023 • edited Loading

mengdonglin commented Jul 27, 2023

RDharageswari commented Aug 14, 2023

macchian commented Aug 16, 2023 • edited by marc-hb Loading

macchian commented Aug 16, 2023

mengdonglin commented Aug 16, 2023 • edited Loading

macchian commented Aug 18, 2023

lrudyX commented Aug 18, 2023

macchian commented Aug 22, 2023

macchian commented Aug 22, 2023 • edited Loading

mengdonglin commented Aug 22, 2023

macchian commented Aug 22, 2023

mengdonglin commented Aug 29, 2023

kv2019i commented Aug 30, 2023

macchian commented Aug 31, 2023

macchian commented Aug 31, 2023

kv2019i commented Aug 31, 2023

mengdonglin commented Sep 5, 2023

tmleman commented Nov 24, 2023 • edited Loading

lgirdwood commented Nov 24, 2023

kv2019i commented Nov 24, 2023

lgirdwood commented Nov 29, 2023

tmleman commented Nov 30, 2023

lgirdwood commented Dec 4, 2023

ranj063 commented Dec 4, 2023

tmleman commented Dec 5, 2023

tmleman commented Dec 5, 2023

ujfalusi commented Dec 5, 2023 • edited Loading

tmleman commented Dec 5, 2023

lgirdwood commented Dec 18, 2023

tmleman commented Dec 22, 2023

tmleman commented Jan 11, 2024

lgirdwood commented Jan 11, 2024

tmleman commented Jan 19, 2024

abonislawski commented Jan 23, 2024

wszypelt commented Feb 9, 2024 • edited Loading

wszypelt commented Apr 12, 2024

wszypelt commented Jul 19, 2024

RDharageswari commented Jun 27, 2023 •

edited by marc-hb

Loading

marc-hb commented Jun 27, 2023 •

edited

Loading

RDharageswari commented Jun 29, 2023 •

edited

Loading

macchian commented Aug 16, 2023 •

edited by marc-hb

Loading

mengdonglin commented Aug 16, 2023 •

edited

Loading

macchian commented Aug 22, 2023 •

edited

Loading

tmleman commented Nov 24, 2023 •

edited

Loading

ujfalusi commented Dec 5, 2023 •

edited

Loading

wszypelt commented Feb 9, 2024 •

edited

Loading