-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][MTL] S0ix long run failure, Firmware boot failure due to timeout (ROM status: 0x50000005, ROM error: 0x0) #7866
Comments
@RDharageswari can you please not ignore the entire bug template? It has a number of specific and important fields. |
Hi Marc, |
@RDharageswari Can you still reproduce this issue with mtl-005.0.3 (Hot Fix 3) release? |
@mengdonglin: No reports of this in the recent days..But @yongzhi1 was able to repro this issue in the PnP set up with mtl-005-hotfix3 release |
@mengdonglin , I also found the ipc time out during suspend resume test on customer MTL board. @yongzhi1 , I saw that your linux PR about page fault
|
attach the crash logs. |
@macchian @RDharageswari @yongzhi1 This issue seems similar to #7990 and #8028 that happens with mtl-005.0.3 when CPC-based clock selection and IMR context save are both enabled. Now for main branch, IMR context save is already enabled and PR of CPC-based clock selection is under review #8019 We found on main branch, with both IMR context save and CPC-based clock selection (PR) enabled, we cannot reproduce #7990 and #8028. Please find details and recipe in #7990 (comment) Maybe you can try this FW recipe of main branch as well? Or you may try this kernel test PR #8007 with MTL-005.0.3 FW? This kernel PR will make FW always run at highest clock by always sending ZERO CPC values to FW? |
|
@macchian Could you provide some more information how to reproduce the issue? |
@lrudyX , the suspend_stress_test tool is specific ChromeOS test tool. I am not sure if the CI relevant check-suspend-resume-with-* can isolate the issue or not. #suspend_stress_test -c 1500 --record_dmesg_dir=/usr/local/agingLogs --suspend_min=15 --suspend_max=20" |
@mengdonglin, from latest report, one DUT reproduce it highly reproducible rate. Attach the logs. |
@macchian Thank you for testing the main branch! @abonislawski confirmed that main branch will always run at highest clock as it lack rimage update for CPC values as mtl-005.0.3. Can we say that except for the specific DUT, the reproduce rate of this issue is lower with main branch FW than mtl-005.0.3? And what's the approximate reproduction rate for this specific DUT? |
@mengdonglin , I'm afraid that 2 DUTs were randomly failure within 20 cycles from suspend_resume. The 3rd DUT will almost certainly appear. This problem can be easily repeated from customer reports. Do you recommend the kernel test PR #8007 worth to tr yor other PRs recommendation? |
|
@macchian @mengdonglin It is worth testing with #7994 (disables IMR context save) With the context-save, e.g. a leak in DSP resources will be compounding and lead to errors. So if 7994 has impact to repro rate, this can reveal a lot more information about the bug and help in debug. |
@
@kv2019i , I setup one device and it's very high reproduce rate no matter enabled or disabled IMR. Almost every time during 20 cycles, the DSP dump reproduce. I could share the remote access to you if you need something to check. [ 131.013601] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------ |
Thanks @macchian this now looks a bit different. The DSP fails to boot (versus a IPC timeout before) and there's a ROM debug code:
|
@tmleman Can you help check this failure? |
While debugging I found one problem that may be causing this issue. If FW receive SET_DX message (HOST->DSP) before receiving ACK for previously sent message/notification (DSP->HOST), IPC device is in busy state. Zephyr PM during system suspend skip suspension of busy devices. In result, the IPC device is neither put to suspend nor resume during power-flow. From the host perspective, it manifests itself in the fact that the ROM reports FW is loaded, but we do not receive FW Ready. When you connect via gdb to core 0, FW is in idle state. |
Good work @tmleman ! Do you have a proposal for a fix ? |
Context save not enabled in stable-v2.8, moving this to v2.9 (this is gating enabling context-save in mainline). |
@tmleman any update on debug - my feeling is same root cause could also cause thesofproject/linux#4832 @kv2019i fyi. |
@lgirdwood I haven't been able to reproduce this particular issue in my local environment. I need someone to confirm that the bug found is also the cause of this issue. I also need to determine how FW should behave in such a situation. The easiest solution is to ignore this last ACK and force the IPC device to sleep while going into the D3 state. Another solution would be to return an error in the SET_DX response, and fix would need to be done in linux driver. |
@plbossart @ujfalusi @RanderWang @ranj063 any comments here ? Is driver able to do this ? |
@tmleman I'm a bit skeptical about the possibility of this happening in the linux driver. Basically, if the DSP has send a message/notification to the host, we hold a spinlock until the ACK has been sent back to the DSP before initializing a new IPC. |
@lgirdwood I spoke with @mmaka1 about this and we agreed that this pending ACK should not be an obstacle in D3 transition. This is because the device is not actually suspended but reset and from a hardware perspective there is no trace of this ACK. I have pushed fix for this issue to review #8573. |
@ranj063 maybe this is a situation when FW sends notifications at the same time when the HOST sends the SET_DX message? I can prepare a build with my changes and if the problem described in this issue does not reproduce, we will be able to consider it as confirmation of the root-cause. I've already asked @keqiaozhang for this. |
@ranj063, I don't think we hold spinlock for notification, that would not work. The host can only ack the notification/reply after it took the message out from mailbox. I think the fw should wait for the ACK (that it is possible to send a message to host) in any case. Linux does that for all message. Linux always checks the DSP BUSY (ack from fw side) before sending message, if it is not clear then the message is moved to deferred 'list' and it is going to be sent when the ACK is received from DSP side that the fw is ready to receive new message. I think similar 'deferred' sending should be done on the fw side as well? |
@tmleman @ujfalusi any consensus here ? |
@lgirdwood @ujfalusi @RDharageswari I proposed a fix (pull request to Zephyr zephyrproject-rtos/zephyr#66135), it's one of many possible solutions. I encourage you to discuss it. I don't have confirmation that this fix resolves this issue. I would need assistance in verifying this. |
@RDharageswari can you check if the issue still reproduces on the main branch? |
@tmleman I guess we need a west update to pick up the Zephyr commit now its merged ? |
@lgirdwood The patch I'm interested in has already been integrated. The kernel version in SOF has also been updated earlier. |
@macchian could you help in fix verification on main branch? |
Due to lack of response, reduces to P2 |
@RDharageswari Can you confirm that the issue no longer occurs? |
Due to the lack of response from the reporting person, I am closing the task |
EDIT: the kernel part of this issue has been transferred to newer:
We see this error reported in multiple scenario:
One was long run S0ix testing.
cc:
The text was updated successfully, but these errors were encountered: