Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stress] Random FW loading failure #785

Closed
mengdonglin opened this issue Dec 31, 2018 · 10 comments
Closed

[Stress] Random FW loading failure #785

mengdonglin opened this issue Dec 31, 2018 · 10 comments
Assignees
Labels
bug Something isn't working as expected GLK Applies to Gemini Lake platform P1 Blocker bugs or important features

Comments

@mengdonglin
Copy link
Collaborator

mengdonglin commented Dec 31, 2018

We are seeing that the code loader DMA sometimes fails randomly during suspend stress test.

4809.407289] sof-audio sof-audio: loading firmware
[54809.407306] sof-audio sof-audio: booting DSP firmware
[54809.407577] sof-audio sof-audio: period_bytes:0x0
[54809.407579] sof-audio sof-audio: periods:1
[54809.429277] sof-audio sof-audio: unstall/run core: core_mask = 1
[54809.429295] sof-audio sof-audio: DSP core(s) enabled? 1 : core_mask 1
[54812.623148] sof-audio sof-audio: error: iteration 0 of load fw failed err: -62

Here's where the fw transfer fails for the first time.
So we try again...

[54812.645266] sof-audio sof-audio: DSP core(s) enabled? 0 : core_mask 3
[54812.645385] sof-audio sof-audio: period_bytes:0x0
[54812.645387] sof-audio sof-audio: periods:1
[54812.667277] sof-audio sof-audio: unstall/run core: core_mask = 1
[54812.667283] sof-audio sof-audio: DSP core(s) enabled? 1 : core_mask 1
[54813.301114] sof-audio sof-audio: unknown ROM status value 02000010

But the ROM fails to initialize properly the second time. What does the ROM status 02000010 indicate?

[54813.301133] sof-audio sof-audio: error: status = 0x00000000 panic = 0x00000000
[54813.301137] sof-audio sof-audio: unknown ROM status value 02000010
[54813.323273] sof-audio sof-audio: DSP core(s) enabled? 0 : core_mask 3
[54813.323292] sof-audio sof-audio: Error code=0xffffffff: FW status=0xffffffff
[54813.323295] sof-audio sof-audio: iteration 1 of Core En/ROM load fail:-5
[54813.323416] sof-audio sof-audio: period_bytes:0x0
[54813.323418] sof-audio sof-audio: periods:1
[54813.345270] sof-audio sof-audio: unstall/run core: core_mask = 1
[54813.345288] sof-audio sof-audio: DSP core(s) enabled? 1 : core_mask 1
[54813.369482] sof-audio sof-audio: error: rx list empty but received 0x4200 
[54813.369484] sof-audio sof-audio: error: can't find message header 0x1004200
[54813.979260] sof-audio sof-audio: unknown ROM status value 02000010
[54813.979278] sof-audio sof-audio: error: status = 0x00000000 panic = 0x00000000
[54813.979282] sof-audio sof-audio: unknown ROM status value 02000010

The third time fails with the same error. Can someone in the FW team please help us understand what is the right sequence to try in such cases?

[54814.001256] sof-audio sof-audio: DSP core(s) enabled? 0 : core_mask 3
[54814.001275] sof-audio sof-audio: Error code=0xffffffff: FW status=0xffffffff
[54814.001277] sof-audio sof-audio: iteration 2 of Core En/ROM load fail:-5
[54814.001293] sof-audio sof-audio: unknown ROM status value ffffffff
[54814.001309] sof-audio sof-audio: error: status = 0xffffffff panic = 0xffffffff
[54814.001325] sof-audio sof-audio: unknown ROM status value ffffffff
[54814.001341] sof-audio sof-audio: error: load fw failed after 3 attempts with err: -5
[54814.001343] sof-audio sof-audio: error: failed to reset DSP
[54814.001346] sof-audio sof-audio: error: failed to boot DSP firmware after resume -5
@mengdonglin mengdonglin added bug Something isn't working as expected P1 Blocker bugs or important features GLK Applies to Gemini Lake platform labels Dec 31, 2018
@jajanusz jajanusz assigned jajanusz and unassigned michalgrodzicki Dec 31, 2018
@jajanusz
Copy link
Contributor

-62 -> it’s timeout

02000010 -> your error log states ‘ROM status’ so I guess it’s first register of ROM FW Registers, then it says only that:
wait_state = WAIT_FOR_IPC_DONE (2)
state = FSR_ROM_BASEFW_CSE_IMR_REQUEST (10)

When you want to decode ROM registers values you should go to:
https://thesofproject.github.io/latest/getting_started/cavs-boot/apollolake/apl-boot-rom.html

@ranj063
Copy link
Collaborator

ranj063 commented Dec 31, 2018

-62 -> it’s timeout

02000010 -> your error log states ‘ROM status’ so I guess it’s first register of ROM FW Registers, then it says only that:
wait_state = WAIT_FOR_IPC_DONE (2)
state = FSR_ROM_BASEFW_CSE_IMR_REQUEST (10)

When you want to decode ROM registers values you should go to:
https://thesofproject.github.io/latest/getting_started/cavs-boot/apollolake/apl-boot-rom.html

@jajanusz does this mean the CSE has not responded or we missed the response. Its a bit of a mystery to me as why the first failure occurs and then why the second time the ROM fails to initialize the second time.

@ranj063
Copy link
Collaborator

ranj063 commented Jan 1, 2019

image007
This diagram explains what was happening with the skylake driver and it is likely what we're seeing with SOF as well.

Summary of the issue:

a. There is a race condition between host OS and CSE on FW auth. The starting point is DID (where CSE get access to system memory so it can start load CSE OS)

b. From this point, CSE load itself till it is ready to handle audio FW auth request – this takes about 255msec (and not 300msec as I reported before).

c. In parallel to CSE, host is running, Coreboot and the OS itself, till it sends the first auth commands to CSE FW.

d. With latest data I got from Suren, there is a gap of ~30-70msec. OS is sending the auth command 70msec before CSE is ready. See diagram below.

e. At this stage, audio identify the error, and start polling for 300msec, till it give up and restart the flow again, only then the flow success

@ranj063
Copy link
Collaborator

ranj063 commented Jan 1, 2019

@mengdonglin @keyonjie @jajanusz FYI the aboce scenario seems like a plausible explanation for what is happening.

@mengdonglin
Copy link
Collaborator Author

Thank you all for analysis!

Linked this issue to thesofproject/linux#87, since we need to harden kernel support for APL and GLK.

@ranj063
Copy link
Collaborator

ranj063 commented Jan 1, 2019

ranj063/linux@08486cf
ranj063/linux@ba92bac

@mengdonglin @keqiaozhang can we please test with the above 2 patches on top of the latest R72 branch?

@ranj063
Copy link
Collaborator

ranj063 commented Jan 1, 2019

Thank you all for analysis!

Not my analysis :). This was done when the skylake driver showed similar symptoms.

Linked this issue to thesofproject/linux#87, since we need to harden kernel support for APL and GLK.

@ranj063
Copy link
Collaborator

ranj063 commented Jan 4, 2019

@mengdonglin should we call this fixed now?

@lgirdwood
Copy link
Member

@mengdonglin ping, can we close ?

@wenqingfu
Copy link

wenqingfu commented Jan 10, 2019

ranj063/linux@08486cf
ranj063/linux@ba92bac

@mengdonglin @keqiaozhang can we please test with the above 2 patches on top of the latest R72 branch?

close as fixed by thesofproject/linux#507

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected GLK Applies to Gemini Lake platform P1 Blocker bugs or important features
Projects
None yet
Development

No branches or pull requests

6 participants