-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dai_zephyr: Silence benign warnings #8621
Conversation
The DAI emits a flood of warnings when presented with empty buffers at copy time. That's not really a reasonable warning condition. There are multiple situations where components upstream may be paused or asynchronous, leading to starvation in any given cycle. Earlier code has hit this with paused components, where the log messages are merely annoying. One new situation is that when using the DP scheduler, updates are async and may happen at a different cadence than the pipeline the DAI is on; the upstream component will be presented with data in a (for example) 1ms pipeline tick, but then send it to a different component (echo cancellation, say) that batches it up into larger buffers (10ms) and releases it downstream only at the slower cadence. In that situation the flood of messages is being emitted during an active stream, and tends to cause glitches all by itself after a few seconds (and even where it doesn't, it floods the Zephyr log backend to the extent that literally every message is dropped). (I don't know that all such warnigns are removed by this patch. These are only the ones I've seen in practice.) Fixes thesofproject#4672 Signed-off-by: Andy Ross <[email protected]>
EDIT: unrelated panic bug moved to: Very interesting! The test run is very far from finished yet but when peeking at the internal side of https://sof-ci.01.org/sofpr/PR8621/build1091/devicetest/index.html I can already see the same boot time panic already observed in #8571:
|
Heh. Are you absolutely, positively sure this isn't just a flaky RVP? None of these lines are even going to be hit on firmware load, they're part of the copy path once a pipeline is started. The only delta you'd see at the stage you're seeing the panic would be code size & linker address choice. Is someone available to hand-build this patch and validate on a different but compatible board? Again, I have an equivalent MTL device (surely not the same board, and likely a different stepping) on my desk and it's been fine. |
Not flaky no. The DSP panic does not happen on every test but it happens several times per test run[*] on every single device and the register values are the same every time it happens. [*] including check-runtime-pm.sh because that reboots the DSP. Basically every time the DSP boots.
Did you compare the checksums at the end of the build? |
We don't have identical toolchains, so I really doubt that's going to work. I'm happy to try one you send though. Which topology are you passing? I can try that too, though again I have a different kernel so how that loads (or if it loads, depending on the linux-side widgets involved) will be slightly different. |
This comment was marked as off-topic.
This comment was marked as off-topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts inline. I have no explanation on the #8629 DSP panic. I agree the changes you make should have no relation to a DSP panic. It is possible you have hit a heisenbug that changes timing enough to trigger a bug elsewhere (or we merged something today that has a regression., but upon quick browse to today's data, all test runs are good before this PR).
@@ -1286,8 +1286,6 @@ int dai_zephyr_multi_endpoint_copy(struct dai_data **dd, struct comp_dev *dev, | |||
|
|||
/* return if nothing to copy */ | |||
if (!frames) { | |||
comp_warn(dev, "dai_zephyr_multi_endpoint_copy(): nothing to copy"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like @marcinszkudlinski weigh in from DP perspective.
These really are useful warnings as the DAI interface should in fact see a steady stream of data in LL ticks. DP modules may process in larger chunks and this is perfectly fine, but
these should be hidden by the DP queues that sit between DP and LL domains. The audio interface (e.g. DAI) runs at fixed rate and if we see these warnings in volume, something is usually wrong. If e.g. peer pipeline is paused, the copiers should handle these situations.
Granted, we do have a minimal ping-pong buffer towards the DAI, so it is possible to have a stable buffering scheme where this warning will come continuously, but so far this has not been the case (given the warnings are still there).
With DP and AEC this might indeed happen, so I'm ok if @marcinszkudlinski et al sign this off.
I would prefer to just downgrade to comp_dbg() first, to make this is a little easier to bring back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kv2019i @marcinszkudlinski , as you pointed out, when DP handling large chunks, delay is expected ( > large trunk period), however, it is not error, it should not be print out.
We may need find a way to make it only report real under-run, or overrun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split the log, once you hit a period threshold report an error? Otherwise this can just be a debug log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@btian1 Right but this delay should be handled within DP domain and DP elements. These prints are from the DAI interface code. If we don't have data when DAI copy is called, we are at the mercy of the per-DAI buffers (depends on DAI type, but typically 2-4 1ms chunks of DAI facing buffers). We should not rely on this buffering for data flow, but instead should always have data to read/write when DAI needs it. We have had separate discussions about minizing the latency and if we have a DAI that can operate with the 2x1ms ping-pong buffer, that does not leave any margin for errors from audio pipeline code. The LL scheduler is run every 1ms tick, and it should be able to provide/consume data without exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @marcinszkudlinski stuck in older #8571
@andyross this message must not appear more than 2xDP processing period - that means 20 times in case of AEC during pipeline startup.
500 lines means - processing is not working properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andyross I would strongly recommend to change the message level to debug. It is a potential risk of message flood that can affect FW operability. It might be tricky to determine at DAI level if no data to copy was intended in given situation.
Bummer. You can still compare Speaking of @lyakh I checks a number of samples and it looks like this new boot test has always been failing on MTL, new DSP panic or not (and always passing on CAVS)
We never looked at firmware errors so this was never noticed |
EDIT: unrelated bug moved to: Yup, that's what done it. Building this patch with SOF_BOOT_TEST=y I do indeed get a panic on load. It looks like this:
Note the failure to load a driver for some component (which I'm sure you can guess: that's the UUID for the Google AEC component). And the faulting PC points to this line in lib_manager.c: https://github.com/thesofproject/sof/blob/main/src/library_manager/lib_manager.c#L341 ...where it's trying to dereference a NULL "ext_lib" pointer it just tried to retrieve, unsuccessfully. Now, as it happens AEC isn't built in the image here (the firmware can only hear about that UUID via topology). Though it is in the main PR obviously, and I checked that the failure is identical in both spots. So somewhere there's a problem with the library_manager registry understanding what it can assume about modules it's trying to test. |
Alternatively there's an API that components need to implement properly in order to show up correctly to ipc4_get_drv and I'm not aware of it. Anyway, I'm going to let you guys work this for a bit. Will return to this time-available, but there are some audio issues with the PR I want to get squashed first. |
Also also, just to I-told-you-so a bit: it's not enough to put a test in CI and demand it pass. You need to document exactly how to reproduce the failures. In this case CI wasn't actually building the same firmware SOF produces, it was building that test thing. And as the scripts aren't (AFAICT?) checked into thesofproject anywhere, I never would have known had Marc not mentioned it. |
You're preaching to the choir here :-) (me)
That test Kconfig is in git. So is the build script and the MTL build logs are public. |
- This test has failed on MTL since it was enabled. Running tests that we systematically ignore the failure of is much worse than not running them because it provides a false impression of quality. - This can cause DSP panics as seen in thesofproject#8621 Signed-off-by: Marc Herbert <[email protected]>
Trying to reproduce DSP panic thesofproject#8621 ------- The DAI emits a flood of warnings when presented with empty buffers at copy time. That's not really a reasonable warning condition. There are multiple situations where components upstream may be paused or asynchronous, leading to starvation in any given cycle. Earlier code has hit this with paused components, where the log messages are merely annoying. One new situation is that when using the DP scheduler, updates are async and may happen at a different cadence than the pipeline the DAI is on; the upstream component will be presented with data in a (for example) 1ms pipeline tick, but then send it to a different component (echo cancellation, say) that batches it up into larger buffers (10ms) and releases it downstream only at the slower cadence. In that situation the flood of messages is being emitted during an active stream, and tends to cause glitches all by itself after a few seconds (and even where it doesn't, it floods the Zephyr log backend to the extent that literally every message is dropped). (I don't know that all such warnigns are removed by this patch. These are only the ones I've seen in practice.) Fixes thesofproject#4672 Signed-off-by: Andy Ross <[email protected]> (cherry picked from commit 514576e)
In #8624 I tested this PR + removing CONFIG_SOF_BOOT_TEST and everything passed. |
@marc-hb yes, it's known that this test fails on MTL and no, it cannot fail on cAVS, because the VMH isn't available on cAVS. But what I don't understand is how is this related to any panics? |
- This test has failed on MTL since it was enabled. Running tests that we systematically ignore the failure of is much worse than not running them because it provides a false impression of quality. - This can cause DSP panics as seen in #8621 Signed-off-by: Marc Herbert <[email protected]>
SOFCI TEST |
The test that caused fear and uncertainty yesterday is now passing -> https://sof-ci.01.org/sofpr/PR8621/build1159/devicetest/index.html ( EDIT: because it was turned off by #8624 ) My other concerns on the PR remain, I'd like an ack from @marcinszkudlinski , and at minimum to leave the warnings as dbg(). |
Continuing from @marcinszkudlinski 's comments in the original PR, best to review here: These are the warnings I'm seeing, they're coming in a flood every 1ms pipeline tick. I'm obviously not claiming to be an expert, but it seems clear (-ish) that the dai_zephyr expects to see non-zero output every tick, or else it hits this warning. But that's not the architecture of the component. The DP AEC component batches input into a 10ms buffer, then calls the AEC processing, then releases all that data into the output sink all at once. It's never going to be a steady thing unless the DAI is able to rate-limit with some other component. And if it is, that part doesn't seem to be configured properly? Basically: what exactly are the requirements for a DP component that I'm violating? If you guys need to reproduce, it should work to just grab the code from the PR.
|
Don't forget |
That's not true, |
Then maybe that's backing up? I'm at a complete loss for why we're seeing these errors in that case. Are there docs anywhere for exactly how many bytes a DP component should be sourcing/sinking and when? Digression: isn't that design exactly backwards though? By rate-limiting downstream you're delaying the arrival at the host output by 10ms (actually 9 I guess). Once AEC finishes, there are microphone bytes available for DMA to the ALSA pipeline on the host. Why are we waiting artificially? |
@marcinszkudlinski added a helpful example in https://github.com/thesofproject/sof/blob/main/src/schedule/zephyr_dp_schedule.c#L64 on the how LL-DP-LL are expected to work. Higher level docs at https://thesofproject.github.io/latest/architectures/firmware/sof-zephyr/mpp_layer/mpp_scheduling.html I wonder if we are systematically late by one LL tick. You could try cherry-picking #7736 (which adds even more debug output in this case) to see see how dd->total_data_processed moves when "no data" conditions are met.
In general this type of delaying is done to ensure continuity of the audio stream. If a module operating at 10ms chunk size is surrounded by real-time sinks and sources running at 1ms tick, a delay has to be introduced as the real-time sinks cannot handle gaps in the stream and data must be consumed/produced at fixed rate. There are special cases (error concealment algorithms can be used to conceal gaps in audio data, there is fast-drain more used in SOF e.g. for wake-on-voice to transfer data to host, and ASRC can be used to adapt rates), but in general the principle applies. So after the first real 10ms chunk of data is output, from that point onward, there should be a steady stream of data every 1ms LL tick with no exceptions. |
Uh... why? There's 10ms of buffering after the very first iteration. It's true that the first time through you need to accumulate on the upstream side, but once you're running we should be emitting data as fast as possible knowing that we have 9 chunks waiting behind us if we fail to keep up. Again, this sounds backwards to me. It makes the DP scheduler worse, latency-wise, than a synchronous 10ms pipeline would be: you need to buffer 10ms upstream in both designs, but DP then adds an additional 9ms on the downstream side because it won't move more than one block at a time? |
FWIW, the root cause of these particular errors turns out to have been bad stream metadata on the sources/sinks, see the workaround patch at the end of the series in the main PR linked above. So I'm happy to drop this if there's consensus these can't be hit in normal circumstances (though again, it seems like there's a pre-existing problem too). The latency issue we ended up discussing instead actually seems like genuine (and kinda serious) problem though. Looking at that now, I'll submit a bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title typo “Silence benign warnings” -> begin?
Close this, latency issues are tracked in #8640 The existing bug will be sad, though. |
Oops, actually hit the "close" button this time. |
I think there is still a request to fix the spam here @andyross |
@cujomalainey wrote:
Compromise proposal -> #8649 |
Merged to make this build time for debug. |
[A more limited version of this patch was submitted in #8571]
The DAI emits a flood of warnings when presented with empty buffers at copy time. That's not really a reasonable warning condition. There are multiple situations where components upstream may be paused or asynchronous, leading to starvation in any given cycle.
Earlier code has hit this with paused components, where the log messages are merely annoying.
One new situation is that when using the DP scheduler, updates are async and may happen at a different cadence than the pipeline the DAI is on; the upstream component will be presented with data in a (for example) 1ms pipeline tick, but then send it to a different component (echo cancellation, say) that batches it up into larger buffers (10ms) and releases it downstream only at the slower cadence.
In that situation the flood of messages is being emitted during an active stream, and tends to cause glitches all by itself after a few seconds (and even where it doesn't, it floods the Zephyr log backend to the extent that literally every message is dropped).
(I don't know that all such warnigns are removed by this patch. These are only the ones I've seen in practice.)
Fixes #4672