Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dai_zephyr: Silence benign warnings #8621

Closed
wants to merge 1 commit into from

Conversation

andyross
Copy link
Contributor

[A more limited version of this patch was submitted in #8571]

The DAI emits a flood of warnings when presented with empty buffers at copy time. That's not really a reasonable warning condition. There are multiple situations where components upstream may be paused or asynchronous, leading to starvation in any given cycle.

Earlier code has hit this with paused components, where the log messages are merely annoying.

One new situation is that when using the DP scheduler, updates are async and may happen at a different cadence than the pipeline the DAI is on; the upstream component will be presented with data in a (for example) 1ms pipeline tick, but then send it to a different component (echo cancellation, say) that batches it up into larger buffers (10ms) and releases it downstream only at the slower cadence.

In that situation the flood of messages is being emitted during an active stream, and tends to cause glitches all by itself after a few seconds (and even where it doesn't, it floods the Zephyr log backend to the extent that literally every message is dropped).

(I don't know that all such warnigns are removed by this patch. These are only the ones I've seen in practice.)

Fixes #4672

The DAI emits a flood of warnings when presented with empty buffers at
copy time.  That's not really a reasonable warning condition.  There
are multiple situations where components upstream may be paused or
asynchronous, leading to starvation in any given cycle.

Earlier code has hit this with paused components, where the log
messages are merely annoying.

One new situation is that when using the DP scheduler, updates are
async and may happen at a different cadence than the pipeline the DAI
is on; the upstream component will be presented with data in a (for
example) 1ms pipeline tick, but then send it to a different component
(echo cancellation, say) that batches it up into larger buffers (10ms)
and releases it downstream only at the slower cadence.

In that situation the flood of messages is being emitted during an
active stream, and tends to cause glitches all by itself after a few
seconds (and even where it doesn't, it floods the Zephyr log backend
to the extent that literally every message is dropped).

(I don't know that all such warnigns are removed by this patch.  These
are only the ones I've seen in practice.)

Fixes thesofproject#4672

Signed-off-by: Andy Ross <[email protected]>
@marc-hb
Copy link
Collaborator

marc-hb commented Dec 13, 2023

EDIT: unrelated panic bug moved to:


Very interesting! The test run is very far from finished yet but when peeking at the internal side of https://sof-ci.01.org/sofpr/PR8621/build1091/devicetest/index.html I can already see the same boot time panic already observed in #8571:

[    6.396362] kernel: i915 display info: overlay_needs_physical: no
[    6.396363] kernel: i915 display info: supports_tv: no
[    6.396365] kernel: i915 display info: has_hdcp: yes
[    6.396366] kernel: i915 display info: has_dmc: yes
[    6.396367] kernel: i915 display info: has_dsc: yes
[    6.426531] kernel: fbcon: i915drmfb (fb0) is primary device
[    6.438103] kernel: Console: switching to colour frame buffer device 480x135
[    6.458504] kernel: i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
[    7.779320] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[    7.779726] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
[    7.779759] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[    7.779773] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[    7.882406] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc rx      : 0x1b0a0000|0x0: GLB_NOTIFICATION|EXCEPTION_CAUGHT
[    7.882421] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[    7.882432] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: DSP panic!
[    7.882437] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[    7.882468] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0x5, ROM error: 0x0
[    7.882703] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0x50000005, ROM debug error: 0x0
[    7.882727] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit enabled
[    7.882776] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Unknown toolchain is used
[    7.882783] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: DSP Firmware Oops
[    7.882787] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: Exception Cause: AllocaCause, MOVSP instruction, if caller’s registers are not in the register file
[    7.882793] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EXCCAUSE 0x00000005 EXCVADDR 0xa010e7fc PS       0x00060420 SAR     0x00000001
[    7.882801] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPC1     0xa00327a5 EPC2     0x00000000 EPC3     0x00000000 EPC4    0x00000000
[    7.882807] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPC5     0x00000000 EPC6     0x00000000 EPC7     0x00000000 DEPC    0x00000000
[    7.882813] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPS2     0x00000000 EPS3     0x00000000 EPS4     0x00000000 EPS5    0x00000000
[    7.882818] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPS6     0x00000000 EPS7     0x00000000 INTENABL 0x00000000 INTERRU 0x00000000
[    7.882825] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: stack dump from 0x00000000
[    7.882831] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: AR registers:
[    7.882838] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x0: a0036098 a010e7d0 00000001 a010e7fc
[    7.882844] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x10: 00000001 a1131080 000a02e0 00000000
[    7.882850] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x20: a0037cb8 a010e7a0 00000001 000002c0
[    7.882858] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x30: a0037cb8 a010e7a0 00000001 000002c0
[    7.882867] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[    7.882874] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state change: 7 -> 8
[    7.882884] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc rx done : 0x1b0a0000|0x0: GLB_NOTIFICATION|EXCEPTION_CAUGHT
[    8.283733] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x47000000|0x0
[    8.283748] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Attempting to prevent DSP from entering D3 state to preserve context
[    8.283753] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[    8.283819] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0xc7000000|0x0|0x0, target: 0x1b0a0000|0x0|0x0, ctl: 0x3
[    8.283824] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
[    8.283828] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
[    8.283834] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ctx_save IPC error: -110, proceeding with suspend
[    8.283932] kernel: snd_sof_pci_intel_mtl:mtl_enable_sdw_irq: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x114c]=0x0 successful
[    8.283944] kernel: snd_sof_pci_intel_mtl:mtl_enable_interrupts: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1800]=0x0 successful
[    8.283952] kernel: snd_sof_pci_intel_mtl:mtl_enable_interrupts: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1140]=0x0 successful
[    8.285028] kernel: snd_sof_pci_intel_mtl:mtl_dsp_core_power_down: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x178d04]=0x2000000 successful
[    8.286092] kernel: snd_sof_pci_intel_mtl:mtl_power_down_dsp: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1000]=0x0 successful
[    8.286171] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIR: 00000000 at  00000044
[    8.286177] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIW: 00000010 at  00000044
[    8.287223] kernel: snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-mtl 0000:00:1f.3: Current DSP power state: D3
[    8.287226] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state change: 8 -> 0
[    8.774103] kernel: ax88179_178a 4-1:1.0 enx000ec682d0ad: ax88179 - Link status is: 1
[   26.474094] kernel: systemd-udevd (431) used greatest stack depth: 11048 bytes left
[  104.292946] kernel: rfkill: input handler disabled

@andyross
Copy link
Contributor Author

Heh. Are you absolutely, positively sure this isn't just a flaky RVP? None of these lines are even going to be hit on firmware load, they're part of the copy path once a pipeline is started. The only delta you'd see at the stage you're seeing the panic would be code size & linker address choice. Is someone available to hand-build this patch and validate on a different but compatible board?

Again, I have an equivalent MTL device (surely not the same board, and likely a different stepping) on my desk and it's been fine.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 13, 2023

Heh. Are you absolutely, positively sure this isn't just a flaky RVP?

Not flaky no. The DSP panic does not happen on every test but it happens several times per test run[*] on every single device and the register values are the same every time it happens.

[*] including check-runtime-pm.sh because that reboots the DSP. Basically every time the DSP boots.

Again, I have an equivalent MTL device (surely not the same board, and likely a different stepping) on my desk and it's been fine.

Did you compare the checksums at the end of the build?

@andyross
Copy link
Contributor Author

Did you compare the checksums at the end of the build?

We don't have identical toolchains, so I really doubt that's going to work. I'm happy to try one you send though. Which topology are you passing? I can try that too, though again I have a different kernel so how that loads (or if it loads, depending on the linux-side widgets involved) will be slightly different.

@marc-hb

This comment was marked as off-topic.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts inline. I have no explanation on the #8629 DSP panic. I agree the changes you make should have no relation to a DSP panic. It is possible you have hit a heisenbug that changes timing enough to trigger a bug elsewhere (or we merged something today that has a regression., but upon quick browse to today's data, all test runs are good before this PR).

@@ -1286,8 +1286,6 @@ int dai_zephyr_multi_endpoint_copy(struct dai_data **dd, struct comp_dev *dev,

/* return if nothing to copy */
if (!frames) {
comp_warn(dev, "dai_zephyr_multi_endpoint_copy(): nothing to copy");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like @marcinszkudlinski weigh in from DP perspective.

These really are useful warnings as the DAI interface should in fact see a steady stream of data in LL ticks. DP modules may process in larger chunks and this is perfectly fine, but
these should be hidden by the DP queues that sit between DP and LL domains. The audio interface (e.g. DAI) runs at fixed rate and if we see these warnings in volume, something is usually wrong. If e.g. peer pipeline is paused, the copiers should handle these situations.

Granted, we do have a minimal ping-pong buffer towards the DAI, so it is possible to have a stable buffering scheme where this warning will come continuously, but so far this has not been the case (given the warnings are still there).

With DP and AEC this might indeed happen, so I'm ok if @marcinszkudlinski et al sign this off.

I would prefer to just downgrade to comp_dbg() first, to make this is a little easier to bring back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i @marcinszkudlinski , as you pointed out, when DP handling large chunks, delay is expected ( > large trunk period), however, it is not error, it should not be print out.

We may need find a way to make it only report real under-run, or overrun.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split the log, once you hit a period threshold report an error? Otherwise this can just be a debug log.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@btian1 Right but this delay should be handled within DP domain and DP elements. These prints are from the DAI interface code. If we don't have data when DAI copy is called, we are at the mercy of the per-DAI buffers (depends on DAI type, but typically 2-4 1ms chunks of DAI facing buffers). We should not rely on this buffering for data flow, but instead should always have data to read/write when DAI needs it. We have had separate discussions about minizing the latency and if we have a DAI that can operate with the 2x1ms ping-pong buffer, that does not leave any margin for errors from audio pipeline code. The LL scheduler is run every 1ms tick, and it should be able to provide/consume data without exceptions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @marcinszkudlinski stuck in older #8571

@andyross this message must not appear more than 2xDP processing period - that means 20 times in case of AEC during pipeline startup.
500 lines means - processing is not working properly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross I would strongly recommend to change the message level to debug. It is a potential risk of message flood that can affect FW operability. It might be tricky to determine at DAI level if no data to copy was intended in given situation.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 13, 2023

We don't have identical toolchains, so I really doubt that's going to work.

Bummer. You can still compare sof_versions.h and .config though, that's still very useful.

Speaking of .config, I just noticed this new sof_boot thing on every DSP boot. It's not specific to MTL. @andyross to maximize reproduction chances please make sure you have CONFIG_SOF_BOOT_TEST=y as well?

@lyakh I checks a number of samples and it looks like this new boot test has always been failing on MTL, new DSP panic or not (and always passing on CAVS)

[    0.106098] <inf> init: print_version_banner: FW ABI 0x301d000 DBG ABI 0x5003000 tags SOF:v2.5-stable-branch-1807-g7ee5c69d3555 zephyr:zephyr-v3.5.0-2009-gefc32081893d src hash 0x236add3c (ref hash 0x236add3c)
[    0.106116] <inf> clock: clock_set_freq: clock 0 set freq 393216000Hz freq_idx 1
[    0.106128] <inf> clock: clock_set_freq: clock 1 set freq 393216000Hz freq_idx 1
[    0.106138] <inf> clock: clock_set_freq: clock 2 set freq 393216000Hz freq_idx 1
*** Booting Zephyr OS build zephyr-v3.5.0-2009-gefc32081893d ***
[    0.106388] <inf> main: sof_app_main: SOF on intel_adsp_ace15_mtpm
[    0.106400] <inf> main: sof_app_main: SOF initialized
[    0.106986] <inf> ipc: ipc_cmd: rx	: 0x43000000|0x20701000
[    0.109031] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x31400008
[    0.109661] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3060004c
[    2.272595] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3060004c
[    2.273046] <inf> ipc: ipc_cmd: rx	: 0x47000000|0x0
[    0.128908] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x31400008
[    0.129560] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3060004c
[    0.405513] <inf> ipc: ipc_cmd: rx	: 0x11000005|0x0
[    0.405533] <inf> pipe: pipeline_new: pipeline new pipe_id 0 priority 0
Running TESTSUITE sof_boot
===================================================================
START - virtual_memory_heap
[    0.505716] <wrn> sof_boot_test: vmh_test_single: Ignoring failure to allocate 1616 in non-contiguous mode
[    0.505723] <err> sof_boot_test: vmh_test_single: Failed to allocate 26
[    0.505845] <err> sof_boot_test: vmh_test: Non-contiguous test error -12
[    0.505850] <err> sof_boot_test: sof_boot_virtual_memory_heap: virtual_memory_heap failed: -12
 FAIL - virtual_memory_heap in 0.001 seconds
===================================================================
TESTSUITE sof_boot failed.
[    0.606210] <inf> ipc: ipc_cmd: rx	: 0x40000004|0x15
[    0.606253] <inf> dma: dma_get: dma_get() ID 0 sref = 1 busy channels 0
[    0.607026] <inf> ipc: ipc_cmd: rx	: 0x40000006|0x10
[    0.607681] <inf> ipc: ipc_cmd: rx	: 0x44000006|0x30000018
[    0.608068] <inf> ipc: ipc_cmd: rx	: 0x40000002|0xa
[    0.608698] <inf> ipc: ipc_cmd: rx	: 0x11010006|0x0

We never looked at firmware errors so this was never noticed

@andyross
Copy link
Contributor Author

andyross commented Dec 13, 2023

EDIT: unrelated bug moved to:


Yup, that's what done it. Building this patch with SOF_BOOT_TEST=y I do indeed get a panic on load. It looks like this:

[    0.288196] <err> ipc: ipc4_get_drv: get_drv(): the provided UUID (b780a0a6 466f269f df2377b4 58f75aa0) can't be found!
[    0.288218] <err> os: xtensa_excint1_c:  ** FATAL EXCEPTION
[    0.288225] <err> os: xtensa_excint1_c:  ** CPU 0 EXCCAUSE 13 (load/store PIF data error)
[    0.288233] <err> os: xtensa_excint1_c:  **  PC 0xa006ba74 VADDR 0x4
[    0.288235] <err> os: xtensa_excint1_c:  **  PS 0x60420
[    0.288238] <err> os: xtensa_excint1_c:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:4 CALLINC:2)
[    0.288241] <err> os: z_xtensa_dump_stack:  **  A0 0xa006c00c  SP 0xa00f2bc0  A2 0x17  A3 0x400feee0
[    0.288245] <err> os: z_xtensa_dump_stack:  **  A4 0x400fef54  A5 (nil)  A6 (nil)  A7 0xa00f2bc0
[    0.288248] <err> os: z_xtensa_dump_stack:  **  A8 0xa006ba68  A9 0x17 A10 0x400f6000 A11 0xce5
[    0.288250] <err> os: z_xtensa_dump_stack:  ** A12 0x400ff068 A13 0x10007 A14 0x400f8980 A15 0x10
[    0.288253] <err> os: z_xtensa_dump_stack:  ** LBEG 0xa007479a LEND 0xa00747d0 LCOUNT (nil)
[    0.288256] <err> os: z_xtensa_dump_stack:  ** SAR 0x10

Note the failure to load a driver for some component (which I'm sure you can guess: that's the UUID for the Google AEC component). And the faulting PC points to this line in lib_manager.c:

https://github.com/thesofproject/sof/blob/main/src/library_manager/lib_manager.c#L341

...where it's trying to dereference a NULL "ext_lib" pointer it just tried to retrieve, unsuccessfully.

Now, as it happens AEC isn't built in the image here (the firmware can only hear about that UUID via topology). Though it is in the main PR obviously, and I checked that the failure is identical in both spots. So somewhere there's a problem with the library_manager registry understanding what it can assume about modules it's trying to test.

@andyross
Copy link
Contributor Author

Alternatively there's an API that components need to implement properly in order to show up correctly to ipc4_get_drv and I'm not aware of it.

Anyway, I'm going to let you guys work this for a bit. Will return to this time-available, but there are some audio issues with the PR I want to get squashed first.

@andyross
Copy link
Contributor Author

Also also, just to I-told-you-so a bit: it's not enough to put a test in CI and demand it pass. You need to document exactly how to reproduce the failures. In this case CI wasn't actually building the same firmware SOF produces, it was building that test thing. And as the scripts aren't (AFAICT?) checked into thesofproject anywhere, I never would have known had Marc not mentioned it.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 13, 2023

Also also, just to I-told-you-so a bit: it's not enough to put a test in CI and demand it pass. You need to document exactly how to reproduce the failures.

You're preaching to the choir here :-) (me)

In this case CI wasn't actually building the same firmware SOF produces, it was building that test thing. And as the scripts aren't (AFAICT?) checked into thesofproject anywhere, I never would have known had Marc not mentioned it.

That test Kconfig is in git.

So is the build script and the MTL build logs are public.

marc-hb added a commit to marc-hb/sof that referenced this pull request Dec 13, 2023
- This test has failed on MTL since it was enabled.  Running tests that
  we systematically ignore the failure of is much worse than not running
  them because it provides a false impression of quality.

- This can cause DSP panics as seen in
  thesofproject#8621

Signed-off-by: Marc Herbert <[email protected]>
marc-hb pushed a commit to marc-hb/sof that referenced this pull request Dec 13, 2023
Trying to reproduce DSP panic thesofproject#8621

 -------

The DAI emits a flood of warnings when presented with empty buffers at
copy time.  That's not really a reasonable warning condition.  There
are multiple situations where components upstream may be paused or
asynchronous, leading to starvation in any given cycle.

Earlier code has hit this with paused components, where the log
messages are merely annoying.

One new situation is that when using the DP scheduler, updates are
async and may happen at a different cadence than the pipeline the DAI
is on; the upstream component will be presented with data in a (for
example) 1ms pipeline tick, but then send it to a different component
(echo cancellation, say) that batches it up into larger buffers (10ms)
and releases it downstream only at the slower cadence.

In that situation the flood of messages is being emitted during an
active stream, and tends to cause glitches all by itself after a few
seconds (and even where it doesn't, it floods the Zephyr log backend
to the extent that literally every message is dropped).

(I don't know that all such warnigns are removed by this patch.  These
are only the ones I've seen in practice.)

Fixes thesofproject#4672

Signed-off-by: Andy Ross <[email protected]>
(cherry picked from commit 514576e)
@marc-hb
Copy link
Collaborator

marc-hb commented Dec 14, 2023

In #8624 I tested this PR + removing CONFIG_SOF_BOOT_TEST and everything passed.

@lyakh
Copy link
Collaborator

lyakh commented Dec 14, 2023

@lyakh I checks a number of samples and it looks like this new boot test has always been failing on MTL, new DSP panic or not (and always passing on CAVS)

@marc-hb yes, it's known that this test fails on MTL and no, it cannot fail on cAVS, because the VMH isn't available on cAVS. But what I don't understand is how is this related to any panics?

kv2019i pushed a commit that referenced this pull request Dec 14, 2023
- This test has failed on MTL since it was enabled.  Running tests that
  we systematically ignore the failure of is much worse than not running
  them because it provides a false impression of quality.

- This can cause DSP panics as seen in
  #8621

Signed-off-by: Marc Herbert <[email protected]>
@kv2019i
Copy link
Collaborator

kv2019i commented Dec 14, 2023

SOFCI TEST

@kv2019i
Copy link
Collaborator

kv2019i commented Dec 14, 2023

The test that caused fear and uncertainty yesterday is now passing -> https://sof-ci.01.org/sofpr/PR8621/build1159/devicetest/index.html

( EDIT: because it was turned off by #8624 )

My other concerns on the PR remain, I'd like an ack from @marcinszkudlinski , and at minimum to leave the warnings as dbg().

@andyross
Copy link
Contributor Author

Continuing from @marcinszkudlinski 's comments in the original PR, best to review here:

These are the warnings I'm seeing, they're coming in a flood every 1ms pipeline tick.

I'm obviously not claiming to be an expert, but it seems clear (-ish) that the dai_zephyr expects to see non-zero output every tick, or else it hits this warning.

But that's not the architecture of the component. The DP AEC component batches input into a 10ms buffer, then calls the AEC processing, then releases all that data into the output sink all at once. It's never going to be a steady thing unless the DAI is able to rate-limit with some other component. And if it is, that part doesn't seem to be configured properly?

Basically: what exactly are the requirements for a DP component that I'm violating? If you guys need to reproduce, it should work to just grab the code from the PR.

[    0.627188] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_common_copy(): Copy_bytes 0 + free bytes 16 < period bytes 384, possible glitch
[    0.627208] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.627220] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_common_copy(): Copy_bytes 0 + free bytes 192 < period bytes 768, possible glitch
[    0.627225] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.628200] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.628213] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.629188] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.629211] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.630171] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.630183] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.631186] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_common_copy(): Copy_bytes 0 + free bytes 16 < period bytes 384, possible glitch
[    0.631208] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.631220] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_common_copy(): Copy_bytes 0 + free bytes 192 < period bytes 768, possible glitch
[    0.631225] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.632200] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.632211] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.633185] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.633206] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.634186] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.634200] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.635186] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_common_copy(): Copy_bytes 0 + free bytes 16 < period bytes 384, possible glitch
[    0.635208] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.635220] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_common_copy(): Copy_bytes 0 + free bytes 192 < period bytes 768, possible glitch
[    0.635225] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.636200] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.636211] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.637188] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.637211] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.638171] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.638183] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 15, 2023

Continuing from @marcinszkudlinski 's comments in the original PR, best to review here:

Don't forget

@marcinszkudlinski
Copy link
Contributor

marcinszkudlinski commented Dec 15, 2023

@andyross

The DP AEC component batches input into a 10ms buffer, then calls the AEC processing, then releases all that data into the output sink all at once. It's never going to be a steady thing unless the DAI is able to rate-limit with some other component

That's not true,
see this patch a55cfa5
LL component will receive data from DP in 1ms chunks

@andyross
Copy link
Contributor Author

Then maybe that's backing up? I'm at a complete loss for why we're seeing these errors in that case. Are there docs anywhere for exactly how many bytes a DP component should be sourcing/sinking and when?

Digression: isn't that design exactly backwards though? By rate-limiting downstream you're delaying the arrival at the host output by 10ms (actually 9 I guess). Once AEC finishes, there are microphone bytes available for DMA to the ALSA pipeline on the host. Why are we waiting artificially?

@kv2019i
Copy link
Collaborator

kv2019i commented Dec 15, 2023

Then maybe that's backing up? I'm at a complete loss for why we're seeing these errors in that case. Are there docs anywhere for exactly how many bytes a DP component should be sourcing/sinking and when?

@marcinszkudlinski added a helpful example in https://github.com/thesofproject/sof/blob/main/src/schedule/zephyr_dp_schedule.c#L64 on the how LL-DP-LL are expected to work. Higher level docs at https://thesofproject.github.io/latest/architectures/firmware/sof-zephyr/mpp_layer/mpp_scheduling.html

I wonder if we are systematically late by one LL tick. You could try cherry-picking #7736 (which adds even more debug output in this case) to see see how dd->total_data_processed moves when "no data" conditions are met.

Digression: isn't that design exactly backwards though? By rate-limiting downstream you're delaying the arrival at the host output by 10ms (actually 9 I guess). Once AEC finishes, there are microphone bytes available for DMA to the ALSA pipeline on the host. Why are we waiting artificially?

In general this type of delaying is done to ensure continuity of the audio stream. If a module operating at 10ms chunk size is surrounded by real-time sinks and sources running at 1ms tick, a delay has to be introduced as the real-time sinks cannot handle gaps in the stream and data must be consumed/produced at fixed rate. There are special cases (error concealment algorithms can be used to conceal gaps in audio data, there is fast-drain more used in SOF e.g. for wake-on-voice to transfer data to host, and ASRC can be used to adapt rates), but in general the principle applies. So after the first real 10ms chunk of data is output, from that point onward, there should be a steady stream of data every 1ms LL tick with no exceptions.

@andyross
Copy link
Contributor Author

If a module operating at 10ms chunk size is surrounded by real-time sinks and sources running at 1ms tick,

Uh... why? There's 10ms of buffering after the very first iteration. It's true that the first time through you need to accumulate on the upstream side, but once you're running we should be emitting data as fast as possible knowing that we have 9 chunks waiting behind us if we fail to keep up.

Again, this sounds backwards to me. It makes the DP scheduler worse, latency-wise, than a synchronous 10ms pipeline would be: you need to buffer 10ms upstream in both designs, but DP then adds an additional 9ms on the downstream side because it won't move more than one block at a time?

@andyross
Copy link
Contributor Author

FWIW, the root cause of these particular errors turns out to have been bad stream metadata on the sources/sinks, see the workaround patch at the end of the series in the main PR linked above. So I'm happy to drop this if there's consensus these can't be hit in normal circumstances (though again, it seems like there's a pre-existing problem too).

The latency issue we ended up discussing instead actually seems like genuine (and kinda serious) problem though. Looking at that now, I'll submit a bug.

Copy link
Contributor

@btian1 btian1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title typo “Silence benign warnings” -> begin?

@andyross
Copy link
Contributor Author

Close this, latency issues are tracked in #8640

The existing bug will be sad, though.

@andyross
Copy link
Contributor Author

Oops, actually hit the "close" button this time.

@andyross andyross closed this Dec 18, 2023
@cujomalainey cujomalainey reopened this Dec 18, 2023
@cujomalainey
Copy link
Contributor

I think there is still a request to fix the spam here @andyross

@kv2019i
Copy link
Collaborator

kv2019i commented Dec 19, 2023

@cujomalainey wrote:

I think there is still a request to fix the spam here @andyross

Compromise proposal -> #8649

@lgirdwood
Copy link
Member

@cujomalainey wrote:

I think there is still a request to fix the spam here @andyross

Compromise proposal -> #8649

Merged to make this build time for debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] need to rate-limit 'nothing to copy' traces when mixer source is paused.
9 participants