dai_zephyr: Silence benign warnings #8621

andyross · 2023-12-13T18:48:26Z

[A more limited version of this patch was submitted in #8571]

The DAI emits a flood of warnings when presented with empty buffers at copy time. That's not really a reasonable warning condition. There are multiple situations where components upstream may be paused or asynchronous, leading to starvation in any given cycle.

Earlier code has hit this with paused components, where the log messages are merely annoying.

One new situation is that when using the DP scheduler, updates are async and may happen at a different cadence than the pipeline the DAI is on; the upstream component will be presented with data in a (for example) 1ms pipeline tick, but then send it to a different component (echo cancellation, say) that batches it up into larger buffers (10ms) and releases it downstream only at the slower cadence.

In that situation the flood of messages is being emitted during an active stream, and tends to cause glitches all by itself after a few seconds (and even where it doesn't, it floods the Zephyr log backend to the extent that literally every message is dropped).

(I don't know that all such warnigns are removed by this patch. These are only the ones I've seen in practice.)

Fixes #4672

The DAI emits a flood of warnings when presented with empty buffers at copy time. That's not really a reasonable warning condition. There are multiple situations where components upstream may be paused or asynchronous, leading to starvation in any given cycle. Earlier code has hit this with paused components, where the log messages are merely annoying. One new situation is that when using the DP scheduler, updates are async and may happen at a different cadence than the pipeline the DAI is on; the upstream component will be presented with data in a (for example) 1ms pipeline tick, but then send it to a different component (echo cancellation, say) that batches it up into larger buffers (10ms) and releases it downstream only at the slower cadence. In that situation the flood of messages is being emitted during an active stream, and tends to cause glitches all by itself after a few seconds (and even where it doesn't, it floods the Zephyr log backend to the extent that literally every message is dropped). (I don't know that all such warnigns are removed by this patch. These are only the ones I've seen in practice.) Fixes thesofproject#4672 Signed-off-by: Andy Ross <[email protected]>

marc-hb · 2023-12-13T19:07:06Z

EDIT: unrelated panic bug moved to:

[BUG] vmh_test_single test fails #8629

Very interesting! The test run is very far from finished yet but when peeking at the internal side of https://sof-ci.01.org/sofpr/PR8621/build1091/devicetest/index.html I can already see the same boot time panic already observed in #8571:

[    6.396362] kernel: i915 display info: overlay_needs_physical: no
[    6.396363] kernel: i915 display info: supports_tv: no
[    6.396365] kernel: i915 display info: has_hdcp: yes
[    6.396366] kernel: i915 display info: has_dmc: yes
[    6.396367] kernel: i915 display info: has_dsc: yes
[    6.426531] kernel: fbcon: i915drmfb (fb0) is primary device
[    6.438103] kernel: Console: switching to colour frame buffer device 480x135
[    6.458504] kernel: i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
[    7.779320] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[    7.779726] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
[    7.779759] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[    7.779773] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[    7.882406] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc rx      : 0x1b0a0000|0x0: GLB_NOTIFICATION|EXCEPTION_CAUGHT
[    7.882421] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[    7.882432] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: DSP panic!
[    7.882437] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[    7.882468] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0x5, ROM error: 0x0
[    7.882703] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0x50000005, ROM debug error: 0x0
[    7.882727] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit enabled
[    7.882776] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Unknown toolchain is used
[    7.882783] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: DSP Firmware Oops
[    7.882787] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: Exception Cause: AllocaCause, MOVSP instruction, if caller’s registers are not in the register file
[    7.882793] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EXCCAUSE 0x00000005 EXCVADDR 0xa010e7fc PS       0x00060420 SAR     0x00000001
[    7.882801] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPC1     0xa00327a5 EPC2     0x00000000 EPC3     0x00000000 EPC4    0x00000000
[    7.882807] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPC5     0x00000000 EPC6     0x00000000 EPC7     0x00000000 DEPC    0x00000000
[    7.882813] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPS2     0x00000000 EPS3     0x00000000 EPS4     0x00000000 EPS5    0x00000000
[    7.882818] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPS6     0x00000000 EPS7     0x00000000 INTENABL 0x00000000 INTERRU 0x00000000
[    7.882825] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: stack dump from 0x00000000
[    7.882831] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: AR registers:
[    7.882838] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x0: a0036098 a010e7d0 00000001 a010e7fc
[    7.882844] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x10: 00000001 a1131080 000a02e0 00000000
[    7.882850] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x20: a0037cb8 a010e7a0 00000001 000002c0
[    7.882858] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x30: a0037cb8 a010e7a0 00000001 000002c0
[    7.882867] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[    7.882874] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state change: 7 -> 8
[    7.882884] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc rx done : 0x1b0a0000|0x0: GLB_NOTIFICATION|EXCEPTION_CAUGHT
[    8.283733] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x47000000|0x0
[    8.283748] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Attempting to prevent DSP from entering D3 state to preserve context
[    8.283753] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[    8.283819] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0xc7000000|0x0|0x0, target: 0x1b0a0000|0x0|0x0, ctl: 0x3
[    8.283824] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
[    8.283828] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
[    8.283834] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ctx_save IPC error: -110, proceeding with suspend
[    8.283932] kernel: snd_sof_pci_intel_mtl:mtl_enable_sdw_irq: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x114c]=0x0 successful
[    8.283944] kernel: snd_sof_pci_intel_mtl:mtl_enable_interrupts: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1800]=0x0 successful
[    8.283952] kernel: snd_sof_pci_intel_mtl:mtl_enable_interrupts: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1140]=0x0 successful
[    8.285028] kernel: snd_sof_pci_intel_mtl:mtl_dsp_core_power_down: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x178d04]=0x2000000 successful
[    8.286092] kernel: snd_sof_pci_intel_mtl:mtl_power_down_dsp: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1000]=0x0 successful
[    8.286171] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIR: 00000000 at  00000044
[    8.286177] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIW: 00000010 at  00000044
[    8.287223] kernel: snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-mtl 0000:00:1f.3: Current DSP power state: D3
[    8.287226] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state change: 8 -> 0
[    8.774103] kernel: ax88179_178a 4-1:1.0 enx000ec682d0ad: ax88179 - Link status is: 1
[   26.474094] kernel: systemd-udevd (431) used greatest stack depth: 11048 bytes left
[  104.292946] kernel: rfkill: input handler disabled

andyross · 2023-12-13T19:18:21Z

Heh. Are you absolutely, positively sure this isn't just a flaky RVP? None of these lines are even going to be hit on firmware load, they're part of the copy path once a pipeline is started. The only delta you'd see at the stage you're seeing the panic would be code size & linker address choice. Is someone available to hand-build this patch and validate on a different but compatible board?

Again, I have an equivalent MTL device (surely not the same board, and likely a different stepping) on my desk and it's been fine.

marc-hb · 2023-12-13T19:32:32Z

Heh. Are you absolutely, positively sure this isn't just a flaky RVP?

Not flaky no. The DSP panic does not happen on every test but it happens several times per test run[*] on every single device and the register values are the same every time it happens.

[*] including check-runtime-pm.sh because that reboots the DSP. Basically every time the DSP boots.

Again, I have an equivalent MTL device (surely not the same board, and likely a different stepping) on my desk and it's been fine.

Did you compare the checksums at the end of the build?

andyross · 2023-12-13T19:38:35Z

Did you compare the checksums at the end of the build?

We don't have identical toolchains, so I really doubt that's going to work. I'm happy to try one you send though. Which topology are you passing? I can try that too, though again I have a different kernel so how that loads (or if it loads, depending on the linux-side widgets involved) will be slightly different.

kv2019i

Some thoughts inline. I have no explanation on the #8629 DSP panic. I agree the changes you make should have no relation to a DSP panic. It is possible you have hit a heisenbug that changes timing enough to trigger a bug elsewhere (or we merged something today that has a regression., but upon quick browse to today's data, all test runs are good before this PR).

kv2019i · 2023-12-13T20:05:48Z

src/audio/dai-zephyr.c

@@ -1286,8 +1286,6 @@ int dai_zephyr_multi_endpoint_copy(struct dai_data **dd, struct comp_dev *dev,

 	/* return if nothing to copy */
 	if (!frames) {
-		comp_warn(dev, "dai_zephyr_multi_endpoint_copy(): nothing to copy");


I'd like @marcinszkudlinski weigh in from DP perspective.

These really are useful warnings as the DAI interface should in fact see a steady stream of data in LL ticks. DP modules may process in larger chunks and this is perfectly fine, but
these should be hidden by the DP queues that sit between DP and LL domains. The audio interface (e.g. DAI) runs at fixed rate and if we see these warnings in volume, something is usually wrong. If e.g. peer pipeline is paused, the copiers should handle these situations.

Granted, we do have a minimal ping-pong buffer towards the DAI, so it is possible to have a stable buffering scheme where this warning will come continuously, but so far this has not been the case (given the warnings are still there).

With DP and AEC this might indeed happen, so I'm ok if @marcinszkudlinski et al sign this off.

I would prefer to just downgrade to comp_dbg() first, to make this is a little easier to bring back.

@kv2019i @marcinszkudlinski , as you pointed out, when DP handling large chunks, delay is expected ( > large trunk period), however, it is not error, it should not be print out.

We may need find a way to make it only report real under-run, or overrun.

split the log, once you hit a period threshold report an error? Otherwise this can just be a debug log.

@btian1 Right but this delay should be handled within DP domain and DP elements. These prints are from the DAI interface code. If we don't have data when DAI copy is called, we are at the mercy of the per-DAI buffers (depends on DAI type, but typically 2-4 1ms chunks of DAI facing buffers). We should not rely on this buffering for data flow, but instead should always have data to read/write when DAI needs it. We have had separate discussions about minizing the latency and if we have a DAI that can operate with the 2x1ms ping-pong buffer, that does not leave any margin for errors from audio pipeline code. The LL scheduler is run every 1ms tick, and it should be able to provide/consume data without exceptions.

From @marcinszkudlinski stuck in older #8571

@andyross this message must not appear more than 2xDP processing period - that means 20 times in case of AEC during pipeline startup.
500 lines means - processing is not working properly

@andyross I would strongly recommend to change the message level to debug. It is a potential risk of message flood that can affect FW operability. It might be tricky to determine at DAI level if no data to copy was intended in given situation.

marc-hb · 2023-12-13T20:12:20Z

We don't have identical toolchains, so I really doubt that's going to work.

Bummer. You can still compare sof_versions.h and .config though, that's still very useful.

Speaking of .config, I just noticed this new sof_boot thing on every DSP boot. It's not specific to MTL. @andyross to maximize reproduction chances please make sure you have CONFIG_SOF_BOOT_TEST=y as well?

boot test and virtual heap testing #7688

@lyakh I checks a number of samples and it looks like this new boot test has always been failing on MTL, new DSP panic or not (and always passing on CAVS)

[    0.106098] <inf> init: print_version_banner: FW ABI 0x301d000 DBG ABI 0x5003000 tags SOF:v2.5-stable-branch-1807-g7ee5c69d3555 zephyr:zephyr-v3.5.0-2009-gefc32081893d src hash 0x236add3c (ref hash 0x236add3c)
[    0.106116] <inf> clock: clock_set_freq: clock 0 set freq 393216000Hz freq_idx 1
[    0.106128] <inf> clock: clock_set_freq: clock 1 set freq 393216000Hz freq_idx 1
[    0.106138] <inf> clock: clock_set_freq: clock 2 set freq 393216000Hz freq_idx 1
*** Booting Zephyr OS build zephyr-v3.5.0-2009-gefc32081893d ***
[    0.106388] <inf> main: sof_app_main: SOF on intel_adsp_ace15_mtpm
[    0.106400] <inf> main: sof_app_main: SOF initialized
[    0.106986] <inf> ipc: ipc_cmd: rx	: 0x43000000|0x20701000
[    0.109031] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x31400008
[    0.109661] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3060004c
[    2.272595] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3060004c
[    2.273046] <inf> ipc: ipc_cmd: rx	: 0x47000000|0x0
[    0.128908] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x31400008
[    0.129560] <inf> ipc: ipc_cmd: rx	: 0x44000000|0x3060004c
[    0.405513] <inf> ipc: ipc_cmd: rx	: 0x11000005|0x0
[    0.405533] <inf> pipe: pipeline_new: pipeline new pipe_id 0 priority 0
Running TESTSUITE sof_boot
===================================================================
START - virtual_memory_heap
[    0.505716] <wrn> sof_boot_test: vmh_test_single: Ignoring failure to allocate 1616 in non-contiguous mode
[    0.505723] <err> sof_boot_test: vmh_test_single: Failed to allocate 26
[    0.505845] <err> sof_boot_test: vmh_test: Non-contiguous test error -12
[    0.505850] <err> sof_boot_test: sof_boot_virtual_memory_heap: virtual_memory_heap failed: -12
 FAIL - virtual_memory_heap in 0.001 seconds
===================================================================
TESTSUITE sof_boot failed.
[    0.606210] <inf> ipc: ipc_cmd: rx	: 0x40000004|0x15
[    0.606253] <inf> dma: dma_get: dma_get() ID 0 sref = 1 busy channels 0
[    0.607026] <inf> ipc: ipc_cmd: rx	: 0x40000006|0x10
[    0.607681] <inf> ipc: ipc_cmd: rx	: 0x44000006|0x30000018
[    0.608068] <inf> ipc: ipc_cmd: rx	: 0x40000002|0xa
[    0.608698] <inf> ipc: ipc_cmd: rx	: 0x11010006|0x0

We never looked at firmware errors so this was never noticed

Catch firmware errors sof-test#1075

andyross · 2023-12-13T21:02:39Z

EDIT: unrelated bug moved to:

[BUG] vmh_test_single test fails #8629

Yup, that's what done it. Building this patch with SOF_BOOT_TEST=y I do indeed get a panic on load. It looks like this:

[    0.288196] <err> ipc: ipc4_get_drv: get_drv(): the provided UUID (b780a0a6 466f269f df2377b4 58f75aa0) can't be found!
[    0.288218] <err> os: xtensa_excint1_c:  ** FATAL EXCEPTION
[    0.288225] <err> os: xtensa_excint1_c:  ** CPU 0 EXCCAUSE 13 (load/store PIF data error)
[    0.288233] <err> os: xtensa_excint1_c:  **  PC 0xa006ba74 VADDR 0x4
[    0.288235] <err> os: xtensa_excint1_c:  **  PS 0x60420
[    0.288238] <err> os: xtensa_excint1_c:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:4 CALLINC:2)
[    0.288241] <err> os: z_xtensa_dump_stack:  **  A0 0xa006c00c  SP 0xa00f2bc0  A2 0x17  A3 0x400feee0
[    0.288245] <err> os: z_xtensa_dump_stack:  **  A4 0x400fef54  A5 (nil)  A6 (nil)  A7 0xa00f2bc0
[    0.288248] <err> os: z_xtensa_dump_stack:  **  A8 0xa006ba68  A9 0x17 A10 0x400f6000 A11 0xce5
[    0.288250] <err> os: z_xtensa_dump_stack:  ** A12 0x400ff068 A13 0x10007 A14 0x400f8980 A15 0x10
[    0.288253] <err> os: z_xtensa_dump_stack:  ** LBEG 0xa007479a LEND 0xa00747d0 LCOUNT (nil)
[    0.288256] <err> os: z_xtensa_dump_stack:  ** SAR 0x10

Note the failure to load a driver for some component (which I'm sure you can guess: that's the UUID for the Google AEC component). And the faulting PC points to this line in lib_manager.c:

https://github.com/thesofproject/sof/blob/main/src/library_manager/lib_manager.c#L341

...where it's trying to dereference a NULL "ext_lib" pointer it just tried to retrieve, unsuccessfully.

Now, as it happens AEC isn't built in the image here (the firmware can only hear about that UUID via topology). Though it is in the main PR obviously, and I checked that the failure is identical in both spots. So somewhere there's a problem with the library_manager registry understanding what it can assume about modules it's trying to test.

andyross · 2023-12-13T21:05:31Z

Alternatively there's an API that components need to implement properly in order to show up correctly to ipc4_get_drv and I'm not aware of it.

Anyway, I'm going to let you guys work this for a bit. Will return to this time-available, but there are some audio issues with the PR I want to get squashed first.

andyross · 2023-12-13T21:07:20Z

Also also, just to I-told-you-so a bit: it's not enough to put a test in CI and demand it pass. You need to document exactly how to reproduce the failures. In this case CI wasn't actually building the same firmware SOF produces, it was building that test thing. And as the scripts aren't (AFAICT?) checked into thesofproject anywhere, I never would have known had Marc not mentioned it.

marc-hb · 2023-12-13T21:14:50Z

Also also, just to I-told-you-so a bit: it's not enough to put a test in CI and demand it pass. You need to document exactly how to reproduce the failures.

You're preaching to the choir here :-) (me)

In this case CI wasn't actually building the same firmware SOF produces, it was building that test thing. And as the scripts aren't (AFAICT?) checked into thesofproject anywhere, I never would have known had Marc not mentioned it.

That test Kconfig is in git.

So is the build script and the MTL build logs are public.

- This test has failed on MTL since it was enabled. Running tests that we systematically ignore the failure of is much worse than not running them because it provides a false impression of quality. - This can cause DSP panics as seen in thesofproject#8621 Signed-off-by: Marc Herbert <[email protected]>

Trying to reproduce DSP panic thesofproject#8621 ------- The DAI emits a flood of warnings when presented with empty buffers at copy time. That's not really a reasonable warning condition. There are multiple situations where components upstream may be paused or asynchronous, leading to starvation in any given cycle. Earlier code has hit this with paused components, where the log messages are merely annoying. One new situation is that when using the DP scheduler, updates are async and may happen at a different cadence than the pipeline the DAI is on; the upstream component will be presented with data in a (for example) 1ms pipeline tick, but then send it to a different component (echo cancellation, say) that batches it up into larger buffers (10ms) and releases it downstream only at the slower cadence. In that situation the flood of messages is being emitted during an active stream, and tends to cause glitches all by itself after a few seconds (and even where it doesn't, it floods the Zephyr log backend to the extent that literally every message is dropped). (I don't know that all such warnigns are removed by this patch. These are only the ones I've seen in practice.) Fixes thesofproject#4672 Signed-off-by: Andy Ross <[email protected]> (cherry picked from commit 514576e)

marc-hb · 2023-12-14T04:19:40Z

In #8624 I tested this PR + removing CONFIG_SOF_BOOT_TEST and everything passed.

lyakh · 2023-12-14T08:50:10Z

@lyakh I checks a number of samples and it looks like this new boot test has always been failing on MTL, new DSP panic or not (and always passing on CAVS)

@marc-hb yes, it's known that this test fails on MTL and no, it cannot fail on cAVS, because the VMH isn't available on cAVS. But what I don't understand is how is this related to any panics?

- This test has failed on MTL since it was enabled. Running tests that we systematically ignore the failure of is much worse than not running them because it provides a false impression of quality. - This can cause DSP panics as seen in #8621 Signed-off-by: Marc Herbert <[email protected]>

kv2019i · 2023-12-14T16:15:21Z

SOFCI TEST

kv2019i · 2023-12-14T19:03:39Z

The test that caused fear and uncertainty yesterday is now passing -> https://sof-ci.01.org/sofpr/PR8621/build1159/devicetest/index.html

( EDIT: because it was turned off by #8624 )

My other concerns on the PR remain, I'd like an ack from @marcinszkudlinski , and at minimum to leave the warnings as dbg().

andyross · 2023-12-15T02:40:51Z

Continuing from @marcinszkudlinski 's comments in the original PR, best to review here:

These are the warnings I'm seeing, they're coming in a flood every 1ms pipeline tick.

I'm obviously not claiming to be an expert, but it seems clear (-ish) that the dai_zephyr expects to see non-zero output every tick, or else it hits this warning.

But that's not the architecture of the component. The DP AEC component batches input into a 10ms buffer, then calls the AEC processing, then releases all that data into the output sink all at once. It's never going to be a steady thing unless the DAI is able to rate-limit with some other component. And if it is, that part doesn't seem to be configured properly?

Basically: what exactly are the requirements for a DP component that I'm violating? If you guys need to reproduce, it should work to just grab the code from the PR.

[    0.627188] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_common_copy(): Copy_bytes 0 + free bytes 16 < period bytes 384, possible glitch
[    0.627208] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.627220] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_common_copy(): Copy_bytes 0 + free bytes 192 < period bytes 768, possible glitch
[    0.627225] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.628200] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.628213] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.629188] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.629211] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.630171] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.630183] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.631186] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_common_copy(): Copy_bytes 0 + free bytes 16 < period bytes 384, possible glitch
[    0.631208] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.631220] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_common_copy(): Copy_bytes 0 + free bytes 192 < period bytes 768, possible glitch
[    0.631225] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.632200] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.632211] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.633185] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.633206] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.634186] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.634200] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.635186] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_common_copy(): Copy_bytes 0 + free bytes 16 < period bytes 384, possible glitch
[    0.635208] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.635220] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_common_copy(): Copy_bytes 0 + free bytes 192 < period bytes 768, possible glitch
[    0.635225] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.636200] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.636211] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.637188] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.637211] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy
[    0.638171] <wrn> dai_comp: dai_common_copy: comp:3 0x50004 dai_zephyr_copy(): nothing to copy
[    0.638183] <wrn> dai_comp: dai_common_copy: comp:0 0x4 dai_zephyr_copy(): nothing to copy

marc-hb · 2023-12-15T02:54:15Z

Continuing from @marcinszkudlinski 's comments in the original PR, best to review here:

Don't forget

[BUG] need to rate-limit 'nothing to copy' traces when mixer source is paused. #4672

marcinszkudlinski · 2023-12-15T07:41:41Z

@andyross

The DP AEC component batches input into a 10ms buffer, then calls the AEC processing, then releases all that data into the output sink all at once. It's never going to be a steady thing unless the DAI is able to rate-limit with some other component

That's not true,
see this patch a55cfa5
LL component will receive data from DP in 1ms chunks

andyross · 2023-12-15T17:21:13Z

Then maybe that's backing up? I'm at a complete loss for why we're seeing these errors in that case. Are there docs anywhere for exactly how many bytes a DP component should be sourcing/sinking and when?

Digression: isn't that design exactly backwards though? By rate-limiting downstream you're delaying the arrival at the host output by 10ms (actually 9 I guess). Once AEC finishes, there are microphone bytes available for DMA to the ALSA pipeline on the host. Why are we waiting artificially?

kv2019i · 2023-12-15T19:01:16Z

Then maybe that's backing up? I'm at a complete loss for why we're seeing these errors in that case. Are there docs anywhere for exactly how many bytes a DP component should be sourcing/sinking and when?

@marcinszkudlinski added a helpful example in https://github.com/thesofproject/sof/blob/main/src/schedule/zephyr_dp_schedule.c#L64 on the how LL-DP-LL are expected to work. Higher level docs at https://thesofproject.github.io/latest/architectures/firmware/sof-zephyr/mpp_layer/mpp_scheduling.html

I wonder if we are systematically late by one LL tick. You could try cherry-picking #7736 (which adds even more debug output in this case) to see see how dd->total_data_processed moves when "no data" conditions are met.

Digression: isn't that design exactly backwards though? By rate-limiting downstream you're delaying the arrival at the host output by 10ms (actually 9 I guess). Once AEC finishes, there are microphone bytes available for DMA to the ALSA pipeline on the host. Why are we waiting artificially?

In general this type of delaying is done to ensure continuity of the audio stream. If a module operating at 10ms chunk size is surrounded by real-time sinks and sources running at 1ms tick, a delay has to be introduced as the real-time sinks cannot handle gaps in the stream and data must be consumed/produced at fixed rate. There are special cases (error concealment algorithms can be used to conceal gaps in audio data, there is fast-drain more used in SOF e.g. for wake-on-voice to transfer data to host, and ASRC can be used to adapt rates), but in general the principle applies. So after the first real 10ms chunk of data is output, from that point onward, there should be a steady stream of data every 1ms LL tick with no exceptions.

andyross · 2023-12-15T21:07:28Z

If a module operating at 10ms chunk size is surrounded by real-time sinks and sources running at 1ms tick,

Uh... why? There's 10ms of buffering after the very first iteration. It's true that the first time through you need to accumulate on the upstream side, but once you're running we should be emitting data as fast as possible knowing that we have 9 chunks waiting behind us if we fail to keep up.

Again, this sounds backwards to me. It makes the DP scheduler worse, latency-wise, than a synchronous 10ms pipeline would be: you need to buffer 10ms upstream in both designs, but DP then adds an additional 9ms on the downstream side because it won't move more than one block at a time?

andyross · 2023-12-16T06:10:48Z

FWIW, the root cause of these particular errors turns out to have been bad stream metadata on the sources/sinks, see the workaround patch at the end of the series in the main PR linked above. So I'm happy to drop this if there's consensus these can't be hit in normal circumstances (though again, it seems like there's a pre-existing problem too).

The latency issue we ended up discussing instead actually seems like genuine (and kinda serious) problem though. Looking at that now, I'll submit a bug.

btian1

title typo “Silence benign warnings” -> begin?

andyross · 2023-12-18T15:42:56Z

Close this, latency issues are tracked in #8640

The existing bug will be sad, though.

andyross · 2023-12-18T15:43:18Z

Oops, actually hit the "close" button this time.

cujomalainey · 2023-12-18T22:00:11Z

I think there is still a request to fix the spam here @andyross

kv2019i · 2023-12-19T13:11:40Z

@cujomalainey wrote:

I think there is still a request to fix the spam here @andyross

Compromise proposal -> #8649

lgirdwood · 2023-12-19T17:11:12Z

@cujomalainey wrote:

I think there is still a request to fix the spam here @andyross

Compromise proposal -> #8649

Merged to make this build time for debug.

andyross requested review from lgirdwood, plbossart, mmaka1, lbetlej, dbaluta and kv2019i as code owners December 13, 2023 18:48

andyross requested a review from marc-hb December 13, 2023 18:49

marc-hb mentioned this pull request Dec 13, 2023

Google AEC rework: unbreak/resurrect, mt8195, DP integration #8571

Closed

This comment was marked as off-topic.

Sign in to view

kv2019i reviewed Dec 13, 2023

View reviewed changes

marc-hb assigned lyakh Dec 13, 2023

marc-hb mentioned this pull request Dec 13, 2023

boot test and virtual heap testing #7688

Merged

marc-hb mentioned this pull request Dec 13, 2023

debug_overlay.conf: temporarily disable SOF_BOOT_TEST #8624

Merged

kv2019i mentioned this pull request Dec 14, 2023

[BUG] vmh_test_single test fails #8629

Open

andyross mentioned this pull request Dec 16, 2023

[BUG] DP scheduler introducing 10ms+ latency to AEC capture streams #8640

Closed

btian1 reviewed Dec 18, 2023

View reviewed changes

andyross closed this Dec 18, 2023

cujomalainey reopened this Dec 18, 2023

kv2019i mentioned this pull request Dec 19, 2023

audio: dai-zephyr: put no-data checks behind a build option #8649

Merged

cujomalainey closed this Dec 19, 2023

marc-hb mentioned this pull request Jan 18, 2024

host_get_copy_bytes_normal: "no bytes to copy" log spam #8761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dai_zephyr: Silence benign warnings #8621

dai_zephyr: Silence benign warnings #8621

andyross commented Dec 13, 2023

marc-hb commented Dec 13, 2023 •

edited

Loading

andyross commented Dec 13, 2023

marc-hb commented Dec 13, 2023 •

edited

Loading

andyross commented Dec 13, 2023

This comment was marked as off-topic.

kv2019i left a comment •

edited by marc-hb

Loading

kv2019i Dec 13, 2023

btian1 Dec 14, 2023

cujomalainey Dec 14, 2023

kv2019i Dec 14, 2023

marc-hb Dec 15, 2023

mwasko Dec 18, 2023

marc-hb commented Dec 13, 2023

andyross commented Dec 13, 2023 •

edited by marc-hb

Loading

andyross commented Dec 13, 2023

andyross commented Dec 13, 2023

marc-hb commented Dec 13, 2023 •

edited

Loading

marc-hb commented Dec 14, 2023

lyakh commented Dec 14, 2023

kv2019i commented Dec 14, 2023

kv2019i commented Dec 14, 2023 •

edited by marc-hb

Loading

andyross commented Dec 15, 2023

marc-hb commented Dec 15, 2023

marcinszkudlinski commented Dec 15, 2023 •

edited

Loading

andyross commented Dec 15, 2023

kv2019i commented Dec 15, 2023

andyross commented Dec 15, 2023

andyross commented Dec 16, 2023

btian1 left a comment

andyross commented Dec 18, 2023

andyross commented Dec 18, 2023

cujomalainey commented Dec 18, 2023

kv2019i commented Dec 19, 2023 •

edited

Loading

lgirdwood commented Dec 19, 2023

dai_zephyr: Silence benign warnings #8621

dai_zephyr: Silence benign warnings #8621

Conversation

andyross commented Dec 13, 2023

marc-hb commented Dec 13, 2023 • edited Loading

andyross commented Dec 13, 2023

marc-hb commented Dec 13, 2023 • edited Loading

andyross commented Dec 13, 2023

This comment was marked as off-topic.

kv2019i left a comment • edited by marc-hb Loading

Choose a reason for hiding this comment

kv2019i Dec 13, 2023

Choose a reason for hiding this comment

btian1 Dec 14, 2023

Choose a reason for hiding this comment

cujomalainey Dec 14, 2023

Choose a reason for hiding this comment

kv2019i Dec 14, 2023

Choose a reason for hiding this comment

marc-hb Dec 15, 2023

Choose a reason for hiding this comment

mwasko Dec 18, 2023

Choose a reason for hiding this comment

marc-hb commented Dec 13, 2023

andyross commented Dec 13, 2023 • edited by marc-hb Loading

andyross commented Dec 13, 2023

andyross commented Dec 13, 2023

marc-hb commented Dec 13, 2023 • edited Loading

marc-hb commented Dec 14, 2023

lyakh commented Dec 14, 2023

kv2019i commented Dec 14, 2023

kv2019i commented Dec 14, 2023 • edited by marc-hb Loading

andyross commented Dec 15, 2023

marc-hb commented Dec 15, 2023

marcinszkudlinski commented Dec 15, 2023 • edited Loading

andyross commented Dec 15, 2023

kv2019i commented Dec 15, 2023

andyross commented Dec 15, 2023

andyross commented Dec 16, 2023

btian1 left a comment

Choose a reason for hiding this comment

andyross commented Dec 18, 2023

andyross commented Dec 18, 2023

cujomalainey commented Dec 18, 2023

kv2019i commented Dec 19, 2023 • edited Loading

lgirdwood commented Dec 19, 2023

marc-hb commented Dec 13, 2023 •

edited

Loading

marc-hb commented Dec 13, 2023 •

edited

Loading

kv2019i left a comment •

edited by marc-hb

Loading

andyross commented Dec 13, 2023 •

edited by marc-hb

Loading

marc-hb commented Dec 13, 2023 •

edited

Loading

kv2019i commented Dec 14, 2023 •

edited by marc-hb

Loading

marcinszkudlinski commented Dec 15, 2023 •

edited

Loading

kv2019i commented Dec 19, 2023 •

edited

Loading