Audio stream / frame-based processing #584

georgezachos · 2022-01-05T13:50:49Z

georgezachos
Jan 5, 2022

Hello,

I'm trying to implement frame-by-frame processing of a trained DCCRNet model, so I can manually feed a stream of audio chunks as continuous input to the model, and get an output again as a stream of audio with the model's inherent latency (6 frames in this case). Is there a straightforward way to achieve this with asteroid? To my understanding, the implementation assumes that all the input data is available for evaluating the model.

I've tried the LambdaOverlapAdd and a custom overlap-and-add method but didn't yield good results. The output is almost as unprocessed.

Digging more into this, I noticed that the LSTM layer's forward implementation is stateless which can be problematic for frame-based processing. I changed it so it keeps its final state and uses it for the next input chunk. While the output is different after this change, it still sounds as it's minimally processed.

Maybe other aspects of the encoders/decoders need to be re-implemented as well for frame-based processing, or am I missing something obvious?

Thanks!

Answered by JorisCos

Jan 5, 2022

Hello,

You are right in order to use a network for streaming applications your network has to be causal ( The network doesn't need information from the future to predict the present) and stateful (the network keeps a sort of a memory of the previous chunks to process the current one). When this is done you will not have any difference between a file process all at once or by chunks.
So, in order to process chunks efficiently with DCCRNet you would have to make the encoders masker decoders causal and stateful.
To do so :

Make the 2D convolutions that feed on the encoded representation causal by padding on the left side.
Make it stateful by keeping a buffer with the data needed for the nex…

View full answer

JorisCos · 2022-01-05T14:30:45Z

JorisCos
Jan 5, 2022
Collaborator

Hello,

You are right in order to use a network for streaming applications your network has to be causal ( The network doesn't need information from the future to predict the present) and stateful (the network keeps a sort of a memory of the previous chunks to process the current one). When this is done you will not have any difference between a file process all at once or by chunks.
So, in order to process chunks efficiently with DCCRNet you would have to make the encoders masker decoders causal and stateful.
To do so :

Make the 2D convolutions that feed on the encoded representation causal by padding on the left side.
Make it stateful by keeping a buffer with the data needed for the next chunk.
LSTM when unidirectional are already causal.
LSTM are made stateful by using the previous state for the current chunk as you said.
Make the conv 2D transpose causal by removing the overlapping part that will be completed by the next chunk.
Make the conv 2D transpose stateful by adding the overlapping part from the previous chunk to the current one.

Note that to keep this harmony between overlapping part etc the size of the chunks that you process must be chosen according to the network parameters.

Sorry if this isn't really clear but this hard to explain just with words. My advice is to take a sheet of paper and to the operation by hand to see what data to buffer where to pad etc.

1 reply

georgezachos Jan 5, 2022
Author

Yes, this is really clear, thanks!

mpariente · 2022-02-01T06:42:09Z

mpariente
Feb 1, 2022
Maintainer

@ EliasLum, I received a notification that you posted something?

If you resolved your issue, it's better to add the solution here as well than removing the comment altogether 😉

1 reply

EliasLum Feb 1, 2022

@mpariente, I indeed posted a question, since my frame-based implementation did not work as expected.

However, it turned out that the problem lied in the wrapping-code that is independent of asteroid and the model. So I figured that keeping the comment would only lead to confusion or time waste of potential readers.

In any case two things that might be helpful for others considering the frame based approach here are:

The changes @JorisCos proposed seem to work
Instead of altering the encoder, it is also possible to just call the model with 7-frame chunks and keep a rolling buffer outside the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio stream / frame-based processing #584

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Audio stream / frame-based processing #584

georgezachos Jan 5, 2022

Replies: 2 comments · 2 replies

JorisCos Jan 5, 2022 Collaborator

georgezachos Jan 5, 2022 Author

mpariente Feb 1, 2022 Maintainer

EliasLum Feb 1, 2022

georgezachos
Jan 5, 2022

Replies: 2 comments 2 replies

JorisCos
Jan 5, 2022
Collaborator

georgezachos Jan 5, 2022
Author

mpariente
Feb 1, 2022
Maintainer