New Fork: Web client + WebSocket + own VAD impl. #105

marcinmatys · 2024-07-08T09:45:29Z

I have created fork of whisper_streaming , so I took the liberty of writing about it here.
We may close this issue soon as it is information only.

I encourage you to check it out if you are interested in topics such as
Web Browser-Based client with WebSocket Communication,
Voice Activity Detection, and Silence Processing.

If you have any comments, please write here or check out feedback section in my README

vuduc153 · 2024-07-08T10:59:30Z

@marcinmatys Hi, thanks for the fork it's really a godsend since I was looking to put together something similar. :)
One thing I notice is that the VAD seems to reset the timestamp to 0 every time it starts again after a silence period. Is this the expected behavior?

marcinmatys · 2024-07-08T11:15:58Z

@vuduc153 Thanks for your feedback.

When silence is detected, OnlineASRProcessor finish() and init() methods are called to read uncommited transcription and clear buffer. We loose context and have uncommited transcription then, but in my opinion, it does not have a significant impact on quality. However, I must say that this implementation is just my experiment. You have to do the tests yourself and decide whether it is appropriate or not.

You could remove line online.init() from below code and check the difference.

if not silence_started:
     o = online.finish()
     online.init()

vuduc153 · 2024-07-08T14:53:09Z

@marcinmatys Thanks for the reply I just wanted to confirm if that's indeed to intended logic.
There's also an issue with really long pauses (>10s) with the current code. Since rms is calculated as the square root mean of the ongoing silence_candidate_chunk, after a long pause when the speech starts again, rms will still be under the SILENCE_THRESHOLD for a while until the new data brings the mean back up above the threshold. From my experience it would take around 1/10 the duration of the pause for the ASR to picks up again, which means the first sentence after a pause will lose some words at the beginning.

Calculating rms per received audio might be a better way to approach this. I have slightly modified the logic in this section in PR. Let me know what you think.

Gldkslfmsd · 2024-07-08T19:33:45Z

Thanks for a nice work, @marcinmatys . I shortly looked at your README2 and I found out that you're using numpy sound intensity detection as "VAD". I think that that way you can detect silence vs non-silence. What about noise vs. speech?

In the vad_streaming branch I'm using Silero VAD, a neural torch model to detect non-voice (such as noise, silence, music etc.) vs voice. It should be more robust than your numpy approach. Silero is used in the default offline Whisper as VAD and it was recommended to me in #39 .

marcinmatys · 2024-07-09T11:48:59Z

@vuduc153 Thanks for this information and PR. You are right; there is probably an issue with long pauses. However, there is also a problem with your new logic. We need to improve your fix. I will write the details in the PR comment.

marcinmatys · 2024-07-11T12:01:11Z

@Gldkslfmsd Thank you for your response and explanations.
I need to look at and test vad-streamin branch one more time and check your silence removal logic.
Do you have any plans to finally verify vad-streaming and merge it into the main branch?

Silero definitely has more capabilities as you said, but in some cases, I think numpy can also handle it. It depends on the environment we are in, whether we have noise around us, what kind of noise we have around us, and what microphone we are using.

We have two types of microphones: Headset Microphone: The microphone in a headset that is positioned near the mouth. Omnidirectional Microphone: A microphone used in conference settings that captures sound from all directions.

I performed some tests using a Headset Microphone and played some conversations (it was probably football match commentaries) from another speaker on the desk next to me. The Headset Microphone did not pick up this noise even when the other speaker was really close.

Do you thik that numpy sound intensity detection could works more efficiently than Silero ? Maybe there should be an option to use one of these. If we need a more robust tool, we use Silero, but if not, we use simple numpy.

Gldkslfmsd · 2024-07-11T16:42:29Z

@Gldkslfmsd Thank you for your response and explanations. I need to look at and test vad-streamin branch one more time and check your silence removal logic.

Do you have any plans to finally verify vad-streaming and merge it into the main branch?

It's verified, it works very well but the code is ugly. It needs to be cleaned, made transparent and self-documented. Then it can be merged.

Not in my time schedule now.

Silero definitely has more capabilities as you said, but in some cases, I think numpy can also handle it. It depends on the environment we are in, whether we have noise around us, what kind of noise we have around us, and what microphone we are using.

We have two types of microphones: Headset Microphone: The microphone in a headset that is positioned near the mouth. Omnidirectional Microphone: A microphone used in conference settings that captures sound from all directions.

I performed some tests using a Headset Microphone and played some conversations (it was probably football match commentaries) from another speaker on the desk next to me. The Headset Microphone did not pick up this noise even when the other speaker was really close.

Do you thik that numpy sound intensity detection could works more efficiently than Silero ? Maybe there should be an option to use one of these. If we need a more robust tool, we use Silero, but if not, we use simple numpy.

I believe there are some good reasons why Silero exists. Check their paper and other VAD papers. They may have it tested rigorously, you can reproduce some test.

Numpy may be faster, simpler to install, and good enough for many. If you present an evidence, we can integrate it as an option.

Gldkslfmsd · 2024-10-29T08:44:19Z

Hi, @marcinmatys ,
thanks for your work. The traffic in #134 and my colleague's project suggest that websocket server would be very appreciated extension of Whisper-Streaming, although it doesn't fit into this project because I don't have capacity to maintain it, and not everybody needs websockets.

So I suggest you can create a new repo whisper_streaming_websocket, or whatever. Definitely put there your Web client + websocket server, about your VAD I'm not sure, it's up to you.
Add this repo to yours as a submodule -- don't duplicate any code or README, if possible, but reference it. Make it nicely commented and documented, so that people can extend your websocket server as they need -- someone may need authorisation, parallel sessions, setting parameters through client, ... but you can't satisfy everything.

I will then reference your project from README, and give you credits.

Thanks!

Good luck!

marcinmatys · 2024-10-29T18:54:35Z

@Gldkslfmsd Ok, sure
But I need to clarify and make sure about a few things below.

Will be enough to rename project from whisper_streaming to e.g whisper-transcription but leave as forked from whisper_streaming or should I completely delete the forked one and create a completely new project ?
Do I have to create a git submodule or I can simply copy your whisper_streaming (selected files) to separate package in my project ?
In the future I am going to remove my VAD and use your better solution with Silero VAD 5 but I need to test it first. I haven't had time to do it yet.
I will name my project probably whisper-transcription as I mention above, becouse maybe I will add in the future some other exaples of transcription (not real time ). I mean simple transcription triggered by VAD for subsequent fragments of audio (we could call it - nearly real time ).

And finally, please give me some time for that , because now I am engaged in other projects...

marcinmatys mentioned this issue Sep 2, 2024

Explanation of using VAD, VAC #117

Closed

marcinmatys mentioned this issue Oct 22, 2024

Feeding raw audio data to faster whisper over websockets #134

Closed

Gldkslfmsd mentioned this issue Dec 1, 2024

How to connect to whisper_online_server from web #144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Fork: Web client + WebSocket + own VAD impl. #105

New Fork: Web client + WebSocket + own VAD impl. #105

marcinmatys commented Jul 8, 2024

vuduc153 commented Jul 8, 2024

marcinmatys commented Jul 8, 2024

vuduc153 commented Jul 8, 2024

Gldkslfmsd commented Jul 8, 2024

marcinmatys commented Jul 9, 2024

marcinmatys commented Jul 11, 2024

Gldkslfmsd commented Jul 11, 2024

Gldkslfmsd commented Oct 29, 2024 •

edited

Loading

marcinmatys commented Oct 29, 2024

New Fork: Web client + WebSocket + own VAD impl. #105

New Fork: Web client + WebSocket + own VAD impl. #105

Comments

marcinmatys commented Jul 8, 2024

vuduc153 commented Jul 8, 2024

marcinmatys commented Jul 8, 2024

vuduc153 commented Jul 8, 2024

Gldkslfmsd commented Jul 8, 2024

marcinmatys commented Jul 9, 2024

marcinmatys commented Jul 11, 2024

Gldkslfmsd commented Jul 11, 2024

Gldkslfmsd commented Oct 29, 2024 • edited Loading

marcinmatys commented Oct 29, 2024

Gldkslfmsd commented Oct 29, 2024 •

edited

Loading