Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Fork: Web client + WebSocket + own VAD impl. #105

Open
marcinmatys opened this issue Jul 8, 2024 · 9 comments
Open

New Fork: Web client + WebSocket + own VAD impl. #105

marcinmatys opened this issue Jul 8, 2024 · 9 comments

Comments

@marcinmatys
Copy link

I have created fork of whisper_streaming , so I took the liberty of writing about it here.
We may close this issue soon as it is information only.

I encourage you to check it out if you are interested in topics such as
Web Browser-Based client with WebSocket Communication,
Voice Activity Detection, and Silence Processing.

If you have any comments, please write here or check out feedback section in my README

@vuduc153
Copy link

vuduc153 commented Jul 8, 2024

@marcinmatys Hi, thanks for the fork it's really a godsend since I was looking to put together something similar. :)
One thing I notice is that the VAD seems to reset the timestamp to 0 every time it starts again after a silence period. Is this the expected behavior?

@marcinmatys
Copy link
Author

@vuduc153 Thanks for your feedback.

When silence is detected, OnlineASRProcessor finish() and init() methods are called to read uncommited transcription and clear buffer. We loose context and have uncommited transcription then, but in my opinion, it does not have a significant impact on quality. However, I must say that this implementation is just my experiment. You have to do the tests yourself and decide whether it is appropriate or not.

You could remove line online.init() from below code and check the difference.

if not silence_started:
     o = online.finish()
     online.init()

@vuduc153
Copy link

vuduc153 commented Jul 8, 2024

@marcinmatys Thanks for the reply I just wanted to confirm if that's indeed to intended logic.
There's also an issue with really long pauses (>10s) with the current code. Since rms is calculated as the square root mean of the ongoing silence_candidate_chunk, after a long pause when the speech starts again, rms will still be under the SILENCE_THRESHOLD for a while until the new data brings the mean back up above the threshold. From my experience it would take around 1/10 the duration of the pause for the ASR to picks up again, which means the first sentence after a pause will lose some words at the beginning.

Calculating rms per received audio might be a better way to approach this. I have slightly modified the logic in this section in PR. Let me know what you think.

@Gldkslfmsd
Copy link
Collaborator

Thanks for a nice work, @marcinmatys . I shortly looked at your README2 and I found out that you're using numpy sound intensity detection as "VAD". I think that that way you can detect silence vs non-silence. What about noise vs. speech?

In the vad_streaming branch I'm using Silero VAD, a neural torch model to detect non-voice (such as noise, silence, music etc.) vs voice. It should be more robust than your numpy approach. Silero is used in the default offline Whisper as VAD and it was recommended to me in #39 .

@marcinmatys
Copy link
Author

@vuduc153 Thanks for this information and PR. You are right; there is probably an issue with long pauses. However, there is also a problem with your new logic. We need to improve your fix. I will write the details in the PR comment.

@marcinmatys
Copy link
Author

@Gldkslfmsd Thank you for your response and explanations.
I need to look at and test vad-streamin branch one more time and check your silence removal logic.
Do you have any plans to finally verify vad-streaming and merge it into the main branch?

Silero definitely has more capabilities as you said, but in some cases, I think numpy can also handle it. It depends on the environment we are in, whether we have noise around us, what kind of noise we have around us, and what microphone we are using.

We have two types of microphones: Headset Microphone: The microphone in a headset that is positioned near the mouth. Omnidirectional Microphone: A microphone used in conference settings that captures sound from all directions.

I performed some tests using a Headset Microphone and played some conversations (it was probably football match commentaries) from another speaker on the desk next to me. The Headset Microphone did not pick up this noise even when the other speaker was really close.

Do you thik that numpy sound intensity detection could works more efficiently than Silero ? Maybe there should be an option to use one of these. If we need a more robust tool, we use Silero, but if not, we use simple numpy.

@Gldkslfmsd
Copy link
Collaborator

@Gldkslfmsd Thank you for your response and explanations. I need to look at and test vad-streamin branch one more time and check your silence removal logic.

Do you have any plans to finally verify vad-streaming and merge it into the main branch?

It's verified, it works very well but the code is ugly. It needs to be cleaned, made transparent and self-documented. Then it can be merged.

Not in my time schedule now.

Silero definitely has more capabilities as you said, but in some cases, I think numpy can also handle it. It depends on the environment we are in, whether we have noise around us, what kind of noise we have around us, and what microphone we are using.

We have two types of microphones: Headset Microphone: The microphone in a headset that is positioned near the mouth. Omnidirectional Microphone: A microphone used in conference settings that captures sound from all directions.

I performed some tests using a Headset Microphone and played some conversations (it was probably football match commentaries) from another speaker on the desk next to me. The Headset Microphone did not pick up this noise even when the other speaker was really close.

Do you thik that numpy sound intensity detection could works more efficiently than Silero ? Maybe there should be an option to use one of these. If we need a more robust tool, we use Silero, but if not, we use simple numpy.

I believe there are some good reasons why Silero exists. Check their paper and other VAD papers. They may have it tested rigorously, you can reproduce some test.

Numpy may be faster, simpler to install, and good enough for many. If you present an evidence, we can integrate it as an option.

@Gldkslfmsd
Copy link
Collaborator

Gldkslfmsd commented Oct 29, 2024

Hi, @marcinmatys ,
thanks for your work. The traffic in #134 and my colleague's project suggest that websocket server would be very appreciated extension of Whisper-Streaming, although it doesn't fit into this project because I don't have capacity to maintain it, and not everybody needs websockets.

So I suggest you can create a new repo whisper_streaming_websocket, or whatever. Definitely put there your Web client + websocket server, about your VAD I'm not sure, it's up to you.
Add this repo to yours as a submodule -- don't duplicate any code or README, if possible, but reference it. Make it nicely commented and documented, so that people can extend your websocket server as they need -- someone may need authorisation, parallel sessions, setting parameters through client, ... but you can't satisfy everything.

I will then reference your project from README, and give you credits.

Thanks!

Good luck!

@marcinmatys
Copy link
Author

@Gldkslfmsd Ok, sure
But I need to clarify and make sure about a few things below.

  • Will be enough to rename project from whisper_streaming to e.g whisper-transcription but leave as forked from whisper_streaming or should I completely delete the forked one and create a completely new project ?
  • Do I have to create a git submodule or I can simply copy your whisper_streaming (selected files) to separate package in my project ?
  • In the future I am going to remove my VAD and use your better solution with Silero VAD 5 but I need to test it first. I haven't had time to do it yet.
  • I will name my project probably whisper-transcription as I mention above, becouse maybe I will add in the future some other exaples of transcription (not real time ). I mean simple transcription triggered by VAD for subsequent fragments of audio (we could call it - nearly real time ).

And finally, please give me some time for that , because now I am engaged in other projects...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants