Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[websocket server] Sample rate impact : 8kHz vs 16kHz vs user-supplied #242

Open
GuillaumeV-cemea opened this issue Nov 4, 2023 · 3 comments

Comments

@GuillaumeV-cemea
Copy link

Hi,

I've been using vosk-server, specifically the websocket server with the dockerfile for a while now, using 16 kHz sample rate (I don't remember exactly why, to be honest). I'm looking into developping a web-extension to send raw audio data to the websocket server, and I've noticed most (if not all) of the examples are using 8 kHz sample rate.

Is there any benefit of using 8kHz instead of 16 kHz (or any other sample rate), as long as I supply kaldi's model with the correct sample rate, of course ?

I'm asking because the websocket server allow runtime configuration of sample_rate (by sending a config message), and from my limited testing this is working perfectly fine - for example, asking my browser to downsample user mic to 8kHz and sending it to vosk-server give me the same result as using whatever my browser base sample rate is (usually 48kHz) and sending it directly to vosk-server.

So if I can avoid any kind of client-side downsampling (which is difficult because only chrome does it natively, so I would have to come up with another solution for Firefox), and just send whatever input data I have to vosk-server, it would be much easier.

Cheers,

@nshmyrev
Copy link
Contributor

nshmyrev commented Nov 4, 2023

For in-browser recognition it is much better to use webrtc server, it uses opus codec and much more responsive. 8khz significantly less accurate, if browser records wideband audio it is recommended to use wideband.

@GuillaumeV-cemea
Copy link
Author

I avoided webRTC because it seemed much more difficult to setup (I'll need a STUN/TURN server if I understand correctly, and a bunch of port forwarding), and websockets seemed much easier to do (since I'm already using them in production).

What's the benefit of opus codec ? From what I understand, the webrtc server will have to transform this to wav before sending it to kaldi (since kaldi only work on wav format), so from a quality point of view it shouldn't be different.

What do you mean by much more responsive ? Delay between user talking and actual voice recognition ?

8khz significantly less accurate, if browser records wideband audio it is recommended to use wideband.
If I understand correctly, it's better to record a high (as high as possible) sampling rate in browser, then send it directly to vosk-server, rather than downsampling it to 16 kHz (or whatever you chose for vosk-server) and sending it ?

Does the same apply for "regular" audio file ? For example, right now I'm parsing many kind of audio files (different format, different sources), and using ffmpeg to convert them to 16 kHz wav audio, then sending it to vosk-server. Would I benefit if I converted them to higher sampling rate wav audio (let's say 44.1kHz or 48kHz) and sending this to vosk-server ?

@nshmyrev
Copy link
Contributor

nshmyrev commented Nov 4, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants