-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[websocket server] Sample rate impact : 8kHz vs 16kHz vs user-supplied #242
Comments
For in-browser recognition it is much better to use webrtc server, it uses opus codec and much more responsive. 8khz significantly less accurate, if browser records wideband audio it is recommended to use wideband. |
I avoided webRTC because it seemed much more difficult to setup (I'll need a STUN/TURN server if I understand correctly, and a bunch of port forwarding), and websockets seemed much easier to do (since I'm already using them in production). What's the benefit of opus codec ? From what I understand, the webrtc server will have to transform this to wav before sending it to kaldi (since kaldi only work on wav format), so from a quality point of view it shouldn't be different. What do you mean by much more responsive ? Delay between user talking and actual voice recognition ?
Does the same apply for "regular" audio file ? For example, right now I'm parsing many kind of audio files (different format, different sources), and using ffmpeg to convert them to 16 kHz wav audio, then sending it to vosk-server. Would I benefit if I converted them to higher sampling rate wav audio (let's say 44.1kHz or 48kHz) and sending this to vosk-server ? |
Opus compress data, so instead of sending 1kb wav you send 100 bytes opus.
Then it works over UDP, so it doesn't wait for packet round trip, if
network ping latency is 100ms, you will have 200ms packet round trip delay.
Opus decoding is done within the Vosk server, you can check the code.
You don't need stun if your server is public, many services use webrtc like
Zoom and others.
Vosk models are 16khz, you won't benefit from converting to 48khz sampling
rate. In the future we might release 48khz models, then it will be better
to send 48khz.
…On Sat, Nov 4, 2023 at 5:32 PM GuillaumeV-cemea ***@***.***> wrote:
I avoided webRTC because it seemed much more difficult to setup (I'll need
a STUN/TURN server if I understand correctly, and a bunch of port
forwarding), and websockets seemed much easier to do (since I'm already
using them in production).
What's the benefit of opus codec ? From what I understand, the webrtc
server will have to transform this to wav before sending it to kaldi (since
kaldi only work on wav format), so from a quality point of view it
shouldn't be different.
What do you mean by much more responsive ? Delay between user talking and
actual voice recognition ?
8khz significantly less accurate, if browser records wideband audio it is
recommended to use wideband.
If I understand correctly, it's better to record a high (as high as
possible) sampling rate in browser, then send it directly to vosk-server,
rather than downsampling it to 16 kHz (or whatever you chose for
vosk-server) and sending it ?
Does the same apply for "regular" audio file ? For example, right now I'm
parsing many kind of audio files (different format, different sources), and
using ffmpeg to convert them to 16 kHz wav audio, then sending it to
vosk-server. Would I benefit if I converted them to higher sampling rate
wav audio (let's say 44.1kHz or 48kHz) and sending this to vosk-server ?
—
Reply to this email directly, view it on GitHub
<#242 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWAYEHJ5XBSRYKJ4JA7SWTYCZGWXAVCNFSM6AAAAAA65SX4W6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGQ3DANJWGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi,
I've been using vosk-server, specifically the websocket server with the dockerfile for a while now, using 16 kHz sample rate (I don't remember exactly why, to be honest). I'm looking into developping a web-extension to send raw audio data to the websocket server, and I've noticed most (if not all) of the examples are using 8 kHz sample rate.
Is there any benefit of using 8kHz instead of 16 kHz (or any other sample rate), as long as I supply kaldi's model with the correct sample rate, of course ?
I'm asking because the websocket server allow runtime configuration of sample_rate (by sending a config message), and from my limited testing this is working perfectly fine - for example, asking my browser to downsample user mic to 8kHz and sending it to vosk-server give me the same result as using whatever my browser base sample rate is (usually 48kHz) and sending it directly to vosk-server.
So if I can avoid any kind of client-side downsampling (which is difficult because only chrome does it natively, so I would have to come up with another solution for Firefox), and just send whatever input data I have to vosk-server, it would be much easier.
Cheers,
The text was updated successfully, but these errors were encountered: