Using whisperfile vs whisper.cpp (or server) for Linux speech input #551

QuantiusBenignus · 2024-08-22T15:07:20Z

QuantiusBenignus
Aug 22, 2024

Hello All,

A big fan of llamafile (thanks Justine), I like the addition of whisperfile (thanks CJ Pais) to the latest release. As the author of BlahST (Blah Speech to Text) - a lean, low-resource tool to input text from speech into any Linux window using whisper.cpp, I just added support for whisperfile in BlahST. This should make it easier for some users who would have to otherwise compile main or whisper.cpp server on their Linux machines to be able to use BlahST.

A few benchmarks: On an AMD64 (znver4) with 12GB VRAM (CUDA 12.5), using 8 CPU threads, I am getting about 90x realtime (90xRT=7.6 sec 16-bit audio returned as text in 84 ms) transcription of microphone speech with the whisper.cpp server (the fastest , partially due to model (base.en) preloaded in VRAM and GPU ready).
It is about 23xRT with whisperfile.tiny.en and about 4xRT when using whisperfile.small. Compares favorably with whisper.cpp main (tiny.en) transcribing at 21xRT.

The whsiperfile runs are without the --gpu auto flag since for the small models and short speech clips in this use case, the GPU startup overhead dominates and there is no speedup. Haven't tried yet to setup the whisperfile as a server, should be closer to the whisper.cpp server in performance. Will post here when I have comparison of the servers with larger speech clips.

Thanks again for this heavy-hitting, portable executable concept and all the clever optimizations of the linear algebra code in ggml!

QuantiusBenignus · 2024-08-22T15:45:00Z

QuantiusBenignus
Aug 22, 2024
Author

Comparing the whisperfile (tiny.en) server vs whisper.cpp server transcription performance (in times faster than realtime - NxRT) for microphone speech of about 21 seconds. Spoke the same text with the same utterance and speed each time.

The advantage of sending the speech to a whisper server (instead of starting an executable each time) is evident in this use case where subsecond transcription makes for smooth speech to text input. The question is, what is the price to pay (if any) for running the server as an APE, versus natively.

Here is a snapshot of a run of wsi (the orchestrator script in BlahST) with the whisper.cpp server (with tiny.en model) ready and listening on port 58080 on the local machine (server --host 0.0.0.0 --port 58080 -t 8 -nt -m ..ai/models/whisper/ggml-tiny.en.bin &):

The 21 second speech clip is transcribed and sent back in 186 milliseconds, resulting in about 112xRT.

The same text, spoken again, is transcribed with the whisperfile APE (whsperfile-tiny.en) in 323 ms, resulting in 21300/323=66xRT:

This is faster than loading a local APE instance for transcription, but about 2x slower than the whisper.cpp server, which was compiled to take advantage of the CUDA backend by default. On closer inspection of the screenshot, it looks like the CUDA backend is not being used by whisperfile, despite the --gpu auto flag. (whisper-tiny.en.llamafile --server --host 0.0.0.0 --port 58080 -t 8 -nt --gpu auto &) .

Any ideas why the whisperfile does not take advantage of the CUDA backend? (From the standard output at startup , it looks like it finds and recognizes the GPU (and nvidia-smi shows that 390 MB of VRAM is populated be the data of .ape-1.10, i.e. whisperfile.tiny.en):

 ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
whisper_model_load:    CUDA0 total size =    77.11 MB
whisper_model_load: model size    =   77.11 MB
whisper_backend_init_gpu: using CUDA backend
whisper_init_state: kv self size  =    9.44 MB
whisper_init_state: kv cross size =    9.44 MB
whisper_init_state: kv pad  size  =    2.36 MB
whisper_init_state: compute buffer (conv)   =   13.45 MB
whisper_init_state: compute buffer (encode) =   85.79 MB
whisper_init_state: compute buffer (cross)  =    4.14 MB
whisper_init_state: compute buffer (decode) =   98.22 MB

whisper server listening at http://0.0.0.0:58080

1 reply

jart Aug 22, 2024
Maintainer

It's probably because you're using the prebuilt tinyBLAS GPU module and whisper.cpp upstream is using cuBLAS. So it's a price you're paying to have a convenient prebuilt binary of libre code. You should be able to pass the --recompile flag once on your machine, to ask whisperfile to run the nvcc command on your system and build a new ggml-cuda.so file that links cuBLAS. You'll have to wait a few minutes for it to happen, but it's a one time cost.

QuantiusBenignus · 2024-08-23T03:52:38Z

QuantiusBenignus
Aug 23, 2024
Author

Thanks Justine!
Indeed, that was the case and while tinyBLAS provides enough libre juice for the task, out of curiosity I --recompiled and after some wait, started the whisperfile in server mode with it duly reporting:

ggml_cuda_link: welcome to CUDA SDK with cuBLAS

but after sending the speech recording, the whisperfile server bombs out with a segmentation fault. I looked briefly at the source code of server.cpp in this repo and compared against the upstream file, and do not see substantial differences. Yet, in server mode, only with --gpu auto, whisperfile with the newly compiled ggml-cuda.so seem to misbehave.
I am content with the speed of the prebuilt binary for this scenario but will try to dig deeper (starting with whisperfile assimilation:-)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using whisperfile vs whisper.cpp (or server) for Linux speech input #551

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Using whisperfile vs whisper.cpp (or server) for Linux speech input #551

QuantiusBenignus Aug 22, 2024

Replies: 2 comments · 1 reply

QuantiusBenignus Aug 22, 2024 Author

jart Aug 22, 2024 Maintainer

QuantiusBenignus Aug 23, 2024 Author

QuantiusBenignus
Aug 22, 2024

Replies: 2 comments 1 reply

QuantiusBenignus
Aug 22, 2024
Author

jart Aug 22, 2024
Maintainer

QuantiusBenignus
Aug 23, 2024
Author