Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insanely-fast-whisper backend #122

Closed
marziye-A opened this issue Sep 19, 2024 · 4 comments
Closed

insanely-fast-whisper backend #122

marziye-A opened this issue Sep 19, 2024 · 4 comments

Comments

@marziye-A
Copy link

marziye-A commented Sep 19, 2024

hi ,thanks for your great work!
i want to use the streaming mode with insanely fast whisper backend. i am adding this backend but i don't know what is the "ts_words" function? what is its utility and what it takes as input ?does the output of the whisper backend need to have timestamps?

can you please help me to understand this function?
any help is really appreciated.

@Gldkslfmsd
Copy link
Collaborator

hi, thanks. Why do you need insanely fast whisper? As far as I know, it uses faster-whisper, same as ours.

What ts_word function do you mean? can you give link to the line where it is specified?

And yes, whisper-streaming needs word-level timestamps.

@marziye-A
Copy link
Author

thank you for your answer.
i think it doesn't use the faster whisper backend. its based on huggingface transformers and flash attention.

it is in this line for faster-whisper backend:

def ts_words(self, segments):

it is in this line for openai whisper backend:

def ts_words(self, segments):

and i want to implement this function for faster whisper backend.

@Gldkslfmsd
Copy link
Collaborator

Alright. ts_words is quite poorly documented here:

# return: transcribe result object to [(beg,end,"word1"), ...]
. It converts the object that comes from the transcribe function into an object that is the same for all backends -- a list of tuples (beg, end, word) where beg and end are floats -- seconds from beginning of the recording, in which the word was uttered.
Word is string. In faster-whisper, it may be a subword, like "space-delimited" can be in two parts: " space" and "-delimited", they should not be joined with a space:
sep = " " # join transcribe words with this character (" " for whisper_timestamped,

@Gldkslfmsd
Copy link
Collaborator

i think it doesn't use the faster whisper backend. its based on huggingface transformers and flash attention.

OK. I think the speed in insanely-fast-whisper is because of using large memory and batching. It's applicable only to the offline mode, you can chunk the whole long recording into small pieces and process them in parallel. In streaming mode, you can use batching like #55 and #42. It should speed a little but not too much.

But anyway, feel free to try it and share your latency-quality test results compared to faster-whisper. Or make a PR and I may do the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants