-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recasepunc #9
base: main
Are you sure you want to change the base?
Recasepunc #9
Conversation
But there are problems with pip install because of missing attributes
…/punctation # Conflicts: # scripts/transcribe.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. I've added a few comments to this. If anything is unclear, please don't hesitate to reach out and ask. Additional to the comments below, this pull request breaks the tests which needs to be fixed, and it doesn't quite fit the new structure.
recasepunc/utils.py
Outdated
@@ -0,0 +1,742 @@ | |||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file very much looks like it was illegally taken from https://github.com/benob/recasepunc/blob/main/recasepunc.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its the same file, packed in a python Package.
The whole Recasepunc project falls under the BDS-3-Clause, so we need to include the License to use this file.
Then everything should be fine.
Im working on a solution to the things you mentioned :) |
I reworked mostly all of the things you mentioned @lkiesow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lkiesow
This file very much looks like it was illegally taken from https://github.com/benob/recasepunc/blob/main/recasepunc.py
vJan00
Its the same file, packed in a python Package.
The whole Recasepunc project falls under the BDS-3-Clause, so we need to include the License to use this file.
Then everything should be fine.
That's correct. But you did not include the license. That, we would definitely need to do and make sure it's clear that this applies to this part of the code and where it came from.
Related to this, I'm also wondering if we really want to copy this file or if we want to work with upstream to get it packaged in pypi (if it isn't already) and just include it as a dependency. Do you have any thoughts on that or any reasoning why you went for copying it?
If we copy this, I'm also wondering if we should add this as a submodule instead of a separate Python module. But I'm not sure what makes more sense. I'm just wondering if otherwise we might conflict with the original project someday.
Linting also does still complain about some things. Take a look at the automated tests, or just run flake8
on the code yourself.
Finally, trying this out, it doesn't seem to work right now:
❯ python -m voskcli -i ~/videos/sintel_trailer-1080p.mp4 -o test.vtt -m vosk-model-en-us-0.22 -p en-pt
Start transcribing with model ./models/vosk-model-en-us-0.22
Finished transcribing...
Start punctuating with model ./models/en-pt
Downloading vocab.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 226k/226k [00:00<00:00, 586kB/s]Downloading tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 18.0kB/s]Downloading config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 570/570 [00:00<00:00, 298kB/s]Downloading pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 420M/420M [00:21<00:00, 20.3MB/s]Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Finished punctuating...
Traceback (most recent call last):
File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/lars/dev/vosk-cli/dist/vosk-cli-0.1/voskcli/__main__.py", line 5, in <module>
transcribe.main()
File "/home/lars/dev/vosk-cli/dist/vosk-cli-0.1/voskcli/transcribe.py", line 233, in main
transcribe(inputFile, outputFile, model, punc)
File "/home/lars/dev/vosk-cli/dist/vosk-cli-0.1/voskcli/transcribe.py", line 186, in transcribe
entry['word'] = case_result_list[word]
IndexError: list index out of range
Without -p
:
❯ python -m voskcli -i ~/videos/sintel_trailer-1080p.mp4 -o test.vtt -m vosk-model-en-us-0.22
Start transcribing with model ./models/vosk-model-en-us-0.22
Finished transcribing...
No punctuating wished...
Finished writing. Saving WebVTT file...
WebVTT saved.
I didn't go through the code changes again.
Could you send me the file?
So far all fixed, one import must be ignored, because this is needed in the transcribe.py file but must be in init.py. Otherwise it does not work.
It was the easiest way, just as a Python module. I don't think Upstream intends to provide this as PyPi in the future - unless we deal with it.
Since the model_path method looks for a folder and not a file, and the files are all named Checkpoint. |
You will find the media file at: https://data.lkiesow.io/opencast/test-media/
That sounds weird. Do you know why you cannot include it where it's needed? This sounds like a problem which may re-appear at any time if e.g. you install the modules in your system. |
Seemingly fixed in the meantime.
Yes its an overall Problem with the Model loader recasepunc uses.
|
# Conflicts: # README.md # voskcli/transcribe.py
This branch adds a method to vosk-cli to load and use punctuation models.
The model (Checkpoint file) must be placed under the matching three-character country code minus punctuation,
where the language model is also located. Example:
/usr/share/vosk/language/***-punctuation
In particular, it:
vosk-cli
namely-p
, which enables punctuation.scripts
to the more informative nametranscriber