Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recasepunc #9

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Recasepunc #9

wants to merge 23 commits into from

Conversation

vJan00
Copy link

@vJan00 vJan00 commented Jul 23, 2022

This branch adds a method to vosk-cli to load and use punctuation models.
The model (Checkpoint file) must be placed under the matching three-character country code minus punctuation,
where the language model is also located. Example: /usr/share/vosk/language/***-punctuation
In particular, it:

  • ...adds a flag to vosk-cli namely -p, which enables punctuation.
  • ...updated the readme with a description and requirements necessary for recasepunc.
  • ...renames the Python module from the generic scripts to the more informative name transcriber

Copy link
Contributor

@lkiesow lkiesow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I've added a few comments to this. If anything is unclear, please don't hesitate to reach out and ask. Additional to the comments below, this pull request breaks the tests which needs to be fixed, and it doesn't quite fit the new structure.

@@ -0,0 +1,742 @@
import sys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file very much looks like it was illegally taken from https://github.com/benob/recasepunc/blob/main/recasepunc.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the same file, packed in a python Package.
The whole Recasepunc project falls under the BDS-3-Clause, so we need to include the License to use this file.
Then everything should be fine.

transcriber/utils.py Outdated Show resolved Hide resolved
transcriber/utils.py Outdated Show resolved Hide resolved
transcriber/utils.py Outdated Show resolved Hide resolved
transcriber/utils.py Outdated Show resolved Hide resolved
transcriber/utils.py Outdated Show resolved Hide resolved
transcriber/utils.py Outdated Show resolved Hide resolved
transcriber/utils.py Outdated Show resolved Hide resolved
transcriber/utils.py Outdated Show resolved Hide resolved
transcriber/utils.py Outdated Show resolved Hide resolved
@vJan00
Copy link
Author

vJan00 commented Aug 8, 2022

Im working on a solution to the things you mentioned :)

@owi92 owi92 mentioned this pull request Aug 19, 2022
11 tasks
@vJan00
Copy link
Author

vJan00 commented Aug 29, 2022

I reworked mostly all of the things you mentioned @lkiesow
Maybe give it a go - if you find something else just tell me :)

Copy link
Contributor

@lkiesow lkiesow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lkiesow
This file very much looks like it was illegally taken from https://github.com/benob/recasepunc/blob/main/recasepunc.py

vJan00
Its the same file, packed in a python Package.
The whole Recasepunc project falls under the BDS-3-Clause, so we need to include the License to use this file.
Then everything should be fine.

That's correct. But you did not include the license. That, we would definitely need to do and make sure it's clear that this applies to this part of the code and where it came from.

Related to this, I'm also wondering if we really want to copy this file or if we want to work with upstream to get it packaged in pypi (if it isn't already) and just include it as a dependency. Do you have any thoughts on that or any reasoning why you went for copying it?

If we copy this, I'm also wondering if we should add this as a submodule instead of a separate Python module. But I'm not sure what makes more sense. I'm just wondering if otherwise we might conflict with the original project someday.


Linting also does still complain about some things. Take a look at the automated tests, or just run flake8 on the code yourself.


Finally, trying this out, it doesn't seem to work right now:

❯ python -m voskcli -i ~/videos/sintel_trailer-1080p.mp4 -o test.vtt -m vosk-model-en-us-0.22 -p en-pt
Start transcribing with model ./models/vosk-model-en-us-0.22
Finished transcribing...
Start punctuating with model ./models/en-pt
Downloading vocab.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 226k/226k [00:00<00:00, 586kB/s]Downloading tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 18.0kB/s]Downloading config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 570/570 [00:00<00:00, 298kB/s]Downloading pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 420M/420M [00:21<00:00, 20.3MB/s]Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Finished punctuating...
Traceback (most recent call last):
  File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/lars/dev/vosk-cli/dist/vosk-cli-0.1/voskcli/__main__.py", line 5, in <module>
    transcribe.main()
  File "/home/lars/dev/vosk-cli/dist/vosk-cli-0.1/voskcli/transcribe.py", line 233, in main
    transcribe(inputFile, outputFile, model, punc)
  File "/home/lars/dev/vosk-cli/dist/vosk-cli-0.1/voskcli/transcribe.py", line 186, in transcribe
    entry['word'] = case_result_list[word]
IndexError: list index out of range

Without -p:

❯ python -m voskcli -i ~/videos/sintel_trailer-1080p.mp4 -o test.vtt -m vosk-model-en-us-0.22         
Start transcribing with model ./models/vosk-model-en-us-0.22
Finished transcribing...
No punctuating wished...
Finished writing. Saving WebVTT file...
WebVTT saved.

I didn't go through the code changes again.

voskcli/transcribe.py Show resolved Hide resolved
@vJan00
Copy link
Author

vJan00 commented Sep 5, 2022

~/videos/sintel_trailer-1080p.mp4

Could you send me the file?
I can't find one in English or German that causes such an error.

Linting also does still complain about some things.

So far all fixed, one import must be ignored, because this is needed in the transcribe.py file but must be in init.py. Otherwise it does not work.

Related to this, I'm also wondering if we really want to copy this file or if we want to work with upstream to get it packaged in pypi (if it isn't already) and just include it as a dependency. Do you have any thoughts on that or any reasoning why you went for copying it?

It was the easiest way, just as a Python module. I don't think Upstream intends to provide this as PyPi in the future - unless we deal with it.

Trying to use this, this threw me off. Why modify the path users specified using the command line parameter -p?

Since the model_path method looks for a folder and not a file, and the files are all named Checkpoint.

@lkiesow
Copy link
Contributor

lkiesow commented Sep 7, 2022

You will find the media file at: https://data.lkiesow.io/opencast/test-media/
I used the vosk-model-en-us-0.22 and https://github.com/benob/recasepunc/releases/download/0.3/en.23000

one import must be ignored, because this is needed in the transcribe.py file but must be in init.py. Otherwise it does not work.

That sounds weird. Do you know why you cannot include it where it's needed? This sounds like a problem which may re-appear at any time if e.g. you install the modules in your system.

@vJan00
Copy link
Author

vJan00 commented Sep 12, 2022

You will find the media file at: https://data.lkiesow.io/opencast/test-media/

Seemingly fixed in the meantime.

That sounds weird. Do you know why you cannot include it where it's needed? This sounds like a problem which may re-appear at any time if e.g. you install the modules in your system.

Yes its an overall Problem with the Model loader recasepunc uses.
For explanation:
Recasepunc uses Torch to load models which uses unpickler which tries to dynamically find classes in a module that is saved in the checkpoint file. All checkpoint files provided by Alpha Cephei are build with the __main__ module saved in them and that's the module (no matter what) where unpickler tries to look for the import. So because of distutils and our setup.py the __main__ module is auto generated and cannot be customised for imports - just look into the Virtual Environment that runs vosk-cli, you will find it there. So the only workaround I found was to change the __main__ module while running transcribe.py and inserting the import in the referenced modules (in this case voskcli) __init__.py.
That's what those lines are doing (Recasepunc loads the model when calling CasePuncPredictor):

old_main = sys.modules['__main__']
sys.modules['__main__'] = voskcli
predictor = CasePuncPredictor(punc + '/checkpoint')
sys.modules['__main__'] = old_main

@vJan00 vJan00 requested a review from lkiesow September 30, 2022 12:24
Jan Mertens and others added 3 commits October 11, 2022 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants