Recasepunc #9

vJan00 · 2022-07-23T01:21:10Z

This branch adds a method to vosk-cli to load and use punctuation models.
The model (Checkpoint file) must be placed under the matching three-character country code minus punctuation,
where the language model is also located. Example: /usr/share/vosk/language/***-punctuation
In particular, it:

...adds a flag to vosk-cli namely -p, which enables punctuation.
...updated the readme with a description and requirements necessary for recasepunc.
...renames the Python module from the generic scripts to the more informative name transcriber

But there are problems with pip install because of missing attributes

…/punctation # Conflicts: # scripts/transcribe.py

…2022/punctation

lkiesow

Thanks for the contribution. I've added a few comments to this. If anything is unclear, please don't hesitate to reach out and ask. Additional to the comments below, this pull request breaks the tests which needs to be fixed, and it doesn't quite fit the new structure.

lkiesow · 2022-08-02T18:22:13Z

recasepunc/utils.py

@@ -0,0 +1,742 @@
+import sys


This file very much looks like it was illegally taken from https://github.com/benob/recasepunc/blob/main/recasepunc.py

Its the same file, packed in a python Package.
The whole Recasepunc project falls under the BDS-3-Clause, so we need to include the License to use this file.
Then everything should be fine.

transcriber/utils.py

vJan00 · 2022-08-08T07:57:20Z

Im working on a solution to the things you mentioned :)

Recasepunc structure

vJan00 · 2022-08-29T07:59:31Z

I reworked mostly all of the things you mentioned @lkiesow
Maybe give it a go - if you find something else just tell me :)

lkiesow

lkiesow
This file very much looks like it was illegally taken from https://github.com/benob/recasepunc/blob/main/recasepunc.py

vJan00
Its the same file, packed in a python Package.
The whole Recasepunc project falls under the BDS-3-Clause, so we need to include the License to use this file.
Then everything should be fine.

That's correct. But you did not include the license. That, we would definitely need to do and make sure it's clear that this applies to this part of the code and where it came from.

Related to this, I'm also wondering if we really want to copy this file or if we want to work with upstream to get it packaged in pypi (if it isn't already) and just include it as a dependency. Do you have any thoughts on that or any reasoning why you went for copying it?

If we copy this, I'm also wondering if we should add this as a submodule instead of a separate Python module. But I'm not sure what makes more sense. I'm just wondering if otherwise we might conflict with the original project someday.

Linting also does still complain about some things. Take a look at the automated tests, or just run flake8 on the code yourself.

Finally, trying this out, it doesn't seem to work right now:

❯ python -m voskcli -i ~/videos/sintel_trailer-1080p.mp4 -o test.vtt -m vosk-model-en-us-0.22 -p en-pt
Start transcribing with model ./models/vosk-model-en-us-0.22
Finished transcribing...
Start punctuating with model ./models/en-pt
Downloading vocab.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 226k/226k [00:00<00:00, 586kB/s]Downloading tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 18.0kB/s]Downloading config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 570/570 [00:00<00:00, 298kB/s]Downloading pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 420M/420M [00:21<00:00, 20.3MB/s]Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Finished punctuating...
Traceback (most recent call last):
  File "/usr/lib64/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/lars/dev/vosk-cli/dist/vosk-cli-0.1/voskcli/__main__.py", line 5, in <module>
    transcribe.main()
  File "/home/lars/dev/vosk-cli/dist/vosk-cli-0.1/voskcli/transcribe.py", line 233, in main
    transcribe(inputFile, outputFile, model, punc)
  File "/home/lars/dev/vosk-cli/dist/vosk-cli-0.1/voskcli/transcribe.py", line 186, in transcribe
    entry['word'] = case_result_list[word]
IndexError: list index out of range

Without -p:

❯ python -m voskcli -i ~/videos/sintel_trailer-1080p.mp4 -o test.vtt -m vosk-model-en-us-0.22         
Start transcribing with model ./models/vosk-model-en-us-0.22
Finished transcribing...
No punctuating wished...
Finished writing. Saving WebVTT file...
WebVTT saved.

I didn't go through the code changes again.

voskcli/transcribe.py

Recasepunc structure

vJan00 · 2022-09-05T09:50:36Z

~/videos/sintel_trailer-1080p.mp4

Could you send me the file?
I can't find one in English or German that causes such an error.

Linting also does still complain about some things.

So far all fixed, one import must be ignored, because this is needed in the transcribe.py file but must be in init.py. Otherwise it does not work.

Related to this, I'm also wondering if we really want to copy this file or if we want to work with upstream to get it packaged in pypi (if it isn't already) and just include it as a dependency. Do you have any thoughts on that or any reasoning why you went for copying it?

It was the easiest way, just as a Python module. I don't think Upstream intends to provide this as PyPi in the future - unless we deal with it.

Trying to use this, this threw me off. Why modify the path users specified using the command line parameter -p?

Since the model_path method looks for a folder and not a file, and the files are all named Checkpoint.

lkiesow · 2022-09-07T11:23:24Z

You will find the media file at: https://data.lkiesow.io/opencast/test-media/
I used the vosk-model-en-us-0.22 and https://github.com/benob/recasepunc/releases/download/0.3/en.23000

one import must be ignored, because this is needed in the transcribe.py file but must be in init.py. Otherwise it does not work.

That sounds weird. Do you know why you cannot include it where it's needed? This sounds like a problem which may re-appear at any time if e.g. you install the modules in your system.

vJan00 · 2022-09-12T09:48:54Z

You will find the media file at: https://data.lkiesow.io/opencast/test-media/

Seemingly fixed in the meantime.

That sounds weird. Do you know why you cannot include it where it's needed? This sounds like a problem which may re-appear at any time if e.g. you install the modules in your system.

Yes its an overall Problem with the Model loader recasepunc uses.
For explanation:
Recasepunc uses Torch to load models which uses unpickler which tries to dynamically find classes in a module that is saved in the checkpoint file. All checkpoint files provided by Alpha Cephei are build with the __main__ module saved in them and that's the module (no matter what) where unpickler tries to look for the import. So because of distutils and our setup.py the __main__ module is auto generated and cannot be customised for imports - just look into the Virtual Environment that runs vosk-cli, you will find it there. So the only workaround I found was to change the __main__ module while running transcribe.py and inserting the import in the referenced modules (in this case voskcli) __init__.py.
That's what those lines are doing (Recasepunc loads the model when calling CasePuncPredictor):

old_main = sys.modules['__main__']
sys.modules['__main__'] = voskcli
predictor = CasePuncPredictor(punc + '/checkpoint')
sys.modules['__main__'] = old_main

# Conflicts: # README.md # voskcli/transcribe.py

Jan Mertens added 11 commits June 15, 2022 18:52

Added Basic Punctation Methods

e09097a

But there are problems with pip install because of missing attributes

Added recasepunc.py

79f3ff9

Moved recasepunc to be a module

6685868

Added chosen_language_dir_path to CasePuncPredictor

7f43050

Added __main__ manipulation to load model

8fca33c

Added argument for optional punctation

1a9460a

Added dependencies and usage of punctation to README.md

24147bf

Merge remote-tracking branch 'origin/jm/2022/punctation' into jm/2022…

563fab3

…/punctation # Conflicts: # scripts/transcribe.py

Renamed scripts to transcriber

caa936c

Renamed scripts to transcriber

5a06eff

Merge remote-tracking branch 'origin_itz/jm/2022/punctation' into jm/…

20b9c89

…2022/punctation

lkiesow requested changes Aug 2, 2022

View reviewed changes

owi92 mentioned this pull request Aug 19, 2022

Collection of tasks #10

Open

11 tasks

vJan00 and others added 3 commits August 29, 2022 09:51

Resolve merge conflicts

49c77eb

Added __init__.py

cb2d594

Merge pull request #1 from vJan00/recasepunc_structure

6a7cab0

Recasepunc structure

lkiesow requested changes Sep 2, 2022

View reviewed changes

voskcli/transcribe.py Show resolved Hide resolved

vJan00 and others added 4 commits September 5, 2022 09:22

Remove Hardcoded Filename

468059f

Removed and added space characters to punctuation.

92d1424

Cleared flake8 warnings

8ee4f58

Merge pull request #2 from vJan00/recasepunc_structure

a0565d4

Recasepunc structure

vJan00 requested a review from lkiesow September 30, 2022 12:24

Jan Mertens and others added 3 commits October 11, 2022 10:17

Merge remote-tracking branch 'GitHub-Main/main' into recasepunc

bed3a89

# Conflicts: # README.md # voskcli/transcribe.py

Added LICENSE

a983f2b

Updated README.md

bf221e9

vJan00 added 2 commits October 11, 2022 11:19

Removed '-' from space prediction because of complications

8e27308

Removed flake8 errors and debug output

1db5647

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recasepunc #9

Recasepunc #9

vJan00 commented Jul 23, 2022

lkiesow left a comment

lkiesow Aug 2, 2022

vJan00 Aug 8, 2022

vJan00 commented Aug 8, 2022

vJan00 commented Aug 29, 2022

lkiesow left a comment

vJan00 commented Sep 5, 2022

lkiesow commented Sep 7, 2022 •

edited

Loading

vJan00 commented Sep 12, 2022

Recasepunc #9

Are you sure you want to change the base?

Recasepunc #9

Conversation

vJan00 commented Jul 23, 2022

lkiesow left a comment

Choose a reason for hiding this comment

lkiesow Aug 2, 2022

Choose a reason for hiding this comment

vJan00 Aug 8, 2022

Choose a reason for hiding this comment

vJan00 commented Aug 8, 2022

vJan00 commented Aug 29, 2022

lkiesow left a comment

Choose a reason for hiding this comment

vJan00 commented Sep 5, 2022

lkiesow commented Sep 7, 2022 • edited Loading

vJan00 commented Sep 12, 2022

lkiesow commented Sep 7, 2022 •

edited

Loading