Skip to content

Commit

Permalink
Update How to create high-quality offline video transcriptions and su…
Browse files Browse the repository at this point in the history
…btitles using Whisper and Python.md
  • Loading branch information
ookgezellig committed Nov 5, 2024
1 parent 7b84196 commit 50848cc
Showing 1 changed file with 8 additions and 7 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# How to create high-quality offline video transcriptions and subtitles using Whisper and Python

<image src="media/afbeelding1.png" width="400" hspace="10" align="right"/>
<br clear="all" />

I always thought that 'doing things with AI' was equivalant to smoking data centers, overheated servers, and massive cloud computing power.

Expand Down Expand Up @@ -30,21 +31,21 @@ As I work with ChatGPT regularly, I had heard of [Whisper, OpenAI’s speech-to-

After some research to see if this could suit my ASR (Automatic Speech Recognition) needs, I found out that [this model excels in Dutch](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). But it also performs well in English.

<image src="media/afbeelding2.png" width="400" hspace="10" align="right"/>

OK, that already sounds promising. But Whisper doesn’t have a user-friendly front end (as far as I know), so I had to work with the API and Python. Fortunately, I found [this short blog post](https://nicobytes.com/blog/en/how-to-use-whisper/) to help me get started, and, combined with the [documentation](https://platform.openai.com/docs/guides/speech-to-text), it was straightforward to set up.

<image src="media/afbeelding2.png" width="400" hspace="10" align="right"/>

Further in this article, you’ll read about what I ultimately created with it and find ready-to-use Python code to try it out yourself.

## FFmpeg is needed
If you’re using the Whisper API with Python, you’ll need [FFmpeg](https://www.ffmpeg.org/) installed on your laptop. [Wikihow explains well](https://www.wikihow.com/Install-FFmpeg-on-Windows) how to set it up. Unfortunately, I couldn’t do this on my work laptop (no admin permissions!), so I did it at home on my personal laptop running Windows 10 Pro.

Here’s what it looks like on my home laptop. I followed the guide mentioned above.

<image src="media/ffmpeg-installatie-win10.PNG" width="400" hspace="10" align="right"/>
<image src="media/ffmpeg-installatie-win10.PNG" width="400" hspace="0" align="left"/>
<br clear="all" /><br>

<image src="media/ffmpeg-path-win10.PNG" width="400" hspace="10" align="right"/>
<image src="media/ffmpeg-path-win10.PNG" width="400" hspace="0" align="left"/>
<br clear="all" />

## Offline use, so privacy friendly
Expand All @@ -54,7 +55,7 @@ When you run this piece of Python code for the first time,

the ‘large’ model is downloaded to your machine once. (See here for [the available models](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages).) To my great surprise, it’s just a 3GB file that handles all speech-to-text tasks without needing any further internet connection. So no smoking data centers, overheated servers, or massive cloud computing power—just a file on your own computer that you can use offline. And it’s great for privacy since everything happens offline on your own machine. Here’s the model on my home laptop. What happens inside that .pt file is pure magic!

<image src="media/whisper-models-location-win10.PNG" width="300" hspace="10" align="right"/>
<image src="media/whisper-models-location-win10.PNG" width="400" hspace="0" align="left"/>
<br clear="all" />

## Speed
Expand All @@ -63,9 +64,9 @@ Does transcription go reasonably fast? The 'large-v2' model I use operates at ab
## And such quality! With subtitles! Even with poor input!
Beyond offline use, I am utterly amazed by the quality of the generated text. I’ll show this best through this (rather dull and quite lengthy) test video where I used myself as the test subject:

[https://commons.wikimedia.org/wiki/File:Wikidata_Workshop_-_Theoretical_part_-_Maastricht_University_-_15_October_2024.webm](https://commons.wikimedia.org/wiki/File:Wikidata_Workshop_-_Theoretical_part_-_Maastricht_University_-_15_October_2024.webm)
https://commons.wikimedia.org/wiki/File:Wikidata_Workshop_-_Theoretical_part_-_Maastricht_University_-_15_October_2024.webm

<image src="media/afbeelding5.png" width="400" hspace="10" align="right"/>
<image src="media/afbeelding5.png" width="600" hspace="0" align="left"/>
<br clear="all" />

The unformatted block of text shown in the file description was fully created using Whisper, with only minimal post-correction. Pay particular attention to how well it generates all named entities and technical terms, including proper capitalization, etc. WOW!
Expand Down

0 comments on commit 50848cc

Please sign in to comment.