Complete Rewrite #61

dariox1337 · 2024-08-27T12:12:56Z

This is almost a different program that happens to use WhisperWriter assets. Not sure if you're interested in merging it, but just not to be called ungrateful, I'm opening this pull request. My motivation was to add profiles (multiple transcription setups that can be switched by a dedicated shortcut), and while doing so, I decided to restructure the whole program flow. Here is the design doc that gives a high-level overview.

Key Features

Multiple Profiles: Configure and switch between different transcription setups on-the-fly.
Flexible Backends: Support for local (Faster Whisper) and API-based (OpenAI) transcription.
Customizable Shortcuts: Each profile can have its own activation shortcut.
Various Recording Modes: Choose from continuous, voice activity detection, press-to-toggle, or hold-to-record modes.
Post-Processing: Apply customizable post-processing scripts to refine transcription output.
Multiple Output Methods: Write to active window, clipboard, or custom output handlers.

P.S. 99.9% of code is generated by AI.

UPDATE: Just for anyone interested, my fork now supports streaming transcription with Faster Whisper and VOSK (new backend). GUI has been updated to PyQt6, and python dependencies updated to support python 3.12.

Settings aren't supported in the settings window yet. They need to be edited in yaml.

Also made extensive changes to ConfigManager to serve the needs of the new settings window.

input_simulator -> keyboard_simulator to avoid confusion with input listeners

There are still bugs, but basic functionality finally restored

Implementing this required adding QT dependency to EventBus

oyhel · 2024-08-27T12:59:22Z

I just stumbled upon this project and then this rewrite. Kudos to both, very useful project, very easy to get up and running! Some suggestions/questions for improvements.

would it be possible to add the ID of an openAI assistant to perform post-processing as part of the transcription? I understand this is possible using the scripts-function, but I assume this means the text would need to be passed back and forth top OpenAI?
For a similar post-processing implementation using local Ollama etc., I assume passing the data to Ollama for post-processing using a separate script would be the preferred approach?

dariox1337 · 2024-08-27T13:54:17Z

I'll speak only for my implementation.

I'm not that familiar with OpenAI API. Does its Whisper API provide an option to redirect transcription result to an assistant on the server side without sending you the raw transcription?

If yes, it can be implemented. All you need to do is implement the logic in openai backend Additional config parameters need to be added to config_schema.yaml.
If no, it can be implemented with post-processing scripts. Here is a quick example (NOT TESTED):

import openai
from post_processing_base import PostProcessor

class Processor(PostProcessor):
    def __init__(self):
        # Initialize OpenAI API key
        # In a real-world scenario, you'd want to load this from a secure config
        openai.api_key = 'your-api-key-here'

    def process(self, text: str) -> str:
        try:
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant that corrects transcription errors."},
                    {"role": "user", "content": f"Please fix any obvious mistakes in this transcribed text, maintaining the original meaning: '{text}'"}
                ]
            )

            corrected_text = response.choices[0].message['content'].strip()
            return corrected_text
        except Exception as e:
            print(f"Error in AI correction: {str(e)}")
            # If there's an error, return the original text
            return text

Just save this script under a new name in scripts and it'll appear in the list of post-processing scripts.

Ollama also can be implemented very easily. Here is a possible implementation (NOT TESTED):

import requests
import json
from post_processing_base import PostProcessor

class Processor(PostProcessor):
    def __init__(self):
        self.api_base = "http://localhost:11434/api"  # Default Ollama API address
        self.model = "llama2"  # Or whatever model you're using

    def process(self, text: str) -> str:
        try:
            response = requests.post(
                f"{self.api_base}/generate",
                json={
                    "model": self.model,
                    "prompt": f"Please fix any obvious mistakes in this transcribed text, maintaining the original meaning: '{text}'",
                    "stream": False
                }
            )
            response.raise_for_status()  # Raise an exception for bad status codes

            result = response.json()
            corrected_text = result['response'].strip()
            return corrected_text
        except requests.RequestException as e:
            print(f"Error in Ollama API call: {str(e)}")
            return text
        except json.JSONDecodeError as e:
            print(f"Error decoding Ollama API response: {str(e)}")
            return text
        except KeyError as e:
            print(f"Unexpected response format from Ollama API: {str(e)}")
            return text
        except Exception as e:
            print(f"Unexpected error in AI correction: {str(e)}")
            return text

Put it in "scripts" and it'll appear in settings under the file name you choose.

The only issue is testing this code. Implementation is very simple. I don't have means to test these things right now.

dariox1337 · 2024-08-28T12:29:05Z

While working on streaming transcription, I found a very tricky bug with keyboard simulation. Specifically, hotkeys affect simulated key presses.

If you use "ctrl+shift" as the trigger hotkey, and the simulated keyboard tries to type "hello", your programs will register it as ctrl+shift+h, ctrl+shift+e, etc. This is most obvious in streaming mode, but it can be triggered in non-streaming mode as well: you press ctrl+shift to trigger recording, release shift or ctrl while still holding the other key and transcription will begin, the output will be affected by shift or ctrl (whatever you keep holding).

In attempt to fix this, I already tried interacting with /dev/uinput directly, but it looks like Linux input system bunches all modifiers from all keyboards (virtual and physical). I'm looking for a solution and keeping this PR as a draft until I find something.

Streaming works for VOSK transcription backend. The general program flow is adjusted to support streaming properly. Also, added uinput output backend. It's the most dependency-free backend, interfacing directly with /dev/uinput via syscalls.

Make full use of newly introduced sentinel values in audio queues.

go-run-jump · 2024-09-05T16:43:20Z

When I tried to use the Open AI API functionality, it did not work because the audio was barely understandable. After listening to it, I was amazed on what the model could still decipher. There seems to be some missing normalization and conversion.

Using this worked:

        if audio_data.dtype == np.float32 and np.abs(audio_data).max() <= 1.0:
            # Data is already in the correct format
            pass
        elif audio_data.dtype == np.float32:
            # Data is float32 but may not be in [-1, 1] range
            audio_data = np.clip(audio_data, -1.0, 1.0)
        elif audio_data.dtype in [np.int16, np.int32]:
            # Convert integer PCM to float32
            audio_data = audio_data.astype(np.float32) / np.iinfo(audio_data.dtype).max
        else:
            raise ValueError(f"Unsupported audio format: {audio_data.dtype}")

dariox1337 · 2024-09-09T10:34:17Z

@go-run-jump Fixing OAI API backend is tricky because I can't test it. But I tried saving audio both before and after conversion in the faster whisper backend. Both audio files sounded completely normal. Maybe the issue is with your microphone?

I have two microphones. The one integrated in the laptop chassis records really shitty audio, especially when the fan is spinning fast. That's why I'm using a usb microphone.

Also, you mentioned in another thread that audio sounded faster than real time for you, perhaps it's because you forgot to specify sample rate? Recording is done at 16k by default, if you replay it at 44k it'd be 2.75 faster.

Anyway, I added the changes you proposed. I hope it helps.

tpougy · 2024-10-14T17:06:08Z

Hello @dariox1337, nice work on the rewrite of this already awesome project using an ellegant architecture. Have you tried to pack your fork as a single executable file using something like PyInstaller? If not, do you think that it is possible to do so? Maybe it could be better as a first step to generate an executable that restricts the project to just OAI API functionality and not local models, but I don't acctually know if it would be needed.

dariox1337 added 21 commits August 27, 2024 11:52

Implement TranscriptionManager

0dfc60b

Settings aren't supported in the settings window yet. They need to be edited in yaml.

Rewrite settings window to handle deeper nesting

b7649e1

Also made extensive changes to ConfigManager to serve the needs of the new settings window.

Implement OpenAI API backend

e2ad7a4

Implement post-processing via separate scripts

f93e255

cleanup

81d2167

Refactoring

bba7223

input_simulator -> keyboard_simulator to avoid confusion with input listeners

cleanup

d93e7dd

more clenaups

66e03b0

Full rewrite

3fa08aa

There are still bugs, but basic functionality finally restored

Status window appearance depends on config

404bbf7

Fix main window closing

7af971c

Remove forgotten debug print

eac6ee1

Process all EventBus events on the main thread

c6ecd36

Implementing this required adding QT dependency to EventBus

Add design doc

1886c3b

Add docs to high-level classes

4d6abd3

Make settings order consistent

9013b6f

Fix VAD mode

5052ea8

Better handling of errors during transcription

86fc233

Implement OpenAI API backend (not tested)

1e5d065

Update readme regarding config options

e61f41d

More readme updates

83bce1e

dariox1337 marked this pull request as draft August 28, 2024 12:18

dariox1337 added 5 commits September 1, 2024 19:00

Streaming transcription, VOSK, uinput

dbe625d

Streaming works for VOSK transcription backend. The general program flow is adjusted to support streaming properly. Also, added uinput output backend. It's the most dependency-free backend, interfacing directly with /dev/uinput via syscalls.

Fix post-processing when no scripts selected

157a720

Reimplement "noise on completion" with pyaudio

40266e8

Fix a couple race conditions

92d9a2d

Simplify Profile and TranscriptionManager

1bca790

Make full use of newly introduced sentinel values in audio queues.

dariox1337 added 3 commits September 2, 2024 12:58

Rename a method for clarity

7cce42d

Update docs

635a456

Open Settings if model initialization fails

78a17ce

dariox1337 marked this pull request as ready for review September 2, 2024 15:01

go-run-jump mentioned this pull request Sep 5, 2024

Distil Whisper models: Missing words and repetitions in transcription #59

Open

Fix OAI API audio conversion

3425f6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete Rewrite #61

Complete Rewrite #61

dariox1337 commented Aug 27, 2024 •

edited

Loading

oyhel commented Aug 27, 2024 •

edited

Loading

dariox1337 commented Aug 27, 2024 •

edited

Loading

dariox1337 commented Aug 28, 2024

go-run-jump commented Sep 5, 2024

dariox1337 commented Sep 9, 2024 •

edited

Loading

tpougy commented Oct 14, 2024

Complete Rewrite #61

Are you sure you want to change the base?

Complete Rewrite #61

Conversation

dariox1337 commented Aug 27, 2024 • edited Loading

oyhel commented Aug 27, 2024 • edited Loading

dariox1337 commented Aug 27, 2024 • edited Loading

dariox1337 commented Aug 28, 2024

go-run-jump commented Sep 5, 2024

dariox1337 commented Sep 9, 2024 • edited Loading

tpougy commented Oct 14, 2024

dariox1337 commented Aug 27, 2024 •

edited

Loading

oyhel commented Aug 27, 2024 •

edited

Loading

dariox1337 commented Aug 27, 2024 •

edited

Loading

dariox1337 commented Sep 9, 2024 •

edited

Loading