Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete Rewrite #61

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open

Complete Rewrite #61

wants to merge 30 commits into from

Conversation

dariox1337
Copy link
Contributor

@dariox1337 dariox1337 commented Aug 27, 2024

This is almost a different program that happens to use WhisperWriter assets. Not sure if you're interested in merging it, but just not to be called ungrateful, I'm opening this pull request. My motivation was to add profiles (multiple transcription setups that can be switched by a dedicated shortcut), and while doing so, I decided to restructure the whole program flow. Here is the design doc that gives a high-level overview.

Key Features

  • Multiple Profiles: Configure and switch between different transcription setups on-the-fly.
  • Flexible Backends: Support for local (Faster Whisper) and API-based (OpenAI) transcription.
  • Customizable Shortcuts: Each profile can have its own activation shortcut.
  • Various Recording Modes: Choose from continuous, voice activity detection, press-to-toggle, or hold-to-record modes.
  • Post-Processing: Apply customizable post-processing scripts to refine transcription output.
  • Multiple Output Methods: Write to active window, clipboard, or custom output handlers.

P.S. 99.9% of code is generated by AI.

UPDATE: Just for anyone interested, my fork now supports streaming transcription with Faster Whisper and VOSK (new backend). GUI has been updated to PyQt6, and python dependencies updated to support python 3.12.

@oyhel
Copy link

oyhel commented Aug 27, 2024

I just stumbled upon this project and then this rewrite. Kudos to both, very useful project, very easy to get up and running! Some suggestions/questions for improvements.

  1. would it be possible to add the ID of an openAI assistant to perform post-processing as part of the transcription? I understand this is possible using the scripts-function, but I assume this means the text would need to be passed back and forth top OpenAI?
  2. For a similar post-processing implementation using local Ollama etc., I assume passing the data to Ollama for post-processing using a separate script would be the preferred approach?

@dariox1337
Copy link
Contributor Author

dariox1337 commented Aug 27, 2024

I'll speak only for my implementation.

  1. I'm not that familiar with OpenAI API. Does its Whisper API provide an option to redirect transcription result to an assistant on the server side without sending you the raw transcription?
  • If yes, it can be implemented. All you need to do is implement the logic in openai backend Additional config parameters need to be added to config_schema.yaml.
  • If no, it can be implemented with post-processing scripts. Here is a quick example (NOT TESTED):
import openai
from post_processing_base import PostProcessor

class Processor(PostProcessor):
    def __init__(self):
        # Initialize OpenAI API key
        # In a real-world scenario, you'd want to load this from a secure config
        openai.api_key = 'your-api-key-here'

    def process(self, text: str) -> str:
        try:
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant that corrects transcription errors."},
                    {"role": "user", "content": f"Please fix any obvious mistakes in this transcribed text, maintaining the original meaning: '{text}'"}
                ]
            )

            corrected_text = response.choices[0].message['content'].strip()
            return corrected_text
        except Exception as e:
            print(f"Error in AI correction: {str(e)}")
            # If there's an error, return the original text
            return text

Just save this script under a new name in scripts and it'll appear in the list of post-processing scripts.

  1. Ollama also can be implemented very easily. Here is a possible implementation (NOT TESTED):
import requests
import json
from post_processing_base import PostProcessor

class Processor(PostProcessor):
    def __init__(self):
        self.api_base = "http://localhost:11434/api"  # Default Ollama API address
        self.model = "llama2"  # Or whatever model you're using

    def process(self, text: str) -> str:
        try:
            response = requests.post(
                f"{self.api_base}/generate",
                json={
                    "model": self.model,
                    "prompt": f"Please fix any obvious mistakes in this transcribed text, maintaining the original meaning: '{text}'",
                    "stream": False
                }
            )
            response.raise_for_status()  # Raise an exception for bad status codes

            result = response.json()
            corrected_text = result['response'].strip()
            return corrected_text
        except requests.RequestException as e:
            print(f"Error in Ollama API call: {str(e)}")
            return text
        except json.JSONDecodeError as e:
            print(f"Error decoding Ollama API response: {str(e)}")
            return text
        except KeyError as e:
            print(f"Unexpected response format from Ollama API: {str(e)}")
            return text
        except Exception as e:
            print(f"Unexpected error in AI correction: {str(e)}")
            return text

Put it in "scripts" and it'll appear in settings under the file name you choose.

The only issue is testing this code. Implementation is very simple. I don't have means to test these things right now.

@dariox1337 dariox1337 marked this pull request as draft August 28, 2024 12:18
@dariox1337
Copy link
Contributor Author

While working on streaming transcription, I found a very tricky bug with keyboard simulation. Specifically, hotkeys affect simulated key presses.

If you use "ctrl+shift" as the trigger hotkey, and the simulated keyboard tries to type "hello", your programs will register it as ctrl+shift+h, ctrl+shift+e, etc. This is most obvious in streaming mode, but it can be triggered in non-streaming mode as well: you press ctrl+shift to trigger recording, release shift or ctrl while still holding the other key and transcription will begin, the output will be affected by shift or ctrl (whatever you keep holding).

In attempt to fix this, I already tried interacting with /dev/uinput directly, but it looks like Linux input system bunches all modifiers from all keyboards (virtual and physical). I'm looking for a solution and keeping this PR as a draft until I find something.

Streaming works for VOSK transcription backend. The general program
flow is adjusted to support streaming properly. Also, added uinput
output backend. It's the most dependency-free backend, interfacing
directly with /dev/uinput via syscalls.
Make full use of newly introduced sentinel values in audio queues.
@go-run-jump
Copy link

When I tried to use the Open AI API functionality, it did not work because the audio was barely understandable. After listening to it, I was amazed on what the model could still decipher. There seems to be some missing normalization and conversion.

Using this worked:

        if audio_data.dtype == np.float32 and np.abs(audio_data).max() <= 1.0:
            # Data is already in the correct format
            pass
        elif audio_data.dtype == np.float32:
            # Data is float32 but may not be in [-1, 1] range
            audio_data = np.clip(audio_data, -1.0, 1.0)
        elif audio_data.dtype in [np.int16, np.int32]:
            # Convert integer PCM to float32
            audio_data = audio_data.astype(np.float32) / np.iinfo(audio_data.dtype).max
        else:
            raise ValueError(f"Unsupported audio format: {audio_data.dtype}")

@dariox1337
Copy link
Contributor Author

dariox1337 commented Sep 9, 2024

@go-run-jump Fixing OAI API backend is tricky because I can't test it. But I tried saving audio both before and after conversion in the faster whisper backend. Both audio files sounded completely normal. Maybe the issue is with your microphone?

I have two microphones. The one integrated in the laptop chassis records really shitty audio, especially when the fan is spinning fast. That's why I'm using a usb microphone.

Also, you mentioned in another thread that audio sounded faster than real time for you, perhaps it's because you forgot to specify sample rate? Recording is done at 16k by default, if you replay it at 44k it'd be 2.75 faster.

Anyway, I added the changes you proposed. I hope it helps.

@tpougy
Copy link

tpougy commented Oct 14, 2024

Hello @dariox1337, nice work on the rewrite of this already awesome project using an ellegant architecture. Have you tried to pack your fork as a single executable file using something like PyInstaller? If not, do you think that it is possible to do so? Maybe it could be better as a first step to generate an executable that restricts the project to just OAI API functionality and not local models, but I don't acctually know if it would be needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants