-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete Rewrite #61
base: main
Are you sure you want to change the base?
Complete Rewrite #61
Conversation
Settings aren't supported in the settings window yet. They need to be edited in yaml.
Also made extensive changes to ConfigManager to serve the needs of the new settings window.
input_simulator -> keyboard_simulator to avoid confusion with input listeners
There are still bugs, but basic functionality finally restored
Implementing this required adding QT dependency to EventBus
I just stumbled upon this project and then this rewrite. Kudos to both, very useful project, very easy to get up and running! Some suggestions/questions for improvements.
|
I'll speak only for my implementation.
Just save this script under a new name in scripts and it'll appear in the list of post-processing scripts.
Put it in "scripts" and it'll appear in settings under the file name you choose. The only issue is testing this code. Implementation is very simple. I don't have means to test these things right now. |
While working on streaming transcription, I found a very tricky bug with keyboard simulation. Specifically, hotkeys affect simulated key presses. If you use "ctrl+shift" as the trigger hotkey, and the simulated keyboard tries to type "hello", your programs will register it as ctrl+shift+h, ctrl+shift+e, etc. This is most obvious in streaming mode, but it can be triggered in non-streaming mode as well: you press ctrl+shift to trigger recording, release shift or ctrl while still holding the other key and transcription will begin, the output will be affected by shift or ctrl (whatever you keep holding). In attempt to fix this, I already tried interacting with /dev/uinput directly, but it looks like Linux input system bunches all modifiers from all keyboards (virtual and physical). I'm looking for a solution and keeping this PR as a draft until I find something. |
Streaming works for VOSK transcription backend. The general program flow is adjusted to support streaming properly. Also, added uinput output backend. It's the most dependency-free backend, interfacing directly with /dev/uinput via syscalls.
Make full use of newly introduced sentinel values in audio queues.
When I tried to use the Open AI API functionality, it did not work because the audio was barely understandable. After listening to it, I was amazed on what the model could still decipher. There seems to be some missing normalization and conversion. Using this worked:
|
@go-run-jump Fixing OAI API backend is tricky because I can't test it. But I tried saving audio both before and after conversion in the faster whisper backend. Both audio files sounded completely normal. Maybe the issue is with your microphone? I have two microphones. The one integrated in the laptop chassis records really shitty audio, especially when the fan is spinning fast. That's why I'm using a usb microphone. Also, you mentioned in another thread that audio sounded faster than real time for you, perhaps it's because you forgot to specify sample rate? Recording is done at 16k by default, if you replay it at 44k it'd be 2.75 faster. Anyway, I added the changes you proposed. I hope it helps. |
Hello @dariox1337, nice work on the rewrite of this already awesome project using an ellegant architecture. Have you tried to pack your fork as a single executable file using something like PyInstaller? If not, do you think that it is possible to do so? Maybe it could be better as a first step to generate an executable that restricts the project to just OAI API functionality and not local models, but I don't acctually know if it would be needed. |
This is almost a different program that happens to use WhisperWriter assets. Not sure if you're interested in merging it, but just not to be called ungrateful, I'm opening this pull request. My motivation was to add profiles (multiple transcription setups that can be switched by a dedicated shortcut), and while doing so, I decided to restructure the whole program flow. Here is the design doc that gives a high-level overview.
Key Features
P.S. 99.9% of code is generated by AI.
UPDATE: Just for anyone interested, my fork now supports streaming transcription with Faster Whisper and VOSK (new backend). GUI has been updated to PyQt6, and python dependencies updated to support python 3.12.