Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/real time stt #26

Closed
wants to merge 13 commits into from
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
**/target
**/Cargo.lock
/.idea

# Vim
*.swp
*.swo
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@
* This also turns `WhisperContext` into an entirely immutable object, meaning it can be shared across threads and used concurrently, safely.
* Send feedback on these changes to the PR: https://github.com/tazz4843/whisper-rs/pull/33

# Version 0.5.0 (2022-03-27)
* Update convert_stereo_to_mono_audio to return a Result
* Used to panic when length of provided slice is not a multiple of two.

# Version 0.4.0 (2023-02-08)

# Version 0.3.0 (2022-12-14)

# Version 0.2.0 (2022-10-28)
* Update upstream whisper.cpp to 2c281d190b7ec351b8128ba386d110f100993973.
* Fix breaking changes in update, which cascade to users:
Expand Down
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[workspace]
members = ["sys"]
exclude = ["examples/full_usage"]
exclude = ["examples/full_usage", "examples/stream"]

[package]
name = "whisper-rs"
Expand Down
14 changes: 14 additions & 0 deletions examples/stream/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[package]
name = "stream"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
anyhow = "1.0.70"
cpal = "0.15.1"
itertools = "0.10.5"
ringbuf = "0.3.2"
webrtc-vad = "0.4"
whisper-rs = { path = "../.." }
282 changes: 282 additions & 0 deletions examples/stream/src/main.rs
Copy link

@ynot01 ynot01 May 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this example file myself, all I could get was gibberish. After monkeying around the code I directed it to my headphones and the input seemed to be so high pitched that it was indecipherable (seemingly resultant of convert_stereo_to_mono_audio) and was being cut off instantly (perhaps intended?)

Removing convert_stereo_to_mono_audio and make_audio_louder did not resolve the gibberish issue; my best guess is that the cutting off is the culprit but I do not know what causes it or how to fix it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. By "cutting off" do you mean how it splits samples and transfers them over the ring buffer?
Where did you put the output audio stream? If you turned on the output stream that I left in the code, does it still sound like gibberish (before the signal goes through any processing)?

I was planning to removing the convert_stereo_to_mono_audio as well because whisper need stereo data for diarization (when that becomes available).

Copy link
Contributor Author

@bruskajp bruskajp May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.s. I am working on a rewrite that removes the 5 second line break thing, but I am only about halfway there.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. By "cutting off" do you mean how it splits samples and transfers them over the ring buffer?

I can only describe it as "choppy", however after reproducing my alterations, I could not reproduce the cutting-off effect, which I took as a good thing. Unfortunately, Whisper still only took my words as gibberish, which leads me to believe I did not achieve whatever Whisper is hearing.

Where did you put the output audio stream? If you turned on the output stream that I left in the code, does it still sound like gibberish (before the signal goes through any processing)?

I used your leftover code, used https://github.com/RustAudio/cpal/blob/master/examples/feedback.rs to put some things in (like pushing zeros to the producer using latency_samples size) and removed the VAD code in input as it was causing input to "fall behind"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the latest commit of this PR work for you (and not produce gibberish)? It may be an OS issue, as I am running Windows 10

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
Unfortunately still does not seem to work on my machine; I made sure to check it was using the correct recording device

Copy link
Contributor Author

@bruskajp bruskajp May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.
I'm going to assume this a cpal issue with your hardware then.
Can you try running the cpal example here (by making your own project with main.rs)?
https://github.com/RustAudio/cpal/blob/master/examples/ios-feedback/src/feedback.rs

Just check if you can read in the audio and output it correctly. If that works, then I will try figuring out where it breaks between your audio in and the whisper code.

You will need to add a main function to the example code
fn main() { run_example(); loop {} }

If this example doesn't work for you, you may want to try another microphone.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds clear as day when I run the above project

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a branch on my repo that tries to output the sound AND classifies at the same time.
Can you see if that works for you?
git clone https://github.com/bruskajp/whisper-rs.git
git checkout bug/ynot01

Also, can you try using a stronger model or using the let samples = make_audio_louder(&samples, 1.0); and use a value of 2.0? Maybe it's not loud enough for the model.

Also, I just want to make sure (though I assume it is true), you can properly run the full_usage example, correct?
Because if that doesn't work, then there is no way that this works.

Copy link

@ynot01 ynot01 May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a branch on my repo that tries to output the sound AND classifies at the same time. Can you see if that works for you? git clone https://github.com/bruskajp/whisper-rs.git git checkout bug/ynot01

ggml-tiny.en.bin results in gibberish (and eventually leads to a "half-sample" error)

Starting the input and output streams with `3000` milliseconds of latency.
 [impressive noise] have a explanation. [alingk] (humming)thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "The stereo audio vector 
has an odd number of samples. This means a half-sample is missing somewhere"', src\main.rs:152:74

larger models (with sufficient latency, which appears to require around 15+ seconds) result in:

Starting the input and output streams with `20000` milliseconds of latency.
 [silence]thread 'main' panicked at 'Classification got behind. It took to long. Try using a smaller model and/or more threads', src\main.rs:146:13

Also, can you try using a stronger model or using the let samples = make_audio_louder(&samples, 1.0); and use a value of 2.0? Maybe it's not loud enough for the model.

In the branch supplied and feature/RealTimeSTT the dependent function make_audio_louder is missing

Also, I just want to make sure (though I assume it is true), you can properly run the full_usage example, correct? Because if that doesn't work, then there is no way that this works.

Recording my own voice results in a clean output with the full_usage project
image

Processor - Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, 4008 Mhz, 4 Core(s), 8 Logical Processor(s)
GPU - NVIDIA GeForce GTX 1070
Installed Physical Memory (RAM) - 16.0 GB

Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
#![allow(clippy::uninlined_format_args)]

//! Feeds back the input stream directly into the output stream.
//!
//! Assumes that the input and output devices can use the same stream configuration and that they
//! support the f32 sample format.
//!
//! Uses a delay of `LATENCY_MS` milliseconds in case the default input and output streams are not
//! precisely synchronised.

extern crate anyhow;
extern crate cpal;
extern crate ringbuf;

use cpal::traits::{DeviceTrait, HostTrait, StreamTrait};
use ringbuf::{LocalRb, Rb, SharedRb};
use std::path::Path;
use whisper_rs::{FullParams, SamplingStrategy, WhisperContext, WhisperToken};

use std::io::Write;
use std::time::{Duration, Instant};
use std::{cmp, thread};

const LATENCY_MS: f32 = 5000.0;
const NUM_ITERS: usize = 2;
const NUM_ITERS_SAVED: usize = 2;

// TODO: JPB: Add clean way to exit besides ctrl+C (which sometimes doesn't work)
// TODO: JPB: Make sure this works with other LATENCY_MS, NUM_ITERS, and NUM_ITERS_SAVED
// TODO: JPB: I think there is an issue where it doesn't compute fast enough and so it loses data

pub fn run_example() -> Result<(), anyhow::Error> {
let host = cpal::default_host();

// Default devices.
let input_device = host
.default_input_device()
.expect("failed to get default input device");
//let output_device = host
// .default_output_device()
// .expect("failed to get default output device");
println!("Using default input device: \"{}\"", input_device.name()?);
//println!("Using default output device: \"{}\"", output_device.name()?);

// Top level variables
let config: cpal::StreamConfig = input_device.default_input_config()?.into();
let latency_frames = (LATENCY_MS / 1_000.0) * config.sample_rate.0 as f32;
let latency_samples = latency_frames as usize * config.channels as usize;
let sampling_freq = config.sample_rate.0 as f32 / 2.0; // TODO: JPB: Divide by 2 because of stereo to mono

// The buffer to share samples
let ring = SharedRb::new(latency_samples * 2);
let (mut producer, mut consumer) = ring.split();

// Setup microphone callback
let input_data_fn = move |data: &[f32], _: &cpal::InputCallbackInfo| {
let mut output_fell_behind = false;
for &sample in data {
if producer.push(sample).is_err() {
output_fell_behind = true;
}
}
if output_fell_behind {
eprintln!("output stream fell behind: try increasing latency");
}
};

//let output_data_fn = move |data: &mut [f32], _: &cpal::OutputCallbackInfo| {
// let mut input_fell_behind = None;
// for sample in data {
// *sample = match consumer.pop() {
// Some(s) => s,
// None => {
// input_fell_behind = Some("");
// 0.0
// }
// };
// }
// if let Some(err) = input_fell_behind {
// eprintln!(
// "input stream fell behind: {:?}: try increasing latency",
// err
// );
// }
//};

// Setup whisper
let arg1 = std::env::args()
.nth(1)
.expect("First argument should be path to Whisper model");
let whisper_path = Path::new(&arg1);
if !whisper_path.exists() && !whisper_path.is_file() {
panic!("expected a whisper directory")
}
let ctx = WhisperContext::new(&whisper_path.to_string_lossy()).expect("failed to open model");
let mut state = ctx.create_state().expect("failed to create key");

// Variables used across loop iterations
let mut iter_samples = LocalRb::new(latency_samples * NUM_ITERS * 2);
let mut iter_num_samples = LocalRb::new(NUM_ITERS);
let mut iter_tokens = LocalRb::new(NUM_ITERS_SAVED);
for _ in 0..NUM_ITERS {
iter_num_samples
.push(0)
.expect("Error initailizing iter_num_samples");
}

// Build streams.
println!(
"Attempting to build both streams with f32 samples and `{:?}`.",
config
);
println!("Setup input stream");
let input_stream = input_device.build_input_stream(&config, input_data_fn, err_fn, None)?;
//println!("Setup output stream");
//let output_stream = output_device.build_output_stream(&config, output_data_fn, err_fn, None)?;
println!("Successfully built streams.");

// Play the streams.
println!(
"Starting the input and output streams with `{}` milliseconds of latency.",
LATENCY_MS
);
input_stream.play()?;
//output_stream.play()?;

// Remove the initial samples
consumer.pop_iter().count();
let mut start_time = Instant::now();

// Main loop
// TODO: JPB: Make this it's own function (And the lines above it)
let mut num_chars_to_delete = 0;
let mut loop_num = 0;
let mut words = "".to_owned();
loop {
loop_num += 1;

// Only run every LATENCY_MS
let duration = start_time.elapsed();
let latency = Duration::from_millis(LATENCY_MS as u64);
if duration < latency {
let sleep_time = latency - duration;
thread::sleep(sleep_time);
} else {
panic!("Classification got behind. It took to long. Try using a smaller model and/or more threads");
}
start_time = Instant::now();

// Collect the samples
let samples: Vec<_> = consumer.pop_iter().collect();
let samples = whisper_rs::convert_stereo_to_mono_audio(&samples).unwrap();
//let samples = make_audio_louder(&samples, 1.0);
let num_samples_to_delete = iter_num_samples
.push_overwrite(samples.len())
.expect("Error num samples to delete is off");
for _ in 0..num_samples_to_delete {
iter_samples.pop();
}
iter_samples.push_iter(&mut samples.into_iter());
let (head, tail) = iter_samples.as_slices();
let current_samples = [head, tail].concat();

// Get tokens to be deleted
if loop_num > 1 {
let num_tokens = state.full_n_tokens(0)?;
let token_time_end = state.full_get_segment_t1(0)?;
let token_time_per_ms =
token_time_end as f32 / (LATENCY_MS * cmp::min(loop_num, NUM_ITERS) as f32); // token times are not a value in ms, they're 150 per second
let ms_per_token_time = 1.0 / token_time_per_ms;

let mut tokens_saved = vec![];
// Skip beginning and end token
for i in 1..num_tokens - 1 {
let token = state.full_get_token_data(0, i)?;
let token_t0_ms = token.t0 as f32 * ms_per_token_time;
let ms_to_delete = num_samples_to_delete as f32 / (sampling_freq / 1000.0);

// Save tokens for whisper context
if (loop_num > NUM_ITERS) && token_t0_ms < ms_to_delete {
tokens_saved.push(token.id);
}
}
num_chars_to_delete = words.chars().count();
if loop_num > NUM_ITERS {
num_chars_to_delete -= tokens_saved
.iter()
.map(|x| ctx.token_to_str(*x).expect("Error"))
.collect::<String>()
.chars()
.count();
}
iter_tokens.push_overwrite(tokens_saved);
//println!();
//println!(
// "TOKENS_SAVED : {}",
// tokens_saved
// .iter()
// .map(|x| ctx.token_to_str(*x).unwrap())
// .collect::<Vec<_>>()
// .join("")
//);
//println!(
// "CHARS_DELETED: {}",
// words[words.len() - num_chars_to_delete..].to_owned()
//);
//println!(
// "ITER_TOKENS : {}",
// iter_tokens
// .iter()
// .flatten()
// .map(|x| ctx.token_to_str(*x).unwrap())
// .collect::<Vec<_>>()
// .join("")
//);
//println!("WORDS : {}", words);
//println!("NUM_CHARS : {} {}", words.len(), num_chars_to_delete);
}

// Make the model params
let (head, tail) = iter_tokens.as_slices();
let tokens = [head, tail]
.concat()
.into_iter()
.flatten()
.collect::<Vec<WhisperToken>>();
let mut params = gen_whisper_params();
params.set_tokens(&tokens);

// Run the model
state
.full(params, &current_samples)
.expect("failed to convert samples");

// Update the words on screen
if num_chars_to_delete != 0 {
// TODO: JPB: Potentially unneeded if statement
print!(
"\x1B[{}D{}\x1B[{}D",
num_chars_to_delete,
" ".repeat(num_chars_to_delete),
num_chars_to_delete
);
}
let num_tokens = state.full_n_tokens(0)?;
words = (1..num_tokens - 1)
.map(|i| state.full_get_token_text(0, i).expect("Error"))
.collect::<String>();
print!("{}", words);
std::io::stdout().flush().unwrap();
}
}

fn gen_whisper_params<'a>() -> FullParams<'a, 'a> {
let mut params = FullParams::new(SamplingStrategy::default());
params.set_print_progress(false);
params.set_print_special(false);
params.set_print_realtime(false);
params.set_print_timestamps(false);
params.set_suppress_blank(true);
params.set_language(Some("en"));
params.set_token_timestamps(true);
params.set_duration_ms(LATENCY_MS as i32);
params.set_no_context(true);
//params.set_n_threads(4);

//params.set_no_speech_thold(0.3);
//params.set_split_on_word(true);

// This impacts token times, don't use
//params.set_single_segment(true);

params
}

fn err_fn(err: cpal::StreamError) {
eprintln!("an error occurred on stream: {}", err);
}

fn main() -> Result<(), anyhow::Error> {
run_example()
}