Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StyleTTS: Reference Audio ineffective #248

Open
andrewhowdencom opened this issue Jan 12, 2025 · 0 comments
Open

StyleTTS: Reference Audio ineffective #248

andrewhowdencom opened this issue Jan 12, 2025 · 0 comments

Comments

@andrewhowdencom
Copy link

Holaa! I hope you're doing well.

Am trying to get the StyleTTS model working, but cannot for the life of me get it to change at all based on the "reference audio". Without getting into too many details, snippet is here:

        # Check to see if the file in the voices path exists before starting the model. Otherwise, the model falls back
        # to a "default voice" somehow.
        p = os.path.join(self.reference_audio_path, f"{voice}.wav", )
        if not os.path.isfile(p):
            raise Exception(f"Voice file not found: {p}")

        voice = StyleTTSVoice(
            model_config_path=self.model_config_path,
            model_checkpoint_path=self.model_checkpoint_path,
            ref_audio_path=p
        )

        engine = StyleTTSEngine(
            style_root=self.styletts_checkout,
            diffusion_steps=15,
            voice=voice,
        )

        stream = TextToAudioStream(
            engine=engine,
            tokenizer="stanza",
            muted=True,
        )

        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_wav:  # Create temp file for result
            stream.feed(_split_paragraph(sys.stdin.read())).play(
                output_wavfile=tmp_wav.name,
            )

        engine.shutdown()

        # Process the input wav into something that the output can use. The input filename is passed directly to ffmpeg,
        # implicitly allowing the user to set formats (e.g. mp3)
        (
            ffmpeg.input(tmp_wav.name)
                .output(to)
                .overwrite_output()
                .run()
        )

Is there something obvious I am missing? It all seems to work similarly to another StyleTTS version (NeuralVox), but I can't quite connect the dots here.

Is there a working example of this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant