Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline PiperEngine.synthesize to allow use of medium- and high-quality models #244

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

InspectorCaracal
Copy link

@InspectorCaracal InspectorCaracal commented Jan 7, 2025

The current implementation requires writing a temporary .wav file as well as forces the sample rate to 16000 via validation of that file. However:

  • The piper command through subprocess.run can return the raw audio data which the player needs, so the file write isn't necessary.
  • Most models are medium-quality and thus 22050, so the sample rate restriction to 16000 is extremely limiting.
  • The model configs already indicate the sample rate to be used for playback, which the existing playback system correctly handles with no additional effort.

This PR modifies the synthesize method to change the file-output parameter given to Piper to raw output that can be added directly to the queue, and removes the WAV-file validation since reading a WAV file is no longer necessary. The change is primarily intended to allow using all sizes of piper voices, but should also reduce I/O overhead.

@KoljaB
Copy link
Owner

KoljaB commented Jan 7, 2025

That's a great simplification of the code, thank you for that!

I wasn't aware that piper does not always synthesizes with 16000 Hz. This leaves one issue: every engine reports the exact sample rate by implementing get_stream_info to get the output stream initialized properly.

Currently this is still hardcoded to 16000 in the PiperEngine, like this:

    def get_stream_info(self):
        """
        Returns PyAudio stream configuration for Piper.

        Returns:
            tuple: (format, channels, rate)
        """
        return pyaudio.paInt16, 1, 16000

By removing writing of the wav file we lose the opportunity to read that sample rate directly from the wav.

I'd love to hear your opinion. What would you suggest how we should handle this? We can rewrite the wav file which as you correctly stated introduces I/O overhead and thus is ugly. We could also add a parameter to the PiperEngine constructor allowing for customization of the sample rate by the user. Also not perfect, since it requires active interaction.

Maybe there's a third option which I can't figure that right now. What do you think? Thanks again for the PR.

@InspectorCaracal
Copy link
Author

Ahh, interesting! I didn't notice that since I only tested that the final audio wasn't being played at the wrong speed.

Since piper requires a configuration file in JSON format and has a documented fallback location for when it isn't specified, the best way is probably reading that file and integrating the needed fields into the PiperVoice class, then referencing it from there. Should be straightforward, I'll try it out in a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants