Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order between UserStoppedSpeakingFrame from VAD and TranscriptionFrame from TTS #859

Open
zizhong opened this issue Dec 14, 2024 · 1 comment

Comments

@zizhong
Copy link

zizhong commented Dec 14, 2024

Description

Previously our logic used to assume UserStoppedSpeakingFrame comes after TranscriptionFrame. So if user stopped speaking, we will get all the text from TranscriptionFrames and use UserStoppedSpeakingFrame as a turn indicator.

However, I found now it is different. UserStoppedSpeakingFrame can come before TranscriptionFrame.
Is it a bug or expected behavior? If expected, what is the recommanded turn indicator?

If reporting a bug, please fill out the following:

Environment

  • pipecat-ai version: 0.0.50
  • python version: 3.10
  • OS: ubuntu

Issue description

UserStoppedSpeakingFrame can come before TranscriptionFrame.

Repro steps

Pipeline with

  • FastAPIWebsocketTransport + SileroVADAnalyzer
  • STT service

Expected behavior

UserStoppedSpeakingFrame as a turn indicator.

Actual behavior

UserStoppedSpeakingFrame is not a turn indicator.

Logs

@chadbailey59
Copy link
Contributor

The Context Aggregators should help solve this problem. They're used in many examples, including the interruptible example.

The logic of how they handle UserStoppedSpeakingFrames and TranscriptionFrames is explained in this comment. Are you using these aggregators, or something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants