Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Twilio Audio Output Causing Broken Audio Chunks #826

Open
saitharunsai opened this issue Dec 11, 2024 · 0 comments
Open

Twilio Audio Output Causing Broken Audio Chunks #826

saitharunsai opened this issue Dec 11, 2024 · 0 comments

Comments

@saitharunsai
Copy link

saitharunsai commented Dec 11, 2024

Description

Bug Report: Malformed Twilio Audio Output Causing Broken Audio Chunks

Environment

  • pipecat-ai version: 0.0.50
  • python version: 3.11.10
  • OS: M2

Issue description

The Twilio output context is being malformed, resulting in broken audio chunks during the pipeline processing. This appears to be happening in a WebSocket-based voice pipeline that integrates Twilio, Deepgram (for STT/TTS), and OpenAI's GPT-4 for conversation processing.

Repro steps

  1. Initialize a WebSocket connection with Twilio stream_sid
  2. Set up the voice pipeline with the following components:
    • FastAPIWebsocketTransport with TwilioFrameSerializer
    • SileroVADAnalyzer for voice activity detection
    • DeepgramSTTService for speech-to-text
    • OpenAILLMService for conversation processing
    • DeepgramTTSService for text-to-speech
  3. Start the pipeline with an initial system message
  4. Begin audio streaming

Expected behavior

  • Audio chunks should be properly formed and maintained throughout the pipeline
  • Smooth audio streaming without breaks or malformation
  • Proper serialization of audio frames by TwilioFrameSerializer

Actual behavior

Audio chunks are breaking during processing, suggesting potential issues with either:

  • Frame serialization in TwilioFrameSerializer
  • Audio buffer handling in the WebSocket transport
  • VAD passthrough configuration
  • Sample rate or encoding mismatches (currently set to 8000Hz mulaw)

Code

`from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import EndFrame, LLMMessagesFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.serializers.twilio import TwilioFrameSerializer
from pipecat.services.deepgram import DeepgramSTTService, DeepgramTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.transports.network.fastapi_websocket import (
    FastAPIWebsocketParams,
    FastAPIWebsocketTransport,
)

from app.core.config import settings


async def run_twilio_bot(websocket_client, stream_sid):
    transport = FastAPIWebsocketTransport(
        websocket=websocket_client,
        params=FastAPIWebsocketParams(
            audio_out_enabled=True,
            add_wav_header=False,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(),
            vad_audio_passthrough=True,
            serializer=TwilioFrameSerializer(stream_sid),
        ),
    )

    llm = OpenAILLMService(api_key=settings.OPENAI_API_KEY, model="gpt-4o")

    stt = DeepgramSTTService(api_key=settings.DEEPGRAM_API_KEY)

    tts = DeepgramTTSService(
        api_key=settings.DEEPGRAM_API_KEY,
    )

    messages = [
        {
            "role": "system",
            "content": "You are a helpful LLM in an audio call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
        },
    ]

    context = OpenAILLMContext(messages)
    context_aggregator = llm.create_context_aggregator(context)

    pipeline = Pipeline(
        [
            transport.input(),  # Websocket input from client
            stt,  # Speech-To-Text
            context_aggregator.user(),
            llm,  # LLM
            tts,  # Text-To-Speech
            transport.output(),  # Websocket output to client
            context_aggregator.assistant(),
        ]
    )

    task = PipelineTask(pipeline, params=PipelineParams(enable_metrics=True))

    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        # Kick off the conversation.
        messages.append(
            {"role": "system", "content": "Please introduce yourself to the user."}
        )
        await task.queue_frames([LLMMessagesFrame(messages)])

    @transport.event_handler("on_client_disconnected")
    async def on_client_disconnected(transport, client):
        await task.queue_frames([EndFrame()])

    runner = PipelineRunner(handle_sigint=True)

    await runner.run(task)

To help diagnose this issue, could you provide:

Logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant