-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When allow_interruption is set to True, the bot may emit some sentences this turn but say some sentences last turn. #754
Comments
I have found the right chance to cause this bug. You talk and interrupt the process just when the tts module is processing the speech chunks, then the sentence just being processed will be the part to be played in the next turn. I think the reason may be the |
This looks related to the work being done here: #721 |
Thank you very much for your guidance. I tried updating the code and running my program, but it didn't work. Your insights are very valuable, and I will try to find a solution to the problem. |
I tried several methods, and my initial guess was that after an interruption, the TTS might not stop immediately. To address this, I added logs in the TTS code to print all frames returned by the TTS. However, things didn't turn out as I expected. After an interruption occurs, my logs stop printing, which means subsequent frames are not being pushed. Here's where I set up the logs: class FilterAzureTTSService(AzureTTSService):
async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
try:
voice_text_frame = VoiceEndTextFrame(
b'', self._settings["sample_rate"], 1, f"###VOICEPARTEND###{text}")
await self.push_frame(voice_text_frame)
async for frame in super().run_tts(text):
# Yield each frame to the caller
yield frame
logger.debug(f"azure frame: {frame}, {frame.id}")
voice_text_frame = VoiceEndTextFrame(b'', self._settings["sample_rate"], 1, f"###VOICEPARTEND###")
await self.push_frame(voice_text_frame)
except asyncio.CancelledError:
# Handle cleanup if necessary
logger.debug("run_tts was cancelled.")
raise # Re-raise to handle or terminate properly This is quite puzzling because the audio being generated is indeed sent to the frontend in the next turn and played. Here's an example: In the previous round, "I am happy to see you" is being generated, but I interrupted it during the TTS audio generation process. At this point, the |
@fiyen what transport are you using? |
async def run_pipeline(websocket: WebSocket):
conn = aiohttp.TCPConnector(ssl=ssl_context)
async with aiohttp.ClientSession(connector=conn) as session:
transport = FastAPIWebsocketTransport(
websocket=websocket,
params=FastAPIWebsocketParams(
audio_out_enabled=True,
add_wav_header=False,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(
confidence = 0.8,
start_secs = 0.2,
stop_secs = 0.4,
min_volume = 0.7
)),
audio_out_sample_rate=16000,
audio_in_sample_rate=16000,
vad_audio_passthrough=True,
serializer=ProtobufFrameSerializer(),
audio_in_filter=NoisereduceFilter()
)
)
llm = OpenAILLMService(
api_key=os.getenv("OPENAI_APIKEY"),
base_url=os.getenv("OPENAI_BASEURL"),
model="fast-gpt-4o-mini",
params=OpenAILLMService.InputParams(
frequency_penalty=0.5, # 减少词汇重复
presence_penalty=0, # 保持话题一致
seed=31, # 固定seed避免不同会话之间的随机变化
temperature=0.3, # 减少随机性
top_p=0.8,
))
tools = []
for tool_name in tool_list:
tool = tool_box.get_tool(tool_name)
if tool_name == 'fetch_next_sentence':
tool.add_tool_func(fetch_next_sentence_from_api)
if tool is not None:
llm.register_function(
None,
tool.tool_func,
start_callback=tool.tool_start_callback)
tools.append(tool.tool_param)
stt = AzureSTTService(
api_key=os.getenv("AZURE_SPEECH_API"),
region=os.getenv("AZURE_SPEECH_REGION"),
sample_rate=16000
)
tts = FilterAzureTTSService(
aiohttp_session=session,
api_key=os.getenv("AZURE_SPEECH_API"),
region=os.getenv("AZURE_SPEECH_REGION"),
voice="en-US-EmmaMultilingualNeural",
sample_rate=16000,
)
messages = []
# avt = AudioVolumeTimer()
# tl = TranscriptionTimingLogger(avt)
context = OpenAILLMContext(messages, tools, tool_choice='auto')
context_aggregator = llm.create_context_aggregator(context)
pipeline = Pipeline([
transport.input(), # Websocket input from client
# avt, # Audio volume timer
stt, # Speech-To-Text
# tl, # Transcription timing logger
context_aggregator.user(),
llm, # LLM
tts, # Text-To-Speech
context_aggregator.assistant(),
transport.output(), # Websocket output to client
])
task = PipelineTask(
pipeline,
PipelineParams(
allow_interruptions=True
)) |
@aconchillo you might want to take a look. |
By changing azuretts to other kind of tts, like elevenlabs or deepgram, this problem can be solved. So this problem may be caused by the azure tts module. |
Description
Is this reporting a bug or feature request?
A BUG
If reporting a bug, please fill out the following:
Environment
Issue description
When allow_interruption is set to True, there is a chance of audio lag occurring during the conversation, manifested as follows: when the user interrupts the Bot while it is speaking, the Bot's unfinished speech may continue after the user has finished speaking. Then, the Bot's current turn may have a sentence omitted, which will only be spoken in the next turn. The mechanism causing this issue has not been clarified to date.
Repro steps
Set allow_interruption = True,use AzureTTSService as tts, and speak with bot freely, this will happen randomly. I can figure out the chance to happen, but it will surely happen when you have been talking for time long enough.
Expected behavior
The bot says what it should say this turn, does not say the sentence in last turn.
Actual behavior
The bot emit one or more than one last sentences this turn, and say some sentences being interrupted in last turn.
Logs
No logs, I am not sure the conditions when it happens.
The text was updated successfully, but these errors were encountered: