Add a TranscriptProcessor and new frames #860

markbackman · 2024-12-14T17:12:18Z

Please describe the changes in your PR. If it is addressing an issue, please reference that as well.

Open issues to resolve:

Events are emitted in the incorrect chronological order. The root cause for this is that the User context frames come after the assistant context frames, even though the user context frames are created first. I'm not sure of the best solution for it. @aconchillo any ideas?

Potential issues:

Anthropic requires user messages be added to the context to direct the LLM. This will appear as transcribed speech. I don't know a good way around this. This is really an Anthropic problem.

Notable:

I'm using the to_standard_message function from each LLM. The provides access to the context in the same message format. I had to update OpenAI to support the "standard message" format so I didn't have to add special case handling. @kwindla and I talked about this, so this might change depending on what we decide.
I've created two new timestamp frames. I wanted to provide timestamps for each transcription message. I didn't want to change the context, so these frames are pushed immediately after the context itself is pushed. The TranscriptProcessor processes these frames and assembles the timestamped transcript message. I think this works pretty well. I'm open to feedback on whether there should be one timestamp frame where the role is included or two so that they can just be handle. I liked the latter approach, which is what's implemented here. @aconchillo any thoughts?

UPDATE (12/16):

Based on Kwin's feedback, I'm reverting the changes to openai_llm_context for to_standard_messages and from_standard_message. Instead, I'll just handle both simple and list content in the TranscriptProcessor.
Also, based on conversations today, this now:
- User transcripts, just processes the TranscriptionFrame from the user and emits an event
- The previous processor now just handles assistant messages from the OpenAILLMContextFrames

kwindla

Left a few comments.

kwindla · 2024-12-16T01:01:25Z

CHANGELOG.md

+    format.
+  - New examples: `28a-transcription-processor-openai.py`,
+    `28a-transcription-processor-openai.py`, and
+    `28a-transcription-processor-openai.py`.


Should be ...-anthropic.py and -gemini.py

kwindla · 2024-12-16T01:08:10Z

examples/foundational/28a-transcription-processor-openai.py

+        async def on_first_participant_joined(transport, participant):
+            await transport.capture_participant_transcription(participant["id"])
+            # Kick off the conversation.
+            await task.queue_frames([LLMMessagesFrame(messages)])


I think we're trying to move away from sending an LLMMessagesFrame to start the conversation.

I've been trying to do this instead:

await task.queue_frames([context_aggregator.user().get_context_frame()])

The reason is that some LLM operations require that the llm service have a persistent context object. Function calling is an example.

If we just send an LLMMessagesFrame the llm service creates a context to use temporarily. But then when the context_aggregator pushes the next frame, that context gets overwritten.

I've actually been wondering if we should formally deprecate starting a pipeline with an LLMMessagesFrame and throw a warning.

(It's fine to use an LLMMessagesFrame later, when there's already a context.)

Good to know. If we want to move in this direction (and it seems we do), then we'll want to:

document it

(maybe) emit an error if it's used to initialize the conversation

kwindla · 2024-12-16T01:11:00Z

src/pipecat/frames/frames.py

+
+    This frame is emitted when new messages are added to the conversation history,
+    containing only the newly added messages rather than the full transcript.
+    Messages have normalized roles (user/assistant) regardless of the LLM service used.


Maybe explicitly say here that messages are always in the OpenAI standard messages format?

kwindla · 2024-12-16T01:14:49Z

src/pipecat/processors/aggregators/openai_llm_context.py

@@ -112,11 +112,59 @@ def get_messages_for_logging(self) -> str:
            msgs.append(msg)
        return json.dumps(msgs)

-    def from_standard_message(self, message):
+    def from_standard_message(self, message) -> dict:


I think this change isn't necessary.

The OpenAI format allows content to be a string or a list. The list elements can be of the form { "type": "text", "text": text }.

If we do make a change like this, we have to handle non-text content parts properly. (Image, audio.)

Reverted + added docstring documenting behavior.

kwindla · 2024-12-16T01:20:12Z

src/pipecat/processors/aggregators/openai_llm_context.py

    def to_standard_messages(self, obj) -> list:
+        """Convert OpenAI message to standard structured format.


Same thing here. I don't think we want this change.

The approach I was taking with this is that everywhere in the whole Pipecat codebase wherever we deal with messages, we should always accept both:

the old-style OpenAI shortcut "content": text

the newer, more flexible "content": List[ContentPart]

This is kind of a pain. But there are too many corner cases in terms of where in code new messages might get defined and injected to do anything else, I think.

If we want this, then the standard message format is both:

the old-style OpenAI shortcut "content": text

the newer, more flexible "content": List[ContentPart]

That means code needs to handle both formats, which is suboptimal. But, if it's not possible, then so be it. I'll revert this change and the from_standard_message format. That will require handling both old and new formats in the application code.

Reverted + added docstring documenting behavior.

aconchillo · 2024-12-17T23:25:04Z

src/pipecat/frames/frames.py

+
+    role: Literal["user", "assistant"]
+    content: str
+    timestamp: str | None = None


This should go in transcript_processor since it's not a frame.

Oh, nevermind! 🤦

aconchillo · 2024-12-17T23:28:30Z