Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Google Gemini LLM service #145

Closed
kwindla opened this issue May 16, 2024 · 3 comments
Closed

Implement Google Gemini LLM service #145

kwindla opened this issue May 16, 2024 · 3 comments

Comments

@kwindla
Copy link
Contributor

kwindla commented May 16, 2024

I'm working on a Google Gemini LLM service for Pipecat and interested in any feedback people have about the LLMMessagesFrame class.

All the other LLMs with a chat (multi-turn) fine-tuning that I've worked with have adopted OpenAI's messages array format. Google's format is a bit different.

  • The role can only be user or model. Contrast with user, assistant, or system for OpenAI.
  • The message content shape is parts: [<string>, ...] instead of just content: <string>.
  • Inline image data is also typed differently.

https://ai.google.dev/api/python/google/ai/generativelanguage/Content

https://ai.google.dev/gemini-api/docs/get-started/tutorial?lang=python#encode_messages

We could do at least three different things.

  1. Implement the Gemini service so that it translates internally from the OpenAI data shape used by LLMMessage into the google.ai.generativelanguage data structures.
  2. Implement a new LLMMessage class/subclass for use with Gemini models.
  3. Design an abstraction that can represent higher-level concepts and that all of our LLM services will use.

I lean towards (1). I think it will be fairly straightforward and we can always do (2) later if we need to.

But I haven't yet gotten to the context management code here, and that may complicate things. Note: we can't use the Google library's convenience functions for multi-turn chat context management, because pipelines need to be interruptible. One important part of interruptibility is making sure that the LLM context includes only sentences that the LLM has "said" out loud to the user.

Any other thoughts here are welcome!

Also, Discord thread is here if people want to hash things out ephemerally before etching pixels into the stone tablet of an issue comment: https://discord.com/channels/1239284677165056021/1239284677823565826/1240682255584854027

@chadbailey59
Copy link
Contributor

I think (1) as well, as long as Google's alternate structure doesn't actually represent something different (which it doesn't).

@kwindla
Copy link
Contributor Author

kwindla commented May 17, 2024

Initial (draft) implementation: #150

https://x.com/kwindla/status/1791319660442611731

@aconchillo
Copy link
Contributor

This is now available in 0.0.17. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants