Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add default indexing policy to chat history #36

Closed
wants to merge 3 commits into from

Conversation

jordanrfrazier
Copy link
Collaborator

Adds a default indexing policy to the chat message history class.

The reason for this is that the default policy when creating a vector store through AstraDBVectorStore uses this policy. Meaning, if you create a collection with the class, you cannot initialize chat history, as you'll get this error:

E               ValueError: Astra DB collection 'test' is detected as having the following indexing policy: {"allow": ["metadata"]}. This is incompatible with the requested indexing policy for this object. Consider reindexing anew on a fresh collection with the requested indexing policy, or alternatively align the requested indexing settings to the collection to keep using it.

I think it makes sense in this case to make the default indexing policy the same here.

DEFAULT_INDEXING_OPTIONS = {"allow": ["metadata"]}

@jordanrfrazier
Copy link
Collaborator Author

Open to discussions here - I didn't dig very deeply into this. I was writing integration tests for langflow and running into this error.

@hemidactylus
Copy link
Collaborator

hemidactylus commented Jun 18, 2024

A side comment: the reason the CI fails is unrelated to this PR. I encountered the same while upgrading the vectorize and came up with a fix (see #37 for a brief explanation). I suggest to bring that one in this branch once 37 is merged, to get smooth CI here as well.

Now to the main question. While tailoring the indexing policy for chat history to the use is in general a good thing (fight SAI waste!), I am not sure I get the reason why one should use the same collection for both vectorstore and chat history. The format of the documents in them is different as far as I remember, no?
Would there be a real-world situation (outside of tests perhaps) where a single collection might serve both purposes? Note: I might be missing some context here ¯\_(ツ)_/¯

@jordanrfrazier
Copy link
Collaborator Author

jordanrfrazier commented Jun 18, 2024

A side comment: the reason the CI fails is unrelated to this PR. I encountered the same while upgrading the vectorize and came up with a fix (see #37 for a brief explanation). I suggest to bring that one in this branch once 37 is merged, to get smooth CI here as well.

Now to the main question. While tailoring the indexing policy for chat history to the use is in general a good thing (fight SAI waste!), I am not sure I get the reason why one should use the same collection for both vectorstore and chat history. The format of the documents in them is different as far as I remember, no? Would there be a real-world situation (outside of tests perhaps) where a single collection might serve both purposes? Note: I might be missing some context here ¯\_(ツ)_/¯

I could probably imagine up some use-case where we'd want metadata attached to each message, such as "time sent" for accessing only the last five minutes of chat history, or "location" for location-based recommendations in a conversation, which would reduce the number of stores to query. However, it's true the main reason was during testing I ran into this - I wanted to use a single fixture to set up my Astra collections before running tests, which used the AstraDBVectorStore class (thus defaulting to {allow: metadata}). I ended up using the chat history in a separate fixture to set up that specific class, so very easy to avoid this issue.

I don't have a stronger argument than the above, so very happy to pass on this as "working-as-intended".

Edit: I'm leaning your direction as well, after looking again at the existing patterns.

            {
                "timestamp": time.time(),
                "session_id": self.session_id,
                "body_blob": json.dumps(message_to_dict(message)),
            }

The body_blob contains whatever metadata we want to add. Langflow adds the sender, type, etc here, and deconstructs it as needed on the read. So, I agree, not necessary. I'll close this PR on that note, and we can discuss later if the flow needs to be updated or clarified if people do try to use the AstraDB component for chat history.

@jordanrfrazier
Copy link
Collaborator Author

Closing per above comment

@hemidactylus hemidactylus deleted the add-default-indexing-policy branch August 22, 2024 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants