fix: add default indexing policy to chat history #36

jordanrfrazier · 2024-06-15T04:36:39Z

Adds a default indexing policy to the chat message history class.

The reason for this is that the default policy when creating a vector store through AstraDBVectorStore uses this policy. Meaning, if you create a collection with the class, you cannot initialize chat history, as you'll get this error:

E               ValueError: Astra DB collection 'test' is detected as having the following indexing policy: {"allow": ["metadata"]}. This is incompatible with the requested indexing policy for this object. Consider reindexing anew on a fresh collection with the requested indexing policy, or alternatively align the requested indexing settings to the collection to keep using it.

I think it makes sense in this case to make the default indexing policy the same here.

langchain-datastax/libs/astradb/langchain_astradb/vectorstores.py

Line 57 in 77d3d81

DEFAULT_INDEXING_OPTIONS = {"allow": ["metadata"]}

jordanrfrazier · 2024-06-15T04:37:22Z

Open to discussions here - I didn't dig very deeply into this. I was writing integration tests for langflow and running into this error.

hemidactylus · 2024-06-18T12:48:29Z

A side comment: the reason the CI fails is unrelated to this PR. I encountered the same while upgrading the vectorize and came up with a fix (see #37 for a brief explanation). I suggest to bring that one in this branch once 37 is merged, to get smooth CI here as well.

Now to the main question. While tailoring the indexing policy for chat history to the use is in general a good thing (fight SAI waste!), I am not sure I get the reason why one should use the same collection for both vectorstore and chat history. The format of the documents in them is different as far as I remember, no?
Would there be a real-world situation (outside of tests perhaps) where a single collection might serve both purposes? Note: I might be missing some context here ¯\_(ツ)_/¯

jordanrfrazier · 2024-06-18T22:31:21Z

A side comment: the reason the CI fails is unrelated to this PR. I encountered the same while upgrading the vectorize and came up with a fix (see #37 for a brief explanation). I suggest to bring that one in this branch once 37 is merged, to get smooth CI here as well.

Now to the main question. While tailoring the indexing policy for chat history to the use is in general a good thing (fight SAI waste!), I am not sure I get the reason why one should use the same collection for both vectorstore and chat history. The format of the documents in them is different as far as I remember, no? Would there be a real-world situation (outside of tests perhaps) where a single collection might serve both purposes? Note: I might be missing some context here ¯\_(ツ)_/¯

I could probably imagine up some use-case where we'd want metadata attached to each message, such as "time sent" for accessing only the last five minutes of chat history, or "location" for location-based recommendations in a conversation, which would reduce the number of stores to query. However, it's true the main reason was during testing I ran into this - I wanted to use a single fixture to set up my Astra collections before running tests, which used the AstraDBVectorStore class (thus defaulting to {allow: metadata}). I ended up using the chat history in a separate fixture to set up that specific class, so very easy to avoid this issue.

I don't have a stronger argument than the above, so very happy to pass on this as "working-as-intended".

Edit: I'm leaning your direction as well, after looking again at the existing patterns.

            {
                "timestamp": time.time(),
                "session_id": self.session_id,
                "body_blob": json.dumps(message_to_dict(message)),
            }

The body_blob contains whatever metadata we want to add. Langflow adds the sender, type, etc here, and deconstructs it as needed on the read. So, I agree, not necessary. I'll close this PR on that note, and we can discuss later if the flow needs to be updated or clarified if people do try to use the AstraDB component for chat history.

jordanrfrazier · 2024-06-18T22:36:43Z

Closing per above comment

Add default indexing policy to chat history

f1cc394

jordanrfrazier requested review from hemidactylus and nicoloboschi June 15, 2024 04:36

jordanrfrazier added 2 commits June 15, 2024 12:46

use correct var

fc5929d

Also adds requested indexing params

620a043

jordanrfrazier closed this Jun 18, 2024

hemidactylus deleted the add-default-indexing-policy branch August 22, 2024 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add default indexing policy to chat history #36

fix: add default indexing policy to chat history #36

jordanrfrazier commented Jun 15, 2024

jordanrfrazier commented Jun 15, 2024

hemidactylus commented Jun 18, 2024 •

edited

Loading

jordanrfrazier commented Jun 18, 2024 •

edited

Loading

jordanrfrazier commented Jun 18, 2024

fix: add default indexing policy to chat history #36

fix: add default indexing policy to chat history #36

Conversation

jordanrfrazier commented Jun 15, 2024

jordanrfrazier commented Jun 15, 2024

hemidactylus commented Jun 18, 2024 • edited Loading

jordanrfrazier commented Jun 18, 2024 • edited Loading

jordanrfrazier commented Jun 18, 2024

hemidactylus commented Jun 18, 2024 •

edited

Loading

jordanrfrazier commented Jun 18, 2024 •

edited

Loading