Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: init signature revision for Cassandra LLM cache classes + small maintenance #17765

Merged
merged 38 commits into from
May 16, 2024

Conversation

hemidactylus
Copy link
Contributor

This PR improves on the CassandraCache and CassandraSemanticCache classes, mainly in the constructor signature, and also introduces several minor improvements around these classes.

Init signature

A (sigh) breaking change is tentatively introduced to the constructor. To me, the advantages outweigh the possible discomfort: the new syntax places the DB-connection objects session and keyspace later in the param list, so that they can be given a default value. This is what enables the pattern of not specifying them, provided one has previously initialized the Cassandra connection through the versatile utility method cassio.init(...).

In this way, a much less unwieldy instantiation can be done, such as CassandraCache() and CassandraSemanticCache(embedding=xyz), everything else falling back to defaults.

A downside is that, compared to the earlier signature, this might turn out to be breaking for those doing positional instantiation. As a way to mitigate this problem, this PR typechecks its first argument trying to detect the legacy usage.
(And to make this point less tricky in the future, most arguments are left to be keyword-only).

If this is considered too harsh, I'd like guidance on how to further smoothen this transition. Our plan is to make the pattern of optional session/keyspace a standard across all Cassandra classes, so that a repeatable strategy would be ideal. A possibility would be to keep positional arguments for legacy reasons but issue a deprecation warning if any of them is actually used, to later remove them with 0.2 - please advise on this point.

Other changes

  • class docstrings: enriched, completely moved to class level, added note on cassio.init(...) pattern, added tiny sample usage code.
  • semantic cache: revised terminology to never mention "distance" (it is in fact a similarity!). Kept the legacy constructor param with a deprecation warning if used.
  • llm_caching notebook: uniform flow with the Cassandra and Astra DB separate cases; better and Cassandra-first description; all imports made explicit and from community where appropriate.
  • cache integration tests moved to community (incl. the imported tools), env var bugfix for CASSANDRA_CONTACT_POINTS.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. Ɑ: memory Related to memory module 🔌: astradb Primarily related to AstraDB integrations 🤖:improvement Medium size change to existing code to handle new use-cases labels Feb 19, 2024
Copy link

vercel bot commented Feb 19, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 16, 2024 5:22pm

@hemidactylus
Copy link
Contributor Author

@baskaryan Hello, I'd like your opinion on this one. You may remember we already briefly discussed about the init signature change (several months back).
I think "the time has come" - I'd like some advice on how to do it in the most compliant, respectful way.

"""

def __init__(
self,
table_name: str = CASSANDRA_CACHE_DEFAULT_TABLE_NAME,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a breaking change, would prefer to avoid it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is, in a sense, the point of this PR. Using the cassio.init() construct to set a globally-available database connection would enable a much easier constructor pattern where one can omit the session/keyspace arguments altogether when creating table-based LangChain objects.

I am aware this is a breaking change, on the other hand it would make it possible to use this pattern like CassandraCache("table") instead of the more unwieldy CassandraCache(None, None, "table"), something that would really be an advantage in the long run.

Would you have a suggestion on how to best make this transition happen by respecting legacy usage as much as possible?

Comment on lines 1239 to 1243
*,
session: Optional[CassandraSession] = None,
keyspace: Optional[str] = None,
distance_metric: Optional[str] = None,
similarity_measure: str = CASSANDRA_SEMANTIC_CACHE_DEFAULT_SIMILARITY,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

@hemidactylus hemidactylus Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about marking these whole classes as deprecated and creating brand new ones with the newer init signature, would that be a better way?
(Note our plan would be to align all Cassandra-related classes to this kind of signature eventually)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This would be an opportunity to rename it to CassandraVectorStore

@hemidactylus
Copy link
Contributor Author

hemidactylus commented May 2, 2024

Hello @baskaryan , we have revised our strategy to simplify the init signatures (in view of the global Cassandra connection available through CassIO).
In line with the similar PR #21209 by @cbornet (that one on the vector store class), I have modified this PR so as to be fully backward-compatible yet offer the shorter constructor form. Please let me know what you think.

(Note: I duplicated a utility module used in testing - the fake embeddings - to avoid introducing additional community dependencies within (core)langchain, in compliance with the ongoing effort to decouple them as much as possible).
Edit: reverted the above to have the test in the proper location - see below comment

libs/community/langchain_community/cache.py Outdated Show resolved Hide resolved
libs/community/langchain_community/cache.py Outdated Show resolved Hide resolved
libs/community/langchain_community/cache.py Outdated Show resolved Hide resolved
libs/community/langchain_community/cache.py Outdated Show resolved Hide resolved
libs/community/langchain_community/cache.py Outdated Show resolved Hide resolved
@hemidactylus
Copy link
Contributor Author

Update: I reverted the location of the Cassandra cache test file to libs/langchain since it requires imports from langchain.... So the duplicate fake embedding module is not necessary anymore.

On the other hand, I have bumped the cassio version in the pyproject.toml of libs/langchain to ensure the async usage (which uses awaitables for the embedding dimensions, among other things) is handled correctly by CassIO.

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label May 16, 2024
@efriis efriis enabled auto-merge (squash) May 16, 2024 17:10
@efriis efriis merged commit 040597e into langchain-ai:master May 16, 2024
44 checks passed
hinthornw pushed a commit that referenced this pull request Jun 20, 2024
…small maintenance (#17765)

This PR improves on the `CassandraCache` and `CassandraSemanticCache`
classes, mainly in the constructor signature, and also introduces
several minor improvements around these classes.

### Init signature

A (sigh) breaking change is tentatively introduced to the constructor.
To me, the advantages outweigh the possible discomfort: the new syntax
places the DB-connection objects `session` and `keyspace` later in the
param list, so that they can be given a default value. This is what
enables the pattern of _not_ specifying them, provided one has
previously initialized the Cassandra connection through the versatile
utility method `cassio.init(...)`.

In this way, a much less unwieldy instantiation can be done, such as
`CassandraCache()` and `CassandraSemanticCache(embedding=xyz)`,
everything else falling back to defaults.

A downside is that, compared to the earlier signature, this might turn
out to be breaking for those doing positional instantiation. As a way to
mitigate this problem, this PR typechecks its first argument trying to
detect the legacy usage.
(And to make this point less tricky in the future, most arguments are
left to be keyword-only).

If this is considered too harsh, I'd like guidance on how to further
smoothen this transition. **Our plan is to make the pattern of optional
session/keyspace a standard across all Cassandra classes**, so that a
repeatable strategy would be ideal. A possibility would be to keep
positional arguments for legacy reasons but issue a deprecation warning
if any of them is actually used, to later remove them with 0.2 - please
advise on this point.

### Other changes

- class docstrings: enriched, completely moved to class level, added
note on `cassio.init(...)` pattern, added tiny sample usage code.
- semantic cache: revised terminology to never mention "distance" (it is
in fact a similarity!). Kept the legacy constructor param with a
deprecation warning if used.
- `llm_caching` notebook: uniform flow with the Cassandra and Astra DB
separate cases; better and Cassandra-first description; all imports made
explicit and from community where appropriate.
- cache integration tests moved to community (incl. the imported tools),
env var bugfix for `CASSANDRA_CONTACT_POINTS`.

---------

Co-authored-by: Erick Friis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔌: astradb Primarily related to AstraDB integrations 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging. Ɑ: memory Related to memory module size:L This PR changes 100-499 lines, ignoring generated files. template
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants