Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pathway vectorstore and rag-pathway template #14859

Merged
merged 42 commits into from
Mar 29, 2024

Conversation

janchorowski
Copy link
Contributor

  • Description: Integration with pathway.com data processing pipeline acting as an always updated vectorstore
  • Issue: not applicable
  • Dependencies: optional dependency on pathway
  • Twitter handle: pathway_com

The PR provides and integration with pathway to provide an easy to use always updated vector store:

import pathway as pw
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import PathwayVectorClient, PathwayVectorServer

data_sources = []
data_sources.append(
    pw.io.gdrive.read(object_id="17H4YpBOAKQzEJ93xmC2z170l0bP2npMy", service_user_credentials_file="credentials.json", with_metadata=True))

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
embeddings_model = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])
vector_server = PathwayVectorServer(
    *data_sources,
    embedder=embeddings_model,
    splitter=text_splitter,
)
vector_server.run_server(host="127.0.0.1", port="8765", threaded=True, with_cache=False)
client = PathwayVectorClient(
    host="127.0.0.1",
    port="8765",
)
query = "What is Pathway?"
docs = client.similarity_search(query)

The PathwayVectorServer builds a data processing pipeline which continusly scans documents in a given source connector (google drive, s3, ...) and builds a vector store. The PathwayVectorClient implements LangChain's VectorStore interface and connects to the server to retrieve documents.

---------

Co-authored-by: mlewandowski <[email protected]>
Co-authored-by: Berke <[email protected]>
Co-authored-by: Jan Chorowski <[email protected]>
Co-authored-by: Adrian Kosowski <[email protected]>
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Dec 18, 2023
Copy link

vercel bot commented Dec 18, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 29, 2024 5:41pm

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Dec 18, 2023
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not import this into langchain - langchain should remain unchanged

only langchain-community should be updated, and we should import directly from there


from typing import Callable, List, Optional

import pathway as pw
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a conditional import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the omission, all should be optional now.

Change to a conditional import.

---------

Co-authored-by: mlewandowski <[email protected]>
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Dec 19, 2023
Fix documentation markdown formatting.

---------

Co-authored-by: mlewandowski <[email protected]>
lewymati and others added 4 commits December 19, 2023 16:59
It was done as follows:
1. fetch fresh langchain master
2. `poetry add --optional pathway@latest --python ">=3.10"`
3. `poetry lock --no-update`
@janchorowski
Copy link
Contributor Author

@hwchase17 we have fixed poetry lock and used type annotations suitable for Py3.8, can you re-trigger the CI run?

@janchorowski
Copy link
Contributor Author

@efriis I tried to fix the formatting, now CI should be clean.

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 11, 2024
@janchorowski
Copy link
Contributor Author

janchorowski commented Mar 11, 2024

@efriis we have simplified the PR, leaving only the client and changing the instruction for a quick start using a publicly available server, then pointing to instructions on how to run it.

The template is also removed, and we have removed the pathway dependency, making it much leaner.

Please review!

@janchorowski
Copy link
Contributor Author

@efriis I fixed linters

@janchorowski
Copy link
Contributor Author

@efriis please trigger CI, we resolved a merge conflict

@baskaryan baskaryan enabled auto-merge (squash) March 28, 2024 00:22
@janchorowski janchorowski restored the update_imports branch March 29, 2024 14:17
auto-merge was automatically disabled March 29, 2024 14:20

Head branch was pushed to by a user without write access

@janchorowski
Copy link
Contributor Author

@efriis @baskaryan sorry to bother you, I re-merged master again and rerun linters. On my end locally make lint works, make test fails with FAILED tests/unit_tests/callbacks/test_callback_manager.py::test_callback_manager_configure_context_vars - AttributeError: 'Client' object has no attribute 'tracing_queue' which seems unrelated and hopefully won't block this more.

@baskaryan baskaryan merged commit b8b42cc into langchain-ai:master Mar 29, 2024
62 checks passed
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Mar 29, 2024
gkorland pushed a commit to FalkorDB/langchain that referenced this pull request Mar 30, 2024
- **Description:** Integration with pathway.com data processing pipeline
acting as an always updated vectorstore
  - **Issue:** not applicable
- **Dependencies:** optional dependency on
[`pathway`](https://pypi.org/project/pathway/)
  - **Twitter handle:** pathway_com

The PR provides and integration with `pathway` to provide an easy to use
always updated vector store:

```python
import pathway as pw
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import PathwayVectorClient, PathwayVectorServer

data_sources = []
data_sources.append(
    pw.io.gdrive.read(object_id="17H4YpBOAKQzEJ93xmC2z170l0bP2npMy", service_user_credentials_file="credentials.json", with_metadata=True))

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
embeddings_model = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])
vector_server = PathwayVectorServer(
    *data_sources,
    embedder=embeddings_model,
    splitter=text_splitter,
)
vector_server.run_server(host="127.0.0.1", port="8765", threaded=True, with_cache=False)
client = PathwayVectorClient(
    host="127.0.0.1",
    port="8765",
)
query = "What is Pathway?"
docs = client.similarity_search(query)
```

The `PathwayVectorServer` builds a data processing pipeline which
continusly scans documents in a given source connector (google drive,
s3, ...) and builds a vector store. The `PathwayVectorClient` implements
LangChain's `VectorStore` interface and connects to the server to
retrieve documents.

---------

Co-authored-by: Mateusz Lewandowski <[email protected]>
Co-authored-by: mlewandowski <[email protected]>
Co-authored-by: Berke <[email protected]>
Co-authored-by: Adrian Kosowski <[email protected]>
Co-authored-by: mlewandowski <[email protected]>
Co-authored-by: berkecanrizai <[email protected]>
Co-authored-by: Erick Friis <[email protected]>
Co-authored-by: Harrison Chase <[email protected]>
Co-authored-by: Bagatur <[email protected]>
Co-authored-by: mlewandowski <[email protected]>
Co-authored-by: Szymon Dudycz <[email protected]>
Co-authored-by: Szymon Dudycz <[email protected]>
Co-authored-by: Bagatur <[email protected]>
hinthornw pushed a commit that referenced this pull request Apr 26, 2024
- **Description:** Integration with pathway.com data processing pipeline
acting as an always updated vectorstore
  - **Issue:** not applicable
- **Dependencies:** optional dependency on
[`pathway`](https://pypi.org/project/pathway/)
  - **Twitter handle:** pathway_com

The PR provides and integration with `pathway` to provide an easy to use
always updated vector store:

```python
import pathway as pw
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import PathwayVectorClient, PathwayVectorServer

data_sources = []
data_sources.append(
    pw.io.gdrive.read(object_id="17H4YpBOAKQzEJ93xmC2z170l0bP2npMy", service_user_credentials_file="credentials.json", with_metadata=True))

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
embeddings_model = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])
vector_server = PathwayVectorServer(
    *data_sources,
    embedder=embeddings_model,
    splitter=text_splitter,
)
vector_server.run_server(host="127.0.0.1", port="8765", threaded=True, with_cache=False)
client = PathwayVectorClient(
    host="127.0.0.1",
    port="8765",
)
query = "What is Pathway?"
docs = client.similarity_search(query)
```

The `PathwayVectorServer` builds a data processing pipeline which
continusly scans documents in a given source connector (google drive,
s3, ...) and builds a vector store. The `PathwayVectorClient` implements
LangChain's `VectorStore` interface and connects to the server to
retrieve documents.

---------

Co-authored-by: Mateusz Lewandowski <[email protected]>
Co-authored-by: mlewandowski <[email protected]>
Co-authored-by: Berke <[email protected]>
Co-authored-by: Adrian Kosowski <[email protected]>
Co-authored-by: mlewandowski <[email protected]>
Co-authored-by: berkecanrizai <[email protected]>
Co-authored-by: Erick Friis <[email protected]>
Co-authored-by: Harrison Chase <[email protected]>
Co-authored-by: Bagatur <[email protected]>
Co-authored-by: mlewandowski <[email protected]>
Co-authored-by: Szymon Dudycz <[email protected]>
Co-authored-by: Szymon Dudycz <[email protected]>
Co-authored-by: Bagatur <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features lgtm PR looks good. Use to confirm that a PR is ready for merging. size:L This PR changes 100-499 lines, ignoring generated files. template Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants