Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArangoDB Integration #3

Draft
wants to merge 37 commits into
base: main
Choose a base branch
from
Draft

ArangoDB Integration #3

wants to merge 37 commits into from

Conversation

aMahanna
Copy link
Member

@aMahanna aMahanna commented Oct 23, 2024

This PR tracks the in-progress & completed ArangoDB Microservices for GenAIComps

Depends on:

Status:

  1. Dataprep (ArangoDB: Dataprep #12)
  2. Retriever (ArangoDB: Retriever #2 )
  3. Chat History (ArangoDB: Chathistory #10)
  4. Feedback Management (ArangoDB: Feedback management #11)
  5. Prompt Registry (ArangoDB: PromptRegistry #8)
  6. Vector Stores (ArangoDB: Vector Store #13)

Development Setup for using new LangChain functionality

Depends on arangoml/langchain#1

  1. Clone this repository

  2. Switch to the arangodb branch

  3. Create a virtual environment:

python -m venv .venv

source .venv/bin/activate
  1. Install the required packages:
pip install python-arango
pip install langchain_openai
pip install git+https://github.com/arangoml/langchain.git@arangodb#subdirectory=libs/community

Note: Check out the contents in arangoml/langchain#1 to better understand the 3 different langchain classes we'll be using in this repo (ArangoGraph, ArangoGraphQAChain, and ArangoVector)

  1. Provision the ArangoDB with Vector Index image:

For ARM:

docker create --name arangodb -p 8529:8529 -e ARANGO_ROOT_PASSWORD=test jbajic/arangodb-arm:vector-index-preview

docker start arangodb

For AMD:

docker create --name arangodb -p 8529:8529 -e ARANGO_ROOT_PASSWORD=test jbajic/arangodb:vector-index-preview

docker start arango-vector

Note: This is an ArangoDB Image that is based off of an ArangoDB PR that introduces Vector Indexing and Vector Similarity support via FAISS. Ask Anthony for more details.

  1. Set your OPENAI_API_KEY environment variable (contact Anthony for access)

  2. Run the test script to confirm LangChain is working:

python langchain_test.py

@aMahanna aMahanna mentioned this pull request Oct 23, 2024
@aMahanna aMahanna marked this pull request as draft October 23, 2024 12:54
ajaykallepalli and others added 3 commits November 25, 2024 14:28
* initial commit

* updating feedback management readme to match arango

* Removing comments above import

* Working API test and updated readme

* Working docker compose file

* Docker compose creating network and docker image

* code review

* update readme & dev yaml

* delete dev files

* Delete arango_store.py

---------

Co-authored-by: Anthony Mahanna <[email protected]>
* Initial commit

* remove unnecessary files

* code review

* update: `prompt_search`

* new: `ARANGO_PROTOCOL`

* README

* cleanup

---------

Co-authored-by: lasyasn <[email protected]>
Co-authored-by: Anthony Mahanna <[email protected]>
aMahanna pushed a commit that referenced this pull request Nov 26, 2024
* Adds an endpoint for image ingestion

Signed-off-by: Melanie Buehler <[email protected]>

* Combined image and video endpoint

Signed-off-by: Melanie Buehler <[email protected]>

* Add test and update README

Signed-off-by: Melanie Buehler <[email protected]>

* fixed variable name for embedding model (#1)

Signed-off-by: okhleif-IL <[email protected]>

* Fixed test script

Signed-off-by: Melanie Buehler <[email protected]>

* Remove redundant function

Signed-off-by: Melanie Buehler <[email protected]>

* get_videos, delete_videos --> get_files, delete_files (#3)

Signed-off-by: okhleif-IL <[email protected]>

* Updates test per review feedback

Signed-off-by: Melanie Buehler <[email protected]>

* Fixed test

Signed-off-by: Melanie Buehler <[email protected]>

* Add support for audio files multimodal data ingestion (#4)

* Add support for audio files multimodal data ingestion

Signed-off-by: dmsuehir <[email protected]>

* Update function name

Signed-off-by: dmsuehir <[email protected]>

---------

Signed-off-by: dmsuehir <[email protected]>

* Change videos_with_transcripts to ingest_with_text

Signed-off-by: Melanie Buehler <[email protected]>

* Add image support to video ingestion with transcript functionality

Signed-off-by: Melanie Buehler <[email protected]>

* Update test and README

Signed-off-by: Melanie Buehler <[email protected]>

* Updated for review suggestions

Signed-off-by: Melanie Buehler <[email protected]>

* Add two tests for ingest_with_text

Signed-off-by: Melanie Buehler <[email protected]>

* LVM TGI Gaudi update for prompts without images (#7)

* LVM Gaudi TGI update for prompts without images

Signed-off-by: dmsuehir <[email protected]>

* Wording

Signed-off-by: dmsuehir <[email protected]>

* Add a test

Signed-off-by: dmsuehir <[email protected]>

---------

Signed-off-by: dmsuehir <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change dummy image to be b64 encoded instead of the url (#9)

Signed-off-by: dmsuehir <[email protected]>

* Updates based on review feedback (#10)

Signed-off-by: dmsuehir <[email protected]>

* Test fix (#11)

Signed-off-by: dmsuehir <[email protected]>

---------

Signed-off-by: Melanie Buehler <[email protected]>
Signed-off-by: okhleif-IL <[email protected]>
Signed-off-by: dmsuehir <[email protected]>
Co-authored-by: dmsuehir <[email protected]>
Co-authored-by: Omar Khleif <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Abolfazl Shahbazi <[email protected]>
aMahanna and others added 7 commits November 26, 2024 18:12
* initial commit

* updating feedback management readme to match arango

* Removing comments above import

* Working API test and updated readme

* Working docker compose file

* Docker compose creating network and docker image

* code review

* update readme & dev yaml

* delete dev files

* Delete arango_store.py

---------

Co-authored-by: Anthony Mahanna <[email protected]>
* Initial commit

* remove unnecessary files

* code review

* update: `prompt_search`

* new: `ARANGO_PROTOCOL`

* README

* cleanup

---------

Co-authored-by: lasyasn <[email protected]>
Co-authored-by: Anthony Mahanna <[email protected]>
* Initial chat history implementation without API and docker implementation

* make copy and remove async

* API functionality matching MongoDB implementation

Working API functionality, update to dockerfile required, and additional checks when updating document required.

* Delete temp.py

* Push changes and reset repo

* Async definitions working in curl calls, updated read me to ArangoDB setup

* Working docker container with network

* Removing need for network to be created before docker compose

* Cleanup async files and backup files

* code review

* fix: typo

* revert mongo changes

---------

Co-authored-by: Anthony Mahanna <[email protected]>
@aMahanna aMahanna marked this pull request as ready for review November 27, 2024 13:56
@aMahanna aMahanna marked this pull request as draft November 27, 2024 13:56
ajaykallepalli and others added 12 commits November 27, 2024 09:46
* initial commit: rename arango envs

* fix comment
* initial commit

* fix: env

* Update README.md

* Revert "Update README.md"

This reverts commit 8f750e4.

* fix: create database

* cleanup

* new: chunk embedding generation

* new: `cithash` dep

* cleanup: `ingest_data_to_arango`

* new: envs in `config`

* fix: more envs

* more env cleanup

* fix: deprecated line

* fix: graph doc

* update dataprep-compose

* Dockerfile update and parametrized prepare_doc_arango.py (#15)

* Initial readme and prepare doc arango, with embeddings by Anthony

* Adding git to Dockerfile, tested dockerfile and dockercompose. Also parametrized variables in prepare_doc_arango.py

* Updating readme with adjustable parameters listed

* Only printing debug statements if log flag is on

* add review

* review pt 2

---------

Co-authored-by: Anthony Mahanna <[email protected]>

* update dataprep readme

---------

Co-authored-by: Ajay Kallepalli <[email protected]>
* wip: retriever

* rename: `arango`

* checkpoint

* cleanup

* fix: env

* update retriever compose

* add test file

* fix: config & dockerfile

* fix: embedding field name

* new: config variables

* new: traverse graph after similarity

* fix: string

* add `uniqueVertices`

* add filter

* infra

* fix: query

* remove: `similarity_distance_threshold`

* temp: replace `p`

* cleanup

* remove: `ARANGO_TRAVERSAL_MIN_DEPTH`

* update max_depth

* new: `fetch_neighborhoods`

* fix: test

* cleanup: `prepare_doc_arango.py`

* move `graph` & `vector_db` instantiation

* cleanup: dataprep readme

* cleanup: retriever

* fix: arango test scripts

* Update test_retrievers_arango_langchain.sh

* update `ARANGO_EMBEDDING_DIMENSION`

* fix: env vars

* cleanup: retriever port

* new: `test_dataprep_arango_langchain`

* new: retriever yaml

* Changing naming convention from arangodb to arango to ensure consistency between microservices, updated dockerfile to match and removed space in port

* fix: retriever name

* remove: `retriever_arangodb`

---------

Co-authored-by: Ajay Kallepalli <[email protected]>
* dataprep improvements

* fix: readme

* new: make embedding generation mandatory

* fix: exception handling

* add logs

* new: `ARANGO_USE_GRAPH_NAME`
aMahanna and others added 13 commits January 10, 2025 07:59
* retriever improvements

* new: `collection_count`

* new: `empty_result` object

* remove: `raise` no longer required

* set `LOGFLAG` to `True`

* Removing config variable ARANGO_EMBED_DIMENSION, getting embed dimension automatically from the db

* minor cleanup

* whitespace

* log cleanup

---------

Co-authored-by: Ajay Kallepalli <[email protected]>
…unk overlap, and process table. CURL command will supercede environment variables (#18)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants