how to save, load and update vectorstoreindex locally? #4188

IamExperimenting · 2023-05-05T17:43:18Z

IamExperimenting
May 5, 2023

Hi team,

I'm creating index using vectorstoreindexcreator, can anyone tell how to save and load locally? because, I feel like running/creating index everytime which is time consuming task.

from langchain.indexes import VectorStoreIndexCreator
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter

index = VectorStoreIndexCreator(
                      embeddings = HuggingFaceEmbeddings(),
                      text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)).from_loaders(loaders)

also how do I update the index if I get new new pdf files? do I need to run from the beginning or is there any options to update or merge?

@oddrationale do you have answer for this?

oddrationale · 2023-05-05T19:50:11Z

oddrationale
May 5, 2023

It depends on what backend vectorstore you are using. FAISS, for example, allows you to save to disk and also merge two vectorstores together.

But you would need to check with the documentation of your specific vectorstore to know whether something similar is supported. I don't have a lot of experience with the other vectorstores.

0 replies

tancs711 · 2023-06-09T14:59:34Z

tancs711
Jun 9, 2023

For FAISS, I saw some documentation like this. But have not tried it myself

docsearch = FAISS.from_documents(docs, embeddings)
docsearch.save_local("faiss_index")
docsearch = FAISS.load_local("faiss_index", embeddings)

2 replies

kakarottoxue Feb 24, 2024

Hi, what's the local here? Is there a way to save it to the Mongo DB?

rutwik777 Jul 28, 2024

faiss_index is the folder name. Just make sure to keep allow_dangerous_deserialization=True

sgowdaks · 2023-06-19T23:22:07Z

sgowdaks
Jun 19, 2023

@IamExperimenting Hi, did you find a solution for how to update the index for new new pdf files? Thanks!

2 replies

tancs711 Jun 22, 2023

@sgowdaks

Yes, I found this to work. Based on Based on the idea here https://www.youtube.com/watch?v=BBp8biou3V4 @ 36 min

faiss_db=FAISS.from_documents(docs, embeddings)

if os.path.exists(FAISS_USERGUIDE_INDEX):
local_index=FAISS.load_local(FAISS_USERGUIDE_INDEX, embeddings)
local_index.merge_from(faiss_db)
local_index.save_local(FAISS_USERGUIDE_INDEX)
else:
faiss_db.save_local(folder_path=FAISS_USERGUIDE_INDEX)

sgowdaks Jun 24, 2023

@tancs711 this really helps, thanks!

catbears · 2023-06-25T18:22:45Z

catbears
Jun 25, 2023

I want to append to this question, because it's not clear to me yet.
Here is what I did:

Got a few websites as langchain documents, pickled them as temp storage in this format
Document(lc_kwargs={'page_content': 'Here is a lot of text from the website', 'metadata': {'title': 'Title of the page', 'id': '342395246', 'source': 'https://my_source.com'}}, page_content='The whole page content, same as above but more', 'id': '342395246', 'source': 'https://my_source.com'})
Ran them all through a splitter, because some are really big, some not so much.
This is the output of the splitter
Document(lc_kwargs={'page_content': 'Here is a lot of text from the website', 'id': '342395246', 'source': 'https://my_source.com'}}, page_content='The whole page content, I think, same as above but more', metadata={'title': 'Title of the page', 'id': '342395246', 'source': 'https://my_source.com'})
Those I looped through the vector_db.add_documents(), adding 100 at a time
vector_db = Chroma(persist_directory="db", collection_name="my_source", embedding_function=embeddings_model)
def process_batch(docs, embeddings_model, vector_db): vector_db.add_documents(documents=docs, embedding=embeddings_model)

It took an awful lot of time, I had 110000 documents, and then my retrieval worked. It stopped working, after I tried to load the vector store from disk. Only 200 are left if I count with collection.count(). Similarity search does not return anything.

Anybody a guess or idea what went wrong? There doesn't seem to be a tutorial (or documentation) around which covers 'more than one document' vector store.

1 reply

catbears Jun 29, 2023

Link to an issue: #6657
Link to a discussion: #5341

I think this is an issue. If I create a db with Chroma methods and add to the collection (see discussion, I created the embeddings separately now), then my documents are there. But I can't load and retrieve them with Langchain - which I'd like to do because of QA with sources.

bondarchukb · 2023-06-29T08:04:29Z

bondarchukb
Jun 29, 2023

Probably very vectorestore specific, maybe its better to make it using native vectorestore methods instead of langchain .from_documents

0 replies

tancs711 · 2023-06-29T10:16:19Z

tancs711
Jun 29, 2023

Now as I increase the embedding, I encountered this problem with FAISS

--> 379     index = faiss.IndexFlatL2(len(embeddings[0]))
    380     vector = np.array(embeddings, dtype=np.float32)
    381     if normalize_L2:

IndexError: list index out of range

Anyone have idea how to resolve this?

I am considering pinecone or another vector store.

@catbears , is there any particular reason you decided on Chroma?

0 replies

catbears · 2023-06-29T10:31:12Z

catbears
Jun 29, 2023

@catbears , is there any particular reason you decided on Chroma?

@tancs711 From the local vector stores supported by Langchain, Chroma was the top alphabetically. Also I found a tutorial which worked 😄 Is FAISS easier to use?

1 reply

tancs711 Jun 29, 2023

I have not tried Chroma. So I am not sure if FAISS is easier.

But so far, FAISS is pretty simple. The problem I just posted was my bad error handling.

I managed to upload about 720 user guides into about 4780 items and persist the index in a local folder. Took about 13 minutes. And I could reload them and read them.

catbears · 2023-06-29T12:46:37Z

catbears
Jun 29, 2023

Thanks for the feedback

I think my issue stems from this 'totally not a bug' feature which I must've overlooked in the documentation: chroma-core/chroma#683
It seems code to get chroma_client can only be called once. Otherwise, it will create a new database. This is confusing.

Will try FAISS as well

0 replies

ivanoikon · 2023-07-26T10:47:40Z

ivanoikon
Jul 26, 2023

Same problem for me using Chroma.

I create an index with

index = VectorstoreIndexCreator(vectorstore_kwargs={"persist_directory":"vector_store"}, embedding=HuggingFaceEmbeddings(model_name='paraphrase-multilingual-MiniLM-L12-v2')).from_documents(documents) index.vectorstore.persist()

Query: "Who is John Doe?". Response: "John Doe is a user"
then i try to update it adding documents from a file loader...

vectorstore = Chroma(persist_directory="vector_store", embedding_function=HuggingFaceEmbeddings(model_name='paraphrase-multilingual-MiniLM-L12-v2')) vectorstore.add_documents(documents) vectorstore.persist()

Query: "Who is John Doe?". Response: "John Doe is not mentioned in text"
It seems that the index is recreated and not updated because the documents added before are not used
Any ideas?

0 replies

GenerativeAI4Finance · 2023-09-03T16:30:58Z

GenerativeAI4Finance
Sep 3, 2023

It worked for me with chroma db, after a few corrections . And then saving and loading from disk.

I Loaded our website - https://www.vaayushop.com/

all_splits = text_splitter.split_documents(data)
self.chromadb = Chroma.from_documents(documents=all_splits,
persist_directory=".//embeddings//",
embedding=OpenAIEmbeddings())
self.qna_chain = RetrievalQA.from_chain_type(
OpenAI(), retriever=self.chromadb.as_retriever()`
time.sleep(5)
self.chromadb.persist()

Now, next time, load from disk

if os.path.exists(".//embeddings//"):
print("Embeddings already exist")
client = chromadb.PersistentClient(path=".//embeddings")
self.chromadb = Chroma(embedding_function=OpenAIEmbeddings(),
client=client)
self.qna_chain = RetrievalQA.from_chain_type(
OpenAI(), retriever=self.chromadb.as_retriever()
)

0 replies

Dandelionym · 2024-01-03T13:19:41Z

Dandelionym
Jan 3, 2024

Yes, chromadb is a good choice for saving and loading without any doubt, for more please refer to this document: Chroma and Langchain.

1 reply

ALIYoussef May 4, 2024

@Dandelionym chromadb saves deleted elements in cash as None!
This will lead to errors when you retrieve data from it on runtime!
Did you face similar issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to save, load and update vectorstoreindex locally? #4188

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

how to save, load and update vectorstoreindex locally? #4188

Replies: 11 comments · 7 replies

Replies: 11 comments 7 replies