Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chromadb Cleanup #15

Open
shortcipher3 opened this issue Apr 22, 2024 · 1 comment
Open

Chromadb Cleanup #15

shortcipher3 opened this issue Apr 22, 2024 · 1 comment

Comments

@shortcipher3
Copy link

It would be nice to have a cleanup command that fully cleaned up the sqlite database.

Some background:

I was testing some latency things with a database with ~1 million records and noticed it was slow when I queried using metadata ~8 seconds/query. I wanted to see how database size impacted things, so I removed a bunch of records until I had ~25k records left.

Inspecting the disk size of the database revealed that the size hadn't changed after removing the records.

Then I ran:

chops clean-wal /path/to/persist_dir

This reduced my sqlite3 database from 7.7 GB to 2.7 GB and sped up my query from ~8 seconds to ~2.5 seconds.

Then I thought, what if I had a fresh database, so I ran:

results = old_collection.get(limit=30_000, offset=0, include=["metadatas", "embeddings"])
new_collection.add(embeddings=results["embeddings"], metadatas=results["metadatas"], ids=results["ids"])

This reduced my sqlite3 database down to 187 MB and also reduced my vector index to several MB from a few GB. It also sped up my query to <0.2 seconds.

Would be nice when running this in production to be able to do that same type of cleanup without totally starting over. I'm thinking of applications with a rolling window of data.

Maybe this isn't realistic with the way the vector indexing works, please advise - would love to understand more.

@tazarov
Copy link
Contributor

tazarov commented Jul 20, 2024

hey @shortcipher3, thanks for this. Indeed you bring a valid point about the HNSW, it needs to be rebuilt to optimize it. I have that in my backlog things to add to Chroma and for now in chops.

Regarding the sqlite3 situation I'll investigate. the clean-wal command is intended only for the WAL which in chroma is unbound.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants