Important to know before running anything

./query_centroid.npy contains the centroid of the query embeddings
./arize_1000_centroid.npy contains the centroid of the document chunk embeddings with a chunk size of 1000. This was an arbitrary choice and you can get the centroid of any chunk size document. I chose 1000 for the sake of time and amount of info that is able to be stored in a chunk
./raw_arize_docs.pkl contains the raw arize docs as a list of strings
./indices contains the document embeddings sorted by chunk size. If you want the raw embeddings, they are in here
./experiment_data contains dataframes and graphs
- arize contains a pkl file for the data and graphs for the following experiments: chunk size, k = 4, original/original reranked
- arize_1000 contains only the debiased data (data and graphs). You'll need to use an editted version of llama_index for this to run
- arize_debiased contains all of the experiments comparing the original, original_rerank, and original_debias retrieval methods

How to run experiments

No debias

Set up your favorite virtual env with Python 3.10 (eg. conda)
pip install -r requirements.txt
Make changes to the experiment's main function (eg. chunk sizes) a. Allowable query transformations: 'original', 'original_rerank', 'hyde', 'hyde_rerank'. This list can be the empty list if you do not want to apply any transformation
```
python llama_index_qa.py
```
Expect this to take many hours (at 2-14 hours depeneding on how many configurations you are choosing). Blame OpenAI for the rate limits :D

With debias

Honestly just message me so I can tell you which llama_index library file you need to edit to subtract the centroids. Llamaindex may have a way to insert a custom similarity function but it's really messy to do so so it's easier to just edit the codebase mounted on your workspace in something like VS code.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
experiment_data		experiment_data
indices		indices
.gitignore		.gitignore
README.md		README.md
arize_1000_centroid.npy		arize_1000_centroid.npy
arize_docs_questions.csv		arize_docs_questions.csv
centroids.py		centroids.py
concat.py		concat.py
constants.py		constants.py
llama_index_qa.py		llama_index_qa.py
query_centroid.npy		query_centroid.npy
raw_arize_docs.pkl		raw_arize_docs.pkl
raw_documents.pkl		raw_documents.pkl
requirements.txt		requirements.txt
simple_chatbot.py		simple_chatbot.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Important to know before running anything

How to run experiments

No debias

With debias

About

Releases

Packages

Languages

ericksiavichay/databot

Folders and files

Latest commit

History

Repository files navigation

Important to know before running anything

How to run experiments

No debias

With debias

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages