This lab implement llama-3-1-70b
model to use as evaluator for RAG application
Before getting start, please follow these steps
- Install all the requirement package
pip install -r requirements.txt
- Install utils by running this command
pip install -e .
- add
.env
file with credential for ibm watsonx (following the env_template)
In folder configs
, you will find a setting.yaml
file which contain all the model speccification for running evaluation.
The detail of this file will be changed according to your folder name and file name that contain the data you want to evaluate.
Input ➡️ .csv
file, contains Dataframe with column ["question", "contexts"]
- "question": list of questions that is used for the LLM to generate the answer
- "contexts": contexts that is retrieve from RAG application, and used as knowledge to answer the above questions
Output ➡️ content.csv
file, contains Dataframe with column ["question", "contexts", "answer"]
- "answer": the answer generate by LLM that will be evaluated, using question and context above
To run this function
- Create a folder. The name should be related to the model you will be used to generate answer (eg.
csv_hr_gpt4
) - Store your Input Dataframe in the folder (.csv file that contain
["question", "contexts"]
) - In the
setting.yaml
file, store yourquestion_csv_location
andquestion_csv_name
- In the
setting.yaml
file, choose the language model (LLM) for generating questions and its source inllm_generate.source
andllm_generate.name
. This model will later be evaluated by LLaMA 3.1 70B in the future.
Input ➡️ conent.csv
file, contains Dataframe with column ["question", "contexts", "answer]
(generated from generate_answer.py
)
Output ➡️ eval_faithfulness_xx-xx-xxxx_xxxx.csv
file, and eval_faithfulness_detail_xx-xx-xxxx_xxxx.csv
file
- eval_faithfulness_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "faithfuln"]
"faithfuln" being the score from 0-1 of the content.csv
- eval_faithfulness_detail_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "divided_answer", "faithfuln"]
faithfulness.py
To run this function
- In the
setting.yaml
file, in thefaithfulness
session 1.1 Store yourcontent_csv_location
andcontent_csv_name
of the content.csv file
1.2 In thellm_divide' and
llm_evalsession, you can choose the model
sourceand
name` you want to use for evaluating the content.csv (default model is llama3.1 70B ) - Go to your terminal
- cd in to the
llm-as-evaluator-main
folder - run the following command to evaluated faitfulness
python __faithfulness/faithfulness.py
After run faithfulness.py, `eval_faithfulness_xx-xx-xxxx_xxxx.csv` and `eval_faithfulness_detail_xx-xx-xxxx_xxxx.csv` will be generated in the folder specified
Input ➡️ conent.csv
file, contains Dataframe with column ["question", "contexts", "answer]
(generated from generate_answer.py
)
Output ➡️ eval_ansrelevancy_xx-xx-xxxx_xxxx.csv
file, and eval_ansrelevancy_detail_xx-xx-xxxx_xxxx.csv
file
- eval_ansrelevancy_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "answer_relevancy"]
"answer_relevancy" being the score from 0-1 of the content.csv
- eval_ansrelevancy_detail_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "predicted_answer", "answer_relevancy"]
answer_relevancy.py
To run this function
- In the
setting.yaml
file, in theanswer_relevancy
session 1.1 Store yourcontent_csv_location
andcontent_csv_name
of the content.csv file 1.2 In theembedder_model
session, you can choose the modelsource
andname
you want to use for embedding (default model is kornwtp/simcse-model-phayathaibert from huggingface )
1.3 In thellm_eval
session, you can choose the modelsource
andname
you want to use to evaluate the content.csv (default model is llama3.1 70B ) - Go to your terminal
- cd in to the
llm-as-evaluator-main
folder - run the following command to evaluated faitfulness
python __answer_relevancy/answer_relevancy.py
After run answer_relevancy.py, `eval_ansrelevancy_xx-xx-xxxx_xxxx.csv` and `eval_ansrelevancy_detail_xx-xx-xxxx_xxxx.csv` will be generated in the folder specified
Input ➡️ conent.csv
file, contains Dataframe with column ["question", "contexts", "answer"]
(generated from generate_answer.py
)
groundtruth.csv
file, contains Dataframe with column ["groundtruth"]
that is the actual answer of the question in content.csv
Output ➡️ eval_customized_metrics_xx-xx-xxxx_xxxx.csv
file, and eval_customized_metrics_detail_xx-xx-xxxx_xxxx.csv
file
- eval_customized_metrics_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "customized_matrics"]
"customized_matrics" being the score from 0-10 of the content.csv
- eval_customized_metrics_detail_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "predicted_answer", "answer_relevancy"]
customized_metrics.py
To run this function
- In the
setting.yaml
file, in thecustomized_metrics
session 1.1 Store yourcontent_csv_location
andcontent_csv_name
of the content.csv file
1.2 Store yourground_truth_csv_location
of the groundtruth.csv file
1.3 In thellm_eval
session, you can choose the modelsource
andname
you want to use for embedding (default model is kornwtp/simcse-model-phayathaibert from huggingface )
1.3 In thellm_eval
session, you can choose the modelsource
andname
you want to use to evaluate the content.csv (default model is llama3.1 70B ) - Go to your terminal
- cd in to the
llm-as-evaluator-main
folder - run the following command to evaluated faitfulness
python __customized_matrics/customized_matrics.py
After run customized_metrics.py, **eval_customized_metrics_xx-xx-xxxx_xxxx.csv** will be generated in the folder specified
Input ➡️ eval_xxxxxxxxxxx.csv
file, contains results of evaluation run in STEP 2: EVALUATION
Output ➡️ compare_xxxx.txt
file, contain average scores of the metrics you want to compare
compare.py
To run this function
- In the
setting.yaml
file, in thecompare
session
1.1 Store yourfile_name_list
, which is the list of eval file you want to compare
1.2 inmetric
session, please choose the evaluation result you want to compare form[faitnfulness, answer_relevancy, customized_matrics]
, The metric should match with the file name - Go to your terminal
- cd in to the
llm-as-evaluator-main
folder - run the following command to evaluated faitfulness
python __compare/compare.py
After run answer_relevancy.py, `compare_xxxx.txt` will be generated in the folder specified