📊 LLM as evaluator

STEP 1: SET UP

This lab implement llama-3-1-70b model to use as evaluator for RAG application Before getting start, please follow these steps

Install all the requirement package
```
pip install -r requirements.txt
```
Install utils by running this command
```
pip install -e .
```
add .env file with credential for ibm watsonx (following the env_template)

STEP 2: SPECIFY MODEL

In folder configs, you will find a setting.yaml file which contain all the model speccification for running evaluation. The detail of this file will be changed according to your folder name and file name that contain the data you want to evaluate.

STEP 3: GENERATE ANSWER FROM LLM MODEL

📌 -------- Generate Answer --------

Input ➡️ .csv file, contains Dataframe with column ["question", "contexts"]
    - "question": list of questions that is used for the LLM to generate the answer
    - "contexts": contexts that is retrieve from RAG application, and used as knowledge to answer the above questions
Output ➡️ content.csv file, contains Dataframe with column ["question", "contexts", "answer"]
    - "answer": the answer generate by LLM that will be evaluated, using question and context above

generate_answer.py

To run this function

Create a folder. The name should be related to the model you will be used to generate answer (eg. csv_hr_gpt4)
Store your Input Dataframe in the folder (.csv file that contain ["question", "contexts"] )
In the setting.yaml file, store your question_csv_location and question_csv_name
In the setting.yaml file, choose the language model (LLM) for generating questions and its source in llm_generate.sourceand llm_generate.name. This model will later be evaluated by LLaMA 3.1 70B in the future.

STEP 3: RUNNING THE EVALUATOR

📌 -------- Faithfulness --------

Input ➡️ conent.csv file, contains Dataframe with column ["question", "contexts", "answer] (generated from generate_answer.py)

Output ➡️ eval_faithfulness_xx-xx-xxxx_xxxx.csv file, and eval_faithfulness_detail_xx-xx-xxxx_xxxx.csv file
    - eval_faithfulness_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "faithfuln"]
        "faithfuln" being the score from 0-1 of the content.csv
    - eval_faithfulness_detail_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "divided_answer", "faithfuln"]

faithfulness.py
To run this function

In the setting.yaml file, in the faithfulness session 1.1 Store your content_csv_location and content_csv_name of the content.csv file
1.2 In the llm_divide' and llm_evalsession, you can choose the modelsourceandname` you want to use for evaluating the content.csv (default model is llama3.1 70B )
Go to your terminal
cd in to the llm-as-evaluator-main folder
run the following command to evaluated faitfulness

python __faithfulness/faithfulness.py

After run faithfulness.py, `eval_faithfulness_xx-xx-xxxx_xxxx.csv` and `eval_faithfulness_detail_xx-xx-xxxx_xxxx.csv` will be generated in the folder specified

📌 -------- Answer Relevancy --------

Input ➡️ conent.csv file, contains Dataframe with column ["question", "contexts", "answer] (generated from generate_answer.py)

Output ➡️ eval_ansrelevancy_xx-xx-xxxx_xxxx.csv file, and eval_ansrelevancy_detail_xx-xx-xxxx_xxxx.csv file
    - eval_ansrelevancy_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "answer_relevancy"]
        "answer_relevancy" being the score from 0-1 of the content.csv
    - eval_ansrelevancy_detail_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "predicted_answer", "answer_relevancy"]

answer_relevancy.py
To run this function

In the setting.yaml file, in the answer_relevancy session 1.1 Store your content_csv_location and content_csv_name of the content.csv file 1.2 In the embedder_model session, you can choose the model source and name you want to use for embedding (default model is kornwtp/simcse-model-phayathaibert from huggingface )
1.3 In the llm_eval session, you can choose the model source and name you want to use to evaluate the content.csv (default model is llama3.1 70B )
Go to your terminal
cd in to the llm-as-evaluator-main folder
run the following command to evaluated faitfulness

python __answer_relevancy/answer_relevancy.py

After run answer_relevancy.py, `eval_ansrelevancy_xx-xx-xxxx_xxxx.csv` and `eval_ansrelevancy_detail_xx-xx-xxxx_xxxx.csv` will be generated in the folder specified

📌 -------- Customize Metric --------

Input ➡️ conent.csv file, contains Dataframe with column ["question", "contexts", "answer"] (generated from generate_answer.py)
groundtruth.csv file, contains Dataframe with column ["groundtruth"] that is the actual answer of the question in content.csv

Output ➡️ eval_customized_metrics_xx-xx-xxxx_xxxx.csv file, and eval_customized_metrics_detail_xx-xx-xxxx_xxxx.csv file
    - eval_customized_metrics_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "customized_matrics"]
        "customized_matrics" being the score from 0-10 of the content.csv
    - eval_customized_metrics_detail_xx-xx-xxxx_xxxx.csv: Dataframe with column ["question", "contexts", "answer", "predicted_answer", "answer_relevancy"]

customized_metrics.py
To run this function

In the setting.yaml file, in the customized_metrics session     1.1 Store your content_csv_location and content_csv_name of the content.csv file
    1.2 Store your ground_truth_csv_location of the groundtruth.csv file
    1.3 In the llm_eval session, you can choose the model source and name you want to use for embedding (default model is kornwtp/simcse-model-phayathaibert from huggingface )
    1.3 In the llm_eval session, you can choose the model source and name you want to use to evaluate the content.csv (default model is llama3.1 70B )
Go to your terminal
cd in to the llm-as-evaluator-main folder
run the following command to evaluated faitfulness

python __customized_matrics/customized_matrics.py

After run customized_metrics.py, **eval_customized_metrics_xx-xx-xxxx_xxxx.csv** will be generated in the folder specified

STEP 4: COMPARE

Input ➡️ eval_xxxxxxxxxxx.csv file, contains results of evaluation run in STEP 2: EVALUATION

Output ➡️ compare_xxxx.txt file, contain average scores of the metrics you want to compare

compare.py
To run this function

In the setting.yaml file, in the compare session
1.1 Store your file_name_list, which is the list of eval file you want to compare
1.2 in metric session, please choose the evaluation result you want to compare form [faitnfulness, answer_relevancy, customized_matrics], The metric should match with the file name
Go to your terminal
cd in to the llm-as-evaluator-main folder
run the following command to evaluated faitfulness

python __compare/compare.py

After run answer_relevancy.py, `compare_xxxx.txt` will be generated in the folder specified

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
__answer_relevancy		__answer_relevancy
__compare		__compare
__customized_matrics		__customized_matrics
__faithfulness		__faithfulness
__generate_answer		__generate_answer
configs		configs
csv_files_example		csv_files_example
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
connection.py		connection.py
env_template.txt		env_template.txt
prompt_template.py		prompt_template.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 LLM as evaluator

STEP 1: SET UP

STEP 2: SPECIFY MODEL

STEP 3: GENERATE ANSWER FROM LLM MODEL

📌 -------- Generate Answer --------

generate_answer.py

STEP 3: RUNNING THE EVALUATOR

📌 -------- Faithfulness --------

📌 -------- Answer Relevancy --------

📌 -------- Customize Metric --------

STEP 4: COMPARE

About

Releases

Packages

Languages

tanyaaton/llm-as-evaluator-asset

Folders and files

Latest commit

History

Repository files navigation

📊 LLM as evaluator

STEP 1: SET UP

STEP 2: SPECIFY MODEL

STEP 3: GENERATE ANSWER FROM LLM MODEL

📌 -------- Generate Answer --------

generate_answer.py

STEP 3: RUNNING THE EVALUATOR

📌 -------- Faithfulness --------

📌 -------- Answer Relevancy --------

📌 -------- Customize Metric --------

STEP 4: COMPARE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages