This project provides a Flask API that evaluates Large Language Model (LLM) responses based on:
- Faithfulness: Accuracy of factual information.
- Answer Relevance: Semantic relevance to the user's prompt.
- Context Utilization: Effective use of provided context.
The evaluation helps ensure that the LLM's outputs are accurate, relevant, and contextually appropriate, which is crucial for applications in business environments where incorrect information can have significant consequences.
- Overview
- Features
- Getting Started
- Usage
- Understanding the Output
- Improving LLM Operations
- Additional Notes
- License
The LLM Response Evaluator is designed to assess the quality of responses generated by language models when interacting with a system of record or business data. It focuses on:
- Faithfulness: Ensuring the response contains accurate facts.
- Answer Relevance: Confirming the response addresses the user's prompt.
- Context Utilization: Checking how well the response uses the provided context.
- Entity-Level Faithfulness Evaluation: Uses Named Entity Recognition (NER) to compare entities (e.g., numbers, dates, names) in the data input and LLM response.
- Semantic Relevance Scoring: Employs Sentence-BERT to compute semantic similarity between the user prompt and the LLM response.
- Context Utilization Measurement: Utilizes ROUGE scores to evaluate the overlap between the data input and the LLM response.
- Docker: Ensure Docker is installed on your system.
- Docker Compose: Required to build and run the application using
docker-compose
.
-
Clone the Repository
git clone https://github.com/yourusername/llm-response-evaluator.git cd llm-response-evaluator
-
Directory Structure
Ensure the following files are present in the project directory:
- app.py - requirements.txt - Dockerfile - docker-compose.yml
-
Build and Run the Application
Use Docker Compose to build and run the application:
docker-compose up --build
This command builds the Docker image and starts the Flask API on port
5000
.
- URL:
http://localhost:5000/evaluate
- Method:
POST
- Content-Type:
application/json
- Required Fields:
user_prompt
: The user's input or question to the LLM.data_input
: The factual data or context provided to the LLM.llm_response
: The response generated by the LLM.
You can test the API using curl
:
curl -X POST -H "Content-Type: application/json" -d '{
"user_prompt": "Can you provide the latest sales figures for product X in the last quarter?",
"data_input": "The sales figures for product X in the last quarter were as follows:\n- January: 1,000 units\n- February: 1,200 units\n- March: 1,500 units\nTotal units sold in the last quarter: 3,700 units.",
"llm_response": "Certainly! In the last quarter, product X sold a total of 3,700 units.\nThe monthly breakdown is:\n- January: 1,000 units\n- February: 1,200 units\n- March: 1,500 units"
}' http://localhost:5000/evaluate
{
"faithfulness": {
"precision": 1.0,
"recall": 1.0,
"f1_score": 1.0,
"true_positives": [
["January", "DATE"],
["February", "DATE"],
["March", "DATE"],
["3,700 units", "QUANTITY"],
["1,500 units", "QUANTITY"],
["1,000 units", "QUANTITY"],
["1,200 units", "QUANTITY"],
["product X", "PRODUCT"],
["last quarter", "DATE"]
],
"false_positives": [],
"false_negatives": []
},
"relevance": {
"semantic_similarity": 0.79
},
"context_utilization": {
"rouge1_f1_score": 0.88,
"rougeL_f1_score": 0.88
}
}
- Precision: The proportion of correctly identified entities in the LLM response out of all entities it provided.
- Recall: The proportion of entities from the data input that are present in the LLM response.
- F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
- True Positives: Entities correctly included in the LLM response.
- False Positives: Entities incorrectly included in the LLM response.
- False Negatives: Entities from the data input missing in the LLM response.
In the example, the LLM response perfectly matches the entities from the data input, resulting in scores of 1.0.
- Semantic Similarity: A score between -1 and 1 indicating how semantically similar the LLM response is to the user prompt. Higher values indicate greater relevance.
A score of 0.79 suggests that the LLM response is highly relevant to the user's prompt.
- ROUGE-1 F1 Score: Measures unigram (word-level) overlap between the data input and the LLM response.
- ROUGE-L F1 Score: Considers the longest common subsequence, reflecting fluency and structural similarity.
Scores of 0.88 indicate strong utilization of the provided context.
The evaluation metrics provided by this tool help in:
- Identifying Inaccuracies: By examining false positives and negatives, you can pinpoint where the LLM may be introducing errors or omissions.
- Enhancing Relevance: Semantic similarity scores highlight how well the LLM understands and responds to user prompts.
- Optimizing Context Usage: ROUGE scores reveal how effectively the LLM incorporates the provided context, which is crucial for generating coherent and contextually appropriate responses.
By analyzing these metrics, developers and data scientists can fine-tune LLMs to produce more accurate, relevant, and context-aware outputs, leading to better performance in real-world applications.
- Dependencies: All necessary Python packages are listed in
requirements.txt
. The Dockerfile ensures that these are installed in the container. - Model Downloads:
- The spaCy English model (
en_core_web_sm
) is downloaded during the Docker build process. - The Sentence-BERT model (
all-MiniLM-L6-v2
) is automatically downloaded when the application first runs.
- The spaCy English model (
- Port Configuration: The Flask app runs on port
5000
. If this port is in use, you can modify the port mapping indocker-compose.yml
.
This project is licensed under the MIT License. See the LICENSE file for details.
Feel free to customize and extend this tool to better suit your specific requirements!