LLM Response Evaluator

This project provides a Flask API that evaluates Large Language Model (LLM) responses based on:

Faithfulness: Accuracy of factual information.
Answer Relevance: Semantic relevance to the user's prompt.
Context Utilization: Effective use of provided context.

The evaluation helps ensure that the LLM's outputs are accurate, relevant, and contextually appropriate, which is crucial for applications in business environments where incorrect information can have significant consequences.

Overview

The LLM Response Evaluator is designed to assess the quality of responses generated by language models when interacting with a system of record or business data. It focuses on:

Faithfulness: Ensuring the response contains accurate facts.
Answer Relevance: Confirming the response addresses the user's prompt.
Context Utilization: Checking how well the response uses the provided context.

Features

Entity-Level Faithfulness Evaluation: Uses Named Entity Recognition (NER) to compare entities (e.g., numbers, dates, names) in the data input and LLM response.
Semantic Relevance Scoring: Employs Sentence-BERT to compute semantic similarity between the user prompt and the LLM response.
Context Utilization Measurement: Utilizes ROUGE scores to evaluate the overlap between the data input and the LLM response.

Getting Started

Prerequisites

Docker: Ensure Docker is installed on your system.
Docker Compose: Required to build and run the application using docker-compose.

Installation

Clone the Repository

git clone https://github.com/yourusername/llm-response-evaluator.git
cd llm-response-evaluator

Directory Structure

Ensure the following files are present in the project directory:
```
- app.py
- requirements.txt
- Dockerfile
- docker-compose.yml
```
Build and Run the Application

Use Docker Compose to build and run the application:
```
docker-compose up --build
```
This command builds the Docker image and starts the Flask API on port 5000.

Usage

API Endpoint

URL: http://localhost:5000/evaluate
Method: POST
Content-Type: application/json
Required Fields:
- user_prompt: The user's input or question to the LLM.
- data_input: The factual data or context provided to the LLM.
- llm_response: The response generated by the LLM.

Example Request

You can test the API using curl:

curl -X POST -H "Content-Type: application/json" -d '{
  "user_prompt": "Can you provide the latest sales figures for product X in the last quarter?",
  "data_input": "The sales figures for product X in the last quarter were as follows:\n- January: 1,000 units\n- February: 1,200 units\n- March: 1,500 units\nTotal units sold in the last quarter: 3,700 units.",
  "llm_response": "Certainly! In the last quarter, product X sold a total of 3,700 units.\nThe monthly breakdown is:\n- January: 1,000 units\n- February: 1,200 units\n- March: 1,500 units"
}' http://localhost:5000/evaluate

Example Response

{
	"faithfulness": {
		"precision": 1.0,
		"recall": 1.0,
		"f1_score": 1.0,
		"true_positives": [
			["January", "DATE"],
			["February", "DATE"],
			["March", "DATE"],
			["3,700 units", "QUANTITY"],
			["1,500 units", "QUANTITY"],
			["1,000 units", "QUANTITY"],
			["1,200 units", "QUANTITY"],
			["product X", "PRODUCT"],
			["last quarter", "DATE"]
		],
		"false_positives": [],
		"false_negatives": []
	},
	"relevance": {
		"semantic_similarity": 0.79
	},
	"context_utilization": {
		"rouge1_f1_score": 0.88,
		"rougeL_f1_score": 0.88
	}
}

Understanding the Output

Faithfulness Evaluation

Precision: The proportion of correctly identified entities in the LLM response out of all entities it provided.
Recall: The proportion of entities from the data input that are present in the LLM response.
F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
True Positives: Entities correctly included in the LLM response.
False Positives: Entities incorrectly included in the LLM response.
False Negatives: Entities from the data input missing in the LLM response.

In the example, the LLM response perfectly matches the entities from the data input, resulting in scores of 1.0.

Answer Relevance

Semantic Similarity: A score between -1 and 1 indicating how semantically similar the LLM response is to the user prompt. Higher values indicate greater relevance.

A score of 0.79 suggests that the LLM response is highly relevant to the user's prompt.

Context Utilization

ROUGE-1 F1 Score: Measures unigram (word-level) overlap between the data input and the LLM response.
ROUGE-L F1 Score: Considers the longest common subsequence, reflecting fluency and structural similarity.

Scores of 0.88 indicate strong utilization of the provided context.

Improving LLM Operations

The evaluation metrics provided by this tool help in:

Identifying Inaccuracies: By examining false positives and negatives, you can pinpoint where the LLM may be introducing errors or omissions.
Enhancing Relevance: Semantic similarity scores highlight how well the LLM understands and responds to user prompts.
Optimizing Context Usage: ROUGE scores reveal how effectively the LLM incorporates the provided context, which is crucial for generating coherent and contextually appropriate responses.

By analyzing these metrics, developers and data scientists can fine-tune LLMs to produce more accurate, relevant, and context-aware outputs, leading to better performance in real-world applications.

Additional Notes

Dependencies: All necessary Python packages are listed in requirements.txt. The Dockerfile ensures that these are installed in the container.
Model Downloads:
- The spaCy English model (en_core_web_sm) is downloaded during the Docker build process.
- The Sentence-BERT model (all-MiniLM-L6-v2) is automatically downloaded when the application first runs.
Port Configuration: The Flask app runs on port 5000. If this port is in use, you can modify the port mapping in docker-compose.yml.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Feel free to customize and extend this tool to better suit your specific requirements!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Response Evaluator

Table of Contents

Overview

Features

Getting Started

Prerequisites

Installation

Usage

API Endpoint

Example Request

Example Response

Understanding the Output

Faithfulness Evaluation

Answer Relevance

Context Utilization

Improving LLM Operations

Additional Notes

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app.py		app.py
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
readme.md		readme.md
requirements.txt		requirements.txt

Backland-Labs/llm-security

Folders and files

Latest commit

History

Repository files navigation

LLM Response Evaluator

Table of Contents

Overview

Features

Getting Started

Prerequisites

Installation

Usage

API Endpoint

Example Request

Example Response

Understanding the Output

Faithfulness Evaluation

Answer Relevance

Context Utilization

Improving LLM Operations

Additional Notes

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages