Skip to content

Backland-Labs/llm-security

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Response Evaluator

This project provides a Flask API that evaluates Large Language Model (LLM) responses based on:

  • Faithfulness: Accuracy of factual information.
  • Answer Relevance: Semantic relevance to the user's prompt.
  • Context Utilization: Effective use of provided context.

The evaluation helps ensure that the LLM's outputs are accurate, relevant, and contextually appropriate, which is crucial for applications in business environments where incorrect information can have significant consequences.


Table of Contents


Overview

The LLM Response Evaluator is designed to assess the quality of responses generated by language models when interacting with a system of record or business data. It focuses on:

  • Faithfulness: Ensuring the response contains accurate facts.
  • Answer Relevance: Confirming the response addresses the user's prompt.
  • Context Utilization: Checking how well the response uses the provided context.

Features

  • Entity-Level Faithfulness Evaluation: Uses Named Entity Recognition (NER) to compare entities (e.g., numbers, dates, names) in the data input and LLM response.
  • Semantic Relevance Scoring: Employs Sentence-BERT to compute semantic similarity between the user prompt and the LLM response.
  • Context Utilization Measurement: Utilizes ROUGE scores to evaluate the overlap between the data input and the LLM response.

Getting Started

Prerequisites

  • Docker: Ensure Docker is installed on your system.
  • Docker Compose: Required to build and run the application using docker-compose.

Installation

  1. Clone the Repository

    git clone https://github.com/yourusername/llm-response-evaluator.git
    cd llm-response-evaluator
  2. Directory Structure

    Ensure the following files are present in the project directory:

    - app.py
    - requirements.txt
    - Dockerfile
    - docker-compose.yml
    
  3. Build and Run the Application

    Use Docker Compose to build and run the application:

    docker-compose up --build

    This command builds the Docker image and starts the Flask API on port 5000.


Usage

API Endpoint

  • URL: http://localhost:5000/evaluate
  • Method: POST
  • Content-Type: application/json
  • Required Fields:
    • user_prompt: The user's input or question to the LLM.
    • data_input: The factual data or context provided to the LLM.
    • llm_response: The response generated by the LLM.

Example Request

You can test the API using curl:

curl -X POST -H "Content-Type: application/json" -d '{
  "user_prompt": "Can you provide the latest sales figures for product X in the last quarter?",
  "data_input": "The sales figures for product X in the last quarter were as follows:\n- January: 1,000 units\n- February: 1,200 units\n- March: 1,500 units\nTotal units sold in the last quarter: 3,700 units.",
  "llm_response": "Certainly! In the last quarter, product X sold a total of 3,700 units.\nThe monthly breakdown is:\n- January: 1,000 units\n- February: 1,200 units\n- March: 1,500 units"
}' http://localhost:5000/evaluate

Example Response

{
	"faithfulness": {
		"precision": 1.0,
		"recall": 1.0,
		"f1_score": 1.0,
		"true_positives": [
			["January", "DATE"],
			["February", "DATE"],
			["March", "DATE"],
			["3,700 units", "QUANTITY"],
			["1,500 units", "QUANTITY"],
			["1,000 units", "QUANTITY"],
			["1,200 units", "QUANTITY"],
			["product X", "PRODUCT"],
			["last quarter", "DATE"]
		],
		"false_positives": [],
		"false_negatives": []
	},
	"relevance": {
		"semantic_similarity": 0.79
	},
	"context_utilization": {
		"rouge1_f1_score": 0.88,
		"rougeL_f1_score": 0.88
	}
}

Understanding the Output

Faithfulness Evaluation

  • Precision: The proportion of correctly identified entities in the LLM response out of all entities it provided.
  • Recall: The proportion of entities from the data input that are present in the LLM response.
  • F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
  • True Positives: Entities correctly included in the LLM response.
  • False Positives: Entities incorrectly included in the LLM response.
  • False Negatives: Entities from the data input missing in the LLM response.

In the example, the LLM response perfectly matches the entities from the data input, resulting in scores of 1.0.

Answer Relevance

  • Semantic Similarity: A score between -1 and 1 indicating how semantically similar the LLM response is to the user prompt. Higher values indicate greater relevance.

A score of 0.79 suggests that the LLM response is highly relevant to the user's prompt.

Context Utilization

  • ROUGE-1 F1 Score: Measures unigram (word-level) overlap between the data input and the LLM response.
  • ROUGE-L F1 Score: Considers the longest common subsequence, reflecting fluency and structural similarity.

Scores of 0.88 indicate strong utilization of the provided context.


Improving LLM Operations

The evaluation metrics provided by this tool help in:

  • Identifying Inaccuracies: By examining false positives and negatives, you can pinpoint where the LLM may be introducing errors or omissions.
  • Enhancing Relevance: Semantic similarity scores highlight how well the LLM understands and responds to user prompts.
  • Optimizing Context Usage: ROUGE scores reveal how effectively the LLM incorporates the provided context, which is crucial for generating coherent and contextually appropriate responses.

By analyzing these metrics, developers and data scientists can fine-tune LLMs to produce more accurate, relevant, and context-aware outputs, leading to better performance in real-world applications.


Additional Notes

  • Dependencies: All necessary Python packages are listed in requirements.txt. The Dockerfile ensures that these are installed in the container.
  • Model Downloads:
    • The spaCy English model (en_core_web_sm) is downloaded during the Docker build process.
    • The Sentence-BERT model (all-MiniLM-L6-v2) is automatically downloaded when the application first runs.
  • Port Configuration: The Flask app runs on port 5000. If this port is in use, you can modify the port mapping in docker-compose.yml.

License

This project is licensed under the MIT License. See the LICENSE file for details.


Feel free to customize and extend this tool to better suit your specific requirements!

About

A collection of various security tools

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages