Merge pull request #79 from icecream-and-tea/main

add prompteval
microsoft · Aug 20, 2024 · f40df6a · f40df6a
2 parents 909d3f0 + dfd6a8c
commit f40df6a
Show file tree

Hide file tree

Showing 5 changed files with 795 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -69,14 +69,14 @@
 <!-- News and Updates -->
 
 ## News and Updates
+- [19/08/2024] Merge [PromptEval](https://github.com/felipemaiapolo/prompteval), an efficient multi-prompt evaluation method, into this repository. 
 - [26/05/2024] Add support for GPT-4o.
 - [13/03/2024] Add support for multi-modal models and datasets.
 - [05/01/2024] Add support for BigBench Hard, DROP, ARC datasets.
 - [16/12/2023] Add support for Gemini, Mistral, Mixtral, Baichuan, Yi models.
 - [15/12/2023] Add detailed instructions for users to add new modules (models, datasets, etc.) [examples/add_new_modules.md](examples/add_new_modules.md). 
 - [05/12/2023] Published promptbench 0.0.1.
 
-
 <!-- Introduction -->
 
 ## Introduction
@@ -92,7 +92,7 @@
 2. **Prompt Engineering:** We implemented several prompt engineering methods. For example: [Few-shot Chain-of-Thought](https://arxiv.org/abs/2201.11903) [1],  [Emotion Prompt](https://arxiv.org/abs/2307.11760) [2], [Expert Prompting](https://arxiv.org/abs/2305.14688) [3] and so on.
 3. **Evaluating adversarial prompts:** promptbench integrated [prompt attacks](https://arxiv.org/abs/2306.04528) [4], enabling researchers to simulate black-box adversarial prompt attacks on models and evaluate their robustness (see details [here](promptbench/prompt_attack/README.md)).
 4. **Dynamic evaluation to mitigate potential test data contamination:** we integrated the dynamic evaluation framework [DyVal](https://arxiv.org/pdf/2309.17167) [5], which generates evaluation samples on-the-fly with controlled complexity.
-
+5. **Efficient multi-prompt evaluation**: We integrated the efficient multi-prompt evaluation method [PromptEval](https://arxiv.org/abs/2405.17202) [8]. This method uses the performance of LLMs on a small amount of data to build an IRT-like model. This model is then used to predict the performance of LLMs on unseen data. Tests on MMLU, BBH, and LMentry show that this method requires sampling only 5% of the data to reduce the error between estimated and actual performance to around 2%. 
 
 
 <!-- GETTING STARTED -->
@@ -168,7 +168,7 @@ We provide tutorials for:
 2. **test the effects of different prompting techniques:** 
 3. **examine the robustness for prompt attacks**, please refer to [examples/prompt_attack.ipynb](examples/prompt_attack.ipynb) to construct the attacks.
 4. **use DyVal for evaluation:** please refer to [examples/dyval.ipynb](examples/dyval.ipynb) to construct DyVal datasets.
-
+5. **efficient multi-prompt evaluation using PromptEval**: please refer to [examples/efficient_multi_prompt_eval.ipynb](examples/efficient_multi_prompt_eval.ipynb)
 
 ## Implemented Components
 
@@ -287,6 +287,7 @@ Please refer to our [benchmark website](https://llm-eval.github.io/) for benchma
 
 [7] Zhou D, Schärli N, Hou L, et al. Least-to-most prompting enables complex reasoning in large language models[J]. arXiv preprint arXiv:2205.10625, 2022.
 
+[8] Felipe Maia Polo, et al. "Prompteval: Efficient Multi-prompt Evaluation of Language Models." arXiv preprint arXiv:2405.17202.
 <!-- CITE -->
 
 ## Citing promptbench and other research papers

diff --git a/examples/efficient_multi_prompt_eval.ipynb b/examples/efficient_multi_prompt_eval.ipynb
diff --git a/promptbench/prompteval/__init__.py b/promptbench/prompteval/__init__.py
@@ -0,0 +1 @@
+from .efficient_eval import *
diff --git a/promptbench/prompteval/efficient_eval.py b/promptbench/prompteval/efficient_eval.py
@@ -0,0 +1,213 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+#
+# Source Attribution:
+# The majority of this code is derived from the following sources:
+# - PromptEval GitHub Repository: https://github.com/felipemaiapolo/prompteval
+
+import numpy as np
+import matplotlib.pyplot as plt
+import pandas as pd
+from tqdm import tqdm
+from sentence_transformers import SentenceTransformer
+from sklearn.decomposition import PCA
+
+from ..utils import InputProcess, OutputProcess
+from .methods import StratSample, ExtendedRaschModel
+
+def get_prompt_embedding(prompt_list, pca_dim):
+    """
+    Generates prompt embeddings using a pre-trained sentence transformer model and reduces 
+    their dimensionality using PCA (Principal Component Analysis).
+
+    Parameters:
+    prompt_list (list of str): A list of text prompts for which embeddings are to be generated.
+    pca_dim (int): The number of principal components to retain during dimensionality reduction.
+
+    Returns:
+    np.ndarray: A matrix where each row corresponds to the reduced-dimensionality embedding 
+                of a prompt.
+    """
+
+    embedder = SentenceTransformer('sentence-transformers/facebook-dpr-question_encoder-multiset-base')
+    pca = PCA(n_components=pca_dim)
+    X = pca.fit_transform(embedder.encode(prompt_list))
+
+    return X
+
+def get_Y_seen(model, prompt_list, example_list, proj_func, budget=1000):
+    """
+    Generates a matrix of observed (seen) examples and their corresponding labels based on 
+    model predictions. The function randomly samples examples up to the given budget and 
+    evaluates the model's performance on those examples.
+
+    Parameters:
+    model (promptbench.LLMModel): The model to evaluate.
+    prompt_list (list of str): A list of prompts used to generate input for the model.
+    example_list (list): A list of labeled examples used for evaluation.
+    proj_func (function): A function used to project model outputs into a classification 
+                          space or other relevant space.
+    budget (int, optional): The maximum number of examples to be evaluated. Defaults to 1000.
+
+    Returns:
+    tuple: 
+        seen_examples (np.ndarray): A boolean matrix indicating which examples were observed 
+                                    (True) and which were not (False).
+        Y_seen (np.ndarray): A matrix where each element is 1 if the model's prediction matches 
+                             the true label, 0 otherwise, and -99 for unseen examples.
+    """
+
+    # create an empty matrix Y, with 'template_num' columns, and 'dataset_size' rows
+    example_num = len(example_list)
+    prompt_num = len(prompt_list)
+    Y_seen = np.zeros((prompt_num, example_num))
+
+    # 随机抽样
+    seen_examples = StratSample(np.zeros(Y_seen.shape).astype(bool), budget, random_seed=0)
+
+    # using np.where to find the indices of all True elements
+    true_indices = np.where(seen_examples)
+
+    # iterate over all True indices and fill in the corresponding values in Y_seen
+    for row, col in tqdm(zip(true_indices[0], true_indices[1]), total=len(true_indices[0])):
+        prompt = prompt_list[row]
+        data = example_list[col]
+        # test it!
+        input_text = InputProcess.basic_format(prompt, data)
+        label = data['label']
+        raw_pred = model(input_text)
+        # process output
+        pred = OutputProcess.cls(raw_pred, proj_func)
+        Y_seen[row, col] = 1 if pred == label else 0
+
+        # mark the unseen examples
+        Y_seen[~seen_examples] = -99 #just a placeholder for non-observed
+
+    return seen_examples, Y_seen
+
+def fit_Y(X, Y_seen, seen_examples):
+    """
+    Fits a model to the seen examples using the Extended Rasch Model and calculates the 
+    predicted scores for each prompt.
+
+    Parameters:
+    X (np.ndarray): The matrix of prompt embeddings.
+    Y_seen (np.ndarray): The matrix of observed example results (1 for correct, 0 for incorrect, 
+                         -99 for unseen).
+    seen_examples (np.ndarray): A boolean matrix indicating which examples were observed.
+
+    Returns:
+    np.ndarray: A vector of predicted scores for each prompt, calculated as the mean score 
+                across all seen examples.
+    """
+
+    extended_rasch_cov = ExtendedRaschModel()
+    extended_rasch_cov.fit(seen_examples, Y_seen, X)
+    S_hat_cov = extended_rasch_cov.get_Y_hat().mean(1)
+
+    return S_hat_cov
+
+def visualize_result(data):
+    """
+    Visualizes the distribution of model performance using a histogram, boxplot, and 
+    cumulative distribution function (CDF).
+
+    Parameters:
+    data (np.ndarray): A vector of performance scores to be visualized.
+
+    Returns:
+    None: The function displays and saves the plots as 'combined_result.png'.
+    """
+
+    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
+
+    # first subplot - Histogram
+    axes[0].hist(data, alpha=0.75, density=True, label='PromptEval')
+    # axes[0].hist(groundtruth, alpha=0.75, density=True, label='Ground Truth')
+    axes[0].set_xlabel("Performance")
+    axes[0].set_ylabel("Density")
+
+    # second subplot- Boxplot
+    axes[1].boxplot([data], labels=['PromptEval (cov)'])
+    axes[1].set_ylabel("Performance Distribution")
+
+    # third subplot - CDF
+    bins = np.linspace(0, 1.1, 100)
+    axes[2].hist(data, density=True, cumulative=True, bins=bins, histtype='step', linewidth=1.5, label='PromptEval')
+    # axes[2].hist(groundtruth, density=True, cumulative=True, bins=bins, histtype='step', linewidth=1.5, label='Ground Truth')
+    axes[2].set_xlim(0.0, 1.0)
+    axes[2].legend(fontsize=10)
+    axes[2].set_xlabel(f"Performance")
+    axes[2].set_ylabel("CDF")
+
+    plt.tight_layout()
+
+    plt.savefig('combined_result.png')
+    plt.show() 
+
+
+def efficient_eval(model, prompt_list, example_list, proj_func, budget=1000, visualize=True, pca_dim=25, method='EmbPT'):
+    """
+    Efficient evaluation of a model on a list of prompts and examples.
+    
+    Parameters:
+    model (promptbench.LLMModel): The model to evaluate. This is typically a large language model that 
+                                  will generate responses based on the provided prompts.
+    prompt_list (list of str): A list of prompts for which the model's performance will be evaluated.
+    example_list (list): A list of examples used for evaluation purposes. These examples are used 
+                         in conjunction with the prompts to generate model responses.
+    proj_func (function): A projection function used to map the model's output to a desired space 
+                          (e.g., embedding space or scoring space).
+    budget (int, optional): The maximum number of examples to be used for evaluation. 
+                            Defaults to 1000.
+    visualize (bool, optional): Whether to visualize the results. If True, a visualization of 
+                                the model's performance will be generated. Defaults to True.
+    pca_dim (int, optional): The number of principal components to retain when using PCA 
+                             for dimensionality reduction in the EmbPT method. Defaults to 25.
+    method (str, optional): The evaluation method to be used. Can be 'EmbPT' for embedding-based 
+                            prompt tuning or 'Rasch' for Rasch model evaluation. Defaults to 'EmbPT'.
+    
+    Returns:
+    dict: A dictionary containing the following keys:
+        'full_performances' (np.ndarray): The complete list of model performance scores 
+                                          for each prompt after fitting the examples.
+        'quantiles' (dict): A dictionary containing the 5th, 25th, 50th, 75th, and 95th 
+                            percentiles of the performance scores.
+        'average' (float): The average performance score across all prompts.
+        'std_dev' (float): The standard deviation of the performance scores.
+    visual_result: if you set visualize=True, the function will generate combined_result.png for you to see the result.
+    """
+
+    # get prompt embedding
+    if method == 'EmbPT':
+        X = get_prompt_embedding(prompt_list, pca_dim)
+    elif method == 'Rasch':
+        X = None
+    else:
+        raise ValueError("Invalid method specified")
+
+    # get Y_seen
+    seen_examples, Y_seen = get_Y_seen(model, prompt_list, example_list, proj_func, budget)
+    # fit Y
+    S_hat_cov = fit_Y(X, Y_seen, seen_examples)  # n个prompt最终的scores
+
+    # Calculate quantiles (5th, 25th, 50th, 75th, 95th)
+    percentile_list = [5, 25, 50, 75, 95]
+    quantiles = np.percentile(S_hat_cov, percentile_list)
+    quantiles_dict = {str(k): v for k, v in zip(percentile_list, quantiles)}
+
+    # Calculate the average
+    average = np.mean(S_hat_cov)
+    # Calculate the standard deviation
+    std_dev = np.std(S_hat_cov)
+
+    if visualize:
+        visualize_result(S_hat_cov)
+
+    # Return the calculated statistics
+    return {
+        'full_performances': S_hat_cov,
+        'quantiles': quantiles_dict,
+        'average': average,
+        'std_dev': std_dev
+    }