diff --git a/README.md b/README.md index 1c58a51..5d91e01 100644 --- a/README.md +++ b/README.md @@ -69,6 +69,7 @@ ## News and Updates +- [19/08/2024] Merge [PromptEval](https://github.com/felipemaiapolo/prompteval), an efficient multi-prompt evaluation method, into this repository. - [26/05/2024] Add support for GPT-4o. - [13/03/2024] Add support for multi-modal models and datasets. - [05/01/2024] Add support for BigBench Hard, DROP, ARC datasets. @@ -76,7 +77,6 @@ - [15/12/2023] Add detailed instructions for users to add new modules (models, datasets, etc.) [examples/add_new_modules.md](examples/add_new_modules.md). - [05/12/2023] Published promptbench 0.0.1. - ## Introduction @@ -92,7 +92,7 @@ 2. **Prompt Engineering:** We implemented several prompt engineering methods. For example: [Few-shot Chain-of-Thought](https://arxiv.org/abs/2201.11903) [1], [Emotion Prompt](https://arxiv.org/abs/2307.11760) [2], [Expert Prompting](https://arxiv.org/abs/2305.14688) [3] and so on. 3. **Evaluating adversarial prompts:** promptbench integrated [prompt attacks](https://arxiv.org/abs/2306.04528) [4], enabling researchers to simulate black-box adversarial prompt attacks on models and evaluate their robustness (see details [here](promptbench/prompt_attack/README.md)). 4. **Dynamic evaluation to mitigate potential test data contamination:** we integrated the dynamic evaluation framework [DyVal](https://arxiv.org/pdf/2309.17167) [5], which generates evaluation samples on-the-fly with controlled complexity. - +5. **Efficient multi-prompt evaluation**: We integrated the efficient multi-prompt evaluation method [PromptEval](https://arxiv.org/abs/2405.17202) [8]. This method uses the performance of LLMs on a small amount of data to build an IRT-like model. This model is then used to predict the performance of LLMs on unseen data. Tests on MMLU, BBH, and LMentry show that this method requires sampling only 5% of the data to reduce the error between estimated and actual performance to around 2%. @@ -168,7 +168,7 @@ We provide tutorials for: 2. **test the effects of different prompting techniques:** 3. **examine the robustness for prompt attacks**, please refer to [examples/prompt_attack.ipynb](examples/prompt_attack.ipynb) to construct the attacks. 4. **use DyVal for evaluation:** please refer to [examples/dyval.ipynb](examples/dyval.ipynb) to construct DyVal datasets. - +5. **efficient multi-prompt evaluation using PromptEval**: please refer to [examples/efficient_multi_prompt_eval.ipynb](examples/efficient_multi_prompt_eval.ipynb) ## Implemented Components @@ -287,6 +287,7 @@ Please refer to our [benchmark website](https://llm-eval.github.io/) for benchma [7] Zhou D, Schärli N, Hou L, et al. Least-to-most prompting enables complex reasoning in large language models[J]. arXiv preprint arXiv:2205.10625, 2022. +[8] Felipe Maia Polo, et al. "Prompteval: Efficient Multi-prompt Evaluation of Language Models." arXiv preprint arXiv:2405.17202. ## Citing promptbench and other research papers diff --git a/examples/efficient_multi_prompt_eval.ipynb b/examples/efficient_multi_prompt_eval.ipynb new file mode 100644 index 0000000..1a97ee0 --- /dev/null +++ b/examples/efficient_multi_prompt_eval.ipynb @@ -0,0 +1,286 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This example will walk you throught the basic usage of PromptBench. We hope that you can get familiar with the APIs and use it in your own projects later." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, there is a unified import of `import promptbench as pb` that easily imports the package." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/ubuntu/miniconda3/envs/am/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "import promptbench as pb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load dataset\n", + "\n", + "First, PromptBench supports easy load of datasets." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "All supported datasets: \n", + "['sst2', 'cola', 'qqp', 'mnli', 'mnli_matched', 'mnli_mismatched', 'qnli', 'wnli', 'rte', 'mrpc', 'mmlu', 'squad_v2', 'un_multi', 'iwslt2017', 'math', 'bool_logic', 'valid_parentheses', 'gsm8k', 'csqa', 'bigbench_date', 'bigbench_object_tracking', 'last_letter_concat', 'numersense', 'qasc', 'bbh', 'drop', 'arc-easy', 'arc-challenge']\n" + ] + }, + { + "data": { + "text/plain": [ + "[{'content': \"it 's a charming and often affecting journey . \", 'label': 1},\n", + " {'content': 'unflinchingly bleak and desperate ', 'label': 0},\n", + " {'content': 'allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . ',\n", + " 'label': 1},\n", + " {'content': \"the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . \",\n", + " 'label': 1},\n", + " {'content': \"it 's slow -- very , very slow . \", 'label': 0}]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# print all supported datasets in promptbench\n", + "print('All supported datasets: ')\n", + "print(pb.SUPPORTED_DATASETS)\n", + "\n", + "# load a dataset, sst2, for instance.\n", + "# if the dataset is not available locally, it will be downloaded automatically.\n", + "dataset = pb.DatasetLoader.load_dataset(\"sst2\")\n", + "# dataset = pb.DatasetLoader.load_dataset(\"mmlu\")\n", + "# dataset = pb.DatasetLoader.load_dataset(\"un_multi\")\n", + "# dataset = pb.DatasetLoader.load_dataset(\"iwslt2017\", [\"ar-en\", \"de-en\", \"en-ar\"])\n", + "# dataset = pb.DatasetLoader.load_dataset(\"math\", \"algebra__linear_1d\")\n", + "# dataset = pb.DatasetLoader.load_dataset(\"bool_logic\")\n", + "# dataset = pb.DatasetLoader.load_dataset(\"valid_parenthesesss\")\n", + "\n", + "# print the first 5 examples\n", + "dataset[:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load models\n", + "\n", + "Then, you can easily load LLM models via promptbench." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "All supported models: \n", + "['google/flan-t5-large', 'llama2-7b', 'llama2-7b-chat', 'llama2-13b', 'llama2-13b-chat', 'llama2-70b', 'llama2-70b-chat', 'phi-1.5', 'phi-2', 'palm', 'gpt-3.5-turbo', 'gpt-4', 'gpt-4-1106-preview', 'gpt-3.5-turbo-1106', 'gpt-4-0125-preview', 'gpt-3.5-turbo-0125', 'gpt-4-turbo', 'gpt-4o', 'vicuna-7b', 'vicuna-13b', 'vicuna-13b-v1.3', 'google/flan-ul2', 'gemini-pro', 'mistralai/Mistral-7B-v0.1', 'mistralai/Mistral-7B-Instruct-v0.1', 'mistralai/Mixtral-8x7B-v0.1', 'mistralai/Mixtral-8x7B-Instruct-v0.1', '01-ai/Yi-6B', '01-ai/Yi-34B', '01-ai/Yi-6B-Chat', '01-ai/Yi-34B-Chat', 'baichuan-inc/Baichuan2-7B-Base', 'baichuan-inc/Baichuan2-13B-Base', 'baichuan-inc/Baichuan2-7B-Chat', 'baichuan-inc/Baichuan2-13B-Chat']\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n", + "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n" + ] + } + ], + "source": [ + "# print all supported models in promptbench\n", + "print('All supported models: ')\n", + "print(pb.SUPPORTED_MODELS)\n", + "\n", + "# load a model, flan-t5-large, for instance.\n", + "model = pb.LLMModel(model='google/flan-t5-large', max_new_tokens=10, temperature=0.0001, device='cuda')\n", + "# model = pb.LLMModel(model='llama2-13b-chat', max_new_tokens=10, temperature=0.0001)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Construct prompt list\n", + "\n", + "Prompts are the key interaction interface to LLMs. Some researches find that evaluating models through a single prompt is instable, so you can test the model by multiple prompts." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# using different prompts to evaluate models\n", + "prompt_list = [\n", + " \"Classify the sentence as positive or negative: {content}\",\n", + " \"Determine the emotion of the following sentence as positive or negative: {content}\",\n", + " \"Is the sentiment of this sentence positive or negative? {content}\",\n", + " \"Identify whether the sentiment in the following sentence is positive or negative: {content}\",\n", + " \"Assess the sentiment of this statement as either positive or negative: {content}\",\n", + " \"Evaluate the following sentence and indicate if it is positive or negative: {content}\",\n", + " \"Judge the emotional tone of this sentence as positive or negative: {content}\",\n", + " \"Label the sentiment expressed in the sentence as positive or negative: {content}\",\n", + " \"Decide if the sentiment in this statement is positive or negative: {content}\",\n", + " \"Analyze the following sentence and determine if it is positive or negative: {content}\",\n", + " \"Categorize the sentiment of the given sentence as positive or negative: {content}\",\n", + " \"Tell if the following sentence conveys a positive or negative sentiment: {content}\",\n", + " \"Discern whether the emotion in the sentence is positive or negative: {content}\",\n", + " \"Determine if the given sentence expresses a positive or negative sentiment: {content}\",\n", + " \"Conclude if the emotional tone of this sentence is positive or negative: {content}\",\n", + " \"Recognize whether the sentiment of the following statement is positive or negative: {content}\",\n", + " \"Rate the sentiment in this sentence as positive or negative: {content}\",\n", + " \"Classify the emotional tone of the given sentence as positive or negative: {content}\",\n", + " \"Identify the sentiment in this sentence and classify it as positive or negative: {content}\",\n", + " \"Assess if the sentiment of the following statement is positive or negative: {content}\",\n", + " \"Indicate whether the sentiment of this sentence is positive or negative: {content}\",\n", + " \"Determine if the sentiment in this sentence is positive or negative: {content}\",\n", + " \"Judge whether the following sentence has a positive or negative sentiment: {content}\",\n", + " \"Analyze the emotional tone of the sentence and classify it as positive or negative: {content}\",\n", + " \"Label the given sentence as having a positive or negative sentiment: {content}\",\n", + " \"Evaluate whether the sentiment in the following sentence is positive or negative: {content}\",\n", + " \"Categorize the given sentence based on whether its sentiment is positive or negative: {content}\",\n", + " \"Determine the emotional quality of the sentence as positive or negative: {content}\",\n", + " \"Is the emotional tone of this sentence positive or negative? {content}\",\n", + " \"Discern the sentiment of the given sentence and label it as positive or negative: {content}\"\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You may need to define the projection function for the model output.\n", + "Since the output format defined in your prompts may be different from the model output.\n", + "For example, for sst2 dataset, the label are '0' and '1' to represent 'negative' and 'positive'.\n", + "But the model output is 'negative' and 'positive'.\n", + "So we need to define a projection function to map the model output to the label." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "def proj_func(pred):\n", + " mapping = {\n", + " \"positive\": 1,\n", + " \"negative\": 0\n", + " }\n", + " return mapping.get(pred, -1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Efficient evaluation using PromptEval\n", + "PromptEval provides an efficient evaluation method: by observing a small number of samples (5% total), we can predict the performance of the model on unseen samples (The experiments shows that the prediction error is usually within 2% of the true value), achieving high efficiency in evaluation. " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1200/1200 [01:31<00:00, 13.10it/s]\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'full_performances': array([0.92496036, 0.9296717 , 0.92268393, 0.91085103, 0.80544397,\n", + " 0.92159231, 0.92195513, 0.92742979, 0.94270547, 0.89704149,\n", + " 0.91374033, 0.91845134, 0.91761411, 0.91542801, 0.90279132,\n", + " 0.88043954, 0.90439775, 0.92869027, 0.89636597, 0.92093057,\n", + " 0.92079846, 0.92634561, 0.93064771, 0.9216495 , 0.93057201,\n", + " 0.92313046, 0.92961422, 0.93589523, 0.93857416, 0.94273073]), 'quantiles': {'5': 0.8876064332443163, '25': 0.9141622483781227, '50': 0.9218023145499467, '75': 0.9293832304634039, '95': 0.9408463788484324}, 'average': 0.9167714158726751, 'std_dev': 0.024776796648712244}\n" + ] + } + ], + "source": [ + "from promptbench.prompteval import efficient_eval\n", + "\n", + "result = efficient_eval(model, prompt_list, dataset, proj_func, \n", + " budget=1200, # The maximum number of examples that can be evaluated during the process. Increasing this value covers more data points, while decreasing it reduces computation.\n", + " visualize=True, # If set to True, the function will generate and display visualizations of the model's performance (combined_result.png), including histograms, boxplots, and cumulative distribution functions (CDFs).\n", + " pca_dim=25, # The number of dimensions retained during PCA on the prompt embeddings. Higher values retain more dimensional information, while lower values reduce dimensionality.\n", + " method='EmbPT') # The evaluation method to use. 'EmbPT' involves embedding the prompts and using these embeddings in model fitting. 'Rasch' does not obtain prompt embeddings; instead, prompts are one-hot encoded in this method.\n", + "\n", + "print(result)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "promptbench", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/promptbench/prompteval/__init__.py b/promptbench/prompteval/__init__.py new file mode 100644 index 0000000..cfdb5a7 --- /dev/null +++ b/promptbench/prompteval/__init__.py @@ -0,0 +1 @@ +from .efficient_eval import * \ No newline at end of file diff --git a/promptbench/prompteval/efficient_eval.py b/promptbench/prompteval/efficient_eval.py new file mode 100644 index 0000000..1b375d0 --- /dev/null +++ b/promptbench/prompteval/efficient_eval.py @@ -0,0 +1,213 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# +# Source Attribution: +# The majority of this code is derived from the following sources: +# - PromptEval GitHub Repository: https://github.com/felipemaiapolo/prompteval + +import numpy as np +import matplotlib.pyplot as plt +import pandas as pd +from tqdm import tqdm +from sentence_transformers import SentenceTransformer +from sklearn.decomposition import PCA + +from ..utils import InputProcess, OutputProcess +from .methods import StratSample, ExtendedRaschModel + +def get_prompt_embedding(prompt_list, pca_dim): + """ + Generates prompt embeddings using a pre-trained sentence transformer model and reduces + their dimensionality using PCA (Principal Component Analysis). + + Parameters: + prompt_list (list of str): A list of text prompts for which embeddings are to be generated. + pca_dim (int): The number of principal components to retain during dimensionality reduction. + + Returns: + np.ndarray: A matrix where each row corresponds to the reduced-dimensionality embedding + of a prompt. + """ + + embedder = SentenceTransformer('sentence-transformers/facebook-dpr-question_encoder-multiset-base') + pca = PCA(n_components=pca_dim) + X = pca.fit_transform(embedder.encode(prompt_list)) + + return X + +def get_Y_seen(model, prompt_list, example_list, proj_func, budget=1000): + """ + Generates a matrix of observed (seen) examples and their corresponding labels based on + model predictions. The function randomly samples examples up to the given budget and + evaluates the model's performance on those examples. + + Parameters: + model (promptbench.LLMModel): The model to evaluate. + prompt_list (list of str): A list of prompts used to generate input for the model. + example_list (list): A list of labeled examples used for evaluation. + proj_func (function): A function used to project model outputs into a classification + space or other relevant space. + budget (int, optional): The maximum number of examples to be evaluated. Defaults to 1000. + + Returns: + tuple: + seen_examples (np.ndarray): A boolean matrix indicating which examples were observed + (True) and which were not (False). + Y_seen (np.ndarray): A matrix where each element is 1 if the model's prediction matches + the true label, 0 otherwise, and -99 for unseen examples. + """ + + # create an empty matrix Y, with 'template_num' columns, and 'dataset_size' rows + example_num = len(example_list) + prompt_num = len(prompt_list) + Y_seen = np.zeros((prompt_num, example_num)) + + # 随机抽样 + seen_examples = StratSample(np.zeros(Y_seen.shape).astype(bool), budget, random_seed=0) + + # using np.where to find the indices of all True elements + true_indices = np.where(seen_examples) + + # iterate over all True indices and fill in the corresponding values in Y_seen + for row, col in tqdm(zip(true_indices[0], true_indices[1]), total=len(true_indices[0])): + prompt = prompt_list[row] + data = example_list[col] + # test it! + input_text = InputProcess.basic_format(prompt, data) + label = data['label'] + raw_pred = model(input_text) + # process output + pred = OutputProcess.cls(raw_pred, proj_func) + Y_seen[row, col] = 1 if pred == label else 0 + + # mark the unseen examples + Y_seen[~seen_examples] = -99 #just a placeholder for non-observed + + return seen_examples, Y_seen + +def fit_Y(X, Y_seen, seen_examples): + """ + Fits a model to the seen examples using the Extended Rasch Model and calculates the + predicted scores for each prompt. + + Parameters: + X (np.ndarray): The matrix of prompt embeddings. + Y_seen (np.ndarray): The matrix of observed example results (1 for correct, 0 for incorrect, + -99 for unseen). + seen_examples (np.ndarray): A boolean matrix indicating which examples were observed. + + Returns: + np.ndarray: A vector of predicted scores for each prompt, calculated as the mean score + across all seen examples. + """ + + extended_rasch_cov = ExtendedRaschModel() + extended_rasch_cov.fit(seen_examples, Y_seen, X) + S_hat_cov = extended_rasch_cov.get_Y_hat().mean(1) + + return S_hat_cov + +def visualize_result(data): + """ + Visualizes the distribution of model performance using a histogram, boxplot, and + cumulative distribution function (CDF). + + Parameters: + data (np.ndarray): A vector of performance scores to be visualized. + + Returns: + None: The function displays and saves the plots as 'combined_result.png'. + """ + + fig, axes = plt.subplots(1, 3, figsize=(18, 6)) + + # first subplot - Histogram + axes[0].hist(data, alpha=0.75, density=True, label='PromptEval') + # axes[0].hist(groundtruth, alpha=0.75, density=True, label='Ground Truth') + axes[0].set_xlabel("Performance") + axes[0].set_ylabel("Density") + + # second subplot- Boxplot + axes[1].boxplot([data], labels=['PromptEval (cov)']) + axes[1].set_ylabel("Performance Distribution") + + # third subplot - CDF + bins = np.linspace(0, 1.1, 100) + axes[2].hist(data, density=True, cumulative=True, bins=bins, histtype='step', linewidth=1.5, label='PromptEval') + # axes[2].hist(groundtruth, density=True, cumulative=True, bins=bins, histtype='step', linewidth=1.5, label='Ground Truth') + axes[2].set_xlim(0.0, 1.0) + axes[2].legend(fontsize=10) + axes[2].set_xlabel(f"Performance") + axes[2].set_ylabel("CDF") + + plt.tight_layout() + + plt.savefig('combined_result.png') + plt.show() + + +def efficient_eval(model, prompt_list, example_list, proj_func, budget=1000, visualize=True, pca_dim=25, method='EmbPT'): + """ + Efficient evaluation of a model on a list of prompts and examples. + + Parameters: + model (promptbench.LLMModel): The model to evaluate. This is typically a large language model that + will generate responses based on the provided prompts. + prompt_list (list of str): A list of prompts for which the model's performance will be evaluated. + example_list (list): A list of examples used for evaluation purposes. These examples are used + in conjunction with the prompts to generate model responses. + proj_func (function): A projection function used to map the model's output to a desired space + (e.g., embedding space or scoring space). + budget (int, optional): The maximum number of examples to be used for evaluation. + Defaults to 1000. + visualize (bool, optional): Whether to visualize the results. If True, a visualization of + the model's performance will be generated. Defaults to True. + pca_dim (int, optional): The number of principal components to retain when using PCA + for dimensionality reduction in the EmbPT method. Defaults to 25. + method (str, optional): The evaluation method to be used. Can be 'EmbPT' for embedding-based + prompt tuning or 'Rasch' for Rasch model evaluation. Defaults to 'EmbPT'. + + Returns: + dict: A dictionary containing the following keys: + 'full_performances' (np.ndarray): The complete list of model performance scores + for each prompt after fitting the examples. + 'quantiles' (dict): A dictionary containing the 5th, 25th, 50th, 75th, and 95th + percentiles of the performance scores. + 'average' (float): The average performance score across all prompts. + 'std_dev' (float): The standard deviation of the performance scores. + visual_result: if you set visualize=True, the function will generate combined_result.png for you to see the result. + """ + + # get prompt embedding + if method == 'EmbPT': + X = get_prompt_embedding(prompt_list, pca_dim) + elif method == 'Rasch': + X = None + else: + raise ValueError("Invalid method specified") + + # get Y_seen + seen_examples, Y_seen = get_Y_seen(model, prompt_list, example_list, proj_func, budget) + # fit Y + S_hat_cov = fit_Y(X, Y_seen, seen_examples) # n个prompt最终的scores + + # Calculate quantiles (5th, 25th, 50th, 75th, 95th) + percentile_list = [5, 25, 50, 75, 95] + quantiles = np.percentile(S_hat_cov, percentile_list) + quantiles_dict = {str(k): v for k, v in zip(percentile_list, quantiles)} + + # Calculate the average + average = np.mean(S_hat_cov) + # Calculate the standard deviation + std_dev = np.std(S_hat_cov) + + if visualize: + visualize_result(S_hat_cov) + + # Return the calculated statistics + return { + 'full_performances': S_hat_cov, + 'quantiles': quantiles_dict, + 'average': average, + 'std_dev': std_dev + } \ No newline at end of file diff --git a/promptbench/prompteval/methods.py b/promptbench/prompteval/methods.py new file mode 100644 index 0000000..6c8412e --- /dev/null +++ b/promptbench/prompteval/methods.py @@ -0,0 +1,291 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. +# +# Source Attribution: +# The majority of this code is derived from the following sources: +# - PromptEval GitHub Repository: https://github.com/felipemaiapolo/prompteval + +import copy +import numpy as np +from sklearn.linear_model import LogisticRegression as LR # type: ignore +from tqdm import tqdm # type: ignore + +class LogisticRegression: + """ + Logistic regression model. + + Attributes: + reg (float): The regularization parameter for the logistic regression model. This is equivalent to the prior Gaussian covariance scaling in the Bayesian setup with gaussian, ie, prior cov = reg*identity. + """ + + def __init__(self, reg=1e2): + """ + Initializes the logistic regression model with a regularization parameter. + + Parameters: + reg (float): Regularization parameter (default is 100). + """ + self.reg = reg + + def fit(self, X, y): + """ + Fits the logistic regression model to the data. + + Parameters: + X (array-like): Feature matrix. + y (array-like): Target vector of 0s and 1s. + """ + # This block of code is just a trick to run the Scikit-Learn implementation for logistic regression + if np.var(y) == 0: + y_copy = copy.deepcopy(y) + local_state = np.random.RandomState(0) + ind = local_state.choice(len(y_copy)) + y_copy[ind] = 1 - np.median(y_copy) + else: + y_copy = copy.deepcopy(y) + + # Fitting the model + logreg = LR(C=self.reg, random_state=0, solver="liblinear", fit_intercept=False).fit(X, y_copy) + self.mu = logreg.coef_.squeeze() + + +class ExtendedRaschModel: + """ + An extended Rasch model incorporating covariates for both formats and examples. + + Attributes: + seen_examples (array-like): Boolean array indicating seen examples. + Y (array-like): Target matrix of 0s and 1s. + X (array-like): Covariates for formats. + Z (array-like): Covariates for examples. + x_dim (int): Dimension of X. + z_dim (int): Dimension of Z. + n_formats (int): Number of formats. + n_examples (int): Number of examples. + rasch_model (LogisticRegression): The fitted logistic regression model. + gammas (array-like): Coefficients for the format covariates. + thetas (array-like): Format parameters. + psi (array-like): Coefficients for the example covariates. + betas (array-like): Example parameters. + logits (array-like): Logits of the fitted model. + """ + + def __init__(self): + """ + Initializes the extended Rasch model. + """ + pass + + def fit(self, seen_examples, Y, X=None, Z=None): + """ + Fits the extended Rasch model to the data. + + Parameters: + seen_examples (array-like): Boolean array indicating seen examples. + Y (array-like): Target matrix. + X (array-like): Covariates for formats (default is identity matrix). + Z (array-like): Covariates for examples (default is identity matrix). + """ + self.seen_examples = seen_examples + self.Y = Y + + # X (formats covariates) + if type(X) != np.ndarray: + self.X = np.eye(Y.shape[0]) + else: + self.X = X + self.x_dim = self.X.shape[1] + + # Z (examples covariates) + if type(Z) != np.ndarray: + self.Z = np.eye(Y.shape[1]) + else: + self.Z = Z + self.z_dim = self.Z.shape[1] + + # Formatting the data + self.n_formats, self.n_examples = seen_examples.shape + features, labels = GenXY(seen_examples, Y, self.X, self.Z) + + if type(X) != np.ndarray and type(Z) != np.ndarray: # basic Rasch model (no need to include intercept) + features = features[:, :-1] + elif ( + type(X) != np.ndarray or type(Z) != np.ndarray + ): # just one set of covariates (no need to include intercept) + pass + else: # two sets of covariates (need to include intercept) + features = np.hstack((features, np.ones((features.shape[0], 1)))) + + # Fitting the model + self.rasch_model = LogisticRegression() + self.rasch_model.fit(features, labels) + + # Predicted probs + self.gammas = self.rasch_model.mu[: self.x_dim] + self.thetas = self.X @ self.gammas + self.psi = self.rasch_model.mu[self.x_dim :] + + if type(X) != np.ndarray and type(Z) != np.ndarray: # basic Rasch model (no intercept) + self.betas = np.hstack((self.psi, np.array([0]))) + self.logits = self.thetas[:, None] + self.betas[None, :] + elif type(X) != np.ndarray or type(Z) != np.ndarray: # just one set of covariates (no intercept) + self.betas = self.Z @ self.psi + self.logits = self.thetas[:, None] + self.betas[None, :] + else: # two sets of covariates (intercept included) + self.betas = self.Z @ self.psi[:-1] + self.logits = self.thetas[:, None] + self.betas[None, :] + self.psi[-1] + + def get_Y_hat(self): + """ + Computes the predicted probabilities. + + Returns: + array-like: Predicted probabilities. + """ + P_hat = sigmoid(self.logits) + Y_hat = np.zeros(self.seen_examples.shape) + Y_hat[self.seen_examples] = self.Y[self.seen_examples] + Y_hat[~self.seen_examples] = P_hat[~self.seen_examples] + return Y_hat + + +class Baseline: + """ + A baseline model for evaluating prompts. + + Attributes: + seen_examples (array-like): Boolean array indicating seen examples. + quantiles (list): List of quantiles for evaluation. + estimates (dict): Dictionary to store evaluation metrics. + """ + + def __init__(self): + """ + Initializes the baseline model. + """ + pass + + def fit(self, Y, quantiles, rounds_eval, random_seed=None): + """ + Fits the baseline model and evaluates the prompts. + + Parameters: + Y (array-like): Target matrix. + quantiles (list): List of quantiles for evaluation. + rounds_eval (list): List of evaluation rounds. + random_seed (int): Random seed for reproducibility (default is None). + """ + n_formats, n_examples = Y.shape + self.seen_examples = np.zeros(Y.shape).astype(bool) + self.quantiles = quantiles + self.estimates = {"n_seen": [], "estimates": [], "accs_hat": []} + + for num_seen_examples in rounds_eval: + self.seen_examples = StratSample(self.seen_examples, num_seen_examples, random_seed) + eps = 1e-10 + accs = np.array([(Y[i, s].sum() + eps) / (s.sum() + eps) for i, s in enumerate(self.seen_examples)]) + self.estimates["n_seen"].append(self.seen_examples.sum()) + self.estimates["estimates"].append(np.percentile(accs, quantiles).tolist()) + self.estimates["accs_hat"].append(accs.tolist()) + + +def StratSample(seen_examples, max_seen, random_seed, active_arms=None, random_column=False): + """ + Generates a stratified sample from the seen examples matrix until the maximum number of seen examples is reached. + + Parameters: + seen_examples (array-like): The matrix of seen examples. + max_seen (int): The maximum number of seen examples. + random_seed (int): The random seed for reproducibility. + active_arms (list or ndarray, optional): List of active arms. Defaults to None. + random_column (bool, optional): If True, selects a column randomly. Defaults to False. + + Returns: + array-like: The updated matrix of seen examples. + """ + matrix = seen_examples + rows, columns = matrix.shape + + if type(active_arms) == list or type(active_arms) == np.ndarray: + pass + else: + active_arms = list(range(rows)) + + local_state = np.random.RandomState(random_seed) + + # initialize the sums of each row and column + row_sums = matrix.sum(1) + col_sums = matrix.sum(0) + + while True: + + if row_sums.sum() >= max_seen: + return matrix + + min_row_sum = row_sums[active_arms].min() + min_sum_rows = [i for i in active_arms if row_sums[i] == min_row_sum] + next_row = local_state.choice(min_sum_rows) + + avail_columns = [i for i in range(columns) if not matrix[next_row, i]] + + if not avail_columns: # nothing else to see + return matrix + + if random_column: + next_column = local_state.choice(avail_columns) + else: + # the most time-consuming step, so we speed it up by reduct the repeated calculation + min_column_sum = col_sums[avail_columns].min() + min_sum_columns = [i for i in avail_columns if col_sums[i] == min_column_sum] + next_column = local_state.choice(min_sum_columns) + + matrix[next_row, next_column] = True + + row_sums[next_row] += 1 + col_sums[next_column] += 1 + + +def GenXY(seen_items, Y, X, Z): + """ + Generates combined feature and label matrices. + + Parameters: + seen_items (array-like): Matrix indicating which items have been seen. + Y (array-like): Target values matrix. + X (array-like): Feature matrix for the first set of features. + Z (array-like): Feature matrix for the second set of features. + + Returns: + tuple: Combined feature matrix and corresponding labels. + """ + Y_seen = -np.ones(Y.shape) + Y_seen[seen_items] = Y[seen_items] + + W_x = [] + W_z = [] + + labels = [] + for i in range(seen_items.shape[0]): + for j in range(seen_items.shape[1]): + if seen_items[i, j] == True: + W_x.append(X[i]) + W_z.append(Z[j]) + labels.append(Y_seen[i, j]) + + W = np.hstack((np.vstack(W_x), np.vstack(W_z))) + labels = np.array(labels) + return W, labels + + +def sigmoid(x): + """ + Applies the sigmoid function to the input. + + Parameters: + x (array-like): The input data. + + Returns: + array-like: The output of the sigmoid function. + """ + x_clipped = np.clip(x, -30, 30) + return 1 / (1 + np.exp(-x_clipped))