Medical LLM Evaluation Framework

Introduction

This project aims to provide a generalized framwork for evaluating the performance existing LLMs on several medical QA tasks and hallucination detection.

The framework now supports:

Models: ChatGLM, Internist, Llama3, Med42, Meditron
Datasets: MMLU_biology, MMLU_anatomy, MMLU_medicine, MMLU_clinical, PubMedQA, MedMCQA, HaluEval

We would love to include more models and datasets in the future.

Setup

Install all the required dependencies before getting started.

pip install -r requirements.txt

Basic Usage

The framework is divided into three steps.

1. Inference

First get the raw model responses.

Please try to run the following script in an environemnt with CUDA-compatible GPU for faster inference.

python3 inference.py -c config.yaml

Following parameters are provided

model - select one model for generation each time
dataset - inference on more than one datasets is supported
few_shot - number of shots prompted into the model
CoT - true if you would like to perform Chain of Thoughts
k_self_consistency - number of Self Consistency prompted into the model
top_p - top_p of the model
temperature - temperature of the model
batch_size - batch size of the evaluating sets

Please refer to config.yaml for modifications

2. Response process

Then, the raw responses would be shortened by calling llama-3.1-70b-versatile using Groq API to the designated answer space for easier evaluation.

export GROQ_API_KEY=your_key_here
python3 process_response.py -c config.yaml

Remeber to specify the dataset name and corresponding path by chosen_dataset as shown in config.yaml, you may also want to specify the output path by shortened_save_path (if not specified, automatically saved at shortened/model/prompting_techniques/dataset)

or for automatic API key deployment

./autoshort_3.sh

In the shell script, replace these few things:

groq.txt -> a txt file path that contains your own API keys
log_file = -> a log file path
config.yaml -> a yaml file you wish to use on process_response.py

3. Evaluation

Finally, get the accuracy for the processed responses

python3 eval.py -c config.yaml

Specify the directory of processed responses by eval in config.yaml e.g. shortened/llama3/0-shot/ for calculating all 0-shot accuracies of the datasets from llama3

Check out https://drive.google.com/drive/folders/1H37kkPxt082KgpfQraPAjrR5PbklFaLQ?usp=drive_link for the raw responses and processed responses pre-generated.

Benchmarking UI

Install Gradio

pip install gradio

Run the UI

python3 BenchUI.py

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
CoT		CoT
__pycache__		__pycache__
src		src
utils		utils
.gitignore		.gitignore
.python-version		.python-version
BenchUI.py		BenchUI.py
CSELinux_instructions.md		CSELinux_instructions.md
README.md		README.md
autoshort.sh		autoshort.sh
config.yaml		config.yaml
datasets.json		datasets.json
ds_partition.py		ds_partition.py
eval.py		eval.py
inference.py		inference.py
preprocess.py		preprocess.py
process_response.py		process_response.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical LLM Evaluation Framework

Introduction

Setup

Basic Usage

1. Inference

2. Response process

3. Evaluation

Benchmarking UI

About

Releases

Packages

Contributors 2

Languages

morgancheung914/fyp_benchmark

Folders and files

Latest commit

History

Repository files navigation

Medical LLM Evaluation Framework

Introduction

Setup

Basic Usage

1. Inference

2. Response process

3. Evaluation

Benchmarking UI

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages