This project aims to provide a generalized framwork for evaluating the performance existing LLMs on several medical QA tasks and hallucination detection.
The framework now supports:
- Models: ChatGLM, Internist, Llama3, Med42, Meditron
- Datasets: MMLU_biology, MMLU_anatomy, MMLU_medicine, MMLU_clinical, PubMedQA, MedMCQA, HaluEval
We would love to include more models and datasets in the future.
Install all the required dependencies before getting started.
pip install -r requirements.txt
The framework is divided into three steps.
First get the raw model responses.
Please try to run the following script in an environemnt with CUDA-compatible GPU for faster inference.
python3 inference.py -c config.yaml
Following parameters are provided
model
- select one model for generation each timedataset
- inference on more than one datasets is supportedfew_shot
- number of shots prompted into the modelCoT
- true if you would like to perform Chain of Thoughtsk_self_consistency
- number of Self Consistency prompted into the modeltop_p
- top_p of the modeltemperature
- temperature of the modelbatch_size
- batch size of the evaluating sets
Please refer to config.yaml for modifications
Then, the raw responses would be shortened by calling llama-3.1-70b-versatile using Groq API to the designated answer space for easier evaluation.
export GROQ_API_KEY=your_key_here
python3 process_response.py -c config.yaml
Remeber to specify the dataset name and corresponding path by chosen_dataset
as shown in config.yaml, you may also want to specify the output path by shortened_save_path
(if not specified, automatically saved at shortened/model
/prompting_techniques
/dataset
)
or for automatic API key deployment
./autoshort_3.sh
In the shell script, replace these few things:
- groq.txt -> a txt file path that contains your own API keys
- log_file = -> a log file path
- config.yaml -> a yaml file you wish to use on process_response.py
Finally, get the accuracy for the processed responses
python3 eval.py -c config.yaml
Specify the directory of processed responses by eval
in config.yaml
e.g. shortened/llama3/0-shot/
for calculating all 0-shot accuracies of the datasets from llama3
Check out https://drive.google.com/drive/folders/1H37kkPxt082KgpfQraPAjrR5PbklFaLQ?usp=drive_link for the raw responses and processed responses pre-generated.
Install Gradio
pip install gradio
Run the UI
python3 BenchUI.py