This is a list tracking good LLM Benchmarks. Unfortunately not all can be run with API endpoints. If there's any you'd like to use with API endpoints, create an issue.
Library | ChatCompletions | Completions | Custom Proxy | Comments |
---|---|---|---|---|
EvalPlus | ✅ | ✅ | ✅ | Evaluates code gen |
LM Eval harness | ✅ | |||
MT-Bench w/ LLM Judge | ✅ | Evaluates chat assistants. Asks turn-by-turn conversation questions and then uses another LLM to evaluate results | ||
RAGAS | ✅ | ✅ | ||
HELM | ✅ | Link | ||
FLASK | ||||
bigcode-project | ||||
HumanEval | ✅ | ✅ | ||
BigBench | ||||
Fiddler | ✅ | ✅ | ||
LLM Attacks | ||||
GPT Fathomhttps://github.com/GPT-Fathom/GPT-Fathom | ✅ | ✅ |