Skip to content

Commit

Permalink
feedback by Vincent Carey
Browse files Browse the repository at this point in the history
  • Loading branch information
slobentanzer committed Feb 23, 2024
1 parent 0cdd377 commit 5eaecd1
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion content/20.results.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ For transparent and reproducible evaluation of LLMs, we implement a benchmarking
The generic Pytest framework [@pytest] allows for the automated evaluation of a matrix of all possible combinations of components.
The results are stored and displayed on our website for simple comparison, and the benchmark is updated upon the release of new models and extensions to the datasets and BioChatter capabilities ([https://biochatter.org/benchmark/](https://biochatter.org/benchmark/)).

Since the biomedical domain has its own tasks and requirements, we created a bespoke benchmark that allows us to be more precise in the evaluation of components [@biollmbench].
Since the biomedical domain has its own tasks and requirements [@biollmbench], we created a bespoke benchmark that allows us to be more precise in the evaluation of components.
This is complementary to the existing, general-purpose benchmarks and leaderboards for LLMs [@doi:10.1038/s41586-023-06291-2;@{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard};@{https://crfm.stanford.edu/helm/lite/latest/}].
Furthermore, to prevent leakage of the benchmark data into the training data of the models, a known issue in the general-purpose benchmarks [@doi:10.48550/arXiv.2310.18018], we implemented an encrypted pipeline that contains the benchmark datasets and is only accessible to the workflow that executes the benchmark (see Methods).

Expand Down

0 comments on commit 5eaecd1

Please sign in to comment.