feedback by Vincent Carey

biocypher · Feb 23, 2024 · 5eaecd1 · 5eaecd1
1 parent 0cdd377
commit 5eaecd1
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/content/20.results.md b/content/20.results.md
@@ -89,7 +89,7 @@ For transparent and reproducible evaluation of LLMs, we implement a benchmarking
 The generic Pytest framework [@pytest] allows for the automated evaluation of a matrix of all possible combinations of components.
 The results are stored and displayed on our website for simple comparison, and the benchmark is updated upon the release of new models and extensions to the datasets and BioChatter capabilities ([https://biochatter.org/benchmark/](https://biochatter.org/benchmark/)).
 
-Since the biomedical domain has its own tasks and requirements, we created a bespoke benchmark that allows us to be more precise in the evaluation of components [@biollmbench].
+Since the biomedical domain has its own tasks and requirements [@biollmbench], we created a bespoke benchmark that allows us to be more precise in the evaluation of components.
 This is complementary to the existing, general-purpose benchmarks and leaderboards for LLMs [@doi:10.1038/s41586-023-06291-2;@{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard};@{https://crfm.stanford.edu/helm/lite/latest/}].
 Furthermore, to prevent leakage of the benchmark data into the training data of the models, a known issue in the general-purpose benchmarks [@doi:10.48550/arXiv.2310.18018], we implemented an encrypted pipeline that contains the benchmark datasets and is only accessible to the workflow that executes the benchmark (see Methods).