Skip to content

Commit

Permalink
discussion update
Browse files Browse the repository at this point in the history
  • Loading branch information
slobentanzer committed Jan 30, 2024
1 parent dc102a9 commit 747ebc7
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions content/30.discussion.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ To keep the framework effective and sustainable, we focus on reusing existing op
The transparency we emphasise at every step of the framework is essential to a sustainable application of LLMs in biomedical research and beyond [@doi:10.1038/d41586-024-00029-4].

To account for the requirements of biomedical research workflows, we take particular care to guarantee robustness and objective evaluation of LLM behaviour and their performance in interaction with other parts of the framework.
We achieve this goal by implementing a living benchmarking framework that allows the automated evaluation of LLMs, prompts, and other components.
We achieve this goal by implementing a living benchmarking framework that allows the automated evaluation of LLMs, prompts, and other components (https://biochatter.org/benchmark-overview/).
Even the most recent and biomedicine-specific benchmarking efforts are small-scale manual approaches that do not consider the full matrix of possible combinations of components, and many benchmarks are performed by accessing web interfaces of LLMs, which obfuscates important parameters, such as model version and temperature [@biollmbench].
As such, a framework is a necessary step towards the objective and reproducible evaluation of LLMs.
We prevent data leakage from the benchmark datasets into the training data of new models by encryption, which is essential for the sustainability of the benchmark as new models are released.
Expand All @@ -21,15 +21,15 @@ We allow self-hosting of open-source models on any scale, from dedicated hardwar

### Limitations

Depending on generic open-source libraries such as LangChain [@langchain] and Pytest [@pytest] allows us to focus on the biomedical domain but also introduces technical dependencies on these libraries.
While we support those upstream libraries via pull requests, we depend on their maintainers for future updates.
In addition, keeping up with these rapid developments is demanding on developer time, which is only sustainable in a community-driven open-source effort.

Most importantly, the current generation of LLMs is not yet ready for unsupervised use in biomedical research.
The current generation of LLMs is not yet ready for unsupervised use in biomedical research.
While we have taken steps to mitigate the risks of using LLMs, such as independent benchmarks, fact-checking, and knowledge graph querying, we cannot guarantee that the models will not produce harmful outputs.
We see current LLMs, particularly in the scope of the BioCypher ecosystem, as helpful tools to assist human researchers, alleviating menial and repetitive tasks and helping with technical aspects such as query languages.
They are not meant to replace human ingenuity and expertise, but to augment it with their complementary strengths.

Depending on generic open-source libraries such as LangChain [@langchain] and Pytest [@pytest] allows us to focus on the biomedical domain but also introduces technical dependencies on these libraries.
While we support those upstream libraries via pull requests, we depend on their maintainers for future updates.
In addition, keeping up with these rapid developments is demanding on developer time, which is only sustainable in a community-driven open-source effort.

### Future directions

Multitask learners that can synthesise, for instance, language, vision, and molecular measurements, are an emerging field of research [@doi:10.48550/arXiv.2306.04529;@doi:10.48550/arXiv.2211.01786;@doi:10.48550/arXiv.2310.09478].
Expand Down

0 comments on commit 747ebc7

Please sign in to comment.