Skip to content

Commit

Permalink
benchmark hashes
Browse files Browse the repository at this point in the history
  • Loading branch information
slobentanzer committed Feb 6, 2024
1 parent bb8150e commit 87085fa
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions content/40.methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,9 @@ For instance, we test the conversion of numbers (which LLMs are notoriously bad
- **sentiment and behaviour**: To assess whether the models exhibit the desired behaviour patterns for each of the personas, we let a second LLM evaluate the responses based on a set of criteria, including professionalism and politeness.

The Pytest framework is implemented at [https://github.com/biocypher/biochatter/blob/main/benchmark](https://github.com/biocypher/biochatter/blob/main/benchmark), and more information and results are available at [https://biocypher.github.io/biochatter/benchmarking](https://biocypher.github.io/biochatter/benchmarking).
The Pytest matrix uses a hash-based system to evaluate whether a model-dataset combination has been run before.
Briefly, the hash is calculated from the dictionary representation of the test parameters, and the test is skipped if the combination of hash and model name is already present in the database.
This allows automatic running of all tests that have been newly added or modified.

To prevent leakage of benchmarking data (and subsequent contamination of future LLMs), we implement an encryption routine on the benchmark datasets.
The encryption is performed using a hybrid encryption scheme, where the data are encrypted with a symmetric key, which is in turn encrypted with an asymmetric key.
Expand Down

0 comments on commit 87085fa

Please sign in to comment.