benchmark hashes

biocypher · Feb 6, 2024 · 87085fa · 87085fa
1 parent bb8150e
commit 87085fa
Showing 1 changed file with 3 additions and 0 deletions.
diff --git a/content/40.methods.md b/content/40.methods.md
@@ -52,6 +52,9 @@ For instance, we test the conversion of numbers (which LLMs are notoriously bad
 - **sentiment and behaviour**: To assess whether the models exhibit the desired behaviour patterns for each of the personas, we let a second LLM evaluate the responses based on a set of criteria, including professionalism and politeness.
 
 The Pytest framework is implemented at [https://github.com/biocypher/biochatter/blob/main/benchmark](https://github.com/biocypher/biochatter/blob/main/benchmark), and more information and results are available at [https://biocypher.github.io/biochatter/benchmarking](https://biocypher.github.io/biochatter/benchmarking).
+The Pytest matrix uses a hash-based system to evaluate whether a model-dataset combination has been run before.
+Briefly, the hash is calculated from the dictionary representation of the test parameters, and the test is skipped if the combination of hash and model name is already present in the database.
+This allows automatic running of all tests that have been newly added or modified.
 
 To prevent leakage of benchmarking data (and subsequent contamination of future LLMs), we implement an encryption routine on the benchmark datasets.
 The encryption is performed using a hybrid encryption scheme, where the data are encrypted with a symmetric key, which is in turn encrypted with an asymmetric key.