Merge branch 'eval-missing-splits' into feature/missing-splits-evalua…

…tion
embeddings-benchmark · Nov 26, 2024 · c5211c6 · c5211c6
2 parents 3987e3e + cde720e
commit c5211c6
Show file tree

Hide file tree

Showing 609 changed files with 66,822 additions and 19,536 deletions.
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -34,6 +34,6 @@ see also https://github.com/embeddings-benchmark/mteb/blob/main/docs/reproducibl
 
  - [ ] I have filled out the ModelMeta object to the extent possible
  - [ ] I have ensured that my model can be loaded using
-   - [ ] `mteb.get_model(model_name, revision_id)` and
-   - [ ] `mteb.get_model_meta(model_name, revision_id)`
+   - [ ] `mteb.get_model(model_name, revision)` and
+   - [ ] `mteb.get_model_meta(model_name, revision)`
  - [ ] I have tested the implementation works on a representative set of tasks.
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -15,7 +15,7 @@ jobs:
 
       - uses: actions/setup-python@v4
         with:
-          python-version: "3.8"
+          python-version: "3.9"
           cache: "pip"
 
       - name: Install dependencies

diff --git a/.github/workflows/mmteb.yml b/.github/workflows/mmteb.yml
@@ -16,7 +16,7 @@ jobs:
 
       - uses: actions/setup-python@v4
         with:
-          python-version: "3.8"
+          python-version: "3.9"
           cache: "pip"
 
       - name: Install dependencies
@@ -38,7 +38,7 @@ jobs:
 
       - uses: actions/setup-python@v4
         with:
-          python-version: "3.8"
+          python-version: "3.9"
           cache: "pip"
 
       - name: Install dependencies

diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -16,11 +16,11 @@ jobs:
       fail-fast: false
       matrix:
         os: [ubuntu-latest] #, macos-latest, windows-latest]
-        python-version: ["3.8", "3.9", "3.10"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
         include:
           # Add Windows with Python 3.8 only to avoid tests taking too long
           - os: windows-latest
-            python-version: "3.8"
+            python-version: "3.9"
 
     steps:
       - uses: actions/checkout@v3

diff --git a/.gitignore b/.gitignore
@@ -143,4 +143,5 @@ sb.ipynb
 tests/create_meta/model_card.md
 
 # removed results from mteb repo they are now available at: https://github.com/embeddings-benchmark/results
-results/
+results/
+uv.lock
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@
 <h4 align="center">
     <p>
         <a href="#installation">Installation</a> |
-        <a href="#usage">Usage</a> |
+        <a href="#usage-documentation">Usage</a> |
         <a href="https://huggingface.co/spaces/mteb/leaderboard">Leaderboard</a> |
         <a href="#documentation">Documentation</a> |
         <a href="#citing">Citing</a>
@@ -38,7 +38,7 @@ pip install mteb
 
 ## Example Usage
 
-* Using a python script:
+* Using a Python script:
 
 ```python
 import mteb
@@ -55,6 +55,37 @@ evaluation = mteb.MTEB(tasks=tasks)
 results = evaluation.run(model, output_folder=f"results/{model_name}")
 ```
 
+<details>
+  <summary> Running SentenceTransformer model with prompts </summary>
+
+Prompts can be passed to the SentenceTransformer model using the `prompts` parameter. The following code shows how to use prompts with SentenceTransformer:
+
+```python
+from sentence_transformers import SentenceTransformer
+
+
+model = SentenceTransformer("average_word_embeddings_komninos", prompts={"query": "Query:", "passage": "Passage:"})
+evaluation = mteb.MTEB(tasks=tasks)
+```
+
+In prompts the key can be:
+1. Prompt types (`passage`, `query`) - they will be used in reranking and retrieval tasks 
+2. Task type - these prompts will be used in all tasks of the given type
+   1. `BitextMining`
+   2. `Classification`
+   3. `MultilabelClassification`
+   4. `Clustering`
+   5. `PairClassification`
+   6. `Reranking`
+   7. `Retrieval`
+   8. `STS`
+   9. `Summarization`
+   10. `InstructionRetrieval`
+3. Pair of task type and prompt type like `Retrival-query` - these prompts will be used in all classification tasks
+4. Task name - these prompts will be used in the specific task
+5. Pair of task name and prompt type like `NFCorpus-query` - these prompts will be used in the specific task
+</details>
+
 * Using CLI
 
 ```bash
@@ -133,18 +164,18 @@ For instance to select the 56 English datasets that form the "Overall MTEB Engli
 
 ```python
 import mteb
-benchmark = mteb.get_benchmark("MTEB(eng)")
+benchmark = mteb.get_benchmark("MTEB(eng, classic)")
 evaluation = mteb.MTEB(tasks=benchmark)
 ```
 
-The benchmark specified not only a list of tasks, but also what splits and language to run on. To get an overview of all available benhcmarks simply run:
+The benchmark specified not only a list of tasks, but also what splits and language to run on. To get an overview of all available benchmarks simply run:
 
 ```python
 import mteb
 benchmarks = mteb.get_benchmarks()
 ```
 
-Generally we use the naming scheme for benchmarks `MTEB(*)`, where the "*" denotes the target of the benchmark. In case of a language we use the three letter language code. For large groups of language we use the group notation, e.g. `MTEB(Scandinavian)` for Scandinavian languages. External benchmarks implemented in MTEB like `CoIR` use their original name. When using a benchmark from MTEB please cite `mteb` along with the citations of the benchmark which you can access using:
+Generally we use the naming scheme for benchmarks `MTEB(*)`, where the "*" denotes the target of the benchmark. In the case of a language, we use the three-letter language code. For large groups of languages, we use the group notation, e.g., `MTEB(Scandinavian)` for Scandinavian languages. External benchmarks implemented in MTEB like `CoIR` use their original name. When using a benchmark from MTEB please cite `mteb` along with the citations of the benchmark which you can access using:
 
 ```python
 benchmark.citation
@@ -161,7 +192,7 @@ benchmark.citation
 To pass in arguments to the model's `encode` function, you can use the encode keyword arguments (`encode_kwargs`):
 
 ```python
-evaluation.run(model, encode_kwargs={"batch_size": 32}
+evaluation.run(model, encode_kwargs={"batch_size": 32})
 ```
 </details>
 
@@ -189,55 +220,35 @@ Note that the public leaderboard uses the test splits for all datasets except MS
 Models should implement the following interface, implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.). For inspiration, you can look at the [mteb/mtebscripts repo](https://github.com/embeddings-benchmark/mtebscripts) used for running diverse models via SLURM scripts for the paper.
 
 ```python
-class MyModel():
+from mteb.encoder_interface import PromptType
+
+class CustomModel:
     def encode(
-        self, sentences: list[str], **kwargs: Any
-    ) -> torch.Tensor | np.ndarray:
+        self,
+        sentences: list[str],
+        task_name: str,
+        prompt_type: PromptType | None = None,
+        **kwargs,
+    ) -> np.ndarray:
         """Encodes the given sentences using the encoder.
-
+        
         Args:
             sentences: The sentences to encode.
+            task_name: The name of the task.
+            prompt_type: The prompt type to use.
             **kwargs: Additional arguments to pass to the encoder.
-
+            
         Returns:
             The encoded sentences.
         """
         pass
 
-model = MyModel()
+model = CustomModel()
 tasks = mteb.get_task("Banking77Classification")
 evaluation = MTEB(tasks=tasks)
 evaluation.run(model)
 ```
 
-If you'd like to use different encoding functions for query and corpus when evaluating on Retrieval or Reranking tasks, you can add separate methods for `encode_queries` and `encode_corpus`. If these methods exist, they will be automatically used for those tasks. You can refer to the `DRESModel` at `mteb/evaluation/evaluators/RetrievalEvaluator.py` for an example of these functions.
-
-```python
-class MyModel():
-    def encode_queries(self, queries: list[str], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
-        """
-        Returns a list of embeddings for the given sentences.
-        Args:
-            queries: List of sentences to encode
-
-        Returns:
-            List of embeddings for the given sentences
-        """
-        pass
-
-    def encode_corpus(self, corpus: list[str] | list[dict[str, str]], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
-        """
-        Returns a list of embeddings for the given sentences.
-        Args:
-            corpus: List of sentences to encode
-                or list of dictionaries with keys "title" and "text"
-
-        Returns:
-            List of embeddings for the given sentences
-        """
-        pass
-```
-
 </details>
 
 <details>
@@ -319,7 +330,7 @@ from sentence_transformers import SentenceTransformer
 
 model = SentenceTransformer("all-MiniLM-L6-v2")
 
-tasks = mteb.get_tasks( tasks=["NFCorpus"], languages=["eng"])
+tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])
 
 evaluation = MTEB(tasks=tasks)
 evaluation.run(
@@ -331,7 +342,7 @@ evaluation.run(
 ```
 
 CLI:
-```
+```bash
 mteb run -t NFCorpus -m all-MiniLM-L6-v2 --output_folder results --save_predictions
 ```
 
@@ -340,11 +351,11 @@ mteb run -t NFCorpus -m all-MiniLM-L6-v2 --output_folder results --save_predicti
 <details>
   <summary> Fetching result from the results repository </summary>
 
-### Fetching result from the results repository
+### Fetching results from the results repository
 
-Multiple models have already been run on tasks avaiable within MTEB. These results are available results [repository](https://github.com/embeddings-benchmark/results).
+Multiple models have already been run on tasks available within MTEB. These results are available results [repository](https://github.com/embeddings-benchmark/results).
 
-To make the results more easily accessible, we have designed custom functionality for retrieving from the repository. For instance, you are selecting the best model for your French and English retrieval task on legal documents you could fetch the relevant tasks and create a dataframe of the results using the following code:
+To make the results more easily accessible, we have designed custom functionality for retrieving from the repository. For instance, if you are selecting the best model for your French and English retrieval task on legal documents you could fetch the relevant tasks and create a dataframe of the results using the following code:
 
 ```python
 import mteb
@@ -369,6 +380,26 @@ df = results_to_dataframe(results)
 
 </details>
 
+<details>
+  <summary>  Caching Embeddings To Re-Use Them </summary>
+
+
+### Caching Embeddings To Re-Use Them
+
+There are times you may want to cache the embeddings so you can re-use them. This may be true if you have multiple query sets for the same corpus (e.g. Wikipedia) or are doing some optimization over the queries (e.g. prompting, other experiments). You can setup a cache by using a simple wrapper, which will save the cache per task in the `cache_embeddings/{task_name}` folder:
+
+```python
+# define your task and model above as normal
+...
+# wrap the model with the cache wrapper
+from mteb.models.cache_wrapper import CachedEmbeddingWrapper
+model_with_cached_emb = CachedEmbeddingWrapper(model, cache_path='path_to_cache_dir')
+# run as normal
+evaluation.run(model, ...) 
+```
+
+</details>
+
 <br /> 
 
 

diff --git a/docs/adding_a_dataset.md b/docs/adding_a_dataset.md
@@ -37,15 +37,14 @@ class SciDocsReranking(AbsTaskReranking):
         dataset={
             "path": "mteb/scidocs-reranking",
             "revision": "d3c5e1fc0b855ab6097bf1cda04dd73947d7caab",
-        }
+        },
         date=("2000-01-01", "2020-12-31"), # best guess
         domains=["Academic", "Non-fiction", "Domains"],
         task_subtypes=["Scientific Reranking"],
         license="cc-by-4.0",
         annotations_creators="derived",
         dialect=[],
         sample_creation="found",
-        descriptive_stats={"n_samples": {"test": 19599}, "avg_character_length": {"test": 69.0}},
         bibtex_citation="""
 @inproceedings{cohan-etal-2020-specter,
     title = "{SPECTER}: Document-level Representation Learning using Citation-informed Transformers",
@@ -73,7 +72,7 @@ class SciDocsReranking(AbsTaskReranking):
 
 # testing the task with a model:
 model = SentenceTransformer("average_word_embeddings_komninos")
-evaluation = MTEB(tasks=[MindSmallReranking()])
+evaluation = MTEB(tasks=[SciDocsReranking()])
 evaluation.run(model)
 ```
 
@@ -109,7 +108,7 @@ class VGClustering(AbsTaskClustering):
         dialect=[],
         text_creation="found",
         bibtex_citation= ... # removed for brevity
-)
+    )
 
     def dataset_transform(self):
         splits = self.description["eval_splits"]

diff --git a/docs/adding_a_model.md b/docs/adding_a_model.md
@@ -14,8 +14,7 @@ model = mteb.get_model("sentence-transformers/paraphrase-multilingual-MiniLM-L12
 
 tasks = mteb.get_tasks(...) # get specific tasks
 # or 
-from mteb.benchmarks import MTEB_MAIN_EN
-tasks = MTEB_MAIN_EN # or use a specific benchmark
+tasks = mteb.get_benchmark("MTEB(eng, classic)") # or use a specific benchmark
 
 evaluation = mteb.MTEB(tasks=tasks)
 evaluation.run(model, output_folder="results")
@@ -29,26 +28,46 @@ mteb run -m {model_name} -t {task_names}
 
 These will save the results in a folder called `results/{model_name}/{model_revision}`.
 
-1. **Format the results using the CLI:**
+2. **Push Results to the Leaderboard**
+
+To add results to the public leaderboard you can push your results to the [results repository](https://github.com/embeddings-benchmark/results) afterwards they will appear on the leaderboard after a day.
+
+
+3. (Optional) **Add the results using to the model card:**
+
+`mteb` implements a cli for adding results to the model card:
 
 ```bash
 mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md
 ```
 
-If readme of model exists:
+To add the content to the public model simply copy the content of the `model_card.md` file to the top of a `README.md` file of your model on the Hub. See [here](https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit/blob/main/README.md) for an example.
+
+If the readme already exists:
 
 ```bash
 mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md --from_existing your_existing_readme.md 
 ```
 
-2. **Add the frontmatter to model repository:**
-
-Copy the content of the `model_card.md` file to the top of a `README.md` file of your model on the Hub. See [here](https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit/blob/main/README.md) for an example.
+Note that if you can run the model on many tasks, this can lead to an excessively large readme frontmatter.
 
-3. **Wait for a refresh the leaderboard:**
+4. **Wait for a refresh the leaderboard:**
 
 The leaderboard [automatically refreshes daily](https://github.com/embeddings-benchmark/leaderboard/commits/main/) so once submitted you only need to wait for the automatic refresh. You can find the workflows for the leaderboard refresh [here](https://github.com/embeddings-benchmark/leaderboard/tree/main/.github/workflows). If you experience issues with the leaderboard please create an [issue](https://github.com/embeddings-benchmark/mteb/issues).
 
 **Notes:**
 - We remove models with scores that cannot be reproduced, so please ensure that your model is accessible and scores can be reproduced.
-- An alternative way of submitting to the leaderboard is by opening a PR with your results [here](https://github.com/embeddings-benchmark/results) & checking that they are displayed correctly by [locally running the leaderboard](https://github.com/embeddings-benchmark/leaderboard?tab=readme-ov-file#developer-setup)
+
+- ##### Using Prompts with Sentence Transformers
+
+    If your model uses Sentence Transformers and requires different prompts for encoding the queries and corpus, you can take advantage of the `prompts` [parameter](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer). 
+
+    Internally, `mteb` uses the prompt named `query` for encoding the queries and `passage` as the prompt name for encoding the corpus. This is aligned with the default names used by Sentence Transformers.
+
+    ###### Adding the prompts in the model configuration (Preferred)
+
+    You can directly add the prompts when saving and uploading your model to the Hub. For an example, refer to this [configuration file](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5/blob/3b5a16eaf17e47bd997da998988dce5877a57092/config_sentence_transformers.json).
+
+    ###### Instantiating the Model with Prompts
+
+    If you are unable to directly add the prompts in the model configuration, you can instantiate the model using the `sentence_transformers_loader` and pass `prompts` as an argument. For more details, see the `mteb/models/bge_models.py` file.