Skip to content

Commit

Permalink
Merge branch 'eval-missing-splits' into feature/missing-splits-evalua…
Browse files Browse the repository at this point in the history
…tion
  • Loading branch information
isaac-chung authored Nov 26, 2024
2 parents 3987e3e + cde720e commit c5211c6
Show file tree
Hide file tree
Showing 609 changed files with 66,822 additions and 19,536 deletions.
4 changes: 2 additions & 2 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,6 @@ see also https://github.com/embeddings-benchmark/mteb/blob/main/docs/reproducibl

- [ ] I have filled out the ModelMeta object to the extent possible
- [ ] I have ensured that my model can be loaded using
- [ ] `mteb.get_model(model_name, revision_id)` and
- [ ] `mteb.get_model_meta(model_name, revision_id)`
- [ ] `mteb.get_model(model_name, revision)` and
- [ ] `mteb.get_model_meta(model_name, revision)`
- [ ] I have tested the implementation works on a representative set of tasks.
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:

- uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Install dependencies
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/mmteb.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:

- uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Install dependencies
Expand All @@ -38,7 +38,7 @@ jobs:

- uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Install dependencies
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ jobs:
fail-fast: false
matrix:
os: [ubuntu-latest] #, macos-latest, windows-latest]
python-version: ["3.8", "3.9", "3.10"]
python-version: ["3.9", "3.10", "3.11", "3.12"]
include:
# Add Windows with Python 3.8 only to avoid tests taking too long
- os: windows-latest
python-version: "3.8"
python-version: "3.9"

steps:
- uses: actions/checkout@v3
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -143,4 +143,5 @@ sb.ipynb
tests/create_meta/model_card.md

# removed results from mteb repo they are now available at: https://github.com/embeddings-benchmark/results
results/
results/
uv.lock
121 changes: 76 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
<h4 align="center">
<p>
<a href="#installation">Installation</a> |
<a href="#usage">Usage</a> |
<a href="#usage-documentation">Usage</a> |
<a href="https://huggingface.co/spaces/mteb/leaderboard">Leaderboard</a> |
<a href="#documentation">Documentation</a> |
<a href="#citing">Citing</a>
Expand All @@ -38,7 +38,7 @@ pip install mteb

## Example Usage

* Using a python script:
* Using a Python script:

```python
import mteb
Expand All @@ -55,6 +55,37 @@ evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results/{model_name}")
```

<details>
<summary> Running SentenceTransformer model with prompts </summary>

Prompts can be passed to the SentenceTransformer model using the `prompts` parameter. The following code shows how to use prompts with SentenceTransformer:

```python
from sentence_transformers import SentenceTransformer


model = SentenceTransformer("average_word_embeddings_komninos", prompts={"query": "Query:", "passage": "Passage:"})
evaluation = mteb.MTEB(tasks=tasks)
```

In prompts the key can be:
1. Prompt types (`passage`, `query`) - they will be used in reranking and retrieval tasks
2. Task type - these prompts will be used in all tasks of the given type
1. `BitextMining`
2. `Classification`
3. `MultilabelClassification`
4. `Clustering`
5. `PairClassification`
6. `Reranking`
7. `Retrieval`
8. `STS`
9. `Summarization`
10. `InstructionRetrieval`
3. Pair of task type and prompt type like `Retrival-query` - these prompts will be used in all classification tasks
4. Task name - these prompts will be used in the specific task
5. Pair of task name and prompt type like `NFCorpus-query` - these prompts will be used in the specific task
</details>

* Using CLI

```bash
Expand Down Expand Up @@ -133,18 +164,18 @@ For instance to select the 56 English datasets that form the "Overall MTEB Engli

```python
import mteb
benchmark = mteb.get_benchmark("MTEB(eng)")
benchmark = mteb.get_benchmark("MTEB(eng, classic)")
evaluation = mteb.MTEB(tasks=benchmark)
```

The benchmark specified not only a list of tasks, but also what splits and language to run on. To get an overview of all available benhcmarks simply run:
The benchmark specified not only a list of tasks, but also what splits and language to run on. To get an overview of all available benchmarks simply run:

```python
import mteb
benchmarks = mteb.get_benchmarks()
```

Generally we use the naming scheme for benchmarks `MTEB(*)`, where the "*" denotes the target of the benchmark. In case of a language we use the three letter language code. For large groups of language we use the group notation, e.g. `MTEB(Scandinavian)` for Scandinavian languages. External benchmarks implemented in MTEB like `CoIR` use their original name. When using a benchmark from MTEB please cite `mteb` along with the citations of the benchmark which you can access using:
Generally we use the naming scheme for benchmarks `MTEB(*)`, where the "*" denotes the target of the benchmark. In the case of a language, we use the three-letter language code. For large groups of languages, we use the group notation, e.g., `MTEB(Scandinavian)` for Scandinavian languages. External benchmarks implemented in MTEB like `CoIR` use their original name. When using a benchmark from MTEB please cite `mteb` along with the citations of the benchmark which you can access using:

```python
benchmark.citation
Expand All @@ -161,7 +192,7 @@ benchmark.citation
To pass in arguments to the model's `encode` function, you can use the encode keyword arguments (`encode_kwargs`):

```python
evaluation.run(model, encode_kwargs={"batch_size": 32}
evaluation.run(model, encode_kwargs={"batch_size": 32})
```
</details>

Expand Down Expand Up @@ -189,55 +220,35 @@ Note that the public leaderboard uses the test splits for all datasets except MS
Models should implement the following interface, implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.). For inspiration, you can look at the [mteb/mtebscripts repo](https://github.com/embeddings-benchmark/mtebscripts) used for running diverse models via SLURM scripts for the paper.

```python
class MyModel():
from mteb.encoder_interface import PromptType

class CustomModel:
def encode(
self, sentences: list[str], **kwargs: Any
) -> torch.Tensor | np.ndarray:
self,
sentences: list[str],
task_name: str,
prompt_type: PromptType | None = None,
**kwargs,
) -> np.ndarray:
"""Encodes the given sentences using the encoder.
Args:
sentences: The sentences to encode.
task_name: The name of the task.
prompt_type: The prompt type to use.
**kwargs: Additional arguments to pass to the encoder.
Returns:
The encoded sentences.
"""
pass

model = MyModel()
model = CustomModel()
tasks = mteb.get_task("Banking77Classification")
evaluation = MTEB(tasks=tasks)
evaluation.run(model)
```

If you'd like to use different encoding functions for query and corpus when evaluating on Retrieval or Reranking tasks, you can add separate methods for `encode_queries` and `encode_corpus`. If these methods exist, they will be automatically used for those tasks. You can refer to the `DRESModel` at `mteb/evaluation/evaluators/RetrievalEvaluator.py` for an example of these functions.

```python
class MyModel():
def encode_queries(self, queries: list[str], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
"""
Returns a list of embeddings for the given sentences.
Args:
queries: List of sentences to encode
Returns:
List of embeddings for the given sentences
"""
pass

def encode_corpus(self, corpus: list[str] | list[dict[str, str]], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
"""
Returns a list of embeddings for the given sentences.
Args:
corpus: List of sentences to encode
or list of dictionaries with keys "title" and "text"
Returns:
List of embeddings for the given sentences
"""
pass
```

</details>

<details>
Expand Down Expand Up @@ -319,7 +330,7 @@ from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

tasks = mteb.get_tasks( tasks=["NFCorpus"], languages=["eng"])
tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])

evaluation = MTEB(tasks=tasks)
evaluation.run(
Expand All @@ -331,7 +342,7 @@ evaluation.run(
```

CLI:
```
```bash
mteb run -t NFCorpus -m all-MiniLM-L6-v2 --output_folder results --save_predictions
```

Expand All @@ -340,11 +351,11 @@ mteb run -t NFCorpus -m all-MiniLM-L6-v2 --output_folder results --save_predicti
<details>
<summary> Fetching result from the results repository </summary>

### Fetching result from the results repository
### Fetching results from the results repository

Multiple models have already been run on tasks avaiable within MTEB. These results are available results [repository](https://github.com/embeddings-benchmark/results).
Multiple models have already been run on tasks available within MTEB. These results are available results [repository](https://github.com/embeddings-benchmark/results).

To make the results more easily accessible, we have designed custom functionality for retrieving from the repository. For instance, you are selecting the best model for your French and English retrieval task on legal documents you could fetch the relevant tasks and create a dataframe of the results using the following code:
To make the results more easily accessible, we have designed custom functionality for retrieving from the repository. For instance, if you are selecting the best model for your French and English retrieval task on legal documents you could fetch the relevant tasks and create a dataframe of the results using the following code:

```python
import mteb
Expand All @@ -369,6 +380,26 @@ df = results_to_dataframe(results)

</details>

<details>
<summary> Caching Embeddings To Re-Use Them </summary>


### Caching Embeddings To Re-Use Them

There are times you may want to cache the embeddings so you can re-use them. This may be true if you have multiple query sets for the same corpus (e.g. Wikipedia) or are doing some optimization over the queries (e.g. prompting, other experiments). You can setup a cache by using a simple wrapper, which will save the cache per task in the `cache_embeddings/{task_name}` folder:

```python
# define your task and model above as normal
...
# wrap the model with the cache wrapper
from mteb.models.cache_wrapper import CachedEmbeddingWrapper
model_with_cached_emb = CachedEmbeddingWrapper(model, cache_path='path_to_cache_dir')
# run as normal
evaluation.run(model, ...)
```

</details>

<br />


Expand Down
7 changes: 3 additions & 4 deletions docs/adding_a_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,14 @@ class SciDocsReranking(AbsTaskReranking):
dataset={
"path": "mteb/scidocs-reranking",
"revision": "d3c5e1fc0b855ab6097bf1cda04dd73947d7caab",
}
},
date=("2000-01-01", "2020-12-31"), # best guess
domains=["Academic", "Non-fiction", "Domains"],
task_subtypes=["Scientific Reranking"],
license="cc-by-4.0",
annotations_creators="derived",
dialect=[],
sample_creation="found",
descriptive_stats={"n_samples": {"test": 19599}, "avg_character_length": {"test": 69.0}},
bibtex_citation="""
@inproceedings{cohan-etal-2020-specter,
title = "{SPECTER}: Document-level Representation Learning using Citation-informed Transformers",
Expand Down Expand Up @@ -73,7 +72,7 @@ class SciDocsReranking(AbsTaskReranking):

# testing the task with a model:
model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation = MTEB(tasks=[SciDocsReranking()])
evaluation.run(model)
```

Expand Down Expand Up @@ -109,7 +108,7 @@ class VGClustering(AbsTaskClustering):
dialect=[],
text_creation="found",
bibtex_citation= ... # removed for brevity
)
)

def dataset_transform(self):
splits = self.description["eval_splits"]
Expand Down
37 changes: 28 additions & 9 deletions docs/adding_a_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@ model = mteb.get_model("sentence-transformers/paraphrase-multilingual-MiniLM-L12

tasks = mteb.get_tasks(...) # get specific tasks
# or
from mteb.benchmarks import MTEB_MAIN_EN
tasks = MTEB_MAIN_EN # or use a specific benchmark
tasks = mteb.get_benchmark("MTEB(eng, classic)") # or use a specific benchmark

evaluation = mteb.MTEB(tasks=tasks)
evaluation.run(model, output_folder="results")
Expand All @@ -29,26 +28,46 @@ mteb run -m {model_name} -t {task_names}

These will save the results in a folder called `results/{model_name}/{model_revision}`.

1. **Format the results using the CLI:**
2. **Push Results to the Leaderboard**

To add results to the public leaderboard you can push your results to the [results repository](https://github.com/embeddings-benchmark/results) afterwards they will appear on the leaderboard after a day.


3. (Optional) **Add the results using to the model card:**

`mteb` implements a cli for adding results to the model card:

```bash
mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md
```

If readme of model exists:
To add the content to the public model simply copy the content of the `model_card.md` file to the top of a `README.md` file of your model on the Hub. See [here](https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit/blob/main/README.md) for an example.

If the readme already exists:

```bash
mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md --from_existing your_existing_readme.md
```

2. **Add the frontmatter to model repository:**

Copy the content of the `model_card.md` file to the top of a `README.md` file of your model on the Hub. See [here](https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit/blob/main/README.md) for an example.
Note that if you can run the model on many tasks, this can lead to an excessively large readme frontmatter.

3. **Wait for a refresh the leaderboard:**
4. **Wait for a refresh the leaderboard:**

The leaderboard [automatically refreshes daily](https://github.com/embeddings-benchmark/leaderboard/commits/main/) so once submitted you only need to wait for the automatic refresh. You can find the workflows for the leaderboard refresh [here](https://github.com/embeddings-benchmark/leaderboard/tree/main/.github/workflows). If you experience issues with the leaderboard please create an [issue](https://github.com/embeddings-benchmark/mteb/issues).

**Notes:**
- We remove models with scores that cannot be reproduced, so please ensure that your model is accessible and scores can be reproduced.
- An alternative way of submitting to the leaderboard is by opening a PR with your results [here](https://github.com/embeddings-benchmark/results) & checking that they are displayed correctly by [locally running the leaderboard](https://github.com/embeddings-benchmark/leaderboard?tab=readme-ov-file#developer-setup)

- ##### Using Prompts with Sentence Transformers

If your model uses Sentence Transformers and requires different prompts for encoding the queries and corpus, you can take advantage of the `prompts` [parameter](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer).

Internally, `mteb` uses the prompt named `query` for encoding the queries and `passage` as the prompt name for encoding the corpus. This is aligned with the default names used by Sentence Transformers.

###### Adding the prompts in the model configuration (Preferred)

You can directly add the prompts when saving and uploading your model to the Hub. For an example, refer to this [configuration file](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5/blob/3b5a16eaf17e47bd997da998988dce5877a57092/config_sentence_transformers.json).

###### Instantiating the Model with Prompts

If you are unable to directly add the prompts in the model configuration, you can instantiate the model using the `sentence_transformers_loader` and pass `prompts` as an argument. For more details, see the `mteb/models/bge_models.py` file.
Loading

0 comments on commit c5211c6

Please sign in to comment.