BEIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark.
For more information, checkout our publications:
- BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (NeurIPS 2021, Datasets and Benchmarks Track)
- Installation
- Features
- Leaderboard
- Course Material on IR
- Examples and Tutorials
- Quick Example
- Datasets
- Models
- Available Metrics
- Citing & Authors
Install via pip:
pip install beir
If you want to build from source, use:
$ git clone https://github.com/benchmarkir/beir.git
$ cd beir
$ pip install -e .
Tested with python versions 3.6 and 3.7
- Preprocess your own IR dataset or use one of the already-preprocessed 17 benchmark datasets
- Wide settings included, covers diverse benchmarks useful for both academia and industry
- Includes well-known retrieval architectures (lexical, dense, sparse and reranking-based)
- Add and evaluate your own model in a easy framework using different state-of-the-art evaluation metrics
Find below Google Sheets for BEIR Leaderboard. Unfortunately with Markdown the tables were not easy to read.
Leaderboard | Link |
---|---|
Dense Retrieval | Google Sheet |
BM25 top-100 + CE Reranking | Google Sheet |
If you are new to Information Retrieval and wish to understand and learn more about classical or neural IR, we suggest you to look at the open-sourced courses below.
Course | University | Instructor | Link | Available |
---|---|---|---|---|
Training SOTA Neural Search Models | Hugging Face | Nils Reimers | Link | Video |
BEIR: Benchmarking IR | UKP Lab | Nandan Thakur | Link | Video + Slides |
Intro to Advanced IR | TU Wien'21 | Sebastian Hofstaetter | Link | Videos + Slides |
CS224U NLU + IR | Stanford'21 | Omar Khattab | Link | Slides |
Pretrained Transformers for Text Ranking: BERT and Beyond | MPI, Waterloo'21 | Andrew Yates, Rodrigo Nogueira, Jimmy Lin | Link | |
BoF Session on IR | NAACL'21 | Sean MacAvaney, Luca Soldaini | Link | Slides |
To easily understand and get your hands dirty with BEIR, we invite you to try our tutorials out 🚀 🚀
Name | Link |
---|---|
How to evaluate pre-trained models on BEIR datasets |
Name | Link |
---|---|
BM25 Retrieval with Elasticsearch | evaluate_bm25.py |
Anserini-BM25 (Pyserini) Retrieval with Docker | evaluate_anserini_bm25.py |
Multilingual BM25 Retrieval with Elasticsearch 🆕 | evaluate_multilingual_bm25.py |
Name | Link |
---|---|
Exact-search retrieval using (dense) Sentence-BERT | evaluate_sbert.py |
Exact-search retrieval using (dense) ANCE | evaluate_ance.py |
Exact-search retrieval using (dense) DPR | evaluate_dpr.py |
Exact-search retrieval using (dense) USE-QA | evaluate_useqa.py |
ANN and Exact-search using Faiss 🆕 | evaluate_faiss_dense.py |
Retrieval using Binary Passage Retriver (BPR) 🆕 | evaluate_bpr.py |
Dimension Reduction using PCA 🆕 | evaluate_dim_reduction.py |
Name | Link |
---|---|
Hybrid sparse retrieval using SPARTA | evaluate_sparta.py |
Sparse retrieval using docT5query and Pyserini | evaluate_anserini_docT5query.py |
Sparse retrieval using docT5query (MultiGPU) and Pyserini 🆕 | evaluate_anserini_docT5query_parallel.py |
Sparse retrieval using DeepCT and Pyserini 🆕 | evaluate_deepct.py |
Name | Link |
---|---|
Reranking top-100 BM25 results with SBERT CE | evaluate_bm25_ce_reranking.py |
Reranking top-100 BM25 results with Dense Retriever | evaluate_bm25_sbert_reranking.py |
Name | Link |
---|---|
Train SBERT with Inbatch negatives | train_sbert.py |
Train SBERT with BM25 hard negatives | train_sbert_BM25_hardnegs.py |
Train MSMARCO SBERT with BM25 Negatives | train_msmarco_v2.py |
Train (SOTA) MSMARCO SBERT with Mined Hard Negatives 🆕 | train_msmarco_v3.py |
Train (SOTA) MSMARCO BPR with Mined Hard Negatives 🆕 | train_msmarco_v3_bpr.py |
Train (SOTA) MSMARCO SBERT with Mined Hard Negatives (Margin-MSE) 🆕 | train_msmarco_v3_margin_MSE.py |
Name | Link |
---|---|
Synthetic Query Generation using T5-model | query_gen.py |
(GenQ) Synthetic QG using T5-model + fine-tuning SBERT | query_gen_and_train.py |
Synthetic Query Generation using Multiple GPU and T5 🆕 | query_gen_multi_gpu.py |
Name | Link |
---|---|
Benchmark BM25 (Inference speed) | benchmark_bm25.py |
Benchmark Cross-Encoder Reranking (Inference speed) | benchmark_bm25_ce_reranking.py |
Benchmark Dense Retriever (Inference speed) | benchmark_sbert.py |
from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
import logging
import pathlib, os
#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S',
level=logging.INFO,
handlers=[LoggingHandler()])
#### /print debug information to stdout
#### Download scifact.zip dataset and unzip the dataset
dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join(pathlib.Path(__file__).parent.absolute(), "datasets")
data_path = util.download_and_unzip(url, out_dir)
#### Provide the data_path where scifact has been downloaded and unzipped
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
#### Load the SBERT model and retrieve using cosine-similarity
model = DRES(models.SentenceBERT("msmarco-distilbert-base-v3"), batch_size=16)
retriever = EvaluateRetrieval(model, score_function="cos_sim") # or "dot" for dot-product
results = retriever.retrieve(corpus, queries)
#### Evaluate your model with NDCG@k, MAP@K, Recall@K and Precision@K where k = [1,3,5,10,100,1000]
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
To load one of the already preprocessed datasets in your current directory as follows:
from beir import util
from beir.datasets.data_loader import GenericDataLoader
dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
This will download the scifact
dataset under the datasets
directory.
For other datasets, just use one of the datasets names, mention below.
Command to generate md5hash using Terminal: md5hash filename.zip
.
Dataset | Website | BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 |
---|---|---|---|---|---|---|---|---|
MSMARCO | Homepage | msmarco |
train dev test |
6,980 | 8.84M | 1.1 | Link | 444067daf65d982533ea17ebd59501e4 |
MSMARCO v2 | Homepage | msmarco-v2 |
train dev1 dev2 |
4,552 4,702 |
138M | Link | ba6238b403f0b345683885cc9390fff5 |
|
TREC-COVID | Homepage | trec-covid |
test |
50 | 171K | 493.5 | Link | ce62140cb23feb9becf6270d0d1fe6d1 |
NFCorpus | Homepage | nfcorpus |
train dev test |
323 | 3.6K | 38.2 | Link | a89dba18a62ef92f7d323ec890a0d38d |
BioASQ | Homepage | bioasq |
train test |
500 | 14.91M | 8.05 | No | How to Reproduce? |
NQ | Homepage | nq |
train test |
3,452 | 2.68M | 1.2 | Link | d4d3d2e48787a744b6f6e691ff534307 |
HotpotQA | Homepage | hotpotqa |
train dev test |
7,405 | 5.23M | 2.0 | Link | f412724f78b0d91183a0e86805e16114 |
FiQA-2018 | Homepage | fiqa |
train dev test |
648 | 57K | 2.6 | Link | 17918ed23cd04fb15047f73e6c3bd9d9 |
Signal-1M(RT) | Homepage | signal1m |
test |
97 | 2.86M | 19.6 | No | How to Reproduce? |
TREC-NEWS | Homepage | trec-news |
test |
57 | 595K | 19.6 | No | How to Reproduce? |
ArguAna | Homepage | arguana |
test |
1,406 | 8.67K | 1.0 | Link | 8ad3e3c2a5867cdced806d6503f29b99 |
Touche-2020 | Homepage | webis-touche2020 |
test |
49 | 382K | 19.0 | Link | 46f650ba5a527fc69e0a6521c5a23563 |
CQADupstack | Homepage | cqadupstack |
test |
13,145 | 457K | 1.4 | Link | 4e41456d7df8ee7760a7f866133bda78 |
Quora | Homepage | quora |
dev test |
10,000 | 523K | 1.6 | Link | 18fb154900ba42a600f84b839c173167 |
DBPedia | Homepage | dbpedia-entity |
dev test |
400 | 4.63M | 38.2 | Link | c2a39eb420a3164af735795df012ac2c |
SCIDOCS | Homepage | scidocs |
test |
1,000 | 25K | 4.9 | Link | 38121350fc3a4d2f48850f6aff52e4a9 |
FEVER | Homepage | fever |
train dev test |
6,666 | 5.42M | 1.2 | Link | 5a818580227bfb4b35bb6fa46d9b6c03 |
Climate-FEVER | Homepage | climate-fever |
test |
1,535 | 5.42M | 3.0 | Link | 8b66f0a9126c521bae2bde127b4dc99d |
SciFact | Homepage | scifact |
train test |
300 | 5K | 1.1 | Link | 5f7d1de60b170fc8027bb7898e2efca1 |
Robust04 | Homepage | robust04 |
test |
249 | 528K | 69.9 | No | How to Reproduce? |
Language | Dataset | Website | BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 |
---|---|---|---|---|---|---|---|---|---|
German | GermanQuAD | Homepage | germanquad |
test |
2,044 | 2.80M | 1.0 | Link | 95a581c3162d10915a418609bcce851b |
Arabic | Mr.TyDI | Homepage | mrtydi/arabic |
train dev test |
1,081 | 2.1M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Bengali | Mr.TyDI | Homepage | mrtydi/bengali |
train dev test |
111 | 304K | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Finnish | Mr.TyDI | Homepage | mrtydi/finnish |
train dev test |
1,254 | 1.9M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Indonesian | Mr.TyDI | Homepage | mrtydi/indonesian |
train dev test |
829 | 1.47M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Japanese | Mr.TyDI | Homepage | mrtydi/japanese |
train dev test |
720 | 7M | 1.3 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Korean | Mr.TyDI | Homepage | mrtydi/korean |
train dev test |
421 | 1.5M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Russian | Mr.TyDI | Homepage | mrtydi/russian |
train dev test |
995 | 9.6M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Swahili | Mr.TyDI | Homepage | mrtydi/swahili |
train dev test |
670 | 136K | 1.1 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Telugu | Mr.TyDI | Homepage | mrtydi/telugu |
train dev test |
646 | 548K | 1.0 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Thai | Mr.TyDI | Homepage | mrtydi/thai |
train dev test |
1,190 | 568K | 1.1 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Language | Dataset | Website | BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 |
---|---|---|---|---|---|---|---|---|---|
Spanish | mMARCO | Homepage | mmarco/spanish |
train dev |
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
French | mMARCO | Homepage | mmarco/french |
train dev |
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
Portuguese | mMARCO | Homepage | mmarco/portuguese |
train dev |
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
Italian | mMARCO | Homepage | mmarco/italian |
train dev |
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
Indonesian | mMARCO | Homepage | mmarco/indonesian |
train dev |
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
German | mMARCO | Homepage | mmarco/german |
train dev |
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
Russian | mMARCO | Homepage | mmarco/russian |
train dev |
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
Chinese | mMARCO | Homepage | mmarco/chinese |
train dev |
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
Otherwise, you can load a custom preprocessed dataset in the following way:
from beir.datasets.data_loader import GenericDataLoader
corpus_path = "your_corpus_file.jsonl"
query_path = "your_query_file.jsonl"
qrels_path = "your_qrels_file.tsv"
corpus, queries, qrels = GenericDataLoader(
corpus_file=corpus_path,
query_file=query_path,
qrels_file=qrels_path).load_custom()
Make sure that the dataset is in the following format:
- corpus file: a .jsonl file (jsonlines) that contains a list of dictionaries, each with three fields
_id
with unique document identifier,title
with document title (optional) andtext
with document paragraph or passage. For example:{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
- queries file: a .jsonl file (jsonlines) that contains a list of dictionaries, each with two fields
_id
with unique query identifier andtext
with query text. For example:{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
- qrels file: a .tsv file (tab-seperated) that contains three columns, i.e. the query-id, corpus-id and score in this order. Keep 1st row as header. For example:
q1 doc1 1
You can also skip the dataset loading part and provide directly corpus, queries and qrels in the following way:
corpus = {
"doc1" : {
"title": "Albert Einstein",
"text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \
one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \
its influence on the philosophy of science. He is best known to the general public for his mass–energy \
equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \
Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \
of the photoelectric effect', a pivotal step in the development of quantum theory."
},
"doc2" : {
"title": "", # Keep title an empty string if not present
"text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \
malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\
with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)."
},
}
queries = {
"q1" : "Who developed the mass-energy equivalence formula?",
"q2" : "Which beer is brewed with a large proportion of wheat?"
}
qrels = {
"q1" : {"doc1": 1},
"q2" : {"doc2": 1},
}
Similar to Tensorflow datasets or HuggingFace's datasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.
If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, feel free to post an issue here or make a pull request!
If you're a dataset owner and wish to include your dataset or model in this library, feel free to post an issue here or make a pull request!
We include different retrieval architectures and evaluate them all in a zero-shot setup.
from beir.retrieval.search.lexical import BM25Search as BM25
hostname = "your-hostname" #localhost
index_name = "your-index-name" # scifact
initialize = True # True, will delete existing index with same name and reindex all documents
model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)
from beir.retrieval.search.sparse import SparseSearch
from beir.retrieval import models
model_path = "BeIR/sparta-msmarco-distilbert-base-v1"
sparse_model = SparseSearch(models.SPARTA(model_path), batch_size=128)
from beir.retrieval import models
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
model = DRES(models.SentenceBERT("msmarco-distilbert-base-v3"), batch_size=16)
retriever = EvaluateRetrieval(model, score_function="cos_sim") # or "dot" for dot-product
from beir.reranking.models import CrossEncoder
from beir.reranking import Rerank
cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-electra-base')
reranker = Rerank(cross_encoder_model, batch_size=128)
# Rerank top-100 results retrieved by BM25
rerank_results = reranker.rerank(corpus, queries, bm25_results, top_k=100)
Name | Implementation |
---|---|
BM25 (Robertson and Zaragoza, 2009) | https://www.elastic.co/ |
Anserini (Yang et al., 2017) | https://github.com/castorini/anserini |
SBERT (Reimers and Gurevych, 2019) | https://www.sbert.net/ |
ANCE (Xiong et al., 2020) | https://github.com/microsoft/ANCE |
DPR (Karpukhin et al., 2020) | https://github.com/facebookresearch/DPR |
USE-QA (Yang et al., 2020) | https://tfhub.dev/google/universal-sentence-encoder-qa/3 |
SPARTA (Zhao et al., 2020) | https://huggingface.co/BeIR |
ColBERT (Khattab and Zaharia, 2020) | https://github.com/stanford-futuredata/ColBERT |
If you use any one of the implementations, please make sure to include the correct citation.
If you implemented a model and wish to update any part of it, or do not want the model to be included, feel free to post an issue here or make a pull request!
If you implemented a model and wish to include your model in this library, feel free to post an issue here or make a pull request. Otherwise, if you want to evaluate the model on your own, see the following section.
Mention your dual-encoder model in a class and have two functions: 1. encode_queries
and 2. encode_corpus
.
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
class YourCustomDEModel:
def __init__(self, model_path=None, **kwargs)
self.model = None # ---> HERE Load your custom model
# Write your own encoding query function (Returns: Query embeddings as numpy array)
def encode_queries(self, queries: List[str], batch_size: int, **kwargs) -> np.ndarray:
pass
# Write your own encoding corpus function (Returns: Document embeddings as numpy array)
def encode_corpus(self, corpus: List[Dict[str, str]], batch_size: int, **kwargs) -> np.ndarray:
pass
custom_model = DRES(YourCustomDEModel(model_path="your-custom-model-path"))
Mention your cross-encoder model in a class and have a single function: predict
from beir.reranking import Rerank
class YourCustomCEModel:
def __init__(self, model_path=None, **kwargs)
self.model = None # ---> HERE Load your custom model
# Write your own score function, which takes in query-document text pairs and returns the similarity scores
def predict(self, sentences: List[Tuple[str,str]], batch_size: int, **kwags) -> List[float]:
pass # return only the list of float scores
reranker = Rerank(YourCustomCEModel(model_path="your-custom-model-path"), batch_size=128)
We evaluate our models using pytrec_eval and in future we can extend to include more retrieval-based metrics:
- NDCG (
NDCG@k
) - MAP (
MAP@k
) - Recall (
Recall@k
) - Precision (
P@k
)
We also include custom-metrics now which can be used for evaluation, please refer here - evaluate_custom_metrics.py
- MRR (
MRR@k
) - Capped Recall (
R_cap@k
) - Hole (
Hole@k
): % of top-k docs retrieved unseen by annotators - Top-K Accuracy (
Accuracy@k
): % of relevant docs present in top-k results
If you find this repository helpful, feel free to cite our publication BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models:
@inproceedings{
thakur2021beir,
title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
year={2021},
url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}
The main contributors of this repository are:
- Nandan Thakur, Personal Website: nandan-thakur.com
Contact person: Nandan Thakur, [email protected]
https://www.ukp.tu-darmstadt.de/
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.