Anserini: BM25 Baselines on MS MARCO Doc Retrieval Task

This page contains instructions for running BM25 baselines on the MS MARCO document ranking task. Note that there is a separate MS MARCO passage ranking task.

Data Prep

We're going to use msmarco-doc/ as the working directory. First, we need to download and extract the MS MARCO document dataset:

mkdir collections/msmarco-doc
mkdir indexes/msmarco-doc

wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -P collections/msmarco-doc

To confirm, msmarco-docs.trec.gz should have MD5 checksum of d4863e4f342982b51b9a8fc668b2d0c0.

There's no need to uncompress the file, as Anserini can directly index gzipped files. Build the index with the following command:

nohup sh target/appassembler/bin/IndexCollection -collection CleanTrecCollection \
 -generator DefaultLuceneDocumentGenerator -threads 1 -input collections/msmarco-doc \
 -index indexes/msmarco-doc/lucene-index.msmarco-doc.pos+docvectors+rawdocs -storePositions -storeDocvectors -storeRaw \
 >& logs/log.msmarco-doc.pos+docvectors+rawdocs &

On a modern desktop with an SSD, indexing takes around 40 minutes. The final log lines should look something like this:

2020-01-14 16:36:30,954 INFO  [main] index.IndexCollection (IndexCollection.java:851) - ============ Final Counter Values ============
2020-01-14 16:36:30,955 INFO  [main] index.IndexCollection (IndexCollection.java:852) - indexed:        3,213,835
2020-01-14 16:36:30,955 INFO  [main] index.IndexCollection (IndexCollection.java:853) - unindexable:            0
2020-01-14 16:36:30,955 INFO  [main] index.IndexCollection (IndexCollection.java:854) - empty:                  0
2020-01-14 16:36:30,955 INFO  [main] index.IndexCollection (IndexCollection.java:855) - skipped:                0
2020-01-14 16:36:30,955 INFO  [main] index.IndexCollection (IndexCollection.java:856) - errors:                 0
2020-01-14 16:36:30,961 INFO  [main] index.IndexCollection (IndexCollection.java:859) - Total 3,213,835 documents indexed in 00:45:32

Retrieving and Evaluating the Dev set

Let's download the queries and qrels:

mkdir collections/msmarco-doc/queries-and-qrels
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-doctrain-queries.tsv.gz -P collections/msmarco-doc/queries-and-qrels
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-doctrain-top100.gz -P collections/msmarco-doc/queries-and-qrels
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-doctrain-qrels.tsv.gz -P collections/msmarco-doc/queries-and-qrels

wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-queries.tsv.gz -P collections/msmarco-doc/queries-and-qrels
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-top100.gz -P collections/msmarco-doc/queries-and-qrels
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-qrels.tsv.gz -P collections/msmarco-doc/queries-and-qrels

gunzip collections/msmarco-doc/queries-and-qrels/*.gz

Here are the sizes:

$ wc collections/msmarco-doc/queries-and-qrels/*.tsv
    5193   20772  108276 collections/msmarco-doc/queries-and-qrels/msmarco-docdev-qrels.tsv
    5193   35787  220304 collections/msmarco-doc/queries-and-qrels/msmarco-docdev-queries.tsv
  367013 1468052 7539008 collections/msmarco-doc/queries-and-qrels/msmarco-doctrain-qrels.tsv
  367013 2551279 15480364 collections/msmarco-doc/queries-and-qrels/msmarco-doctrain-queries.tsv
  744412 4075890 23347952 total

There are indeed lots of training queries! In this guide, to save time, we are only going to perform retrieval on the dev queries. This can be accomplished as follows:

target/appassembler/bin/SearchCollection -topicreader TsvInt -index indexes/msmarco-doc/lucene-index.msmarco-doc.pos+docvectors+rawdocs \
 -topics collections/msmarco-doc/queries-and-qrels/msmarco-docdev-queries.tsv -output runs/run.msmarco-doc.dev.bm25.txt -bm25

On a modern desktop with an SSD, the run takes around 12 minutes. After the run completes, we can evaluate with trec_eval:

$ eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 collections/msmarco-doc/queries-and-qrels/msmarco-docdev-qrels.tsv runs/run.msmarco-doc.dev.bm25.txt
map                   	all	0.2310
recall_1000           	all	0.8856

Let's compare to the baselines provided by Microsoft (note that to be fair, we restrict evaluation to top 100 hits per topic):

$ eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 collections/msmarco-doc/queries-and-qrels/msmarco-docdev-qrels.tsv collections/msmarco-doc/queries-and-qrels/msmarco-docdev-top100
map                   	all	0.2219

$ eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 collections/msmarco-doc/queries-and-qrels/msmarco-docdev-qrels.tsv runs/run.msmarco-doc.dev.bm25.txt
map                   	all	0.2303

We see that "out of the box" Anserini is already better!

BM25 Tuning

It is well known that BM25 parameter tuning is important. The above instructions use the Anserini (system-wide) default of k1=0.9, b=0.4.

Let's try to do better! We tuned BM25 using the queries found here: these are five different sets of 10k samples from the training queries (using the shuf command). Tuning was performed on each individual set (grid search, in tenth increments) and then we averaged parameter values across all five sets (this has the effect of regularization). Here, we optimized for average precision (AP). The tuned parameters using this approach are k1=3.44, b=0.87.

To perform a run with these parameters, issue the following command:

target/appassembler/bin/SearchCollection -topicreader TsvString -index indexes/msmarco-doc/lucene-index.msmarco-doc.pos+docvectors+rawdocs \
 -topics collections/msmarco-doc/queries-and-qrels/msmarco-docdev-queries.tsv -output runs/run.msmarco-doc.dev.bm25.tuned.txt -bm25 -bm25.k1 3.44 -bm25.b 0.87

Here's the comparison between the Anserini default and tuned parameters:

Setting	AP	Recall@1000
Default (`k1=0.9`, `b=0.4`)	0.2310	0.8856
Tuned (`k1=3.44`, `b=0.87`)	0.2788	0.9326

As expected, BM25 tuning makes a big difference!

Replication Log

Results replicated by @edwinzhng on 2020-01-14 (commit 3964169)
Results replicated by @nikhilro on 2020-01-21 (commit 631589e)
Results replicated by @yuki617 on 2020-03-29 (commit 074723c)
Results replicated by @HangCui0510 on 2020-04-23 (commit 0ae567d)
Results replicated by @x65han on 2020-04-25 (commit f5496b9)
Results replicated by @y276lin on 2020-04-26 (commit 8f48f8e)
Results replicated by @stephaniewhoo on 2020-04-26 (commit 8f48f8e)
Results replicated by @YimingDou on 2020-05-14 (commit 3b0a642)
Results replicated by @richard3983 on 2020-05-14 (commit a65646f)
Results replicated by @MXueguang on 2020-05-20 (commit 3b2751e)
Results replicated by @shaneding on 2020-05-23 (commit b6e0367)
Results replicated by @kelvin-jiang on 2020-05-24 (commit b6e0367)
Results replicated by @adamyy on 2020-05-28 (commit a1ecfa4)
Results replicated by @TianchengY on 2020-05-28 (commit 2947a16)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments-msmarco-doc.md

experiments-msmarco-doc.md

Anserini: BM25 Baselines on MS MARCO Doc Retrieval Task

Data Prep

Retrieving and Evaluating the Dev set

BM25 Tuning

Replication Log

Files

experiments-msmarco-doc.md

Latest commit

History

experiments-msmarco-doc.md

File metadata and controls

Anserini: BM25 Baselines on MS MARCO Doc Retrieval Task

Data Prep

Retrieving and Evaluating the Dev set

BM25 Tuning

Replication Log