Skip to content

Latest commit

 

History

History
184 lines (142 loc) · 14.1 KB

experiments-cord19.md

File metadata and controls

184 lines (142 loc) · 14.1 KB

This document describes various tools for working with the COVID-19 Open Research Dataset (CORD-19) from the Allen Institute for AI. For an easy way to get started, check out our Colab demos, also available here:

We provide instructions on how to build Lucene indexes for the collection using Anserini below, but if you don't want to bother building the indexes yourself, we have pre-built indexes that you can directly download:

If you don't want to build the index yourself, you can download the latest pre-built copies here:

Version Type Size Link Checksum
2020-05-26 Abstract 1.7G [Dropbox] 2dc054f4ca7db281e9f5e0d4836df14c
2020-05-26 Full-Text 3.3G [Dropbox] 9b9fd4b97f75fa295e3345d0cf7914e3
2020-05-26 Paragraph 4.7G [Dropbox] 72eb265c1c9983f02f1e79a2ba19befb

"Size" refers to the output of ls -lh, "Version" refers to the dataset release date from AI2. For our answer to the question, "which one should I use?" see below.

We've kept around older versions of the index for archival purposes — scroll all the way down to the bottom of the page to see those.

Data Prep

The latest distribution available is from 2020/05/26. First, download the data:

DATE=2020-05-26
DATA_DIR=./collections/cord19-"${DATE}"
mkdir "${DATA_DIR}"

wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/document_parses.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/latest/metadata.csv -P "${DATA_DIR}"

ls "${DATA_DIR}"/document_parses.tar.gz | xargs -I {} tar -zxvf {} -C "${DATA_DIR}"
rm "${DATA_DIR}"/document_parses.tar.gz

Building Local Lucene Indexes

We can now index this corpus using Anserini. Currently, we have implemented three different variants, described below. For a sense of how these different methods stack up, refer to the following paper:

The tl;dr — we'd recommend getting started with abstract index since it's the smallest in size and easiest to manipulate. Paragraph indexing is likely to be more effective (i.e., better search results), but a bit more difficult to manipulate since some deduping is required to post-process the raw hits (since multiple paragraphs from the same article might be retrieved). The full-text index overly biases long documents and isn't really effective; this condition is included here only for completeness.

Note that as of TREC-COVID Round 1, there is some evidence that the abstract index is more effective for search, see results of experiments here.

Abstract

We can index abstracts (and titles, of course) with Cord19AbstractCollection, as follows:

sh target/appassembler/bin/IndexCollection \
  -collection Cord19AbstractCollection -generator Cord19Generator \
  -threads 8 -input "${DATA_DIR}" \
  -index indexes/lucene-index-cord19-abstract-"${DATE}" \
  -storePositions -storeDocvectors -storeContents -storeRaw -optimize > logs/log.cord19-abstract.${DATE}.txt

The log should end with something like this:

2020-05-27 11:17:25,530 INFO  [main] index.IndexCollection (IndexCollection.java:874) - Indexing Complete! 134,176 documents indexed
2020-05-27 11:17:25,531 INFO  [main] index.IndexCollection (IndexCollection.java:875) - ============ Final Counter Values ============
2020-05-27 11:17:25,531 INFO  [main] index.IndexCollection (IndexCollection.java:876) - indexed:          134,176
2020-05-27 11:17:25,531 INFO  [main] index.IndexCollection (IndexCollection.java:877) - unindexable:            0
2020-05-27 11:17:25,531 INFO  [main] index.IndexCollection (IndexCollection.java:878) - empty:                 24
2020-05-27 11:17:25,531 INFO  [main] index.IndexCollection (IndexCollection.java:879) - skipped:                6
2020-05-27 11:17:25,531 INFO  [main] index.IndexCollection (IndexCollection.java:880) - errors:                 0
2020-05-27 11:17:25,535 INFO  [main] index.IndexCollection (IndexCollection.java:883) - Total 134,176 documents indexed in 00:02:14

The contents field of each Lucene document is a concatenation of the article's title and abstract.

Full-Text

We can index the full text, with Cord19FullTextCollection, as follows:

sh target/appassembler/bin/IndexCollection \
  -collection Cord19FullTextCollection -generator Cord19Generator \
  -threads 8 -input "${DATA_DIR}" \
  -index indexes/lucene-index-cord19-full-text-"${DATE}" \
  -storePositions -storeDocvectors -storeContents -storeRaw -optimize > logs/log.cord19-full-text.${DATE}.txt

The log should end with something like this:

2020-05-27 11:24:30,636 INFO  [main] index.IndexCollection (IndexCollection.java:874) - Indexing Complete! 134,176 documents indexed
2020-05-27 11:24:30,637 INFO  [main] index.IndexCollection (IndexCollection.java:875) - ============ Final Counter Values ============
2020-05-27 11:24:30,637 INFO  [main] index.IndexCollection (IndexCollection.java:876) - indexed:          134,176
2020-05-27 11:24:30,637 INFO  [main] index.IndexCollection (IndexCollection.java:877) - unindexable:            0
2020-05-27 11:24:30,637 INFO  [main] index.IndexCollection (IndexCollection.java:878) - empty:                 24
2020-05-27 11:24:30,637 INFO  [main] index.IndexCollection (IndexCollection.java:879) - skipped:                6
2020-05-27 11:24:30,638 INFO  [main] index.IndexCollection (IndexCollection.java:880) - errors:                 0
2020-05-27 11:24:30,642 INFO  [main] index.IndexCollection (IndexCollection.java:883) - Total 134,176 documents indexed in 00:06:42

The contents field of each Lucene document is a concatenation of the article's title and abstract, and the full text JSON (if available).

Paragraph

We can build a paragraph index with Cord19ParagraphCollection, as follows:

sh target/appassembler/bin/IndexCollection \
  -collection Cord19ParagraphCollection -generator Cord19Generator \
  -threads 8 -input "${DATA_DIR}" \
  -index indexes/lucene-index-cord19-paragraph-"${DATE}" \
  -storePositions -storeDocvectors -storeContents -storeRaw -optimize > logs/log.cord19-paragraph.${DATE}.txt

The log should end with something like this:

2020-05-27 11:44:03,071 INFO  [main] index.IndexCollection (IndexCollection.java:874) - Indexing Complete! 2,353,190 documents indexed
2020-05-27 11:44:03,071 INFO  [main] index.IndexCollection (IndexCollection.java:875) - ============ Final Counter Values ============
2020-05-27 11:44:03,071 INFO  [main] index.IndexCollection (IndexCollection.java:876) - indexed:        2,353,190
2020-05-27 11:44:03,071 INFO  [main] index.IndexCollection (IndexCollection.java:877) - unindexable:            0
2020-05-27 11:44:03,071 INFO  [main] index.IndexCollection (IndexCollection.java:878) - empty:                 24
2020-05-27 11:44:03,071 INFO  [main] index.IndexCollection (IndexCollection.java:879) - skipped:            2,660
2020-05-27 11:44:03,072 INFO  [main] index.IndexCollection (IndexCollection.java:880) - errors:                 0
2020-05-27 11:44:03,076 INFO  [main] index.IndexCollection (IndexCollection.java:883) - Total 2,353,190 documents indexed in 00:18:05

In this configuration, the indexer creates multiple Lucene Documents for each source article:

  • docid: title + abstract
  • docid.00001: title + abstract + 1st paragraph
  • docid.00002: title + abstract + 2nd paragraph
  • docid.00003: title + abstract + 3rd paragraph
  • ...

The suffix of the docid, .XXXXX identifies which paragraph is being indexed. The original raw JSON full text is stored in the raw field of docid (without the suffix).

Pre-Built Indexes (All Versions)

All versions of pre-built indexes:

Version Type Size Link Checksum
2020-05-26 Abstract 1.7G [Dropbox] 2dc054f4ca7db281e9f5e0d4836df14c
2020-05-26 Full-Text 3.3G [Dropbox] 9b9fd4b97f75fa295e3345d0cf7914e3
2020-05-26 Paragraph 4.7G [Dropbox] 72eb265c1c9983f02f1e79a2ba19befb
2020-05-19 Abstract 1.7G [Dropbox] 37bb97d0c41d650ba8e135fd75ae8fd8
2020-05-19 Full-Text 3.3G [Dropbox] f5711915a66cd2b511e0fb8d03e4c325
2020-05-19 Paragraph 4.9G [Dropbox] 012ab1f804382b2275c433a74d7d31f2
2020-05-12 Abstract 1.3G [Dropbox] dfd09e70cd672bbe15a63437351e1f74
2020-05-12 Full-Text 2.5G [Dropbox] 5b914e8ae579195185cf28a60051236d
2020-05-12 Paragraph 3.6G [Dropbox] a2cb36762078ef9373f0ddaf52618e7f
2020-05-01 Abstract 1.2G [Dropbox] a06e71a98a68d31148cb0e97e70a2ee1
2020-05-01 Full-Text 2.4G [Dropbox] e7eca1b976cdf2cd80e908c9ac2263cb
2020-05-01 Paragraph 3.6G [Dropbox] 8f9321757a03985ac1c1952b2fff2c7d
2020-04-24 Abstract 1.3G [Dropbox] 93540ae00e166ee433db7531e1bb51c8
2020-04-24 Full-Text 2.4G [Dropbox] fa927b0fc9cf1cd382413039cdc7b736
2020-04-24 Paragraph 5.0G [Dropbox] 7c6de6298e0430b8adb3e03310db32d8
2020-04-17 Abstract 1.2G [Dropbox] d57b17eadb1b44fc336b4121c139a598
2020-04-17 Full-Text 2.2G [Dropbox] 677546e0a1b7855a48eee8b6fbd7d7af
2020-04-17 Paragraph 4.7G [Dropbox] c11e46230b744a46747f84e49acc9c2b
2020-04-10 Abstract 1.2G [Dropbox] ec239d56498c0e7b74e3b41e1ce5d42a
2020-04-10 Full-Text 3.3G [Dropbox] 401a6f5583b0f05340c73fbbeb3279c8
2020-04-10 Paragraph 3.4G [Dropbox] 8b87a2c55bc0a15b87f11e796860216a
2020-04-03 Abstract 1.1G [Dropbox] 5d0d222e746d522a75f94240f5ab9f23
2020-04-03 Full-Text 3.0G [Dropbox] 9aafb86fec39e0882bd9ef0688d7a9cc
2020-04-03 Paragraph 3.1G [Dropbox] 523894cfb52fc51c4202e76af79e1b10
2020-03-27 Abstract 1.1G [Dropbox] c5f7247e921c80f41ac6b54ff38eb229
2020-03-27 Full-Text 2.9G [Dropbox] 3c126344f9711720e6cf627c9bc415eb
2020-03-27 Paragraph 3.1G [Dropbox] 8e02de859317918af4829c6188a89086
2020-03-20 Abstract 1.0G [Dropbox] 281c632034643665d52a544fed23807a
2020-03-20 Full-Text 2.6G [Dropbox] 30cae90b85fa8f1b53acaa62413756e3
2020-03-20 Paragraph 2.9G [Dropbox] 4c78e9ede690dbfac13e25e634c70ae4

Known Issues

  • Release of 2020/05/19: Missing URLs for several articles due to a known issue with the CORD-19 dataset release.