Skip to content

Latest commit

 

History

History
35 lines (22 loc) · 1.51 KB

README.md

File metadata and controls

35 lines (22 loc) · 1.51 KB

ML and NLP paper data

This repository contains the data crawled and processed for the post series on ML and NLP publications.

The project was created by Marek Rei (@MarekRei). The country annotation was contributed by Jonas Pfeiffer (@PfeiffJo) and Andrew Caines (@cainesap).

Conference proceedings

The papers directory contains json files for each of the crawled conferences. Take a look inside to see the available metadata.

Country annotation

annotated_orgs.tsv contains the following columns in tab-separated format:

  • id
  • org_name - the name of the organization, as crawled
  • paper_count - the number of papers that matched that name, after initial processing
  • is_org - manually annotated field, indicating whether this is an actual organization or crawling noise
  • canonical_org_name - a canonical name for this organization, to match together different versions
  • country - manually annotated country name for each organization
  • example1 - an example paper where this organization was crawled from
  • example2 - another example
  • example3 - another example

License

This dataset is made available under the CC BY-NC 4.0 license.