This repository contains the data crawled and processed for the post series on ML and NLP publications.
The project was created by Marek Rei (@MarekRei). The country annotation was contributed by Jonas Pfeiffer (@PfeiffJo) and Andrew Caines (@cainesap).
The papers directory contains json files for each of the crawled conferences. Take a look inside to see the available metadata.
annotated_orgs.tsv contains the following columns in tab-separated format:
- id
- org_name - the name of the organization, as crawled
- paper_count - the number of papers that matched that name, after initial processing
- is_org - manually annotated field, indicating whether this is an actual organization or crawling noise
- canonical_org_name - a canonical name for this organization, to match together different versions
- country - manually annotated country name for each organization
- example1 - an example paper where this organization was crawled from
- example2 - another example
- example3 - another example
This dataset is made available under the CC BY-NC 4.0 license.