Publicly available scholarly articles collections for NLP/IR applications
Name and Link | Description | Contents | Notes | Data |
---|---|---|---|---|
arXiv Dataset | A dataset of 1.7 million arXiv articles for applications. | This dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: * id: ArXiv ID (can be used to access the paper, see below) * submitter: Who submitted the paper * authors: Authors of the paper * title: Title of the paper * comments: Additional info, such as number of pages and figures * journal-ref: Information about the journal the paper was published in * doi (Digital Object Identifier) * abstract: The abstract of the paper * categories: Categories / tags in the ArXiv system * versions: A version history) |
||
TREC 2019 Fair Ranking Track | Academic search task. The corpus for this project is the Semantic Scholar (S2) Open Corpus from the Allen Institute for Artificial Intelligence. | a large corpus of 81.1M English-language academic papers spanning many academic disciplines. The corpus consists of rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text is annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects (paper) | needs a permission to access this data | |
explicit Semantic Ranking Dataset | Dataset for the paper Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding. | It includes: * The query log used in the paper * relevance judgements for the queries * ranking lists from Semantic Scholar * candidate documents * entity embeddings trained using the knowledge graph, and baselines, development methods, and alternative methods from the experiments. |
Publicly Available | |
AMiner Citation Network Dataset: DBLP+Citation, ACM Citation network | citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. | Each paper is associated with abstract, authors, year, venue, and title. | Publicly Available | |
explicit Semantic Ranking Dataset | Dataset for the paper Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding. | It includes: * The query log used in the paper * relevance judgements for the queries * ranking lists from Semantic Scholar * candidate documents * entity embeddings trained using the knowledge graph, and baselines, development methods, and alternative methods from the experiments. |
Publicly Available | |
CORE | provides access to 89M free to read full text research papers with 29M full texts hosted directly | metadata and full text research articles from thousands of data providers. On top of this continuously growing corpus | API | |
Semantic Scholar Academic Graph | multidisciplinary knowledge graph where scientific papers and their authors are connected by citations of one paper by another | metadata from scholarly articles | API | |
CiteSeer | An evolving scientific literature digital library and search engine that has focused primarily on the literature in computer and information science | metadata, databases, data sets of pdf files and text of pdf files. | Needs permision to access to sharing folders on Google Drive | |
Isearch | information retrieval (IR) test collection to facilitate the evaluation of integrated search, i.e. search across a range of different sources but with one search box and one ranked result list | approx. 18,000 monographic records, 160,000 papers and journal articles in PDF and 275,000 abstracts with a varied set of metadata and vocabularies from the physics domain, 65 topics based on real work tasks and corresponding graded relevance assessments. | ||
English Wikipedia | Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). | The dataset has abstracts of Articles from the English Wikipedia (202,383 documents) as well as Titles and URLs. | All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL) | data 795 MB |