arXiv data and code

This repo contains arXiv source data, and associated code for preprocessing, labeling, and partitioning it. The source data are under data/source as gzipped JSONL files.

After setting up a Python environment, run

python runner.py 'data/source/arxiv-data-20200125-split*.jsonl.gz'

The result will be a preprocessed corpus under data/processed and various partitions and samples for training under data/train.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

arXiv data and code

Files

README.md

Latest commit

History

README.md

File metadata and controls

arXiv data and code