Skip to content

Latest commit

 

History

History
12 lines (8 loc) · 450 Bytes

README.md

File metadata and controls

12 lines (8 loc) · 450 Bytes

arXiv data and code

This repo contains arXiv source data, and associated code for preprocessing, labeling, and partitioning it. The source data are under data/source as gzipped JSONL files.

After setting up a Python environment, run

python runner.py 'data/source/arxiv-data-20200125-split*.jsonl.gz'

The result will be a preprocessed corpus under data/processed and various partitions and samples for training under data/train.