Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 1.81 KB

README.md

File metadata and controls

24 lines (17 loc) · 1.81 KB

nLab corpus

This repository contains a "cleaned" version of the contents of the nLab (as of c. December 2020), intending to be used as a training corpus for various machine learning projects. The cleaning process strips out any non-textual elements (such as bullet points) and converts the LaTeX mathematics into unicode, wherever possible. (a different, more recent version of the nLab corpus can be found at https://github.com/ToposInstitute/nLab2024-corpus)

  • nlab_plain_normalized.txt is the concatenation of all the pages into one large text file.
  • nlab_plain.json has the same content as the plaintext file, but is organised into key-value pairs, with the key being the title of the page, and the value being its contents.
  • nlab_stats.json contains some basic statistics about the corpus, generated by spaCy.

The original version of the corpus has been used in two prototypes Parmesan 0.1 (extracted using Collard et al root- and rule-based method, which could be explored at http://18.222.108.184:8080/0). This was superseded by Parmesan 0.2, available at http://www.jacobcollard.com/parmesan2/. See the description in http://www.jacobcollard.com/parmesan2/about.

For licensing information for the nLab, see the nLab licence.

Corpus Statistics

There are two types of part-of-speech tags in the corpus statistics, both generated by spaCy. The first tagset, labeled "pos" in nlab_stats.json, represents course-grained part of speech and is taken from the Universal POS tag set. The second tagset, "tag", is specific to spaCy's pretrained English model.

Details about the different tagsets, as well as other label schemes for this model, can be found on spaCy's website.