Skip to content

mbingenheimer/cbetaCorpusSorted

Repository files navigation

CBETA Corpus Sorted

This are files from the CBETA Corpus v. 2021, made available here: https://github.com/DILA-edu/CBETA_TAFxml, sorted into Indian-Chinese and Chinese-Chinese texts as preparation for various NLP tasks focusing on comparing texts translated or produced in China between 500 and 800 CE.

2021-07: Sorted c.660 Indian-Chinese and c.290 Chinese-Chinese texts translated or written between 500 and 800 CE. The Indian-Chinese collection contains more files but less text. The files in the Chinese-Chinese collection are longer on average.

2022-05: Transformed the xml files from the CBETA github repository to clean text files for training and analysis (see the two tar.bz2 archives).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •