ChtbSplit

The Penn Chinese Treebank Project provides annotated word segmentation, part-of-speech tagging, and constituency trees for Chinese.

Penn Chinese Treebank 6.0

This corpus (LDC2007T36) contains a file named "list-of-files.pdf". This file includes a recommended data split for training, development and test datasets. Since it is a PDF file, however, it is not easy to extract the file IDs from it. I manually copy the file IDs to separated plain text files. These files may facilitate further data preparing process.

chtb_v6.split.txt: the data split information
chtb_v6.trn.idx: the file indices for training set
chtb_v6.dev.idx: the file indices for development set
chtb_v6.tst.idx: the file indices for test set
chtb_v6.mis.idx: the missing file indices, they belong to the training set

Note: Files with ID "0886--0899" are listed in the PDF file, but are missed in the treebank. They belong to the training set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChtbSplit

Penn Chinese Treebank 6.0

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
chtb_v6.dev.idx		chtb_v6.dev.idx
chtb_v6.mis.idx		chtb_v6.mis.idx
chtb_v6.split.txt		chtb_v6.split.txt
chtb_v6.trn.idx		chtb_v6.trn.idx
chtb_v6.tst.idx		chtb_v6.tst.idx

huangyunict/ChtbSplit

Folders and files

Latest commit

History

Repository files navigation

ChtbSplit

Penn Chinese Treebank 6.0

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages