The Penn Chinese Treebank Project provides annotated word segmentation, part-of-speech tagging, and constituency trees for Chinese.
This corpus (LDC2007T36) contains a file named "list-of-files.pdf". This file includes a recommended data split for training, development and test datasets. Since it is a PDF file, however, it is not easy to extract the file IDs from it. I manually copy the file IDs to separated plain text files. These files may facilitate further data preparing process.
- chtb_v6.split.txt: the data split information
- chtb_v6.trn.idx: the file indices for training set
- chtb_v6.dev.idx: the file indices for development set
- chtb_v6.tst.idx: the file indices for test set
- chtb_v6.mis.idx: the missing file indices, they belong to the training set
Note: Files with ID "0886--0899" are listed in the PDF file, but are missed in the treebank. They belong to the training set.