-
Notifications
You must be signed in to change notification settings - Fork 0
/
project.yml
134 lines (116 loc) · 8.01 KB
/
project.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
title: "Parsing the _Jingdian Shiwen_"
description: |
[![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://direct-phonology-jdsw-scriptsvisualize-0px83h.streamlit.app/)
This project is an attempt to convert the annotations compiled by the Tang dynasty scholar [Lu Deming (陸德明)](https://en.wikipedia.org/wiki/Lu_Deming) in the [_Jingdian Shiwen_ (经典释文)](https://en.wikipedia.org/wiki/Jingdian_Shiwen) into a structured form that separates phonology, glosses, and references to secondary sources. A [spaCy](https://spacy.io/) pipeline is configured to parse and tag the annotations, and [prodigy](https://prodi.gy/) is used for guided annotation of the training data. The project is part of a broader effort to build a linguistic model of [Old Chinese (上古漢語)](https://en.wikipedia.org/wiki/Old_Chinese) that incoporates phonology.
## Data
The _Jingdian Shiwen_ comprises Lu's annotations on most of the ["Thirteen Classics" (十三經)](https://en.wikipedia.org/wiki/Thirteen_Classics) of the Confucian tradition, as well as some Daoist texts. We use the edition of the _Jingdian Shiwen_ found in the [_Collectanea of the Four Categories_ (四部叢刊)](http://www.chinaknowledge.de/Literature/Poetry/sibucongkan.html), which includes high-quality lithographic reproductions of many ancient texts. The annotations given in the _Jingdian Shiwen_ are paired with the source texts to which they apply; for this we predominantly use the definitive (正文) editions published by the [Kanseki Repository](https://www.kanripo.org/).
|work|title|source|_Jingdian Shiwen_ chapters (卷)|
|-|-|-|-|
|周易|[_Book of Changes_](https://en.wikipedia.org/wiki/I_Ching)|[KR1a0001](https://github.com/kanripo/KR1a0001)|2
|尚書|[_Book of Documents_](https://en.wikipedia.org/wiki/Book_of_Documents)|[KR1b0001](https://github.com/kanripo/KR1b0001)|3-4|
|毛詩|[_Mao Commentary_](https://en.wikipedia.org/wiki/Mao_Commentary) on the [_Book of Odes_](https://en.wikipedia.org/wiki/Classic_of_Poetry)|[KR1c0001](https://github.com/kanripo/KR1c0001)|5-7|
|周禮|[_Rites of Zhou_](https://en.wikipedia.org/wiki/Rites_of_Zhou)|[KR1d0001](https://github.com/kanripo/KR1d0001)|8-9|
|儀禮|[_Etiquette and Ceremonial_](https://en.wikipedia.org/wiki/Etiquette_and_Ceremonial)|CH1e0873*|10|
|禮記|[_Book of Rites_](https://en.wikipedia.org/wiki/Book_of_Rites)|[KR1d0052](https://github.com/kanripo/KR1d0052)|11-14|
|春秋左傳|[_Commentary of Zuo_](https://en.wikipedia.org/wiki/Zuo_Zhuan) on the [_Spring and Autumn Annals_](https://en.wikipedia.org/wiki/Spring_and_Autumn_Annals)|[KR1e0001](https://github.com/kanripo/KR1e0001)|15-20|
|春秋公羊傳|[_Commentary of Gongyang_](https://en.wikipedia.org/wiki/Gongyang_Zhuan) on the [_Spring and Autumn Annals_](https://en.wikipedia.org/wiki/Spring_and_Autumn_Annals)|CH1e0877*|21|
|春秋穀梁傳|[_Commentary of Guliang_](https://en.wikipedia.org/wiki/Guliang_Zhuan) on the [_Spring and Autumn Annals_](https://en.wikipedia.org/wiki/Spring_and_Autumn_Annals)|[KR1e0008](https://github.com/kanripo/KR1e0008)|22|
|孝經|[_Classic of Filial Piety_](https://en.wikipedia.org/wiki/Classic_of_Filial_Piety)|[KR1f0001](https://github.com/kanripo/KR1f0001)|23|
|論語|[_Analects of Confucius_](https://en.wikipedia.org/wiki/Analects)|[KR1h0004](https://github.com/kanripo/KR1h0004)|24|
|老子|[_Laozi_](https://en.wikipedia.org/wiki/Tao_Te_Ching)|[KR5c0057](https://github.com/kanripo/KR5c0057)|25|
|莊子|[_Zhuangzi_](https://en.wikipedia.org/wiki/Zhuangzi_(book))|[KR5c0126](https://github.com/kanripo/KR5c0126)|26-28|
*This data is sourced with permission from the [China Ancient Texts (CHANT) database](https://www.cuhk.edu.hk/ics/rccat/en/database.html).
We omit chapter 1 of the _Jingdian Shiwen_, corresponding to the [_Erya_ (爾雅)](https://en.wikipedia.org/wiki/Erya). All digital sources have been preprocessed to remove punctuation, whitespace, and non-Chinese characters. Kanseki Repository data is generously licensed CC-BY.
After processing, the labeled output data is saved in JSON-lines (`.jsonl`) format, to be used for machine learning, natural language processing, and other computational applications.
## Annotating
To annotate training data, you need to have spacy installed in your python environment:
```sh
pip install spacy
```
You also need a copy of [prodigy](https://prodi.gy/). Once you have the appropriate wheel, install it with:
```sh
# example: prodigy version 1.11.8 for python 3.10 on windows
pip install prodigy-1.11.8-cp310-cp310-win_amd64.whl
```
Then, verify the project assets are downloaded:
```sh
spacy project assets
```
Install python dependencies needed for annotation:
```sh
spacy project run install
```
Then, choose a task (see "commands" below). Invoke it with e.g.:
```sh
# annotate data by correcting predictions
spacy project run annotate
```
# Variables can be referenced across the project.yml using ${vars.var_name}
vars:
corpus: "annotations-large"
embedding: "tok2vec" # tok2vec, trf
suggester: "ngram" # ngram, span_finder
transformer_model_name: "KoichiYasuoka/roberta-classical-chinese-base-char"
gpu_id: -1
config_file: "configs/${vars.suggester}/config_${vars.embedding}.cfg"
spancat_model: "training/${vars.embedding}_${vars.suggester}/model-best"
# These are the directories that the project needs. The project CLI will make
# sure that they always exist.
directories: ["assets", "configs", "data", "metrics", "packages", "scripts", "training"]
# Assets that should be downloaded or available in the directory. Remote assets
# will be downloaded if not present locally.
assets:
- dest: "assets/docs.csv"
description: "Table mapping each chapter in a source text to its location in the _Jingdian Shiwen_"
url:
- dest: "assets/variants.json"
description: "Equivalency table for graphic variants of characters"
url:
- dest: "assets/treebank"
description: "Universal Dependencies treebank for Classical Chinese"
git:
repo: "https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto"
branch: master
path: ""
# Workflows are series of commands that are run in order and often depend on
# each other.
# workflows: []
# Project commands, specified in a style similar to CI config files (e.g. Azure
# pipelines). The name is the command name that lets you trigger the command
# via "spacy project run [command] [path]". The help message is optional and
# shown when executing "spacy project run [optional command] [path] --help".
commands:
- name: "install"
help: "Install dependencies"
script:
- "python -m pip install -r requirements.txt"
- name: "annotate-spans"
help: "Annotate spans by correcting predictions based on heuristics"
script:
- "python -m prodigy jdsw.spans.correct spans assets/${vars.corpus}.jsonl -F scripts/recipes/spancat.py"
- name: "export"
help: "Export training data from prodigy's database for use with spaCy"
script:
- "python -m prodigy db-out spans data"
- "python -m prodigy data-to-spacy data --lang zh --config ${vars.config_file} --spancat spans"
deps:
- "${vars.config_file}"
outputs:
- "data/spans.jsonl"
- "data/dev.spacy"
- "data/train.spacy"
- name: "train"
help: "Train a spaCy pipeline"
script:
- "python -m spacy train ${vars.config_file} --output training/${vars.embedding}_${vars.suggester}/ --paths.train data/train.spacy --paths.dev data/dev.spacy --gpu-id ${vars.gpu_id} --vars.transformer_model_name ${vars.transformer_model_name}"
deps:
- "${vars.config_file}"
- "data/train.spacy"
- "data/dev.spacy"
outputs:
- "${vars.spancat_model}"
- name: "eval"
help: "Evaluate the trained spaCy pipeline's accuracy and speed using test data"
script:
- "python -m spacy benchmark accuracy ${vars.spancat_model} data --gpu-id ${vars.gpu_id} --output metrics/accuracy.json"
- "python -m spacy benchmark speed ${vars.spancat_model} data --gpu-id ${vars.gpu_id}"