-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
7acc6a8
commit 1457e70
Showing
2 changed files
with
58 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Text Analysis with HGCT | ||
|
||
This repository showcases the [HGCT (Hanzi Glyph Corpus Toolkit)](https://yongfu.name/hgct/index.html), a tool designed for advanced analysis and querying of Chinese text data, demonstrated through a textbook-derived corpus. | ||
|
||
## Installation | ||
To use this package, clone the repository and install the required dependencies. | ||
|
||
1. Python version | ||
```bash | ||
3.11 > Python >= 3.0 | ||
``` | ||
|
||
2. Clone repository | ||
|
||
```bash | ||
git clone [email protected]:lopentu/HanziAnalysisKit.git | ||
``` | ||
|
||
3. Install Requirement | ||
|
||
```bash | ||
cd HanziAnalysisKit && pip install -r requirements.txt | ||
``` | ||
|
||
## Quick Start | ||
|
||
### Building the Corpus | ||
To prepare your data for use with the HGCT tool, build your corpus from the provided textbook data: | ||
|
||
```python | ||
from textbook import build_corpus | ||
|
||
# Specify your CSV data file and desired output folder | ||
csv_file = './data/教科書課文.csv' | ||
folder = "textbook_corpus" | ||
|
||
# Build the corpus | ||
build_corpus(csv_file, folder) | ||
``` | ||
|
||
This will create a `textbook_corpus` folder in your project directory, containing the processed data ready for analysis with HGCT. | ||
|
||
### Directory Structure | ||
After building the corpus, your project directory will include: | ||
|
||
``` | ||
|--- data/ | ||
| |--- 教科書課文.csv | ||
|--- textbook/ | ||
|--- textbook_corpus/ # Newly generated | ||
|--- ... | ||
``` | ||
|
||
### Advanced HGCT Features | ||
Explore the powerful querying capabilities of HGCT with the prepared textbook corpus. For comprehensive examples and detailed usage instructions, visit our [GitHub project page](https://lopentu.github.io/HanziAnalysisKit/) . | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
PyYAML==6.0.1 | ||
hgct==0.0.1 |