🌳 CED: Catalog Extraction from Documents

Rebuild document catalog tree structures from plain texts.

📃 This is the official implementation of the ICDAR'23 paper: CED: Catalog Extraction from Documents
📂 Data files are available at releases/tag/data-v1 and releases/tag/data-v1-patch.
📦 Model weights and logs are available at releases/tag/model-v1.

✈️ Abilities

Concatenate OCR text pieces
Compose paragraphs from sequences
Extract document catalogs from plain texts

⚙️ Installation

Make sure you have torch>=1.9.1 installed.

Python>=3.7
torch>=1.9.1

# better to create new environment in case of any potential package version mismatch
$ conda create -n doctree python=3.7
$ conda activate doctree
# install pytorch
$ echo 'install pytorch from https://pytorch.org/'

# install basics
$ pip install -e .
# if you want to test and make development, use this command
$ pip install -e .[dev]
# if you want to deploy demo on your local machine, use this
$ pip install -e .[demo]
# if you want to do both development and demo deployment, use this
$ pip install -e .[all]

🚀 Quick Start

All task examples are listed in examples/ .

Concatenate text segments

Train

# change setting file in `examples/text_concat/train/config.yaml`
$ bash examples/text_concat/train/run.sh

Inference

# check `examples/text_concat/inference/run.sh` and change the task directory
$ bash examples/text_concat/inference/run.sh

Concat & split paragraphs

We use the same task class to train paragraph split and text concatenation models

Train

# change setting file in `examples/text_concat/train/config.yaml`
$ bash examples/text_concat/train/run.sh

Inference

# check `examples/text_concat/inference/run.sh` and change the task directory
$ bash examples/text_concat/inference/run.sh

Extract catalog structures

We use the hierarchical config mechanism in REx. Here, settings in credit_eval.yaml will override those in base_config.yaml. You may want to add/change configurations in credit_eval.yaml to make it work.

Train

# change setting file in `examples/doc_tree_construction/train/credit_eval.yaml`
$ bash examples/doc_tree_construction/train/run.sh

Inference

# change `task_dir` in `examples/doc_tree_construction/inference/run_simple.py`
$ python examples/doc_tree_construction/inference/run_simple.py

📃 Documentations

Check docs/ for more detailed explanations.

📜 Citation

If you find this paper or repo useful, please cite our paper:

@article{zhu2023ced,
  title={CED: Catalog Extraction from Documents},
  author={Zhu, Tong and Zhang, Guoliang and Li, Zechang and Yu, Zijian and Ren, Junfei and Wu, Mengsong and Wang, Zhefeng and Huai, Baoxing and Chao, Pingfu and Chen, Wenliang},
  journal={arXiv preprint arXiv:2304.14662},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
docs/zh		docs/zh
doctree		doctree
examples		examples
tests		tests
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
wait.py		wait.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌳 CED: Catalog Extraction from Documents

✈️ Abilities

⚙️ Installation

🚀 Quick Start

Concatenate text segments

Concat & split paragraphs

Extract catalog structures

📃 Documentations

📜 Citation

About

Releases 3

Packages

Languages

License

Spico197/CatalogExtraction

Folders and files

Latest commit

History

Repository files navigation

🌳 CED: Catalog Extraction from Documents

✈️ Abilities

⚙️ Installation

🚀 Quick Start

Concatenate text segments

Concat & split paragraphs

Extract catalog structures

📃 Documentations

📜 Citation

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages