GitHub - funstory-ai/BabelDOC: Yet Another Document Translator

PDF scientific paper translation and bilingual comparison library.

Beta version launched Immersive Translate - BabelDOC 1000 free pages per month.
Provides a simple command line interface.
Provides a Python API.
Mainly designed to be embedded into other programs, but can also be used directly for simple translation tasks.

Preview

Getting Started

Install from PyPI

We recommend using the Tool feature of uv to install yadt.

First, you need to refer to uv installation to install uv and set up the PATH environment variable as prompted.
Use the following command to install yadt:

uv tool install --python 3.12 BabelDOC

babeldoc --help

Use the babeldoc command. For example:

babeldoc --bing  --files example.pdf

# multiple files
babeldoc --bing  --files example1.pdf --files example2.pdf

Install from Source

We still recommend using uv to manage virtual environments.

First, you need to refer to uv installation to install uv and set up the PATH environment variable as prompted.
Use the following command to install yadt:

# clone the project
git clone https://github.com/funstory-ai/BabelDOC

# enter the project directory
cd BabelDOC

# install dependencies and run babeldoc
uv run babeldoc --help

Use the uv run babeldoc command. For example:

uv run babeldoc --bing --files example.pdf

# multiple files
uv run babeldoc --bing --files example.pdf --files example2.pdf

Tip

The absolute path is recommended.

Advanced Options

Language Options

--lang-in, -li: Source language code (default: en)
--lang-out, -lo: Target language code (default: zh)

Tip

Currently, this project mainly focuses on English-to-Chinese translation, and other scenarios have not been tested yet.

(2025.3.1 update): Basic English target language support has been added, primarily to minimize line breaks within words([0-9A-Za-z]+).

HELP WANTED: Collecting word regular expressions for more languages

PDF Processing Options

--files: One or more file paths to input PDF documents.
--pages, -p: Specify pages to translate (e.g., "1,2,1-,-3,3-5"). If not set, translate all pages
--split-short-lines: Force split short lines into different paragraphs (may cause poor typesetting & bugs)
--short-line-split-factor: Split threshold factor (default: 0.8). The actual threshold is the median length of all lines on the current page * this factor
--skip-clean: Skip PDF cleaning step
--dual-translate-first: Put translated pages first in dual PDF mode (default: original pages first)
--disable-rich-text-translate: Disable rich text translation (may help improve compatibility with some PDFs)
--enhance-compatibility: Enable all compatibility enhancement options (equivalent to --skip-clean --dual-translate-first --disable-rich-text-translate)
--use-alternating-pages-dual: Use alternating pages mode for dual PDF. When enabled, original and translated pages are arranged in alternate order. When disabled (default), original and translated pages are shown side by side on the same page.
--no-watermark: Do not add watermark to the translated PDF.

Tip

Both --skip-clean and --dual-translate-first may help improve compatibility with some PDF readers
--disable-rich-text-translate can also help with compatibility by simplifying translation input
However, using --skip-clean will result in larger file sizes
If you encounter any compatibility issues, try using --enhance-compatibility first

Translation Service Options

--qps: QPS (Queries Per Second) limit for translation service (default: 4)
--ignore-cache: Ignore translation cache and force retranslation
--no-dual: Do not output bilingual PDF files
--no-mono: Do not output monolingual PDF files
--min-text-length: Minimum text length to translate (default: 5)
--openai: Use OpenAI for translation (default: False)
--bing: Use Bing for translation (default: False)
--google: Use Google Translate for translation (default: False)

Tip

You must specify one translation service among --openai, --bing, --google.
It is recommended to use models with strong compatibility with OpenAI, such as: glm-4-flash, deepseek-chat, etc.
Currently, it has not been optimized for traditional translation engines like Bing/Google, it is recommended to use LLMs.

OpenAI Specific Options

--openai-model: OpenAI model to use (default: gpt-4o-mini)
--openai-base-url: Base URL for OpenAI API
--openai-api-key: API key for OpenAI service

Tip

This tool supports any OpenAI-compatible API endpoints. Just set the correct base URL and API key. (e.g. https://xxx.custom.xxx/v1)
For local models like Ollama, you can use any value as the API key (e.g. --openai-api-key a).

Output Control

--output, -o: Output directory for translated files. If not set, use current working directory.
--debug, -d: Enable debug logging level and export detailed intermediate results in ~/.cache/yadt/working.
--report-interval: Progress report interval in seconds (default: 0.1).

Offline Assets Management

--generate-offline-assets: Generate an offline assets package in the specified directory. This creates a zip file containing all required models and fonts.
--restore-offline-assets: Restore an offline assets package from the specified file. This extracts models and fonts from a previously generated package.

Tip

Offline assets packages are useful for environments without internet access or to speed up installation on multiple machines.
Generate a package once with babeldoc --generate-offline-assets /path/to/output/dir and then distribute it.
Restore the package on target machines with babeldoc --restore-offline-assets /path/to/offline_assets_*.zip.
The offline assets package name cannot be modified because the file list hash is encoded in the name.
If you provide a directory path to --restore-offline-assets, the tool will automatically look for the correct offline assets package file in that directory.
The package contains all necessary fonts and models required for document processing, ensuring consistent results across different environments.
The integrity of all assets is verified using SHA3-256 hashes during both packaging and restoration.
If you're deploying in an air-gapped environment, make sure to generate the package on a machine with internet access first.

Configuration File

--config, -c: Configuration file path. Use the TOML format.

Example Configuration:

[babeldoc]
debug = true
lang-in = "en-US"
lang-out = "zh-CN"
qps = 20
# this is a comment
# pages = 4
openai = true
openai-model = "SOME_ALSOME_MODEL"
openai-base-url = "https://example.example/v1"
openai-api-key = "[KEY]"
# Offline assets management
# generate-offline-assets = "/path/to/output/dir"
# restore-offline-assets = "/path/to/offline_assets_package.zip"
# All other options can also be set in the configuration file.

For a more comprehensive configuration example with offline assets management:

[babeldoc]
# Basic settings
debug = true
lang-in = "en-US"
lang-out = "zh-CN"
qps = 10
output = "/path/to/output/dir"

# Translation service
openai = true
openai-model = "gpt-4o-mini"
openai-base-url = "https://api.openai.com/v1"
openai-api-key = "your-api-key-here"

# PDF processing options
split-short-lines = false
short-line-split-factor = 0.8
skip-clean = false
dual-translate-first = false
disable-rich-text-translate = false
use-alternating-pages-dual = false
no-watermark = false

# Output control
no-dual = false
no-mono = false
min-text-length = 5
report-interval = 0.5

# Offline assets management
# Uncomment one of these options as needed:
# generate-offline-assets = "/path/to/output/dir"
# restore-offline-assets = "/path/to/offline_assets_package.zip"

Python API

You can refer to the example in main.py to use BabelDOC's Python API.

Please note:

Make sure call babeldoc.high_level.init() before using the API
The current TranslationConfig does not fully validate input parameters, so you need to ensure the validity of input parameters

For offline assets management, you can use the following functions:

# Generate an offline assets package
from pathlib import Path
import babeldoc.assets.assets

# Generate package to a specific directory
# path is optional, default is ~/.cache/babeldoc/assets/offline_assets_{hash}.zip
babeldoc.assets.assets.generate_offline_assets_package(Path("/path/to/output/dir"))

# Restore from a package file
# path is optional, default is ~/.cache/babeldoc/assets/offline_assets_{hash}.zip
babeldoc.assets.assets.restore_offline_assets_package(Path("/path/to/offline_assets_package.zip"))

# You can also restore from a directory containing the offline assets package
# The tool will automatically find the correct package file based on the hash
babeldoc.assets.assets.restore_offline_assets_package(Path("/path/to/directory"))

Tip

The offline assets package name cannot be modified because the file list hash is encoded in the name.
When using in production environments, it's recommended to pre-generate the assets package and include it with your application distribution.
The package verification ensures that all required assets are intact and match their expected checksums.

Background

There are a lot projects and teams working on to make document editing and translating easier like:

There are also some solutions to solve specific parts of the problem like:

layoutreader: the read order of the text block in a pdf
Surya: the structure of the pdf

This project hopes to promote a standard pipeline and interface to solve the problem.

In fact, there are two main stages of a PDF parser or translator:

Parsing: A stage of parsing means to get the structure of the pdf such as text blocks, images, tables, etc.
Rendering: A stage of rendering means to render the structure into a new pdf or other format.

For a service like mathpix, it will parse the pdf into a structure may be in a XML format, and then render them using a single column reader order as layoutreader does. The bad news is that the original structure lost.

Some people will use Adobe PDF Parser because it will generate a Word document and it keeps the original structure. But it is somewhat expensive. And you know, a pdf or word document is not a good format for reading in mobile devices.

We offer an intermediate representation of the results from parser and can be rendered into a new pdf or other format. The pipeline is also a plugin-based system which everybody can add their new model, ocr, renderer, etc.

Roadmap

Our first 1.0 version goal is to finish a translation from PDF Reference, Version 1.7 to the following language version:

Simplified Chinese
Traditional Chinese
Japanese
Spanish

And meet the following requirements:

layout error less than 1%
content loss less than 1%

Known Issues

Parsing errors in the author and reference sections; they get merged into one paragraph after translation.
Lines are not supported.
Does not support drop caps.

How to Contribute

We encourage you to contribute to YADT! Please check out the CONTRIBUTING guide.

Everyone interacting in YADT and its sub-projects' codebases, issue trackers, chat rooms, and mailing lists is expected to follow the YADT Code of Conduct.

Immersive Translation sponsors monthly Pro membership redemption codes for active contributors to this project, see details at: CONTRIBUTOR_REWARD.md

Name		Name	Last commit message	Last commit date
Latest commit History 784 Commits
.github		.github
babeldoc		babeldoc
docs		docs
examples		examples
tests		tests
.cursorignore		.cursorignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preview

Getting Started

Install from PyPI

Install from Source

Advanced Options

Language Options

PDF Processing Options

Translation Service Options

OpenAI Specific Options

Output Control

Offline Assets Management

Configuration File

Python API

Background

Roadmap

Known Issues

How to Contribute

Acknowledgements

Star History

About

Releases 48

Packages

Contributors 9

Languages

License

funstory-ai/BabelDOC

Folders and files

Latest commit

History

Repository files navigation

Preview

Getting Started

Install from PyPI

Install from Source

Advanced Options

Language Options

PDF Processing Options

Translation Service Options

OpenAI Specific Options

Output Control

Offline Assets Management

Configuration File

Python API

Background

Roadmap

Known Issues

How to Contribute

Acknowledgements

Star History

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 48

Packages 0

Contributors 9

Languages

Packages