DFC

Document Format Converter

这是一个可以将某一种办公文档的格式转换为其他格式的文档的工具，使用Tesseract作为OCR识别工具，提高识别准确率，尽量做到完美转换。

在开始前，请您安装最新版的Tesseract，然后运行本项目中的自动运行批处理脚本 setup_and_run.bat。

如果脚本不可用，您可以尝试手动部署。

安装Python 3 (版本 >= 3.9)，将其添加至环境变量中。
使用Git clone本项目，或者直接在Releases中下载项目最新版压缩包。
解压项目，进入项目根目录，使用 pip3 install 安装依赖：
```
pip3 install -r requirements.txt
```
之后，使用以下命令来运行脚本：
```
python main.py
```
如果此项目出现任何问题，请您提交issues，或者发送反馈邮件至 [email protected]。

邮件格式：

标题：【Feedback】[您发现的问题或建议]

内容：[问题正文内容] + [您的联系方式]

自定义设置

在运行转换时，您可以自定义以下设置以优化图像处理和OCR识别：

图像处理参数：在代码中，您可以调整以下参数以优化图像质量：
- 对比度增强：可以通过更改 enhancer.enhance(2.0) 中的值来提高或降低对比度。
- 去噪声设置：您可以更改 ImageFilter.MedianFilter(size=3) 中的 size 参数，以调整去噪声的强度。
OCR语言选择：在识别图像文本时，您可以选择语言类型。以下是可用语言示例：
- 英文：lang='eng'
- 中文：lang='chi_sim'
- 日语：lang='jpn'
- 俄语：lang='rus'

请确保在调用 pytesseract.image_to_string() 方法时指定适当的语言参数。例如：

ocr_text = pytesseract.image_to_string(image, lang='chi_sim')  # 对于简体中文

This is a tool for converting one type of office document format to another, using Tesseract as the OCR recognition tool to improve accuracy and strive for perfect conversion.

Before you begin, please install the latest version of Tesseract, and then run the automated batch script setup_and_run.bat in this project.

If the script is not available, you can try manual deployment.

Install Python 3 (version >= 3.9) and add it to the environment variables.
Clone this project using Git or download the latest release from the Releases section.
Extract the project, navigate to the project root directory, and install the dependencies using pip:
```
pip3 install -r requirements.txt
```
After that, run the script using:
```
python main.py
```
If you encounter any issues with this project, please submit an issue or send feedback via email to [email protected].

Email format:

Subject: 【Feedback】[The issue or suggestion you discovered]

Content: [The detailed content of the issue] + [Your contact information]

Custom Settings

When running the conversion, you can customize the following settings to optimize image processing and OCR recognition:

Image Processing Parameters: In the code, you can adjust the following parameters to improve image quality:
- Contrast Enhancement: You can change the value in enhancer.enhance(2.0) to increase or decrease contrast.
- Denoising Settings: You can adjust the size parameter in ImageFilter.MedianFilter(size=3) to modify the strength of denoising.
OCR Language Selection: When recognizing text in images, you can choose the type of language. Here are examples of available languages:
- English: lang='eng'
- Chinese: lang='chi_sim'
- Japanese: lang='jpn'
- Russian: lang='rus'

Make sure to specify the appropriate language parameter when calling pytesseract.image_to_string(). For example:

ocr_text = pytesseract.image_to_string(image, lang='chi_sim')  # For Simplified Chinese

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup_and_run.bat		setup_and_run.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DFC

Document Format Converter

自定义设置

Custom Settings

About

Releases 1

Packages

Languages

License

DaZiDian/DFC

Folders and files

Latest commit

History

Repository files navigation

DFC

Document Format Converter

自定义设置

Custom Settings

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages