diff --git a/README.md b/README.md index d0cb535..b8ab52a 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,6 @@ -# OCR -OCR demo +# OCR(cptn + tesseract) -#### +MacOS下配置tesseract ```bash # install the latest version of tesseract for MacOS brew install tesseract --head @@ -13,9 +12,10 @@ pip install pytesseract # Then move these models to /usr/local/share/tessdata/ chi_sim.traineddata (Simplified Chinese) chi_tra.traineddata (Traditional Chinese) +``` -# configure training env for MacOS -# other OS please reffer to https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos +MacOS下配置tesseract训练环境,其他环境参考[这里](https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos) +```markdown export HOMEBREW_NO_AUTO_UPDATE=true brew install libtool automake git clone https://gitee.com/vance-coder/tesseract.git @@ -26,38 +26,77 @@ make training sudo make install training-install ``` +**下面是针对tesseract4.1+使用lstm模型的fine tuning训练方式,以eng(英语)为例:** + +1. 先提前下载训练所需材料 +- 建立一个文件夹,用于承载所需材料,如:tesseract-train。 +在下面操作中如果遇到没有的文件夹就自己建,这里不再一一赘述。 +- 从[这里](https://github.com/tesseract-ocr/tessdata_best)下载eng.traineddata文件, +放到tesseract-train/models/下面(这是官方推荐的模型文件:Best (most accurate) trained LSTM models.) +- 从[这里](https://github.com/tesseract-ocr/tessconfigs/tree/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs)下载lstm.train文件, +放到tesseract-train/models/configs/下面(这是LSTM的配置文件) +- 从[这里](https://github.com/tesseract-ocr/langdata_lstm)下载eng文件夹下包括eng.training_text以及其他的所有文件, +放到tesseract-train/langdata_lstm/eng/下面(文件作用:Data used for LSTM model training) +- 从[这里](https://github.com/tesseract-ocr/langdata_lstm)找到并下载radical-stroke.txt文件, +放到tesseract-train/langdata_lstm/下面 + +2. 从原始eng模型(eng.traineddata)中提取lstm文件(lstm文件似乎就是训练权重) ```markdown -# 原始eng模型(该模型从tessdata_best仓库取)中提取lstm文件 +# eng.lstm就是提取出来的lstm文件,这里建立了一个lstm文件夹用来存放它 combine_tessdata -e ./model/eng.traineddata ./lstm/eng.lstm +``` +执行完后,那么在tesseract-train目录下,应该会有以下文件 +![tree1](./static/imgs/tree1.png) -# 生成训练集 -tesstrain.sh --fonts_dir /System/Library/Fonts --lang eng --linedata_only --fontlist "Heiti SC" --save_box_tiff --noextract_font_properties --langdata_dir ./langdata --maxpages 100 --tessdata_dir ./models --output_dir ./ +3. tesstrain工具生成训练集(Linux下应该都可以这么生成的,也可以用text2image) +- --fonts_dir /System/Library/Fonts 指定字体目录(不同系统目录不一样,自行百度) +- --fontlist "Heiti SC" 指定生产训练集使用的字体 +- --lang eng 指定语言 +- --linedata_only 指定生成的训练集是适合lstm训练的 +- --save_box_tiff 同时保存box和tiff文件 +- --noextract_font_properties +- --langdata_dir ./langdata_lstm 上面下载的语言材料包 +- --maxpages 100 指定生成多少页训练数据 +- --tessdata_dir ./models 指定模型 +- --output_dir ./ 指定生成的训练集输出目录 +(还有更多参数请查看文档或者: tesstrain.sh --help) +```markdown +tesstrain.sh --fonts_dir /System/Library/Fonts --fontlist "Heiti SC" --lang eng --linedata_only --save_box_tiff --noextract_font_properties --langdata_dir ./langdata_lstm --maxpages 100 --tessdata_dir ./models --output_dir ./ +``` +执行完后,那么在目录下,应该会生成以下文件 +![tree2](./static/imgs/tree2.png) + +4. 开始训练(参数太多,不想一一解析了,只说关键的,因为这些都可以从文档找到,最底下有官方文档链接以及翻译版) +- --target_error_rate 0.05 指定目标错误率,我理解就是对本次训练集的识别的错误率 +- --learning_rate 0.002 学习率,这应该就是梯度下降找最优解的那个学习率,默认只有0.001 +- --model_output ./checkpoint/ 指定训练完成之后的模型输出,输出是checkpoint文件 +- --continue_from ./lstm/eng.lstm 指定训练权重 +- --traineddata ./models/eng.traineddata 指定现有的模型 +- --train_listfile ./eng.training_files.txt 指定训练集(这个文件就是上面生成训练集的时候生成的文件之一) +- --max_iterations 10000 最大迭代次数,跑完一次测试集就算一次吧?(如果提前达到了target_error_rate,训练会提前结束) +```markdown # 基于现有模型开始训练(fine tuning) lstmtraining --debug_interval 100 --max_image_MB 2000 --target_error_rate 0.05 --learning_rate 0.002 --model_output ./checkpoint/ --continue_from ./lstm/eng.lstm --traineddata ./models/eng.traineddata --train_listfile ./eng.training_files.txt --max_iterations 5000 > basetrain.log # 基于上次训练开始继续训练(上次训练输出的是checkpoint,指向checkpoint即可) lstmtraining --debug_interval 100 --max_image_MB 2000 --target_error_rate 0.02 --learning_rate 0.002 --model_output ./checkpoint/ --continue_from ./checkpoint/_0.091_244_3200.checkpoint --traineddata ./models/eng.traineddata --train_listfile ./eng.training_files.txt --max_iterations 8000 > basetrain.log - -# 合并模型 -lstmtraining --stop_training --continue_from ./checkpoint/_0.091_244_3200.checkpoint --traineddata ./models/eng.traineddata --model_output ./eng.traineddata ``` +到此如果一切顺利的话,就会输出训练结果(checkpoint文件),下面就是将checkpoint文件转成tesseract可以使用的模型文件, +这里说一下,训练过程中有可能你的checkpoint文件会输出很多个,但一般来说你是要用错误率最低的那个checkpoint来合成模型, +如:_0.003_244_3200.checkpoint文件来说,0.003就是错误率。 +5. 合并模型 +- --continue_from ./checkpoint/_0.003_244_3200.checkpoint 指定我需要合并模型的checkpoint文件 +- --traineddata ./models/eng.traineddata 指定现有的模型 ```markdown -OCR language: 识别图像中字体中的语言,在命令行和pytesseract,使用-l 选项 -OCR Engine Mode(oem):tesseract4有2个ocr引擎(legacy,lstm),用—oem选项去设置 -0 Legacy engine only. -1 Neural nets LSTM engine only. -2 Legacy + LSTM engines. -3 Default, based on what is available. -Page Segmentation Mode(psm): psm 或许是非常有用的,对于结构化文本有额外的信息对于python和命令行工具默认是3. -0 只有方向和脚本检测(OSD)。 -1 使用OSD自动分页。 -2 自动分页,但没有OSD或OCR。 -3 全自动页面分割,但没有OSD。(默认) -4 假设一列可变大小的文本。 -5 假设一个统一的垂直排列文本块。 -6 假设一个统一的文本块。 -7 将图像作为单个文本行处理。 -8 将图像视为一个单词。 -9 将图像视为一个圆圈中的单个单词。 -10 将图像视为单个字符。 +# 合并模型 +lstmtraining --stop_training --continue_from ./checkpoint/_0.003_244_3200.checkpoint --traineddata ./models/eng.traineddata --model_output ./eng.traineddata ``` + + +参考资料: + +[Tesseract 4.0 LSTM训练超详细教程 - 知乎](https://zhuanlan.zhihu.com/p/58366201) + +[tesseract训练教程 - 官方](https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00#training-text-requirements) + +[tesseract训练教程翻译版 - CSDN](https://blog.csdn.net/panbiao1999/article/details/74638749) \ No newline at end of file diff --git a/images/0.png b/images/0.png new file mode 100644 index 0000000..af34b33 Binary files /dev/null and b/images/0.png differ diff --git a/images/0_.png b/images/0_.png new file mode 100644 index 0000000..ebe93e4 Binary files /dev/null and b/images/0_.png differ diff --git a/images/1.png b/images/1.png new file mode 100644 index 0000000..536049c Binary files /dev/null and b/images/1.png differ diff --git a/images/10.png b/images/10.png new file mode 100644 index 0000000..1802f79 Binary files /dev/null and b/images/10.png differ diff --git a/images/10_.png b/images/10_.png new file mode 100644 index 0000000..e622e85 Binary files /dev/null and b/images/10_.png differ diff --git a/images/11.png b/images/11.png new file mode 100644 index 0000000..c25f07a Binary files /dev/null and b/images/11.png differ diff --git a/images/11_.png b/images/11_.png new file mode 100644 index 0000000..3e62339 Binary files /dev/null and b/images/11_.png differ diff --git a/images/12.png b/images/12.png new file mode 100644 index 0000000..d7dd2db Binary files /dev/null and b/images/12.png differ diff --git a/images/12_.png b/images/12_.png new file mode 100644 index 0000000..9238d7b Binary files /dev/null and b/images/12_.png differ diff --git a/images/13.png b/images/13.png new file mode 100644 index 0000000..3028aed Binary files /dev/null and b/images/13.png differ diff --git a/images/13_.png b/images/13_.png new file mode 100644 index 0000000..7de5200 Binary files /dev/null and b/images/13_.png differ diff --git a/images/14.png b/images/14.png new file mode 100644 index 0000000..81dc70e Binary files /dev/null and b/images/14.png differ diff --git a/images/14_.png b/images/14_.png new file mode 100644 index 0000000..c21e710 Binary files /dev/null and b/images/14_.png differ diff --git a/images/15.png b/images/15.png new file mode 100644 index 0000000..a3048e1 Binary files /dev/null and b/images/15.png differ diff --git a/images/15_.png b/images/15_.png new file mode 100644 index 0000000..1a4d672 Binary files /dev/null and b/images/15_.png differ diff --git a/images/16.png b/images/16.png new file mode 100644 index 0000000..5ab0955 Binary files /dev/null and b/images/16.png differ diff --git a/images/16_.png b/images/16_.png new file mode 100644 index 0000000..4a381c2 Binary files /dev/null and b/images/16_.png differ diff --git a/images/17.png b/images/17.png new file mode 100644 index 0000000..fbbaf69 Binary files /dev/null and b/images/17.png differ diff --git a/images/17_.png b/images/17_.png new file mode 100644 index 0000000..acbb15b Binary files /dev/null and b/images/17_.png differ diff --git a/images/18.png b/images/18.png new file mode 100644 index 0000000..597e459 Binary files /dev/null and b/images/18.png differ diff --git a/images/18_.png b/images/18_.png new file mode 100644 index 0000000..3ea5afa Binary files /dev/null and b/images/18_.png differ diff --git a/images/19.png b/images/19.png new file mode 100644 index 0000000..3cd6a2c Binary files /dev/null and b/images/19.png differ diff --git a/images/19_.png b/images/19_.png new file mode 100644 index 0000000..f5e5382 Binary files /dev/null and b/images/19_.png differ diff --git a/images/1_.png b/images/1_.png new file mode 100644 index 0000000..4ca6b11 Binary files /dev/null and b/images/1_.png differ diff --git a/images/2.png b/images/2.png new file mode 100644 index 0000000..e0e405d Binary files /dev/null and b/images/2.png differ diff --git a/images/20.png b/images/20.png new file mode 100644 index 0000000..bbfab8e Binary files /dev/null and b/images/20.png differ diff --git a/images/20_.png b/images/20_.png new file mode 100644 index 0000000..cb59984 Binary files /dev/null and b/images/20_.png differ diff --git a/images/21.png b/images/21.png new file mode 100644 index 0000000..42c8753 Binary files /dev/null and b/images/21.png differ diff --git a/images/21_.png b/images/21_.png new file mode 100644 index 0000000..8ae3b22 Binary files /dev/null and b/images/21_.png differ diff --git a/images/22.png b/images/22.png new file mode 100644 index 0000000..c904a28 Binary files /dev/null and b/images/22.png differ diff --git a/images/22_.png b/images/22_.png new file mode 100644 index 0000000..a1e5b26 Binary files /dev/null and b/images/22_.png differ diff --git a/images/23.png b/images/23.png new file mode 100644 index 0000000..b42940b Binary files /dev/null and b/images/23.png differ diff --git a/images/23_.png b/images/23_.png new file mode 100644 index 0000000..dc87fed Binary files /dev/null and b/images/23_.png differ diff --git a/images/24.png b/images/24.png new file mode 100644 index 0000000..a711842 Binary files /dev/null and b/images/24.png differ diff --git a/images/24_.png b/images/24_.png new file mode 100644 index 0000000..cd67dea Binary files /dev/null and b/images/24_.png differ diff --git a/images/25.png b/images/25.png new file mode 100644 index 0000000..7f465e9 Binary files /dev/null and b/images/25.png differ diff --git a/images/25_.png b/images/25_.png new file mode 100644 index 0000000..cf2e468 Binary files /dev/null and b/images/25_.png differ diff --git a/images/26.png b/images/26.png new file mode 100644 index 0000000..4b04c7f Binary files /dev/null and b/images/26.png differ diff --git a/images/26_.png b/images/26_.png new file mode 100644 index 0000000..6f40b4e Binary files /dev/null and b/images/26_.png differ diff --git a/images/27.png b/images/27.png new file mode 100644 index 0000000..ad15e3c Binary files /dev/null and b/images/27.png differ diff --git a/images/27_.png b/images/27_.png new file mode 100644 index 0000000..631a47b Binary files /dev/null and b/images/27_.png differ diff --git a/images/28.png b/images/28.png new file mode 100644 index 0000000..5cd8e1b Binary files /dev/null and b/images/28.png differ diff --git a/images/28_.png b/images/28_.png new file mode 100644 index 0000000..a04f903 Binary files /dev/null and b/images/28_.png differ diff --git a/images/29.png b/images/29.png new file mode 100644 index 0000000..0517470 Binary files /dev/null and b/images/29.png differ diff --git a/images/29_.png b/images/29_.png new file mode 100644 index 0000000..fbefd92 Binary files /dev/null and b/images/29_.png differ diff --git a/images/2_.png b/images/2_.png new file mode 100644 index 0000000..959322e Binary files /dev/null and b/images/2_.png differ diff --git a/images/3.png b/images/3.png new file mode 100644 index 0000000..972b682 Binary files /dev/null and b/images/3.png differ diff --git a/images/3_.png b/images/3_.png new file mode 100644 index 0000000..72c875a Binary files /dev/null and b/images/3_.png differ diff --git a/images/4.png b/images/4.png new file mode 100644 index 0000000..cfcd2d0 Binary files /dev/null and b/images/4.png differ diff --git a/images/4_.png b/images/4_.png new file mode 100644 index 0000000..314e5f1 Binary files /dev/null and b/images/4_.png differ diff --git a/images/5.png b/images/5.png new file mode 100644 index 0000000..b1c1dc5 Binary files /dev/null and b/images/5.png differ diff --git a/images/5_.png b/images/5_.png new file mode 100644 index 0000000..ba1e03b Binary files /dev/null and b/images/5_.png differ diff --git a/images/6.png b/images/6.png new file mode 100644 index 0000000..2335b96 Binary files /dev/null and b/images/6.png differ diff --git a/images/6_.png b/images/6_.png new file mode 100644 index 0000000..03fda7b Binary files /dev/null and b/images/6_.png differ diff --git a/images/7.png b/images/7.png new file mode 100644 index 0000000..8b640ec Binary files /dev/null and b/images/7.png differ diff --git a/images/7_.png b/images/7_.png new file mode 100644 index 0000000..2a290e4 Binary files /dev/null and b/images/7_.png differ diff --git a/images/8.png b/images/8.png new file mode 100644 index 0000000..2ce4732 Binary files /dev/null and b/images/8.png differ diff --git a/images/8_.png b/images/8_.png new file mode 100644 index 0000000..a32dce8 Binary files /dev/null and b/images/8_.png differ diff --git a/images/9.png b/images/9.png new file mode 100644 index 0000000..2841715 Binary files /dev/null and b/images/9.png differ diff --git a/images/9_.png b/images/9_.png new file mode 100644 index 0000000..a8b68e5 Binary files /dev/null and b/images/9_.png differ