Note: This documentation expects you to be familiar with compiling software on your operating system.
Use the same tools for building tesseract as you used for building leptonica.
C++ compiler with good C++17 support is required for building Tesseract from source. Several (known) toolchains can help you build the tesseract: GNU Autotools, CMake, Software Network (a.k.a. sw) and vcpkg. Please have a look at the tesseract Github Action Worklows if the following instructions are not clear to you.
To install Tesseract 4.x you can simply run the following command on your Ubuntu 18.xx bionic:
sudo apt install tesseract-ocr
If you wish to install the Developer Tools which can be used for training, run the following command:
sudo apt install libtesseract-dev
The following instructions are for building on Linux, which also can be applied to other UNIX like operating systems.
- A compiler for C and C++: GCC or Clang
- GNU Autotools: autoconf, automake, libtool
- pkg-config
- Leptonica
- (optional) zlib, libpng, libjpeg, libtiff, giflib, openjpeg, webp, archive, curl
If they are not already installed, you need the following libraries (Ubuntu 16.04/14.04):
sudo apt-get install g++ # or clang++ (presumably)
sudo apt-get install autoconf automake libtool
sudo apt-get install pkg-config
sudo apt-get install libpng-dev
sudo apt-get install libjpeg8-dev
sudo apt-get install libtiff5-dev
sudo apt-get install zlib1g-dev
sudo apt-get install libwebpdemux2 libwebp-dev
sudo apt-get install libopenjp2-7-dev
sudo apt-get install libgif-dev
sudo apt-get install libarchive-dev libcurl4-openssl-dev
if you plan to install the training tools, you also need the following libraries:
sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev
You also need to install Leptonica. Ensure that the development headers for Leptonica are installed before compiling Tesseract.
Tesseract versions and the minimum version of Leptonica required:
Tesseract | Leptonica | Ubuntu |
---|---|---|
4.00 | 1.74.2 | Ubuntu 18.04 |
3.05 | 1.74.0 | Must build from source |
3.04 | 1.71 | Ubuntu 16.04 |
3.03 | 1.70 | Ubuntu 14.04 |
3.02 | 1.69 | Ubuntu 12.04 |
3.01 | 1.67 |
One option is to install the distro's Leptonica package:
sudo apt-get install libleptonica-dev
but if you are using an oldish version of Linux, the Leptonica version may be too old, so you will need to build from source.
The sources are at https://github.com/DanBloomberg/leptonica . The instructions for building are given in Leptonica README.
Note that if building Leptonica from source, you may need to ensure that /usr/local/lib is in your library path. This is a standard Linux bug, and the information at Stackoverflow is very helpful.
Please follow instructions in Compiling--GitInstallation
Also read Install Instructions
Tesseract can be configured to install anywhere, which makes it possible to install it without root access.
To install it in $HOME/local:
./autogen.sh
./configure --prefix=$HOME/local/
make
make install
To install it in $HOME/local using Leptonica libraries also installed in $HOME/local:
./autogen.sh
LIBLEPT_HEADERSDIR=$HOME/local/include ./configure \
--prefix=$HOME/local/ --with-extra-libraries=$HOME/local/lib
make
make install
In some systems, you might also need to specify the path to the pkg-config
before running the configure
script:
export PKG_CONFIG_PATH=$HOME/local/lib/pkgconfig
- Download the data file(s) for the language(s) you are interested in.
- Move it to the
tessdata
directory (e.g.mv tessdata $TESSDATA\_PREFIX
if definedTESSDATA_PREFIX
)
You can also use:
export TESSDATA_PREFIX=/some/path/to/tessdata
to point to your tessdata directory (for example: if your tessdata path is '/usr/local/share/tessdata' you have to use 'export TESSDATA_PREFIX='/usr/local/share/').
!!! IMPORTANT !!! To use Tesseract in your application (to include Tesseract or to link it into your app) see this very simple example.
- Download the latest SW (Software Network
https://software-network.org/
) client fromhttps://software-network.org/client/
. - Run
sw setup
(may require administrator access) - Run
sw build org.sw.demo.google.tesseract.tesseract
.
Today it is possible to build a full set of Tesseract training tools on Windows with Visual Studio. You need to have the latest VS compiler (VS2019/2022 or light VS 2019/2022 build tools distro installed.
To do this:
- Download the latest SW (Software Network
https://software-network.org/client/
) client fromhttps://software-network.org/client/
. - Checkout tesseract sources
git clone https://github.com/tesseract-ocr/tesseract tesseract && cd tesseract
. - Run
sw build
. - Binaries will be available under .sw\out\some hash dir...
- Setup Vcpkg the Visual C++ Package Manager.
- Run
vcpkg install tesseract:x64-windows
for 64-bit. Use --head for the master branch.
To build a self-contained tesseract.exe
executable (without any DLLs or runtime dependencies), use Vcpkg as above with the following command:
vcpkg install tesseract:x64-windows-static
for 64-bitvcpkg install tesseract:x86-windows-static
for 32-bit
Use --head for the main branch. It may still require one DLL for the OpenMP runtime, vcomp140.dll
(which you can find in the Visual C++ Redistributable 2015).
- Build and install Leptonica based as described on its wiki
- Install ICU library for Visual Studio
chdir tesseract
cmake -Bbuild -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=%INSTALL_DIR% -DCMAKE_INSTALL_PREFIX=%INSTALL_DIR% -DSW_BUILD=OFF -DBUILD_SHARED_LIBS=ON -DENABLE_LTO=ON -DBUILD_TRAINING_TOOLS=ON -DFAST_FLOAT=ON -DGRAPHICS_DISABLED=ON -DOPENMP_BUILD=OFF
cmake --build build --config Release --target install
This will create most of the training tools (excluding text2image as its requirements Pango library is not easy to build&installed on Windows). For more details have a look at https://github.com/tesseract-ocr/tesseract/blob/main/.github/workflows/cmake-win64.yml
For development purposes of Tesseract itself do the next steps:
- Download and install Git, CMake and put them in PATH.
- Download the latest SW (Software Network
https://software-network.org/
) client fromhttps://software-network.org/client/
. SW is a source package distribution system. - Add SW client to PATH.
- Run
sw setup
(may require administrator access) - If you have a release archive, unpack it to
tesseract
dir.
If you're using the main branch run
git clone https://github.com/tesseract-ocr/tesseract tesseract
-
Run
cd tesseract mkdir build && cd build cmake ..
-
Build a solution (
tesseract.sln
) in your Visual Studio version. If you want to build and install from the command line (e.g. Release build) you can use this command:
cmake --build . --config Release --target install
If you want to install to another directory than C:\Program Files
(you will need admin right for this), you need to specify the install path during configuration:
cmake .. -G "Visual Studio 15 2017 Win64" -DCMAKE_INSTALL_PREFIX=inst
For development purposes of training tools after cloning a repo from the previous paragraph, run
sw build
You'll see a solution link appearing in the root directory of Tesseract.
If you're building with sw+cmake, run cmake as follows:
mkdir win64 && cd win64
cmake .. -G "Visual Studio 14 2015 Win64"
If you're building with sw run sw generate
, it will create a solution link for you (not yet implemented!).
If you have Visual Studio 2015, checkout the https://github.com/peirick/VS2015_Tesseract repository for Visual Studio 2015 Projects for Tessearct and dependencies. and click on build_tesseract.bat. After that you still need to download the language packs.
Have a look at blog How to build Tesseract 3.03 with Visual Studio 2013.
For tesseract-ocr 3.02 please follow instruction in Visual Studio 2008 Developer Notes for Tesseract-OCR.
Download these packages from the Downloads Archive on SourceForge page:
tesseract-3.01.tar.gz
- Tesseract sourcetesseract-3.01-win_vs.zip
- Visual studio (2008 & 2010) solution with necessary librariestesseract-ocr-3.01.eng.tar.gz
- English language file for Tesseract (or download other language training file)
Unpack them to one directory (e.g. tesseract-3.01
). Note that tesseract-ocr-3.01.eng.tar.gz
names the root directory 'tesseract-ocr'
instead of 'tesseract-3.01'
.
Windows relevant files are located in vs2008 directory (e.g. tesseract-3.01\vs2008
). The same build process as usual applies: Open tesseract.sln
with VC++Express 2008 and build all (or just Tesseract.) It should compile (in at least release mode) without having to install anything further. The dll dependencies and Leptonica are included. Output will be in tesseract-3.01\vs2008\bin
(or tesseract-3.01\vs2008\bin.rd
or tesseract-3.01\vs2008\bin.dbg
based on configuration build).
For Mingw+Msys have a look at blog Compiling Leptonica and Tesseract-ocr with Mingw+Msys.
Download and install MSYS2 Installer from https://msys2.github.io/
The core packages groups you need to install if you wish to build from PKGBUILDs are:
- base-devel for any building
- msys2-devel for building msys2 packages
- mingw-w64-i686-toolchain for building mingw32 packages
- mingw-w64-x86_64-toolchain for building mingw64 packages
To build the tesseract-ocr release package, use PKGBUILD from https://github.com/Alexpux/MINGW-packages/tree/master/mingw-w64-tesseract-ocr
To build on Cygwin have a look at blog How to build Tesseract on Cygwin.
Tesseract as well as the training utilities for 3.04.00 onwards are available as Cygwin packages.
Tesseract specific packages to be installed:
tesseract-ocr 3.04.01-1
tesseract-ocr-eng 3.04-1
tesseract-training-core 3.04-1
tesseract-training-eng 3.04-1
tesseract-training-util 3.04.01-1
Mingw-w64 allows building 32- or 64-bit executables for Windows. It can be used for native compilations on Windows, but also for cross compilations on Linux (which are easier and faster than native compilations). Most large Linux distributions already contain packages with the tools need for a cross build. Before building Tesseract, it is necessary to build some prerequisites.
For Debian and similar distributions (e. g. Ubuntu), the cross tools can be installed like that:
# Development environment targeting 32- and 64-bit Windows (required)
apt-get install mingw-w64
# Development tools for 32- and 64-bit Windows (optional)
apt-get install mingw-w64-tools
These prerequisites will be needed:
- libpng, libtiff, zlib (binaries for Mingw-w64 available as part of the GTK+ bundles)
- libicu
- liblcms2
- openjpeg
- leptonica
Typically a package manager like Fink, Homebrew or MacPorts is needed in addition to Apple's Xcode.
Xcode and the related command line tools provides the compiler (llvm-gcc
) and linker, but also libraries like zlib
. The package manager provides free software packages which are not part of Xcode.
The Xcode Command Line Tools can be installed by running xcode-select --install
.
Note that Tesseract 4 can be built with OpenMP support, but that requires additional installations.
Fink (as of 2017-04) neither provides Leptonica nor the packages needed for the Tesseract training tools, so it cannot be recommended for building Tesseract.
Install OpenMP:
sudo port install libomp
The following method which gets, compiles and installs OpenMP manually should no longer be needed:
# Install cmake if it is not available.
sudo port install cmake
git clone https://github.com/llvm-mirror/openmp.git
cd openmp
mkdir build
cd build
cmake ..
make
sudo make install
sudo port install autoconf \
automake \
libtool \
pkgconfig \
leptonica
Compilation itself relies on the Autotools suite:
git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract
./autogen.sh
./configure
make
sudo make install
If you want support for multithreading, you have to install OpenMP first (see above)
and tell the compiler and linker how to activate OpenMP support.
This is done by adding that information to the options for configure
:
./configure CXXFLAGS="-Xpreprocessor -fopenmp -I/opt/local/include/libomp -Wall -O2" LDFLAGS=-L/opt/local/lib/libomp LIBS=-lomp
If compilation fails at the make
command, with libtool
erring on missing instructions, you may be building with MacPort's g++
compiler, with known issues. The community recommends to use clang
, but a workaround for g++
is to re-configure the build:
./configure CXXFLAGS=-Wa,-q
And then proceed with make
.
In the above training tools are not installed. You can install not only Tesseract but also training tools like below.
sudo port install cairo pango
sudo port install icu +devel
git clone https://github.com/tesseract-ocr/tesseract/
cd tesseract
./autogen.sh
./configure
make training
sudo make install training-install
# Packages which are always needed.
brew install automake autoconf libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
# Packages required for training tools.
brew install pango
# Optional packages for extra features.
brew install libarchive
# Optional package for builds using g++.
brew install gcc
As of January 2017, the clang builds but OpenMP will only use a single thread, potentially reducing performance. If you really need OpenMP, install and use gcc.
git clone https://github.com/tesseract-ocr/tesseract/
cd tesseract
./autogen.sh
mkdir build
cd build
# Optionally add CXX=g++-8 to the configure command if you really want to use a different compiler.
../configure PKG_CONFIG_PATH=/usr/local/opt/icu4c/lib/pkgconfig:/usr/local/opt/libarchive/lib/pkgconfig:/usr/local/opt/libffi/lib/pkgconfig
make -j
# Optionally install Tesseract.
sudo make install
# Optionally build and install training tools.
make training
sudo make training-install
For cross-compiling see discussion in issue 2334. You need to specify target this way:
./configure CXX="g++ --target=arm-apple-darwin64"
Tesseract can be built for Android as a static command-line executable tesseract
, or you can use Java binding to work with libtess from your Android app.
Currently, the easiest build method can be found in a tess-two fork. This fork contains both tesseract and leptonica sources, so that it is enough to download the repository. To build the command-line executable, you don't need Android SDK or Android Studio, only install Android NDK (r.20 has been tested) and run the ndk-build
command, e.g.:
ndk-build -C tess-two-git/tess-two tesseract APP_ABI=arm64-v8a
The 4.1 branch is available, too. Note that performance may be significantly different:
> adb shell time tess3 --tessdata-dir tessdata3 eurotext.png txt3
Tesseract Open Source OCR Engine v3.05.00 with Leptonica
0m05.95s real 0m05.77s user 0m00.17s system
> adb shell time tess4 --tessdata-dir tessdata4 eurotext.png txt4
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
0m59.07s real 0m58.56s user 0m00.45s system
> adb shell time tess4 --tessdata-dir tessdata3 eurotext.png txt42
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
0m05.61s real 0m05.37s user 0m00.23s system
Another method of compiling is using the project Building for Android with Docker, which at the time of writing can produce shared libraries for the following versions and architectures:
Arch \ Version | 3.02.02 | 3.05.02 | 4.0.0 | 4.1.0 |
---|---|---|---|---|
armv7-a | ✔ | ✔ | ✔ | ✔ |
arm64-v8a | ✖ | ✔ | ✔ | ✔ |
x86 | ✔ | ✔ | ✔ | ✔ |
Compilation of dependent libraries, leptonica and tiff, are included and handled as well.
Another method of compiling is doing it on a Linux machine, with Android NDK r22 (22.1.7171670). This method compiles for following versions and architectures:
Arch \ Version | 4.1.0 |
---|---|
armv7-a | ✔ |
arm64-v8a | ✔ |
x86 | ✔ |
x86_64 | ✔ |
These prerequisites will be needed:
- libjpeg - GitHub branch 2.1.1 - https://github.com/libjpeg-turbo/libjpeg-turbo
- libpng - GitHub branch v1.6.37 - https://github.com/glennrp/libpng
- libtiff - version 4.0.10 downloaded - https://download.osgeo.org/libtiff/
- leptonica - version 1.74.4 downloaded - https://github.com/DanBloomberg/leptonica
Compile Leptonica with:
./autobuild
./configure \
--host=$TARGET \
--disable-programs \
--without-giflib \
--without-libwebp \
--without-zlib \
--without-libopenjpeg \
--prefix $ROOT/output/$OUTARCH/
make -j && make install
Compile Tesseract with:
export API=23
export TOOLCHAIN=$ANDROID_NDK_HOME_22/toolchains/llvm/prebuilt/linux-x86_64
export ABI_CONFIGURE_HOST=$NDKTARGET
export AR=$TOOLCHAIN/bin/$NDKTARGET-ar
export CC=$TOOLCHAIN/bin/$TARGET$API-clang
export CXX=$TOOLCHAIN/bin/$TARGET$API-clang++
export AS=$CC
export LD=$TOOLCHAIN/bin/$TARGET-ld
export RANLIB=$TOOLCHAIN/bin/$NDKTARGET-ranlib
export STRIP=$TOOLCHAIN/bin/$NDKTARGET-strip
export LEPTONICA_LIBS="-L$ROOT/output/$OUTARCH/lib -llept"
export LEPTONICA_CFLAGS="-I$ROOT/output/$OUTARCH/include/leptonica"
export PKG_CONFIG_PATH="$ROOT/output/$OUTARCH/lib/pkgconfig"
export LIBS="-L$ROOT/output/$OUTARCH/lib"
make clean
./autogen.sh
./configure \
--host=$TARGET \
--disable-doc \
--without-archive \
--disable-openmp \
--without-curl \
--prefix $ROOT/output/$OUTARCH/
make -j
make install
- To fix this error
./configure: line 4237: syntax error near unexpected token `-mavx,'
./configure: line 4237: `AX_CHECK_COMPILE_FLAG(-mavx, avx=1, avx=0)'
ensure that autoconf-archive
is installed. Don't forget to run ./autogen.sh
after the installation of autoconf-archive
. Note this error happens often under CentOS, where autoconf-archive
is missing and no package is available. Some projects help with installing.
The latest code from GitHub does not require autoconf-archive
.
-
If configure fails with such error "configure: error: Leptonica 1.74 or higher is required." Try to install libleptonica-dev package.
-
If you are sure you have installed leptonica (for example in /usr/local) then probably pkg-config is not looking at your install folder (check with
pkg-config --variable pc_path pkg-config
).
A solution is to set PKG_CONFIG_PATH : example :PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
- On some systems autotools does not create m4 directory automatically (giving the error: "configure: error: cannot find macro directory 'm4'").
In this case you must create m4 directory (mkdir m4
), and then rerun the above commands starting with ./configure.