Skip to content

Commit

Permalink
Merge pull request #331 from ifad/feature/tesseract-5
Browse files Browse the repository at this point in the history
Update to Tesseract 5
  • Loading branch information
tagliala authored Dec 28, 2024
2 parents 3274702 + ab55d46 commit 40f83ac
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 11 deletions.
10 changes: 5 additions & 5 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ GEM
byebug (11.1.3)
concurrent-ruby (1.3.4)
connection_pool (2.4.1)
csv (3.3.0)
csv (3.3.2)
daemons (1.4.1)
date (3.4.1)
diff-lcs (1.5.1)
Expand All @@ -29,7 +29,7 @@ GEM
csv
json (2.7.6)
kgio (2.11.4)
logger (1.6.2)
logger (1.6.4)
mail (2.8.1)
mini_mime (>= 0.1.1)
net-imap
Expand Down Expand Up @@ -62,7 +62,7 @@ GEM
rack-protection (3.2.0)
base64 (>= 0.1.0)
rack (~> 2.2, >= 2.2.4)
rack-test (2.1.0)
rack-test (2.2.0)
rack (>= 1.3)
raindrops (0.20.1)
rake (13.2.1)
Expand Down Expand Up @@ -113,9 +113,9 @@ GEM
eventmachine (~> 1.0, >= 1.0.4)
rack (>= 1, < 3)
thor (1.3.2)
tilt (2.4.0)
tilt (2.5.0)
timecop (0.9.10)
timeout (0.4.2)
timeout (0.4.3)
tzinfo (2.0.6)
concurrent-ruby (~> 1.0)
unf (0.2.0)
Expand Down
19 changes: 13 additions & 6 deletions docker/colore/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,23 @@ RUN apt-get update && apt-get -yq install --no-install-suggests --no-install-rec
build-essential \
imagemagick \
libmagic-dev \
tesseract-ocr \
tesseract-ocr-ara \
tesseract-ocr-fra \
tesseract-ocr-spa \
wkhtmltopdf

# Needed to get the latest libreoffice
# Ref: https://wiki.debian.org/LibreOffice#Using_Debian_backports
RUN echo 'deb http://deb.debian.org/debian bullseye-backports main contrib non-free' >> /etc/apt/sources.list
RUN apt-get update && apt-get -yq -t bullseye-backports install libreoffice
RUN echo 'deb https://deb.debian.org/debian bullseye-backports main contrib non-free' >> /etc/apt/sources.list

# Needed for Tesseract 5
# Ref: https://notesalexp.org/tesseract-ocr/html/
RUN echo 'deb https://notesalexp.org/tesseract-ocr5/bullseye bullseye main' >> /etc/apt/sources.list
RUN wget -qO /etc/apt/trusted.gpg.d/alexp_key.asc https://notesalexp.org/debian/alexp_key.asc

RUN apt-get update && apt-get -yq -t bullseye-backports install \
libreoffice \
tesseract-ocr \
tesseract-ocr-ara \
tesseract-ocr-fra \
tesseract-ocr-spa

ARG TIKA_VERSION=3.0.0

Expand Down

0 comments on commit 40f83ac

Please sign in to comment.