documentscanner
allows you to transform (almost) any ADF scanner into a document scanner that produces OCRed PDFs. All you need is
- a sane-compatible ADF scanner
- a raspberry pi
- (optional) a more powerful host to run the OCR tasks
- Check out
documentscanner
onto a raspberry pi:$ git checkout https://github.com/BastianPoe/documentscanner.git ; cd documentscanner
- Install sane and other dependencies
$ apt-get install sane sane-utils bash unpaper tesseract-ocr tesseract-ocr-deu imagemagick bc poppler-utils findutils scanbd
- Install scanbd script:
$ mkdir -p /etc/scanbd/scripts ; cp scanbd/test.script /etc/scanbd/scripts/
- Enable scanbd:
$ systemctl enable scanbd
- Restart scanbd:
$ systemctl restart scanbd
- Create inbox and outbox:
$ mkdir -p /inbox /outbox
- Start document processor:
$ cd scripts ; ./process.sh /inbox /outbox
- Done
- Check if
sane
recognizes your scanner via$ scanimage -L
- Check the logs of
scanbd
via$ journalctl -f
. You should be seeing log outputs whenever you press a button - Modify the events scanbd triggers for in
/etc/scanbd/scripts/test.script
(currently: scan and email) - Check if scanned raw documents end up in
/inbox
- Check logfiles of the processor
- Check if PDFs end up in
/outbox
documentscanner
uses scanbd to wait for someone to press a button on the scanner. This triggers the script in /etc/scanbd/scripts/test.script
which differentiates which button has been pressed. The script calls /home/pi/documentscanner/scripts/scan.sh
and scans all pages available into a folder in /inbox
. After completing the scan, a file called complete
is placed in the scan directory.
The processor checks every 10s in /inbox
and if there is a new document with the complete
flag, the document is processed. Initially, we use identify with a heuristic to identify and remove empty pages. Then, each page is processed using unpaper to remove the background, etc. Subsequently, the pages are OCRed using tesseract and converted to PDFs. Finally, the individual PDFs are joined into one using pdfunite and the scan directory is deleted.
Incomplete scans (e.g. those where the ADF pulled multiple pages at once) are aborted and never receive the complete
flag and hence are not processed by the processor. Check /inbox
from time to time to see, which documents have ended up there and delete them.
I run the processor in a Docker container on my Synology NAS. This is way faster than on the raspberry and does not slow down subsequent scans. The required setup steps are quite easy:
- Create a new shared directory on your NAS and expose it via NFS to your raspberry pi
- Install autofs:
$ apt-get install autofs
- Add NFS mounting to /etc/auto.misc:
documentarchive -rw,soft,intr,rsize=8192,wsize=8192 192.168.1.26:/volume1/documentarchive
- Enable auto.misc by adding the following line to
/etc/auto.master
:/misc /etc/auto.misc
- Edit your
/etc/scanbd/scripts/test.script
to place scans into your output folder. E.g.FOLDER="/misc/documentarchive/scans_raw
- Pull
bastianpoe/document_archive
into the Docker Station on your NAS - Map
/inbox
onto the NFS share created above and/outbox
onto where the PDFs shall be stored - Start the docker container
- Done