Skip to content
Sean DeNigris edited this page Apr 22, 2018 · 1 revision

Ghostscript [1]

  • More efficient than ImageMagic, which can make a big difference in bulk/large PDFs [3]
# GS option explanations:
# -dSAFER \ # prevent unwanted file writing
# -dQUIET \ # suppress some console output
# -sDEVICE=png16m \ # Save in 24 bit color, -sDEVICE=pnggray -dTextAlphaBits=4 worked equally well; tiffg4 (from [GH](https://gist.github.com/henrik/1967035)) *is for B&W* and was awful on grayscale input, tiff12nc caused an error apparently because it's not compressed
# -dINTERPOLATE \ # Designed to improve the quality of images that have been upscaled from a smaller size
# -dNumRenderingThreads=8 \ #Gs recommends setting this option to the maximum number of available cores
# -r300 \ # Tesseract default of 72dpi is much too low for good results; 300dpi seems to be the consensus
# -c 30000000 setvmthreshold -f \ # Boost performance with 30MB extra RAM for complex PDFs

gs -o ./output\_image.png -dSAFER -dQUIET -sDEVICE=png16m -dINTERPOLATE -dNumRenderingThreads=8 -r300 -c 30000000 setvmthreshold -f ./Orders\ 1.pdf

ImageMagick Convert [2]

# "-strip -background white -alpha off" removes any alpha channels, and makes the background white. Tesseract is rather picky about this kind of thing.
# -density 300 # Tesseract default of 72dpi is much too low for good results; 300dpi seems to be the consensus
# -depth 8 # Tesseract can't handle 32 bit depth (default)
convert -density 300 Orders\ 1.pdf -depth 8 -strip -background white -alpha off file.tiff

Stitching the PDF pages back together

See https://ryanfb.github.io/etc/2014/11/13/command_line_ocr_on_mac_os_x.html

Footnotes

  1. https://mazira.com/blog/optimal-image-conversion-settings-tesseract-ocr
  2. https://diging.atlassian.net/wiki/spaces/DCH/pages/5275668/Tutorial+Text+Extraction+and+OCR+with+Tesseract+and+ImageMagick
  3. http://bertanguven.com/faster-conversions-from-pdf-to-pngjpeg-imagemagick-vs-ghostscript
Clone this wiki locally