PDFs

Ghostscript [1]

More efficient than ImageMagic, which can make a big difference in bulk/large PDFs [3]

# GS option explanations:
# -dSAFER \ # prevent unwanted file writing
# -dQUIET \ # suppress some console output
# -sDEVICE=png16m \ # Save in 24 bit color, -sDEVICE=pnggray -dTextAlphaBits=4 worked equally well; tiffg4 (from [GH](https://gist.github.com/henrik/1967035)) *is for B&W* and was awful on grayscale input, tiff12nc caused an error apparently because it's not compressed
# -dINTERPOLATE \ # Designed to improve the quality of images that have been upscaled from a smaller size
# -dNumRenderingThreads=8 \ #Gs recommends setting this option to the maximum number of available cores
# -r300 \ # Tesseract default of 72dpi is much too low for good results; 300dpi seems to be the consensus
# -c 30000000 setvmthreshold -f \ # Boost performance with 30MB extra RAM for complex PDFs

gs -o ./output\_image.png -dSAFER -dQUIET -sDEVICE=png16m -dINTERPOLATE -dNumRenderingThreads=8 -r300 -c 30000000 setvmthreshold -f ./Orders\ 1.pdf

ImageMagick Convert [2]

# "-strip -background white -alpha off" removes any alpha channels, and makes the background white. Tesseract is rather picky about this kind of thing.
# -density 300 # Tesseract default of 72dpi is much too low for good results; 300dpi seems to be the consensus
# -depth 8 # Tesseract can't handle 32 bit depth (default)
convert -density 300 Orders\ 1.pdf -depth 8 -strip -background white -alpha off file.tiff

Stitching the PDF pages back together

See https://ryanfb.github.io/etc/2014/11/13/command_line_ocr_on_mac_os_x.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFs

Ghostscript [1]

ImageMagick Convert [2]

Stitching the PDF pages back together

Footnotes

Clone this wiki locally