-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to increase processing speed of tesseract OCR? #160
Comments
More CPU cores, more RAM, multi-threading? Keep an instance of Tesseract engine to process several images instead of repeatedly instantiating for each image. Use GS to convert PDFs for speed. Other users/developers please charm in. |
New release 4.4.1 bundles |
@svijayakumar1 |
Quan |
@Yogeshmsharma-architect Yes, setup and shutdown of the OCR engine for each image could take significant amounts of time. If you can send in a list of images to be processed all at once, it could help. There's a Or you can extend or come up with an alternative implementation of If PDFBox is faster than GS for you, then, by all means, stick with it. Our own experience showed that GS has generally been faster. |
Here is what I did: Extract pages from the PDF in parallel, a page per core. Then pass every page image for further processing to the callback The following sample is written in Kotlin: /**
* Converts PDF-pages to BufferedImage's.
*/
@Throws(IOException::class)
fun convertPdfToBufferedImages(inputPdfFile: File, onImageExtracted: (BufferedImage, pageIndex:Int)->Unit) {
val executor = Executors.newFixedThreadPool(8)
PDDocument.load(inputPdfFile).use { document ->
val pdfRenderer = PDFRenderer(document)
val numberOfPages = document.numberOfPages
val out = Array<BufferedImage?>(numberOfPages) { null }
out.forEachIndexed { pageIndex, _ ->
executor.submit {
try {
val pageImage = pdfRenderer.renderImageWithDPI(pageIndex, 300f, ImageType.GRAY)
out[pageIndex] = pageImage
onImageExtracted(pageImage,pageIndex)
} catch (e: IOException) {
logger.error("Error extracting PDF Document pageIndex $pageIndex=> $e", e)
}
}
}
executor.shutdown()
executor.awaitTermination(5, TimeUnit.HOURS)
}
} |
@ChristianSchwarz Thanks for example! Than you call one instance of tess4j or have you also 8 instances in a pool? |
Hi Quan,
Hope you're doing good. I have developed tessesract ocr application in spring boot. This application must scan 600,000 pdf scanned images. Currently , I am using tess 4j 4.4.0 version. It is taking 1 hour to process 275 pdfs. Per day it will be 6600 pdfs. I request you kindly provide solution to increase the processing speed of tesseract OCR , so that it scanning part will be completed. I must finish this task at the earliest. Please help me
The text was updated successfully, but these errors were encountered: