-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProcessPage() generates a corrupt file #271
Comments
IMO |
Can you submit a PR to fix this? If not I'll try to do it whenever I have some time |
I do not have time either, but IMO there should be somed iscussion what is expected goal (input&output)... Just adding renderer->BeginDocument(...). and renderer->EndDocument() could be easy. This could be used for case If somebody wants to OCR one multipage tiff and receive one pdf then solution is to use If somebody what to do everything in memory (PIL-> OCR->PDF) than more development needs to be done (maybe on side of tesseract too). Other interesting idea would be play with HOCR/ALTO(?) output (e.g. create pdf with python, add there input image, hocr result, maybe highlight areas with low confidence etc...) |
Indeed, renderer = api.GetRenderer(path)
with renderer: # calls renderer.BeginDocument()
# do stuff...
# renderer.EndDocument() called I'm not familiar with the API so this is just a rough draft |
With PR #277 this should works: image_filename = "5.png"
img = Image.open(image_filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
api.SetVariable("tessedit_create_pdf", "true")
api.SetVariable("tessedit_create_hocr", "true")
api.SetVariable("tessedit_create_alto", "true")
api.ProcessPage(outputbase="test1",
image=img,
page_index=0,
filename=image_filename,
title="this will be title") |
Of course, implement something that can easily process multipage tiff (or list of filenames) would take more works. import tesserocr
from PIL import Image, ImageSequence
filename = "multipage.tif"
title = "My title"
outputbase = "ocr_result"
im = Image.open(filename)
with tesserocr.PyTessBaseAPI(path=tessdata_path) as api:
api.SetVariable("tessedit_create_pdf", "true")
renderer = api.GetRenderer()
renderer.BeginDocument(title)
for page_index, img in enumerate(ImageSequence.Iterator(im)):
api.ProcessPage(img,
page_index,
outputbase,
renderer,
retry_config=None,
timeout=0)
renderer.EndDocument() |
Thanks! As soon as it is merged and updated on conda-forge (which is on 2.5.1 atm) I'll try it this way. |
I also do not think #277 is a good solution. It makes tesserocr deviate from Tesseract's API unexpectedly:
Since tesserocr already wraps IINM this is merely a documentation issue (but perhaps we should unexpose |
Yes this is for discussion, but for me all existing solution are not the best. e.g. ProcessPages can be used only for files (as input). If you have a memory object - bad luck - put it disk. Also I do not like that |
@zdenop thanks for your explanation, I had completely overlooked that aspect. Indeed, |
I'm trying to OCR a PIL image and create a searchable PDF from that image. According to the documentation, I should use
ProcessPage()
to generate the PDF files. However, every file that is created is corrupted or damaged.The code is as follows:
with tesserocr.PyTessBaseAPI() as api:
api.SetVariable("tessedit_create_pdf", "true")
api.SetImage(img)
api.ProcessPage(outputbase=img_name, image=img, page_index=0, filename=img_name)
The PDF is then created, but it says that the file is corrupted.
I've also tried to use
ProcessPages()
with an image file, but once again the PDF generated is corrupted.I've found the issue #167, but it isn't explain what
page_index
should be, and whatfilename
should be set to. The documentation isn't clear on the correct order to callProcessPage()
, should I callGetUTF8Text
first or isProcessPage()
call in any order? What is the correct usage?Also, if I want to store the OCR result as a string variable, but I too want to create a searchable PDF, should I call
GetUTF8Text
andProcessPage()
individually, which will result in the OCR being process twice, or is there a way to get it done without the extra processing?Thanks for the help.
The text was updated successfully, but these errors were encountered: