Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Text scanner missing line breaks and space #39

Open
omerabbas01 opened this issue Jun 15, 2012 · 4 comments
Open

PDF Text scanner missing line breaks and space #39

omerabbas01 opened this issue Jun 15, 2012 · 4 comments

Comments

@omerabbas01
Copy link

Hello,

Thank you for providing such a beautiful framework to handle the PDF, Your framework save allot of my time, Helped me allot. There are some things i have noticed in the framework while creating custom text highlighting feature. Highlighting works while Text to speech read aloud. For that i am using NSRange to determine which part of string to be highlight. Everything working very good so far i am able to highlight. But there are some issues with pdf scan text. There are some missing spaces between words and Also missing line breaks.

I have never worked with PDF before, Also i don't know much about PDF. But now i am looking into it how things are working. So i have found you are using CGPDFScannerRef to scan text from PDF. So there must be something i can do that help me to get better text. Can you please guide me a bit where should i look and if there's any tutorial about CGPDFScannerRef.

Thank you!

@KurtCode
Copy link
Owner

Spaces and line breaks may not (and will most likely not) be represented as characters in text objects. Instead, while drawing the document, you will be instructed to move the current point of focus (the "cursor") something like 12 points to the right, i.e. a space between two words.

As I recall, the width of a space is not included in the font, so you would have to listen for those operators that change the text matrix, and decide whether the horizontal translation is large enough to be a space character. There are separate operators for newlines, so that one is easy to implement.

Hope this helps.

@omerabbas01
Copy link
Author

Thank you for the reply, Seems like this is gonna be a tough job, I haven't looked into font yet. Gonna look into it and will let you know if i am succeed.

Thank you

@KurtCode
Copy link
Owner

Sure, working with PDFs gets complicated sometimes.

On 15 jun 2012, at 11:31, omerabbas01
[email protected]
wrote:

Thank you for the reply, Seems like this is gonna be a tough job, I haven't looked into font yet. Gonna look into it and will let you know if i am succeed.

Thank you


Reply to this email directly or view it on GitHub:
#39 (comment)

@hugo53
Copy link

hugo53 commented Nov 8, 2013

@omerabbas01 Have you resolved your problem? I am being stuck in this issue and using a temporary solution: split multi-words keywords and search for separate word, then do some complex code to locate the right place for all words in the keyword. Thereafter, draw all result frames!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants