Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema/pdfbox2 fails to extract text as well as pdftotext #20

Open
jmscott opened this issue May 7, 2016 · 0 comments
Open

schema/pdfbox2 fails to extract text as well as pdftotext #20

jmscott opened this issue May 7, 2016 · 0 comments

Comments

@jmscott
Copy link
Owner

jmscott commented May 7, 2016

for a particular pdf

sha:3acd68c1cb7effbc9c2cf50fda6decd96d555d64

the first line of the first page fails to extracted the title correctly

sha:c64e0721c2d5ccdf48992d9a78dbe7d179bbf471

in particular, the venerable pdftotext appears to recognize the newline that
separates the title from the author name. here is the extracted pdftotext blob

sha:c64e0721c2d5ccdf48992d9a78dbe7d179bbf471

why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant