schema/pdfbox2 fails to extract text as well as pdftotext #20

jmscott · 2016-05-07T13:27:50Z

for a particular pdf

sha:3acd68c1cb7effbc9c2cf50fda6decd96d555d64

the first line of the first page fails to extracted the title correctly

sha:c64e0721c2d5ccdf48992d9a78dbe7d179bbf471

in particular, the venerable pdftotext appears to recognize the newline that
separates the title from the author name. here is the extracted pdftotext blob

sha:c64e0721c2d5ccdf48992d9a78dbe7d179bbf471

why?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema/pdfbox2 fails to extract text as well as pdftotext #20

schema/pdfbox2 fails to extract text as well as pdftotext #20

jmscott commented May 7, 2016

schema/pdfbox2 fails to extract text as well as pdftotext #20

schema/pdfbox2 fails to extract text as well as pdftotext #20

Comments

jmscott commented May 7, 2016