Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSTOR watermark #24

Open
rcallahan opened this issue Mar 28, 2013 · 5 comments
Open

JSTOR watermark #24

rcallahan opened this issue Mar 28, 2013 · 5 comments

Comments

@rcallahan
Copy link

This content downloaded from X at T on bottom of all pages

@kanzure
Copy link
Owner

kanzure commented Mar 29, 2013

JSTOR has been working since 0.0.10, can you show me a sample that it fails on?

@rcallahan
Copy link
Author

http://diyhpl.us/~bryan/papers2/paperbot/The%20New%20England%20Origins%20of%20Mormonism.pdf

On Thu, Mar 28, 2013 at 9:45 PM, Bryan Bishop [email protected]:

JSTOR has been working since 0.0.10, can you show me a sample that it
fails on?


Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-15626583
.[image: Web Bug from
https://github.com/notifications/beacon/wqfBRmzxV38hApHt4ur6UsiolTJx5bYjkACsruXJ0vv7OKxH-fCMWhVyHonLgOnB.gif]

@gffa
Copy link

gffa commented May 4, 2013

I experience the same issue at this date. Having tested several JSTOR pdfs I can not scrub the watermark from them with pdfparanoia.

@fmap fmap mentioned this issue Nov 10, 2013
fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 7, 2013
fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 7, 2013
fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 8, 2013
@fmap
Copy link

fmap commented Dec 10, 2013

The existing JSTOR scrubber stopped working because JSTOR are now adding
watermarks using a different program; including more information, in a way
harder to expunge.

The above patches remove watermark strings as before, but in the process, we're
corrupting the file. mupdf reports:

error: cannot recognize xref format
error: cannot read xref (ofs=2290213)
error: cannot read xref at offset 2290213

Here's what I think's happening:

A PDF object can be thought of as a hierarchy of objects; the most important of
these is the Root entry, which "contains references to other objects defining
the document’s contents, outline, article threads, named destinations, and
other attributes". In the old style generator, the index of the Root entry was
found by consulting the file trailer, which was guaranteed to be at a particular
position near the end of the file. With the new generator, this index is
instead contained in the dictionary of a cross-reference stream, the position
of which is referenced by byte offset at the end of the file.

When we remove watermarks, we're changing the length of objects within the
file, breaking that reference; the offset is no longer accurate. This stops the
root value from being retrieved, KABLAM!

We could solve this by, after manipulating objects within pdfparanoia.eraser,
determining the new location of the xref section, and updating the offset
description accordingly. I'll probably get around to this tomorrow.

fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 10, 2013
fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 11, 2013
fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 11, 2013
fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 11, 2013
fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 11, 2013
@fmap
Copy link

fmap commented Dec 11, 2013

Further errors, now. A sample:

error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: cannot find page -1 in page tree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants