JSTOR watermark #24

rcallahan · 2013-03-28T22:26:56Z

This content downloaded from X at T on bottom of all pages

kanzure · 2013-03-29T02:45:46Z

JSTOR has been working since 0.0.10, can you show me a sample that it fails on?

rcallahan · 2013-03-29T14:29:54Z

http://diyhpl.us/~bryan/papers2/paperbot/The%20New%20England%20Origins%20of%20Mormonism.pdf

On Thu, Mar 28, 2013 at 9:45 PM, Bryan Bishop [email protected]:

JSTOR has been working since 0.0.10, can you show me a sample that it
fails on?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-15626583
.[image: Web Bug from
https://github.com/notifications/beacon/wqfBRmzxV38hApHt4ur6UsiolTJx5bYjkACsruXJ0vv7OKxH-fCMWhVyHonLgOnB.gif]

gffa · 2013-05-04T16:28:39Z

I experience the same issue at this date. Having tested several JSTOR pdfs I can not scrub the watermark from them with pdfparanoia.

fmap · 2013-12-10T08:43:00Z

The existing JSTOR scrubber stopped working because JSTOR are now adding
watermarks using a different program; including more information, in a way
harder to expunge.

The above patches remove watermark strings as before, but in the process, we're
corrupting the file. mupdf reports:

error: cannot recognize xref format
error: cannot read xref (ofs=2290213)
error: cannot read xref at offset 2290213

Here's what I think's happening:

A PDF object can be thought of as a hierarchy of objects; the most important of
these is the Root entry, which "contains references to other objects defining
the document’s contents, outline, article threads, named destinations, and
other attributes". In the old style generator, the index of the Root entry was
found by consulting the file trailer, which was guaranteed to be at a particular
position near the end of the file. With the new generator, this index is
instead contained in the dictionary of a cross-reference stream, the position
of which is referenced by byte offset at the end of the file.

When we remove watermarks, we're changing the length of objects within the
file, breaking that reference; the offset is no longer accurate. This stops the
root value from being retrieved, KABLAM!

We could solve this by, after manipulating objects within pdfparanoia.eraser,
determining the new location of the xref section, and updating the offset
description accordingly. I'll probably get around to this tomorrow.

…anzure#24).

…ified documents (kanzure#24)

fmap · 2013-12-11T09:15:07Z

Further errors, now. A sample:

error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: expected 'obj' keyword (2198 0 ?)
error: cannot parse object (141 0 R)
warning: cannot load object (141 0 R) into cache
error: cannot find page -1 in page tree

fmap mentioned this issue Nov 10, 2013

JSTOR scrubber #36

Closed

fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 7, 2013

Recent JSTOR generator, remove access date (kanzure#24).

cfd0420

fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 7, 2013

Replaced footer strings for more recent JSTOR generator (kanzure#24).

d3f5d61

fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 8, 2013

Tests for new JSTOR generator (kanzure#24)?

72b0170

fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 10, 2013

Test for corruption in the JSTOR scrubber (kanzure#24).

f4769ed

fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 11, 2013

Stubbed eraser.find_xref_section_offset, with failing test for same (k…

077b28b

…anzure#24).

fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 11, 2013

working xref_section_offset (kanzure#24)

102b057

fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 11, 2013

Update xref offset description after manipulating objects. (kanzure#24)

987b517

fmap pushed a commit to fmapfmapfmap/pdfparanoia that referenced this issue Dec 11, 2013

Only insert offset if an xref was found, this unbreaks tests on simpl…

e3711d6

…ified documents (kanzure#24)

joepie91 added bug labels May 26, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSTOR watermark #24

JSTOR watermark #24

rcallahan commented Mar 28, 2013

kanzure commented Mar 29, 2013

rcallahan commented Mar 29, 2013

gffa commented May 4, 2013

fmap commented Dec 10, 2013

fmap commented Dec 11, 2013

JSTOR watermark #24

JSTOR watermark #24

Comments

rcallahan commented Mar 28, 2013

kanzure commented Mar 29, 2013

rcallahan commented Mar 29, 2013

gffa commented May 4, 2013

fmap commented Dec 10, 2013

fmap commented Dec 11, 2013