Local PDF support #770

dwhly · 2013-09-20T00:59:52Z

Online PDFs can be related to eachother via URLs. Offline, local PDFs on one's personal drive represent another large opportunity for collaborative annotation.

Some services, notably Mendeley, have developed approaches to associate these PDFs with their equivalents elsewhere, either via hashes or other fingerprinting techniques. We might apply these techniques ourselves or partner with an external service (Mendeley offered the use of their dataset at one point) to achieve the same capability.

csillag · 2013-09-20T01:53:58Z

+1

tilgovi · 2013-09-20T19:54:34Z

PDF.js has a fingerprint for the PDF that's loaded which is accessible from code. I don't know how they derive it, whether it's standard for PDFs or not.

csillag · 2013-09-21T13:53:46Z

Some of the methods required to solve this might be similar to those needed for #772. (Although that is a broader, more open-ended problem.)

THIS-Usr · 2013-09-22T10:56:35Z

I find it incredible that you are thinking of implementing this.
Brilliant!
(There are quite a lot of feature you are now discussing where I feel the
same.)
The general point I wish to make is one I haven't seen much discussed,
there is a competitive context to H. I don't expect the idea is to
implement every possible feature of every offering in this space. It is to
produce something truly usable in across a broad landscape and encourages
certain, collaborative, behaviour that, arguably, to this point other
solutions do not.
I have a couple of other points to make here but don't want to be even more
long winded.
I'll jot them down for elsewhere.

Adam

On Saturday, September 21, 2013, Kristof Csillag wrote:

Some of the methods required to solve this might be similar to those
needed for #772 #772. (Although
that is a broader, more open-ended problem.)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/770#issuecomment-24862785
.

tilgovi · 2013-09-22T14:49:03Z

Thanks, Adam. We generally try not to self-censor in issues, though. The idea is that the core team can't implement everything, but the ideas should not be lost and it should be possible for others to pick up more niche issues if they have a particular interest. You can see our proposed roadmap on the wiki and you're welcome to discuss it with us any time. That outlines our priorities, which is a narrower agenda than the whole of the issue tracker would suggest.

THIS-Usr · 2013-09-22T15:41:33Z

OK.
I just don't want to stray too far off topic.
This is not about features but about the overall logic of a feature.
Thinking for a second about Mendeley I can think why they or similar
services would not want to open up some aspect of their system, for
proprietorial reasons.
I imagine this would apply in particular to private PDFs, for instance Mendeley
cater to papers in process or private circulation.
My thought is, and you can see it is off topic, that if the philosophy of H
is to encourage different forms of public participation isn't a line being
crossed if private documents can also be annotated?
What is the rationale here?

I have nothing against the idea, myself, I should add. In fact, in the
area I am looking at I believe it would be very valuable. This crosses into
Scribd territory and brings up the thorny issue of annotation of epub, and
general interoperability of systems. Does H aim to be universal in that
way, or is the priority to be very good as the place where a large number
of annotations are shared, sifted and so forth? I am thinking the latter.

Thanks for coming back to me and for inviting me to look and and discuss
items on the wiki.
I am interested and will do that. But I certainly have not got a systematic
view of the field, or entirely what my own putative requirements might be,
either.

Best,

Adam

On 22 September 2013 15:49, Randall Leeds [email protected] wrote:

Thanks, Adam. We generally try not to self-censor in issues, though. The
idea is that the core team can't implement everything, but the ideas should
not be lost and it should be possible for others to pick up more niche
issues if they have a particular interest. You can see our proposed roadmap
on the wiki and you're welcome to discuss it with us any time. That
outlines our priorities, which is a narrower agenda than the whole of the
issue tracker would suggest.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/770#issuecomment-24883423
.

dwhly · 2013-09-22T18:08:27Z

My thought is, and you can see it is off topic, that if the philosophy of H
is to encourage different forms of public participation isn't a line being
crossed if private documents can also be annotated?
What is the rationale here?

It's a really good question, and one that we discuss from time to time as a team. I'll give you my perspective:

The rationale for me to become involved was certainly an interest in a transformation of the public. Along the way I've come to appreciate that we live in layers. Personal, private (the interpersonal private) and public. Each of them is a vital piece of the whole. Annotation as a concept doesn't stop at the boundary between each of them. Neither do we. We move effortlessly between these contexts, and exist in each of them simultaneously.

If we limit the technology we're building artificially to only serve the public sphere, then we force our users to seek alternative solutions for the other 2/3 of their lives. By contrast, if we serve the entirety of their needs, then we provide a much more powerful holistic service and product as a consequence.

Take the continuum of scholarly communications as an example. Co-authors collaborate on a draft article, peers in their field privately, and often anonymously, review the finished product for a journal, and only then does the published final version enter the public sphere, where the community begins to engage in an open discussion about it, cite it and incorporate it into subsequent research. Why constrain annotation to only serve the last portion? That would preclude us using a flexible annotation infrastructure to selectively expose (according to the practices of the journal and agreement of participants) some reviews in order to give subsequent readers the perspective and back and forth as to the strength of evidence or the rationale for various arguments or conclusions. In fact, the discussions between co-authors in drafting, as annotations to each other, might sometimes obviate the need for questions by reviewers altogether, and themselves be discoverable on the published paper as footnotes and additional context.

The ability to annotate personally or privately, and then subsequently modify the visibility of those annotations is a powerful concept, and one that I think can showcase its disruptive potential.

tilgovi · 2013-09-22T20:32:16Z

Adam, thanks again for your thoughts. After I reply, I would recommend moving this conversation to our development mailing list subscribe link if you'd like to continue it further.

For my part, these issues are justified as part of this repository and project because a major aim of mine is to make universal interfaces for annotation and that is orthogonal to the requirements of storage and dissemination. I am even on record (on our mailing list and this issue tracker) opposing social features, such as groups and following/friending. I've opposed having private annotations altogether on more than one occasion. Once I even suggested that Hypothes.is not run a service at all, but only focus on a browser interface to marshal myriad providers to the cause of reading, writing and sharing annotations.

I think the world is better served by embracing a Web that has a multitude of providers and platforms. I am more interested in unifying the user experience than I am with the details of storage.

I would rather see private annotations stored in browser storage with persistence, especially cross-client persistence, achieved through existing means. Google and Mozilla both offer sync services for extension user data and services like Evernote, Dropbox, and Google Drive have rich APIs for document storage. These providers have already put the effort in to make things printable, shareable, and embeddable, often with support for fine grained visibility and collaboration settings. I've previously experimented, with reasonable success, with storing annotations on app.net channels. The possibilities are large in number.

Dan writes that the "ability to annotate personally or privately, and then subsequently modify the visibility of those annotations is a powerful concept". I would respond by saying that the ability to move annotations easily between publishers and platforms is an even more powerful concept and one which offloads visibility to services more focused on social features and respects the publisher+community as de facto mediators of visibility.

Most people seem not accustomed to thinking about universal interfaces to multiple service providers. I find this ironic considering this is exactly the function of a browser.

I respect collaborative priority setting and have not to attempted to prohibit the development of any service provider features. Nevertheless, I wish to assure you that your questions are most certainly reasonable. I would suggest that perhaps they don't go far enough.

csillag · 2013-09-23T02:25:13Z

A brief summary of my thinking, for the record. (Please excuse me for the spelling and other errors; it's late here.)

Part 1: about building it

First, let me state that I am very much in favor of open technologies, protocols and standards. Actually, in the last 15 years, these have been the most important criteria I used, whenever I had to make a decision about a piece of technology. I suffered many inconveniences because of this, but I think it was worth it, because the world is better server by these technologies, so we should be promoting and supporting them, as opposed to closed alternatives.

I am also in favor of distributed, federated systems. (For example, the OS I generally use, Debian GNU/Linux, is maintained by 100s of individual maintainers, working together in a very loosely coordinated network.)

Taking a look at the Hypothes.is project, it's a great pleasure for me to know that it's 100% free software, and that we are based on open standards, and part of a larger economy of OA.

But, despite all this, I don't think it would be safe course of action to limit our activities to a too limited area of the problem we want to solve (which would be, "improve the quality of information on the Internet and in the greater world around us"), hoping that others will automatically supply the missing parts, and so an emergent system-of-systems will solve the problem. It would certainly be nice, but realistically speaking, I would be very, very hesitant to bet the success of the project solely on the results of such doubtful coincidence of events.

I think that if we want to see this solved (and we sure do), then we have to supply a full stack solution, catering all relevant needs of our users. (At least on a basic level.) If we have a working product, and which picks up some momentum, then this momentum will most likely spread to the rest of the OA field, too, and help invigorating other players, who will hopefully be able to cooperate with us in many fruitful ways. We are doing everything we can to facilitate that: open code, open APIs, no vendor lock-in, free migration of data, even support for federation - that's all great, but I still think that ultimately, for core features, we should rely on our own work. (At least in the foreseeable near future.) And we also need some level of control. Definitely not in the sense of building barriers that would hinder anybody else from following us, or building on our results, but in the sense of making sure that every necessary piece of the technology puzzle is there, when and where we need it, and is working together properly.

That's why I think we should plan for building and operating a fully self-sufficient service, that is usable by the users without relying on the services of any other party. (Except standard internet technology, like web servers and browsers, of course.)

Part 2: about running it

Let's assume that at some point in the future, all required pieces of technology are ready and available, and is integrated. Then what? How should it be operated, and by whom? There are several possibilities here, all with different outcomes. Evaluating the benefits or drawbacks of those outcomes depends on the goals and principles we choose for the project. I would say that our highest-level goal is helping humanity sort out (some of) it's problems by augmenting the process of cooperative information gathering and analysis.

I think cooperative sense-making works better if more information is flowing into it; therefore, in this respect, 10 minds thinking together is better than 2 groups of 5 minds thinking together. This is why I think that the process of collaborative analysis is augmented most efficiently if we suppress any unwanted fragmentation. In practice, this would mean that whenever possible, we need to encourage uniting the discussion spaces, so that all thinking (about a given topic) can flow together, and can be synthesized. Of course this does not mean that we should rule out private discussions, or separated discussion spaces, but I as a general rule, I would like to avoid breaking the general discussion space down into small, independent ghettos.

Therefore, I want to operate the service in a way that makes it possible for the users to access all useful information simultaneously. Obviously, the easiest way to achieve this would be to simply store everything centrally, on a backend operated by us. But since we certainly don't want anything that smells like (or is) a vendor lock-in, so we want to support free data migration, reading simultaneously from multiple backends, federation, etc. These are all fine, but I think we should not give up on the goal of providing one main public space for discussion, about every document.

It would be very nice if we could build a system that provides this united discussion space, implemented on top of many independent and diverse back-ends. In fact I would be fascinated with exploring this direction. But I think it would be a huge disaster if some user groups adopted Evernote, some others adopted Evernote, and yet others adopted Google Drive for data storage, all with their separate and incompatible access control, visibility and cooperative policies, and in the end, this ruled out bringing all those annotations together in the public discussion space, using common rules.

If we have data sources that use different access control, visibility and collaboration settings and policies, and we want to unite data living in these, then I think we need to build very significant amount of self-description, discovery and auto-negation into the protocol, so that the clients and backends have a change to find out how can they interact with each other. That might be possible, but I have my serious doubts about the feasibility of this. Or that it's worth the effort.

And so since my No. 1 priority is to build a working service (and then start solving the problem using it), I am very much in favor in building and operating all the service provider features as part of this project, and also adding in any and all features that our users will need for their basic workflow. (Which includes private, group and off-line annotations, too.)

tilgovi · 2013-09-25T20:50:04Z

To get back to the issue at hand, I think PDF.js used to fail for annotations in FF because the file:/// trips things up. That may have been an issue with easyXDM and it may just work now that we use jschannel. Obviously, the URI of the submitted annotations is unhelpful, but it would be a quick and easy hack to grab the fingerprint out of PDFjs.

csillag · 2013-10-02T17:34:24Z

@tilgovi wrote:

To get back to the issue at hand, I think PDF.js used to fail for annotations in FF
because the file:/// trips things up. That may have been an issue with easyXDM
and it may just work now that we use jschannel.

No, unfortunately it still does not work.
(At least with Chrome, using the PDF.js extension.)

tilgovi · 2013-10-02T18:23:48Z

It may just be that jschannel is saying that file:// is invalid. You'll see
that I recently had to add a hack for our extension because it didn't like
chrome-extension://. I'll file something upstream.

csillag · 2013-10-07T05:44:14Z

To me, it sounds more like it's postMessage() itself who is complaining. But let's hope you are right, because then we can fix it. (I have already patched JSChannel, so that it works with PDF.js and FF, see here: mozilla/jschannel#21 )

csillag · 2013-10-19T23:30:41Z

According to mozilla/pdf.js#3751, newer versions of the PDF.js extension place the PDF under the chrome-extension:// prefix, which is better than the file:// prefix, because postMessage() supports it. So this might be easier soon.
(The PDF.js version embedded in FF does not yet do this; the Chrome extension will probably updated very soon.)

csillag · 2013-10-19T23:32:48Z

@edsu, I could we enhance the document plugin in a way that would allow us to recognize PDF document equivalence based on the PDF hash, which we can extract from PDF.js?

It would be great if all offline copies of the PDF document could automatically get all the annotations, even those which were made on other offline copies.

tilgovi · 2013-10-21T23:01:24Z

I've been thinking a bit about some of the patterns we might pursue for Annotator plugins.

I like manual configuration OR discovery as options.

Rather than make the document plugin aware of PDF.js, let it keep its clean, HTML-centric view of the world but allow to pass options when the plugin is created to set metadata explicitly.

Then, we can load the document plugin a little bit later in the startup, after our main application code has checked whether we're on a PDF. If we are, it can pass it an URN based on the PDF.js fingerprint field. Otherwise, it will load the plugin and let the plugin do normal HTML metadata discovery.

Does that make sense?

csillag · 2013-10-21T23:36:46Z

On 2013-10-22 01:01, Randall Leeds wrote:

[...] Rather than make the document plugin aware of PDF.js, let it
keep its clean, HTML-centric view of the world but allow to pass
options when the plugin is created to set metadata explicitly. Then,
we can load the document plugin a little bit later in the startup,
after our main application code has checked whether we're on a PDF.

I think we already have that order. (At least in Annotator Standalone,
when we only have one Annotator instance. For our two-framed approach, I
am not sure.)

If we are, it can pass it an URN based on the PDF.js fingerprint field.

And what would the document plugin do with this URN?
(Would this replace the URL fetched from browser location?)

Otherwise, it will load the plugin and let the plugin do normal HTML
metadata discovery.

Does that make sense?
Yes, but we still need to parse the original URL, too. (Otherwise, we
can't capture document sameness, when the HTML version is specifying the
PDF form with a link.)
And we also need recursive searches (see #833), so that we can find the
HTML -> online PDF -> downloaded PDF links, too.

(Btw, I think we should discuss this at #835 instead, and keep this
issue for problems related to accessing data with the file:// protocol.)

tilgovi · 2013-10-26T00:11:33Z

Before I forget, I think it might make sense to default the first annotation to private on file:// URLs. I feel like as a sure I would be surprised if I made an annotation on my local document and it was default public. Should I open an issue?

csillag · 2013-10-26T00:13:41Z

I think we can wait with the issue until the point when this gets near working.
Which, as far as I can understand, will only happen if we integrate the PDF.js plugin into our plugin...

csillag · 2013-11-26T13:42:48Z

So, I gave this a new quick testing ground. Findings.

By default, local PDF files are not opened by our (combined) extension.
If we enable local file acceess in about:plugins, our integrated PDF.js code can load and render local PDF documents.
However, if we try to actually activate the plugin, nothing happens.

We are doing this:

 chrome.tabs.executeScript(null, {
        file: 'public/js/embed.js'
  })

... but there is no error (or other) message, either on normal console log, or the extension's background page.

Our embed code does not seem to be executing at all.

I wonder why this fails...

csillag · 2014-08-18T11:51:13Z

Just a quick note: on my latest test branch, this actually works, at least in FF.

You can open a local PDF, apply the bookmarklet, and annotate around.
Furthermore, we are using PDF fingerprints for identification, so if you open the same PDF file from another location (either online or offline), you can still all the notes.

This also means that now a group can collaborate around an offline PDF document which they both have, and share notes, even in realtime. Sweet :)

I still need to get this going with Chrome[ium].

dwhly · 2014-08-18T17:00:09Z

Realtime collaborative annotation of offline PDFs? Now that is magic.

csillag · 2014-08-18T17:01:35Z

Realtime collaborative annotation of offline PDFs? Now that is magic.

Well, network connection is still needed, to transfer annotations to and
fro from a common backend.
But other than that, it's pretty self sufficient :)

dwhly · 2014-08-18T17:02:57Z

Yes, right, sorry, "local" like the issue says. :)

csillag · 2014-08-18T17:06:33Z

The OP speaks about "Offline, local PDFs on one's personal drive" - so I think we are reaching that point now. (At least with FF.)

tilgovi · 2014-09-27T03:35:02Z

The piece described in the original issue is covered: identifiers.

What's separate is whether the Chromium extension will load from file:///.

dwhly · 2014-09-27T03:40:21Z

Open as a separate issue?

tilgovi · 2014-09-27T03:41:53Z

Let's see what happens with the current work and what does and does not work.

tilgovi · 2014-09-27T03:45:46Z

A quick test of my PR 1334-pdfjs-latest-redux shows that even without a lot of the stuff the PDF.js extension has about file:// it works with no problem. Maybe that code has more to do with needing to intercept requests for file:// than it does with loading them explicitly. Works, though!

dwhly · 2014-09-27T03:46:29Z

That would be incredibly awesome.

tilgovi · 2014-09-27T03:46:45Z

    "uri": "urn:x-pdf:5362f7013b25bc6fa7a957d05dd4cb",
    "document": {
    "link": [
      {
        "href": "content/web/viewer.html?file=file%3A%2F%2F%2Fhome%2Ftilgovi%2Fspaa11-matchings.pdf"
      }, 
      {
        "href": "urn:x-pdf:5362f7013b25bc6fa7a957d05dd4cb"
      }
    ], 
    "title": "spaa11-matchings.pdf"
  },

tilgovi · 2014-09-27T03:47:20Z

Looks like we need to strip/sanitize those URLs, though. I don't want to be storing directory hierachies from people's drives.

csillag · 2014-09-29T15:44:24Z

Looks like we need to strip/sanitize those URLs, though.

Indeed. IIRC the other PR has some of the functionality. I will look this up.

I don't want to be storing directory hierachies from people's drives.

Well, we must store the hierarchy, if we want to be able to open the file, when clicking on the "source" link in the standalone page, don't we?

csillag · 2014-09-29T15:45:32Z

A quick test of my PR 1334-pdfjs-latest-redux shows that even without a lot of the stuff the PDF.js extension has about file:// it works with no problem.

Previous testing has shown that this is not always reliable.
For me, the extension worked great while testing locally, but stopped working when installed via the Chrome web store. I have not (yet) found the reason why.

csillag · 2014-09-29T15:46:20Z

I am reopening this. Let's close when we have a solution that is a) merged to master b) shipped and tested.

dwhly · 2014-09-29T16:22:15Z

Well, we must store the hierarchy, if we want to be able to open the file, when clicking on the "source" link in the standalone page, don't we?

I think @tilgovi's caution around storing personal disk hierarchies is probably a good one. Also, it bears observing that if two people are annotating a local PDF together using the fingerprint as the canonical document vs the URLs, then the "source" link functionality is going to be broken for anyone else's annotations anyway (the ones created by any counterparty who is storing the file in some other place in their filesystem).

I'm wondering if there might be a clever way to cache a link from the fingerprint to the local path that only the local client knows-- but for now I wouldn't be too bothered by this getting stripped out. I definitely agree that leaking my directory information, which might have all kinds of sensitive things in it, publicly or to a small group, because I'm annotating a document with them, is probably a bad idea.

csillag · 2014-09-29T16:25:24Z

What about private annotations on local PDF files?
Should we drop the file URL in those cases, too?

dwhly · 2014-09-29T16:56:16Z

What about private annotations on local PDF files? Should we drop the file URL in those cases, too?

It's a reasonable question. On the one hand they're "private", on the other hand they're being stored in a remote cloud service. Simplicity might argue that we strip everything for now. The file name is there and will be a good indicator of which file is the target-- though we all can think of examples where local files aren't well named, and where the path provides important detail. I can certainly see this being annoying for users (personally I'd probably opt to store mine). I'd be willing to let user feedback drive how we approach this one.

tilgovi · 2014-09-29T17:05:43Z

It's a reasonable question. On the one hand they're "private", on the other hand they're being stored in a remote cloud service. Simplicity might argue that we strip everything for now.

👍

I'd be willing to let user feedback drive how we approach this one.

👍

csillag · 2014-10-25T19:22:15Z

Some testing for local PDF:

FF34 beta (linux) - bookmarklet - works
FF31 (linux) - bookmarklet - works
Chromium 38 (linux) - staging extension (pdf.js 1.0.437) - broken (because cross-origin loading for file:/// is not supported)
Chromium 38 (linux) - staging extension (pdf.js 1.0.712) - broken (because cross-origin loading for file:/// is not supported)

I need to see if we can tune the manifest.json in a way that supplies the required permissions.
(It might or might not be possible - but it should be.)

csillag · 2014-10-25T19:22:30Z

The trick here is (again) the building and installing the Chrome extension locally, it works.

So the only way to experiment with this is always uploading in the web store, wait, download, test, modify, repeat...
... which is mindboggingly slow.

csillag · 2014-11-10T16:03:41Z

According to same recent experiments by @dwhly, this is working now, both in FF and in Chrome. So, closing this.

csillag mentioned this issue Oct 21, 2013

cross-format: use PDF hashes #835

Closed

tilgovi added 1 - Ready and removed ready labels Apr 22, 2014

dwhly mentioned this issue Jul 4, 2014

Offline annotation hypothesis/vision#45

Open

tilgovi removed the steak label Jul 28, 2014

csillag reopened this Sep 29, 2014

tilgovi removed the 1 - Ready label Oct 7, 2014

dwhly added the 1 - Backlog label Oct 8, 2014

nickstenning added Documentation and removed Documentation labels Oct 23, 2014

csillag removed the Epic label Nov 4, 2014

nickstenning added the Needs acceptance criteria label Nov 4, 2014

csillag mentioned this issue Nov 10, 2014

When annotating local documents, remove paths from public annotations #1661

Closed

csillag closed this as completed Nov 10, 2014

nickstenning removed the Needs acceptance criteria label Nov 21, 2014

Local PDF support #770

Local PDF support #770

Comments

dwhly commented Sep 20, 2013

csillag commented Sep 20, 2013

tilgovi commented Sep 20, 2013

csillag commented Sep 21, 2013

THIS-Usr commented Sep 22, 2013

tilgovi commented Sep 22, 2013

THIS-Usr commented Sep 22, 2013

dwhly commented Sep 22, 2013

tilgovi commented Sep 22, 2013

csillag commented Sep 23, 2013

tilgovi commented Sep 25, 2013

csillag commented Oct 2, 2013

tilgovi commented Oct 2, 2013

csillag commented Oct 7, 2013

csillag commented Oct 19, 2013

csillag commented Oct 19, 2013

tilgovi commented Oct 21, 2013

csillag commented Oct 21, 2013

tilgovi commented Oct 26, 2013

csillag commented Oct 26, 2013

csillag commented Nov 26, 2013

csillag commented Aug 18, 2014

dwhly commented Aug 18, 2014

csillag commented Aug 18, 2014

dwhly commented Aug 18, 2014

csillag commented Aug 18, 2014

tilgovi commented Sep 27, 2014

dwhly commented Sep 27, 2014

tilgovi commented Sep 27, 2014

tilgovi commented Sep 27, 2014

dwhly commented Sep 27, 2014

tilgovi commented Sep 27, 2014

tilgovi commented Sep 27, 2014

csillag commented Sep 29, 2014

csillag commented Sep 29, 2014

csillag commented Sep 29, 2014

dwhly commented Sep 29, 2014

csillag commented Sep 29, 2014

dwhly commented Sep 29, 2014

tilgovi commented Sep 29, 2014

csillag commented Oct 25, 2014

csillag commented Oct 25, 2014

csillag commented Nov 10, 2014