-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"IRI of the Package Document": what is this exactly? #1374
Comments
This issue is why we introduced the notion of "root directory" in the LPF format. We still used urls (we decided not using the IRI term) in "Contents within the Package MUST reference these resources [those in the package] via relative-URL strings [url]." |
Ah yes, indeed. We may want to get inspired and bring this over to the EPUB spec... |
Is there a difference from the OCF abstract container's root directory? Doesn't it already establish the virtual directory structure within the zip? |
I think the difference is not in the terms/concept but the surrounding explanation. I find the few extra terms in the LPF spec on "virtual in nature", etc, helpful. And a reference to this from the item element definition may also be helpful. |
The issue was discussed in a meeting on 2021-04-01 List of resolutions:
View the transcript2. Clarify base IRISee github pull request #1468. Dave Cramer: this is a PR about base IRI in package documents See github issue #1374, #1456. Matt Garrish: basically all the PR does is define base IRI for package because it wasn't clear how that was to be calculated Dave Cramer: i'm happy with PR Matt Garrish: the PR seemed to make Ivan happy Dave Cramer: i think we should accept the PR, and then if Laurent wants to raise the other question, maybe he can come back with a more detailed rationale Matt Garrish: there was an issue about whether relative IRIs MUST be resolved, but it was only ever the intention that it be possible if you need to do it
|
Yes :-) |
Sorry to come late to the party. After looking at an issue in EPUBCheck (see the ref linked above), I'm thinking this is still underspecified. I think the intention is that:
But our issue is that since the URL of the Package Document is not defined, its path within the container can be described either in the path part of the URL (e.g. Consider the following container content:
Now, for the following manifest item: <item href="xhtml/content.xhtml" media-type="application/xhtml+xml" /> Examples A and B would resolve the item URL to But example C would resolve the item URL to In other words, all we have is Schrödinger’s URLs: we can't tell if they’re conforming or non-conforming (i.e. identifying container resources or not) until we open a reading system. 🤔 Shouldn't we reopen this issue? |
Shouldn't this be:
You mean "cannot tell about D", right? (not that I have an answer).
Yes. I will do so (through this comment). |
The question I have: are C and D examples of real Reading Systems, or are they imaginary? In other words, can we allow ourselves to say that the RS must behave, conceptually, like A or B (and how they do that is up to them)? |
No, see the live URL viewer (JSDOM implementation of the URL standard, closely following the spec).
Right. Edited, thx 😊
Yes, I would really like RS folks to chime in! |
Isn't the problem the lack of clarity about the root of the container as in #1681? I opened that because root-relative paths are never going to be consistent, but it's the lack of a common root dir that's the big problem. The current definition of the OCF root directory explicitly says the root is optional, even if I don't believe this is formalized normatively (but then how to unpack isn't normatively described, either):
If there were consistency here, would it matter what scheme you use to reference/resolve the resources? How the content is served doesn't matter, in my understanding, since this is only a check of the abstract container's integrity. But this also seems to make this statement problematic:
On the one hand, we're talking about an "abstract" container, but then on the other we're parsing for physical URL records. The root directory may have disappeared in the meantime, but we're implicitly assuming it is kept for checking this requirement. We've run into this problem in the past with multiple renditions not working when shared content is not below the package document(s) in a common content directory, even though the epub is valid. Some reading systems don't create a root to match the zip, so you can't have sibling directories in the zip root, one for each rendition. All we appear to want to know is a) is the resource inside the zip file, and b) is it not a duplicate of another entry. Resolving to URLs doesn't appear to have any other function. You can't test a) unless you explicitly create a root that matches the zip root. And b) should sort itself out since the application will have internal consistency in however it then resolves the urls. So do reading systems verify the manifest entries, or is this only an epubcheck problem to fix? If it's the latter, can't you assume a root directory that is the root of the zip and work from there? If the content doesn't work on any given reading system... well, that's sort of the state of the world now. |
Oops, you're right. Sorry for the noise. |
The problem is that we're making conformity statements based on an undefined object, and the nature of this object can very much change the interpretation of these statements. Specifically, in EPUB RS "3.1 Parsing Relative URLs":
This statement is problematic for a validator. And it could be problematic for reading systems too (in which case they would just ignore it).
For EPUBCheck, yes. For a reading system: what should happen if the resource is outside the ZIP file? What if it's a duplicate? Do we specify the mandated behavior? In any case, to specify how a UA assesses (a) and (b), the current wording doesn't work. We might work around the current issue by assigning an arbitrary URL to the container root. I'm not sure. |
Right, we let this slip under the radar in the past by not bringing in absolute URLs at all. All we said was that the relative paths had to resolve to resources in the container without explaining at all how that should happen. It intuitively makes sense until you get into the problem of an unzipped epub and an "abstract container" not having a common root dir.
Ya, I'm not saying what we have is helpful at all, but if we were to always assume for validation the root dir is the root of the ocf can't you parse the relative URLs and determine if they are below that directory? But it's worth questioning why we try to block access this way but if you put in an aboslute url that references the file system that doesn't raise any concern. You just have to call it a remote resource. Maybe that's all we should require for resources outside the container and leave it to reading systems to similarly determine whether they want to use these?
Okay, you got me here! I wasn't thinking about the unwritten rules... 😄 But how do we solve this without a fixed path to the package document, and isn't it too late for that? Files inside the container may appear outside the container to reading systems depending on how they unpack them, too. That's why I'm not sure we can solve this outside of epubcheck, at least not beyond recommendations that accommodate both possibilities. There can only be internal consistency within any given application processing the epub.
We tend to tolerate mistakes like this. The caveat to authors is if you don't follow these kinds of rules then bad things will happen. In this case, the reading system may get conflicting information about the resource. It's possible it might break the spine, too, as if you navigate to the resource I imagine it will complicate reading systems looking up which spine item you're in if there's more than one entry that matches the resource. I guess we could recommend that reading systems ignore duplicate entries, though that wouldn't solve the problem of inaccuracies between the listings. |
I am still trying to get around the problem (also for #1681). Looking at #1374 (comment) from @rdeltour, and using <item href="/xhtml/content.xhtml" media-type="application/xhtml+xml" /> (ie, a path-absolute-URL) and looking at options A and B I get the URLs What this means is that the A and B cases, i.e.,
are consistent with the expectations as well as the URL spec. Isn't it possible then to say (in our spec) that a Reading System MUST treat these URLs as if its implementation followed one of these two models? This does not mean that it must implement it exactly this way, but it must, when using a different approach, emulate one of the two options. Wouldn't that be enough for the spec? After all, if we consider an EPUB instance as a 'frozen website', those two options do represent a perfectly consistent mental model, so I do not think that approach would feel to strange... |
How can you be sure this is what you will get when you don't know if the root directory of the epub will be preserved, though? Nothing says it is wrong to expand it to: https://example.org/nameOfEPUB/EPUB/xhtml/content.html because you might also have in the same container: https://example.org/nameOfEPUB/EPUB2/xhtml/content.html in which case the / no longer refers to where you think it does. |
Ouch, you are right: for root-relative paths #1374 (comment) does not work. And, having gone through that, I think I agree with the proposal in #1681 that those should be disallowed. But it does for path-relative-scheme-less-URLs, doesn't it? |
It all depends on how the EPUB is structured and then unpacked. If you put all your content in a single directory and have the package document at the root, there shouldn't be an issue. If you refer to files across sibling directories in the root, then you're back in trouble again. For example, if you had this:
If you have a path from EPUB1 like '../shared/img/photo1.jpg', the file may not be there after the reading system unpacks the EPUB. It may only extract the directory where the first package file is. (I'm using multiple renditions, but even a single-rendition epub doesn't have to be self-contained in a subdirectory. It's generally the norm, though, because of this problem.) And to make things even more fun, if you put the package documents in the root, there probably isn't an issue (but I haven't test this):
But these are the problem of having a requirement that requires checking the zipped content using URL records when we're not completely clear what will be unpacked. Epubcheck can make easier assumptions to check the validity that reading systems cannot, at least as I understand @rdeltour's concern. What happens after extraction is technically a separate problem, but it's also one we're quiet on. As @rdeltour has noted, what the reading system determines is available is not going to always be the same as what epubcheck does. But that's true now, too. So we probably also have a few other issues to look at:
|
Exactly! And my bad for not having spotted that earlier (during the URL PR review or EPUBCheck dev) 😅 Some comments on @mattgarrish’s list of questions:
This question is specifically depending on this current issue, which will define how relative URLs must be parsed. Depending on the approach, it could result in URLs that can refer to something outside the container, or not. It all boils down to what base URL must be used to parse relative URLs. We have two high-level options:
With the second option, given we already say "as if", we may go further than what @iherman proposed and assign an arbitrary URL to the root of the container. For instance
I cannot see a use case where an author would intentionally do that, so it might not help much to require marking it as In any case, if we make it so that relative URLs are never parsed to something identifying a location outside the container, this point is moot.
Yeah, good question 😊 (it's typically the kind of issue that was underspecified in previous author-centric EPUB spec, but that is worth some well-specified RS guidance). That can be discussed in a separate issue.
I'm not sure I understand the issue there. Or your concern with the shared directory use case. I guess this is related to when you said earlier "Files inside the container may appear outside the container to reading systems depending on how they unpack them, too.", which I'm not sure I understood either 😅. For me, the spec is rather clear on this one. It explicitly says that any location descendant from the root directory can be used in the publication. So it's the RS responsibility to keep everything under the root when unpacking the container, no? Finally, while we're at it, we could add another question:
Would it be reasonable to say only absolute URLs with a special scheme that is not @mattgarrish I didn't create new issues for the separate questions we brought, I'll let you decide what you prefer as an editor. But if you'd like me to create these, let me know! |
I guess I do not understand what I am seeing here. Are we seeing the content of the ZIP file, after being unpacked onto the processor's file system? If so, if the file |
No, that's the packed EPUB. It seems some reading systems look up the location of the package document and only unpack the directory it's in. So in the above case, any content in /EPUB2 and /shared are not available to the publication in /EPUB1. That doesn't seem like it should be valid, as the definition only say the root dir is optional to create. But that's also only a definition and we don't say anything about what has to be unpacked or made accessible.
I only brought that in because you can't rely on checking if a file is "within the abstract container" after the reading system has unpacked the content. Files that were in the zip container are gone at that point. |
Yes, exactly. Again the idea is to use spec language to unambiguously define what the relative URLs identify. An RS is of course free to implement that as they please!
Sure.
Interesting use case 😊.
I wasn't aware of RS interop issues there. The EPUB container spec says ("File and directory" section):
which quite unambiguously says this is allowed? And that RS should theoretically handle that fine?
In the current spec, that's an RS bug in my book.
OK will do! (later today, kid ill at home 😅) |
Yes, that sounds absolutely wrong. We should say that the full ZIP package content should be available. I am actually surprised this is not the case... |
Ya, I don't know why we didn't log an issue when we were developing the MR spec. I guess it got forgotten after we wrote the note.
Definitely allowed, but it's the theoretical part that always does us in. The spec doesn't disallow extracting only the file where the package document is located, and it probably works fine for the vast majority of EPUBs. Wish I could remember which reading systems we got tripped up by, but in any case we need a proper requirement.
I've got a post-vaccine queasy adult at home today, and sadly it's me, so I'm in no better a boat... 🤢 |
The issue was discussed in a meeting on 2021-06-10 List of resolutions:
View the transcript2. URLs and the package documentSee github issue #1681, #1374, #1688, #1686. Dave Cramer: this is a bunch of issues that revolve around how you interpret URLs in the package document, especially if they're absolute URLs Matt Garrish: in epubcheck there was a root-relative URL that caused an error, and that spawned all of this Dave Cramer: in issue 1688 Romain he suggests that manifest items should have one of the special schemes (except Matt Garrish: there are edge cases where file scheme items make sense, but not generally for epub Dave Cramer: it goes against epub as a portable format, and the file scheme ties the epub to a specific file system Matt Garrish: never heard of one Dave Cramer: okay, so what if we just say no file URLs in epub? Matt Garrish: most RS probably won't do anything with file URL Wendy Reid: depending on platform you might not even be able to access parts of the file system (e.g. iOS apps) Dave Cramer: can we start by resolving on this point from 1688?
Dan Lazin: is there a use case for some of these other schemes? Why would you have an FTP in your epub? Matt Garrish: if we go too far, do we prevent future stuff? will we have to come back and re-add this in the future? Ben Schroeter: is the idea that if we disallow file scheme, then we also disallow "slash URLs"? Matt Garrish: not sure those are the same Dave Cramer: what would be the consequences of forbidding root-relative paths? Matt Garrish: not sure there are any, because epubcheck had forbidden these until a recent update Dave Cramer: and this is just for href on manifest? Matt Garrish: no, this would be anywhere, e.g. in content docs too Dan Lazin: do we support the base tag? Dave Cramer: we've been phasing out Dan Lazin: the base tag allows you to define what the relative path is relative to Matt Garrish: base would force you to have all external resources, right? It exists, but I don't imagine anyone really going there
Marisa DeMeglio: there was a resolution a few weeks ago about dumping Dave Cramer: and that's separate from the HTML base element Dan Lazin: if you set base to some website, and then use root-relative URLs, your URLs would appear to be relative, when they are actually absolute Dave Cramer: but can we really say anything about base because its part of HTML? Matt Garrish: so you must not use root-relative URLs unless you use a base? Dan Lazin: what was the harm in not banning root-relative? Matt Garrish: because the RS might treat zip root as the root, but they could also treat location of package doc as root Dan Lazin: maybe permit it, but use SHOULD NOT? Dave Cramer: yes, e.g. with books that only work with iBooks because of scripting support Matt Garrish: maybe just a note that root-relative could cause issues if authors use it? Dave Cramer: so does that mean that there are epubs that could be built to work in some RS, but expose an interop issue if opened in another RS? Matt Garrish: right Dave Cramer: not sure what the right course of action is, but maybe we can continue this another time with Romain present Wendy Reid: we need RS people here on next call that know exactly what RSes are doing right now Marisa DeMeglio: one of the github threads has a sample, but I wasn't able to download it
Matt Garrish: also, there's not much hand authoring, and most tools will put all the content into one folder |
The issue was discussed in a meeting on 2021-06-18
View the transcript2. What is the relationship between URLs and the package doc (what is home?)See github issue #1681, #1374, #1687, #1686. Wendy Reid: we started this discussion last week. Core question is: Where is home (given we allow both relative and absolute URLs) in the epub context Romain Deltour: we have to keep in mind: 1) what things have to be put in epub core spec, and 2) what are the rules for epub RS spec Wendy Reid: okay, so what is the IRI of the package document then? Ivan Herman: we can't really answer what the IRI of the package is, and i'm not sure we should try Matt Garrish: we have 2 issues, 1) are these resources within the container and how do we determine that? 2) what happens when you unpack, and where do these resources go? Brady Duga: so absolute URIs are not allowed, and what relative IRI is interpreted by the language in question (e.g. HTML, or CSS, depending on what type of document it is) Matt Garrish: i think the issue is root-relative is still a relative path, so do we have to say "all relative is allowed, except root relative" Romain Deltour: even with regular relative URLs, the spec is silent on what happens if the relative URL tries to go below the container root? Ivan Herman: i was surprised to find that some RS don't automatically unpack the whole zip Matt Garrish: we have requirement in OCF that all relative resources must resolve to something in container Gregorio Pellegrino: i know that Colibrio streams files out of zip without unzipping Wendy Reid: yes, there are more examples of RS doing that beyond that Ivan Herman: but conceptually an RS unpacks the whole zip file onto a domain (as if it were a file system). If we do that then all these concepts become clear Hadrien Gardeur: streaming from zip is what Readium does by default Romain Deltour: i'm surprised that resources that are not in the same directory tree as the OPF would not be accessible in the epub
Romain Deltour: this defines unambiguously how relative URLs are to be resolved Wendy Reid: going back to romain's point about testing, there are a variety of ways that RSes handle these URLs Ivan Herman: would some sort of conceptual model clash with how things are implemented? Hadrien Gardeur: we treat OPF as base, and that seems to work in most cases. Seems to make more sense to us than treating zip as base Matt Garrish: this originally came up in multiple renditions when we had issues referencing across sibling directories Romain Deltour: drawback of conceptual solution is that sometimes adding this layer of abstraction makes spec harder to use Wendy Reid: is the best way forward at this point for us to do some sort of testing? (e.g. OPF as base, zip as base, examples of files living outside when OPF is base) Ivan Herman: i think we should also test environment where multiple renditions is implemented Wendy Reid: do we know if a functioning implementation of multiple renditions? Hadrien Gardeur: barnes and noble were using multiple renditions for newspapers and magazines Wendy Reid: okay, so maybe we test on Nook app |
The issue was discussed in a meeting on 2021-07-02
View the transcript2. Are root-relative paths valid?See github issue #1681, #1374. See github pull request #1725. Dave Cramer: What more needs to happen or can happen in the spec for root-relative paths? Ivan Herman: one problem we need to address is that we have a problem with iBooks and others that rely on Adobe ADE, namely that they rely on a specific way of organizing the files, which is not in the standard. Romain Deltour: the test was done with valid ePub with shared resources - there is still the issue of root-relative URL paths and paths that would go outside the container. I think we need the spec to address that. John Foliot: Is an unintended consequence that a publisher would have to create two versions, one for iBooks and another for other reading systems? Dave Cramer: I don't see huge problems around interoperability because EPUBs are consistent with folder structure, generally. Ivan Herman: Whatever works for iBooks works for others - but there are perfectly valid ePubs that iBooks doesn't take. Romain Deltour: these are edge cases, we don't see this problem often if ever. Ivan Herman: it would be helpful to have a clearly-worded proposal for reading systems. Hoping Romain's help with this. Dave Cramer: everyone seems to agree that having Hadrien Gardeur: from a reading system perspective, they need to resolve URIs, and expose the HTML resource (or any resource) to web view. Ivan Herman: What precisely should the recommendation in the reading system spec be to cover all implementations? Hadrien Gardeur: we don't know how each RS works behind the scenes, we can only speculate. Ivan Herman: If we put something in the spec, it's up to RS how to implement Hadrien Gardeur: On the web, we don't think about files and root containers. For reading systems, we are deciding how an EPUB behaves. So weary of this conceptual approach. Dave Cramer: we are really talking about edge cases here. Hoping that we can build some tests based on the write-up and what we are trying to achieve. Hadrien Gardeur: difficult to test everywhere - gets tricky when you have to consider different CSS, etc Dave Cramer: let's get some proposals down with Romain's help, and get Matt to take a look at them, and proceed from there. Ivan Herman: Must have a clear statement somewhere whether we intend to restrict EPUB content and define organization of EPUB package. |
I reframed the issue in #1888, along with a (non-exhaustive) list of possible solutions. |
The issue was discussed in a meeting on 2021-10-29 List of resolutions:
View the transcript2.3. "IRI of the Package Document": what is this exactly? (issue epub-specs#1374)See github issue epub-specs#1374.
Romain Deltour: I may summarize.
Romain Deltour: for using this algorith we have to now the base URL (https://example.org).
Romain Deltour: I'm going to show other examples.
Romain Deltour: in this case I'm going outside of the EPUB.
Romain Deltour: that's why I think we should define which is the base URL, also for security issues. Ivan Herman: I remember that one solution may be to consider an EPUB as a localhost (with a unique port). Romain Deltour: yes, there are different approches. One is to use domains, another is to use a custom protocol scheme:.
Romain Deltour: I don't know which one is better. Ivan Herman: I think defining a URI scheme for that is not a good idea. Romain Deltour: I don't think we'll come with a solution that will be used by the end user. Brady Duga: I think there are 4 cases: local URLs, online URLs, jar URLs. Romain Deltour: yes, but also referencing to resources outside the package. Brady Duga: do we need to tell people how to display URLs inside on EPUBs (using fragments)?.
Hadrien Gardeur: referencing everything outside the archive is problematic specially for the content document. Romain Deltour: removing that paragraph about the URL of the package document won't work. Romain Deltour: at a minimium, we should base everything on the assumption that there is a url for the root of the container. See github issue epub-specs#1843. Dan Lazin: there is another issue: #1843. Romain Deltour: this might not answer entirely. Dan Lazin: is it a predicable url?. Romain Deltour: this scripting mechanism is only about an origin--could be an opaque origin, doesn't have to be a url. Ivan Herman: where do we go from here?. Dave Cramer: do we ask for help?. Ivan Herman: we have tried and failed before. Romain Deltour: I was supposed to come up with a proposal. Ivan Herman: we can't go to CR with this stuff open. Dave Cramer: could we talk to ping?. Romain Deltour: could we liase with Anne at WhatWG?. Ivan Herman: I worry about that. Tzviya Siegman: talking to Tess would be good. Ivan Herman: if we have a proposal that romain can put together. Romain Deltour: I can summarize the problem statement. Laurent Le Meur: tests will take time.
Laurence Zaysser: could we have a fifth objective, easy to move to web publication?. Romain Deltour: it's about any relative urls. Just dealing with path-relative won't solve the issue. See github pull request epub-specs#1725. Matt Garrish: we have 1725 PR, which forbids path-absolute URLs. Is there any reason we shouldn't merge that?. Wendy Reid: have we exhausted this?. Ivan Herman: to answer matt, that one can go in.
Ivan Herman: using root-relative IRIs is a bad idea for something like epub, where the root url is unclear.
|
works for me! |
We may not be in position of providing an absolutely clean definition (so maybe some editorial hand waving would be necessary) but…
§4.2.3.4.2 The item Element says:
The intention is clear but what is the "IRI of the Package Document"? After all, the package document is part of a ZIP file, it is not really on the Web, ie, it is not clear what its IRI is.
Can we say something more precise about this?
The text was updated successfully, but these errors were encountered: