Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine the requirements on how RS must process the container directory structure #1687

Closed
rdeltour opened this issue May 27, 2021 · 16 comments · Fixed by #1724
Closed

Refine the requirements on how RS must process the container directory structure #1687

rdeltour opened this issue May 27, 2021 · 16 comments · Fixed by #1724
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-ReadingSystems The issue affects the EPUB Reading Systems 3.3 Recommendation Topic-OCF The issue affects the OCF section of the core EPUB 3 specification

Comments

@rdeltour
Copy link
Member

The EPUB core spec says in the File and Directory Structure section:

EPUB Creators MAY locate all other files within the OCF Abstract Container in any location descendant from the Root Directory, provided they are not within the META-INF directory.

But @mattgarrish reports in #1374 that some reading systems do not handle that correctly. They don't allow content that is not in the directory of the Package Document (or any descendant directory).

There are at least two (exclusive) options:

  • add language in the RS spec to say more explicitly that RS must be able to handle all container content.
  • add restrictions to the Core spec that say authors should not rely on content that is not in the Package Document directory and descendants.
@rdeltour
Copy link
Member Author

Copying previous comments made in #1374:

from @mattgarrish

There isn't a requirement to preserve the zip root directory or any descendant content that isn't in the same directory as the package document. You'd think reading systems would preserve it, but past experience in multiple renditions showed that couldn't be relied on. The result is that some reading systems won't let you reach across sibling folders in the zip root because they don't appear to preserve them.

That's why we put this note in the multiple renditions spec: https://www.w3.org/TR/epub-multi-rend-11/#h-note

I know most EPUBs have a single "EPUB" directory where the content is stored, but that's not a requirement. If you don't follow that pattern, and don't have the package document in the root, bad things can happen (i.e., reading systems won't display the content that never got unpacked).

So who's at fault in this scenario? Should the reading system be required to unpack all content and ensure that all content below the root directory is available, even if it doesn't create an extra folder for the root directory? Should authors be more strongly warned not to rely on being able to access across sibling folders that are not below the package document?

and:

It seems some reading systems look up the location of the package document and only unpack the directory it's in. So in the above case, any content in /EPUB2 and /shared are not available to the publication in /EPUB1.

That doesn't seem like it should be valid, as the definition only say the root dir is optional to create. But that's also only a definition and we don't say anything about what has to be unpacked or made accessible.

and @iherman (about Matt's last paragraphs):

Yes, that sounds absolutely wrong. We should say that the full ZIP package content should be available. I am actually surprised this is not the case…

also from @mattgarrish

The spec doesn't disallow extracting only the file where the package document is located, and it probably works fine for the vast majority of EPUBs.

Wish I could remember which reading systems we got tripped up by, but in any case we need a proper requirement.

@iherman
Copy link
Member

iherman commented May 28, 2021

Picking from @mattgarrish:

Should the reading system be required to unpack all content and ensure that all content below the root directory is available,

I do not think it would be particularly shocking to require this.

@mattgarrish mattgarrish added Spec-ReadingSystems The issue affects the EPUB Reading Systems 3.3 Recommendation Topic-OCF The issue affects the OCF section of the core EPUB 3 specification labels May 28, 2021
@mattgarrish
Copy link
Member

Should the reading system be required to unpack all content and ensure that all content below the root directory is available,

I do not think it would be particularly shocking to require this.

If we require all the content be unpacked, then conceivably the reading system has to produce the "/EPUB" or "/OEBPS" directory that the vast majority of EPUBs use - so you'd always have this directory on top of whatever subdomain you serve the publication from (if that's how you serve the epub). For the majority of EPUBs, the only thing this accomplishes is making the generally useless meta-inf directory available.

That's why I wonder if the requirement should be changed to authoring - publication resources may be located anywhere in the abstract container provided they are at or below the directory that contains the package file(s).

Multiple renditions already suggests you do this, and I suspect most single-rendition publications don't encounter the problem, so it shouldn't be backwards-breaking.

But doing this also makes the rules on relative paths more complex, as there would be a rule that the references not resolve outside the container and, for content, that they not resolve to a directory above the package document.

@mattgarrish
Copy link
Member

To finish up the thought, if we disallow root-relative paths, then with such an authoring restriction it doesn't matter if the reading system unpacks the ocf root or only the directory with the content. The resources will all be there and the relative paths to them in the content will always work.

Another option, anyway.

@iherman
Copy link
Member

iherman commented Jun 2, 2021

Should the reading system be required to unpack all content and ensure that all content below the root directory is available,

If we require all the content be unpacked, then conceivably the reading system has to produce the "/EPUB" or "/OEBPS" directory that the vast majority of EPUBs use - so you'd always have this directory on top of whatever subdomain you serve the publication from (if that's how you serve the epub). For the majority of EPUBs, the only thing this accomplishes is making the generally useless meta-inf directory available.

The requirement on reading system is conceptual and not a hard requirement on how they MUST implement things. If they want to optimize things, they are free to do so. But by making this conceptual requirement we have a clear framework to define things unambiguously and that, at the end of the day, what counts.

That's why I wonder if the requirement should be changed to authoring - publication resources may be located anywhere in the abstract container provided they are at or below the directory that contains the package file(s).

I do not think we should introduce a structural requirement at this point. We do not know how EPUB files are structured in the wild, and we do not want to create backward incompatibilities...

@mattgarrish
Copy link
Member

The requirement on reading system is conceptual and not a hard requirement on how they MUST implement things.

But that's essentially saying that what they do right now is fine, no? If there's no requirement, then unpacking and discarding everything but the folder where package document is located is as legitimate as not unpacking everything but the directory where the package document is located. You net access to the same set of files.

@iherman
Copy link
Member

iherman commented Jun 2, 2021

Well... I may be lost. What I understood in this and the other thread was that, in some cases, content with a relative URL-s are not found because it is not unpacked. My reaction was that everything should be (conceptually) unpacked, ie, if that is what happens then it is an RS bug.

But maybe I need a hard reset to understand the problem :-(

@mattgarrish
Copy link
Member

in some cases, content with a relative URL-s are not found because it is not unpacked

Right, this is one of the problems of the current state of affairs. I think we're maybe talking across each other right now.

The other problem is why do we say that authors can locate their content anywhere below the root but then have no rules on unpacking or requirement that all the content in the container be available? It's a major gotcha for authoring that isn't explained anywhere.

I'm not proposing this as a resolution to the question of how do we determine what is in the container. I'm suggesting, given the known state of the world, that we may want to have a recommendation/requirement for authors to make sure their content is structured in a way that won't cause them unexpected grief in the reading systems that don't unpack everything. That's probably more realistic than expecting reading systems to change.

We'll still need to figure out how to consistently check what is in the container and/or what is below the package document, but that may be more of a conceptual problem to work out, as you say.

@iherman
Copy link
Member

iherman commented Jun 3, 2021

Right, this is one of the problems of the current state of affairs. I think we're maybe talking across each other right now.

The other problem is why do we say that authors can locate their content anywhere below the root but then have no rules on unpacking or requirement that all the content in the container be available? It's a major gotcha for authoring that isn't explained anywhere.

Ah. So we should probably, beyond clarifying the full conceptual unpacking,:

  • Specify that all documents, referred to through a relative URL in the package document MUST be present in the container. epubcheck should check that (this should not be very complicated). RS-s should check that, too (see also below).
  • There is also something to be said, I presume, on how RSs should react on an HTTP 404 error. This may be because there is an absolute URL reference that is not part of the container and whose HTTP request indeed gives a 404, but that is also the situation if the requirement in the previous point is not fulfilled. I do not know what RSs do at this moment; my preference is that they should display some custom page on 404 like browsers do...

Would that cover our issues?

@dauwhe dauwhe added the Agenda+ Issues that should be discussed during the next working group call. label Jun 9, 2021
@iherman
Copy link
Member

iherman commented Jun 18, 2021

The issue was discussed in a meeting on 2021-06-18

  • no resolutions were taken
View the transcript

2. What is the relationship between URLs and the package doc (what is home?)

See github issue #1681, #1374, #1687, #1686.

Wendy Reid: we started this discussion last week. Core question is: Where is home (given we allow both relative and absolute URLs) in the epub context

Romain Deltour: we have to keep in mind: 1) what things have to be put in epub core spec, and 2) what are the rules for epub RS spec
… later is more important because we can say whatever we want in core, but authors may deviate, and then it is up to RS to decide how to react
… also, i think we should look into question of what is home first, and that will inform what to do with root-relative URLs

Wendy Reid: okay, so what is the IRI of the package document then?

Ivan Herman: we can't really answer what the IRI of the package is, and i'm not sure we should try
… rather, what do we expect RS to do conceptually?
… who epub structure relies on the idea that epub is kind of a frozen website
… i think we say this is the conceptual model within which epub exists, and we should not say exactly how RS can do that
… just as long as the observable behavior is identical
… so as long as after epub is unpacked there is a root that we can refer to, it is fine
… and whether this root is the same IRI of the package or not is none of our business

Matt Garrish: we have 2 issues, 1) are these resources within the container and how do we determine that? 2) what happens when you unpack, and where do these resources go?
… so I don't think there can be a consistent root unless we start to enforce these things
… inside epub resources can be within the container, but that might not be true once the epub is unpacked
… e.g. do you have to unpack everything in the zip? Or just whatever is in the epub under the package?

Brady Duga: so absolute URIs are not allowed, and what relative IRI is interpreted by the language in question (e.g. HTML, or CSS, depending on what type of document it is)
… so why do we have to define what root is if we don't allow absolute URIs?

Matt Garrish: i think the issue is root-relative is still a relative path, so do we have to say "all relative is allowed, except root relative"

Romain Deltour: even with regular relative URLs, the spec is silent on what happens if the relative URL tries to go below the container root?
… and is it possible to look at RSes today and test what they do?

Ivan Herman: i was surprised to find that some RS don't automatically unpack the whole zip
… i thought this was obvious
… but then what if there is a relative URL that is not on manifest, but also happens to be in zip?

Matt Garrish: we have requirement in OCF that all relative resources must resolve to something in container
… i don't think that was the issue

Gregorio Pellegrino: i know that Colibrio streams files out of zip without unzipping

Wendy Reid: yes, there are more examples of RS doing that beyond that

Ivan Herman: but conceptually an RS unpacks the whole zip file onto a domain (as if it were a file system). If we do that then all these concepts become clear
… but i'm not sure if a streaming based solution meets that conceptual model

Hadrien Gardeur: streaming from zip is what Readium does by default
… unzipping is a problem for DRM. Some expectation that you keep the epub zipped. And we've done some optimizations with this in mind

Romain Deltour: i'm surprised that resources that are not in the same directory tree as the OPF would not be accessible in the epub
… going back to the point about defining what should happen conceptually, the spec could say that we define a URL that must be used as the base when resolving relative URLs (e.g., https://ocf.example.org))

Ivan Herman: +1 to romain

Romain Deltour: this defines unambiguously how relative URLs are to be resolved
… and we can say this URL is the root of the OCF
… this makes it so that relative URLs cannot go outside of the container
… and then RSes know what relative URLs point to

Wendy Reid: going back to romain's point about testing, there are a variety of ways that RSes handle these URLs
… we are especially unsure what happens when files are outside the container
… so this is good reason to do some testing

Ivan Herman: would some sort of conceptual model clash with how things are implemented?

Hadrien Gardeur: we treat OPF as base, and that seems to work in most cases. Seems to make more sense to us than treating zip as base
… but these two are most common implementations

Matt Garrish: this originally came up in multiple renditions when we had issues referencing across sibling directories
… not sure if this is still an obstacle, worth testing

Romain Deltour: drawback of conceptual solution is that sometimes adding this layer of abstraction makes spec harder to use
… so we want to respect people who are actually having to implement it

Wendy Reid: is the best way forward at this point for us to do some sort of testing? (e.g. OPF as base, zip as base, examples of files living outside when OPF is base)

Ivan Herman: i think we should also test environment where multiple renditions is implemented
… if we end up with something that makes multiple renditions impossible, then we should just remove the multiple rendition note

Wendy Reid: do we know if a functioning implementation of multiple renditions?

Hadrien Gardeur: barnes and noble were using multiple renditions for newspapers and magazines
… not sure if they still use it

Wendy Reid: okay, so maybe we test on Nook app
… okay, so for now we test. Will have to ask Dan and the rest of the testing folk to help
… for now we don't have consensus on any sort of language, right?

@mattgarrish
Copy link
Member

mattgarrish commented Jun 20, 2021

I've hacked up one of the multiple rendition samples to very basically test whether resources in a sister directory to the opf file can be accessed (attaching with .zip to post, so just delete the extra extension).

First test in Apple Books and the images and css were not rendered (kind of a big problem).

Thorium displays the book fine, as did the dropbox viewer.

I'll try some other apps tomorrow, but anyone who wants to try it out feel free to post what results you get.

(Interestingly, I did open an issue about this after the AHL work finished but it got reassigned back to the MR spec to address in 3.1: #619. To show how shot my brain is when it comes to multiple renditions, the note about the problem was added earlier in this revision as part of clearing off the open issues.)

share-test.epub.zip

@iherman
Copy link
Member

iherman commented Jun 21, 2021

Ouch. It works with colibrio (on my mac) and on the Firefox extension to read epub. On my iPad it does not run with the (old) bluefire reader, on aldiko, and, as you said, on Apple Books. It works with Marvin; I do not remember how to side-load books to the Kobo or Google player.

It is a mess.

@mattgarrish
Copy link
Member

Ya, I just tried the Adobe Digital Editions app on Windows and it didn't render the css or images, but the Google Play chrome app did.

I was sort of hoping with time I'd be proven wrong and we wouldn't have to deal with this, but looks like the same state of affairs we discovered with multiple renditions.

@wareid
Copy link
Contributor

wareid commented Jun 23, 2021

Tried this book on the Kobo iOS app and the desktop application, both worked just fine.

I also tested this on VoiceDream Reader, it also rendered fine.

@iherman
Copy link
Member

iherman commented Jun 25, 2021

The issue was discussed in a meeting on 2021-06-24

List of resolutions:

  • Resolution No. 1: Provide a note in the core spec that this is a known issue, include non-normative advice about what to do, close issue 1687
  • Resolution No. 2: Declare root relative paths not recommend (should not be used), close 1681
View the transcript

1. Refine the requirements on how RS must process the container structure

See github issue #1687, #1681, #1686.

Wendy Reid: per discussion last week, mgarrish made us a test epub for this
… we've put it through various RS, Apple, Thorium, Colibrio, Kobo Desktop, Kobo iOS, ADE, more...
… aside from Apple and ADE, the test epub has worked
… it seems like most RS are flexible in their sourcing, but with our two fail cases, there is some variability in implementation

Brady Duga: and most of this was done via sideloading, and publisher pipelines are often different
… if publisher sent apple a book, we might have gotten a different result

Matt Garrish: we still have the problem that the spec doesn't say anything about this. There is no authoring requirement for where to put your content (other that below the root). And for RS there is no requirement for how to unpack, etc.
… it seems like it should be common sense. But beyond what we've already said, not sure what we should do. Maybe note it as a potential issue?

Wendy Reid: it probably doesn't hurt to refine language, but at this point creating a firm requirement would impact some existing RS implementations
… and it might make authors uncomfortable
… do we note that there is some confusion as to implementation, but clarify that we aren't going to enforce anything?

Matt Garrish: easiest solution is probably an authoring requirement. Esp. because most authors have probably never tried to do anything like the test epub
… so say authors should put their content under the package document

Brady Duga: this has been an issue forever, and the only time we noticed was with multiple renditions, which hasn't been implemented really. So is a 3rd solution to just leave it?
… if some publisher creates an epub that just doesn't work on Apple, maybe that can just be between that specific publisher and Apple...

Matt Garrish: this whole thing really only came up because of that root-relative thing, so on that issue maybe we just say not to use those

Wendy Reid: right, so we advise not to use root-relative, and we can't say specifically how RS will behave if you do it

Matt Garrish: can we resolve just to use something similar to the note we were going to have for multiple renditions?

Proposed resolution: Provide a note in the core spec that this is a known issue, include non-normative advice about what to do, close issue 1687 (Wendy Reid)

Brady Duga: +1

Wendy Reid: +1

Matthew Chan: +1

Matt Garrish: +1

Masakazu Kitahara: +1

Ben Schroeter: +1

Toshiaki Koike: +1

Shinya Takami (高見真也): +1

Resolution #1: Provide a note in the core spec that this is a known issue, include non-normative advice about what to do, close issue 1687

Wendy Reid: the other two related issues first are root relative paths valid? is this now moot?

Matt Garrish: i think we are on safer ground to just disallow those, especially because in the past epubcheck has had those come up as an error
… it may work on some RS, but that's fine

Proposed resolution: Declare root relative paths not recommend (should not be used), close 1681 (Wendy Reid)

Wendy Reid: +1

Matthew Chan: +1

Matt Garrish: +1

Toshiaki Koike: +1

Masakazu Kitahara: +1

Ben Schroeter: +1

Brady Duga: +1

Resolution #2: Declare root relative paths not recommend (should not be used), close 1681

Wendy Reid: the second one: what should RS do when manifest item has duplicate entries?
… this is worth testing (and should be easy enough to test)

Matt Garrish: i think the issue with this was that if there were multiple copies of the same item in manifest, then RS might not know which manifest item to go to when one copy is referenced

@dauwhe
Copy link
Contributor

dauwhe commented Jul 2, 2021

Test book works in Kindle Previwer 3 for Mac. Fails in ADE 4.5 on Mac. Fails in iBooks as we already knew. Sigh.

@dauwhe dauwhe removed the Agenda+ Issues that should be discussed during the next working group call. label Jul 7, 2021
@mattgarrish mattgarrish added the EPUB33 Issues addressed in the EPUB 3.3 revision label Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-ReadingSystems The issue affects the EPUB Reading Systems 3.3 Recommendation Topic-OCF The issue affects the OCF section of the core EPUB 3 specification
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants