-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider specifying how EPUB interacts with the MIME sniffing standard #2491
Comments
[My obnoxious admin hat put on] If we do that, this means we introduce a new normative statement to the CR. This means, if I am not mistaken, that a new CR snapshot should be issued, and this would also trigger a round of minimally 28 days of comment deadline (see process doc) which means, in practice, that we cannot move to PR within that 28 days' time limit. This is not a problem per se and, actually, I would think that issuing a CR snapshot with all the changes we have done on the CR would be a proper thing to do, but that means we most probably would have to ask for a charter extension. (Our charter runs out end of February.) Just saying... |
[My obnoxious admin hat put down] I must admit I was not familiar with the Mime Sniff Whatwg document.
I am not familiar with history, so @mattgarrish @bduga should know better, but I suspect that the goal was more on the informative side. But adding a normative of the sort you propose may improve the spec indeed... |
I can't say I recall ever discussing what a reading system has to do with the media types in the manifest, but my memory only goes back to 2011. I'd always assumed they were informative, since they're easily faked. They're meant, as @rdeltour says, for things like checking that CMTs are used, and fallbacks provided when not. It's been more of a security consideration that the resources in the package may not be what the manifest says they are, but defining how to ensure that hasn't been attempted. If we can leverage the WHATWG spec defines then it doesn't seem like a bad thing to add, but maybe only as a recommendation since we're late to the game on this. |
I am a little worried about adding any conformance statements here. I expect quite a few existing EPUBs will be considered broken, since these often do not match. Sometimes it is even unavoidable, for instance where the MIME type has drifted over the years (eg fonts). I am not sure what is gained by checking these. I guess fallback chain handling would be a little more reliable, but since that is fairly under-implemented as it is I don't see that as a compelling argument. My initial feeling is to leave this as-is, are there any strong reasons for wanting this change? |
One reason would be better interoperability between reading systems. Another is that there seems to be security considerations involved. HTML has this warning:
But this is beyond my area of expertise… |
To be clear, I'm not aware of any concrete problem that fixing this would solve. |
I am not sure this would really help with interop. I would at least want a
concrete example of an interop problem before making a change here. And
even if we wanted to require (or suggest) that reading systems make sure
the MIME type given in the package doc actually matches the MIME type of
the resource, how do we do that? Won't they have to apply the MIME sniffing
algorithm? Which in turn means having a proper supplied type.
…On Tue, Nov 29, 2022 at 12:04 PM Romain Deltour ***@***.***> wrote:
To be clear, I'm not aware of any concrete problem that fixing this would
solve.
I went down that spec rabbit hole when trying to answer the question "what
is the actual MIME type of a resource?", and noticed EPUB isn't very
specific here. Basically, the RS is free to decide what MIME type it
retrieves resources with, as it is both the "server" and the "UA".
—
Reply to this email directly, view it on GitHub
<#2491 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA246ZFAATXN3M6TFN3DSG3WKZOTJANCNFSM6AAAAAASOSHJ74>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
What about adding more or less the same text as the one quoted for HTML to the security consideration section of the RS? That would not be a conformance issue for existing RS-s, but would draw attention on possible issues. |
We switched everything in that section to normative recommendations, so would we do the same here? It seems to fall under the malicious content category, so we could just add it as an example if we want to avoid normative statements:
|
yes, but all statements are SHOULD-s. No MUST-s. That being said, I am perfectly happy with the extension of the text you propose, too. |
FWIW, I ran a little experiment with the attached EPUB (mimetypes.zip). It contains:
The single content document has the following markup: <img id="img" src="jpeg.foreign" alt=""/>
<picture>
<source srcset="jpeg.foreign" type="image/vnd.epubtest"/>
<img src="png.png" alt=""/>
</picture> I tested in both Apple Books and Thorium. Not so surprisingly, the I'm wondering what happens when the computed MIME type is not dependent on byte pattern sniffing, but is defined as the supplied MIME type? That's the case in a script context. Although I'm not sure if/how the Fetch API works in an interoperable manner in EPUB… Also, note that the RS spec says:
Fortunately (?) this is a bit vague, in that "the MIME media type of a given publication resource" is not defined. I always thought that was the type declared in the package document. But in practice I think "the MIME media type of a publication resource" is more the computed MIME type of the resource (which depends on the how the RS serves the resource and defines the supplied MIME type of the resource). If the RS detects the resource as JPEG and renders it, then no problem. That's what happens in the two RS I tested. This is subtle, but I think at least adding a note explaining that (and how it is RS-dependent) would be helpful. [edited for clarity / typos] |
The idea is not so much to require RS to compute a MIME type equal to the one in the package doc, but rather to better define or describe what an RS does to determine the computed MIME type of a resource. Most of it is defined in MIME Sniffing. The RS-dependent part is related to what is the supplied MIME type of the resource, and consequently specifically depends how the RS serves the resource:
All that to say that a lot depends on how the RS actually handles resources internally. I'm not saying that we should enforce one particular way. I'm fine with it being RS-dependent as long as we do not identify a concrete interoperability issue. But:
Yes, they're required (by HTML) to apply the sniffing algorithms, and yes they need a proper supplied type. That was precisely my point: currently, that supplied type is not well defined in EPUB.
|
Looking at this again, I was thinking the original prose was requiring reading systems to verify the media type, but if all this is asking is that reading systems set the supplied media type for resources, could we say:
Does that still capture the essence of what you're after, @rdeltour, without being too specific about how it's done? |
I am fine with making this explicit, but I don't think the history of this property, both the intended use and the actual use, merits recommending its use in the sniffing algorithm. If anything, I would argue for the exact opposite - specify that Reading Systems SHOULD NOT use the mime type as the supplied media type for mimesniff. And maybe add some text around what we intend this to be used for (fallbacks? Anything else?). |
Let me try to clarify: what I am after is understanding and hopefully specifying how the MIME Sniffing standard works in EPUB, which implies how the supplied MIME type detection algorithm is applied in an EPUB context. The key is that algorithm is basically a switch statement based on the protocol used to retrieve the resource. In EPUB, that protocol is implementation specific. So in your proposal @mattgarrish:
I don't think we can even say that a reading system can set a MIME type , nor that a resource MIME type is its supplied media type. That would be a violation of the MIME Sniffing standard, which is normatively referenced by HTML. Conformance to MIME Sniffing means the MIME Sniffing algorithms are applied, which means the computed MIME type is determined from a combination of byte sniffing and the protocol-dependent switch statement mentioned above. As far as I can tell (but I'm not sure), browser engines conforms to HTML/Fetch/MIME Sniffing in how they handle resource loading, in an interoperable manner. In an EPUB context, one could argue OCF-processing is defining another protocol (point 4 in the supplied MIME type detection algorithm), which would allow us to say EPUB defines the supplied MIME type as the one authored in the package document. But in hindsight, I think that would be a stretch, and probably do not match what current RS are doing, like @bduga suggests. So all in all, to summarize, what I am after is:
|
Here's a proposal: Create a new section in 3. Publication resource processing, before all other sections, which would say something like:
|
(the note above could also add, informatively:
although I'm not confident it matches the reality of current implementations, nor that it is particularly helpful to implementing or understanding the specification). |
Is there a way to make this more precise? "such that the result is consistent with"? There are (deliberately obtuse) ways of reading "consistent with" that can make the requirement opaque (e.g., if I follow my own algorithm that leads to a different result, is that "consistent" since they both require following algorithms?). |
Do we really need to add anything, though? Presumably any resources loaded by the UA of the reading system will implement mimesniff, since it is required via html. Though, I suppose knowing that you have an html document in the first place requires knowing the mime type. Ugh. This is a bit of rats nest - do reading systems need to run the mimesniff algorithm at ingestion time, to make sure the epub is valid? So, for instance, if a spine item claims to be html, but applying mimesniff calculates the type as json, should the reading system never even try to display the spine item? |
Given where we are in the revision, I'm fine if we want to defer the issue. Last minute additions have a way of needing to be fixed later. But I'll leave it to the chairs to decide. |
(My administrative comment...)
That would be my option. We are at the point when we want to seriously look at the implementation reports and, hopefully, move ahead to PR and then to a Rec. Adopting #2491 (comment) (which, content-wise, sounds o.k. to my non-expert eyes) means that:
We will have to discuss how we will process with EPUB 3.3 maintenance. One option is that the document will be turned into a "living standard", i.e., a simple maintenance WG may make such changes, possible one-by-one, to republish new versions. I think we do not run into any danger of interoperability today if we defer this issue to those times... |
I've taken this language from HTML.
These are all good questions, and it is similar questions that led me to open this issue… The fact that EPUB does not specify how resources are loaded from the container is the real issue here. I think we nailed something to be clarified in EPUB 3.4 😁
Fine by me. Can we still add an informative note, in the spirit of the one I proposed above? I think it would be relevant to acknowledge the issue, and clarify that it is currently implementation specific. |
Which says you must "obtain and interpret". That's the precision lacking here. "Determine" speaks to process, not outcome. |
The issue was discussed in a meeting on 2022-12-08 List of resolutions:
View the transcript2. consider specifying how EPUB interacts with the MIME sniffing standard (issue epub-specs#2491)See github issue epub-specs#2491. Dave Cramer: this was pointed out by Romain that as part of the HTML it says you have to determine what kind of resource it is because HTTP headers might be wrong. Brady Duga: agree. There's not a problem.. Dave Cramer: given the varieties of approaches that can be taken (e.g. HTTP server, etc.), and do we have to mess with that?. Matt Garrish: don't like the idea of pointing out that there is no standard for this until we come up with a solution. Dave Cramer: i've been working on epub for a decade and I didn't realize the MIMESNIFF was an issue until it was raised. Matt Garrish: defer it. Dave Cramer: if someone can come up with a test that behaves different in different RS, or finds an example in the real world, definitely bring it up in the next round.
|
FWIW, I did find some cases where different RS behaved differently. With the following items declared in the package doc: <item id="svg-unknown-extension" href="svg.unknown" media-type="image/svg+xml"/>
<item id="svg-foreign" href="svg.foreign" media-type="image/unknown+xml" fallback="png"/>
<item id="png" href="png.png" media-type="image/png"/> And used in HTML with
Tested with Apple Books 1.19, Thorium v2.10, ADE v4.0.x and v4.5.11.
I attached a test EPUB with more cases and examples: mimetypes.zip I know it's edge cases, and is unlikely to pop up in real-word content. But it does how that RS behaves differently in how they load resources and handle MIME types.
I believe an informative note would not hurt. But if it's just me, it won't prevent me to sleep at night :) |
The MIME Sniffing standard is quite central to how HTML defines the loading of resources in HTML.
Specifically, in the "Determining the type of a resource" section, HTML says that the Content-Type metadata and computed MIME type of a resource must be obtained in a manner consistent with the requirements of MIME Sniffing.
In turn, MIME Sniffing says that to handle a resource, a user agent must keep track of (among other things) a supplied MIME type, determined by the supplied MIME type detection algorithm. That algorithm looks at various cases to detect the supplied MIME type: if the resource is retrieved via HTTP, or from the file system, or via another protocol.
EPUB sits a bit in-between all this, since it does not specify how resources are loaded from the OCF container. Some RS will serve them over HTTP, some as files, some possibly with another protocol.
Are the MIME types defined in the package document meant to be the authoritative source of how a reading system MUST detect a resource’s type? or is it only informative content used for type support processing? (via the fallback mechanism).
EPUB could say something along these lines, in the RS spec:
maybe somewhere in the OCF ZIP container section. Or in its own "Determining the type of a resource" section, à la HTML, with a similar language.
This is testable when scripting is available, by using the
fetch()
API to load an in-container resource of a custom unknown-to-the-RS MIME type (e.g.image/vnd.epub+test
), and verify that theContent-Type
header of the fetch Response object is the one declared in the package document.The text was updated successfully, but these errors were encountered: