Blobs should not have an associated mime-type #1213
Replies: 8 comments 2 replies
-
Surprising proposal! I don't think I have a strong opinion here in general, but I'd be disappointed if Speaking from experience, regardless of how you design an API or protocol or platform, URLs always leak outside of their original context. Hotlinking is one example, yes, but only one of many. We already have a thriving community of alternative Bluesky frontends and other tools, many with their own internal storage formats, APIs, etc, and I expect some of those pass around raw blob URLs. Hotlinking also isn't always bad; many sites actively encourage it. You don't necessarily need to go out of your way to actively support all out-of-context URL uses, but your ecosystem will definitely be more robust and healthy if you at least avoid actively breaking them, like suppressing Content-Type would do here. (I get some of the specific arguments, but I guess I disagree that the tradeoffs are worth it. Yes, XML SVG vulns exist, but this is an awkward painful sledgehammer for dealing with them, and wouldn't really prevent them, since consumers still have to parse them. Similarity, multiple MIME types seems pretty esoteric, and for applications that really need it (use cases?), they can easily use their own logic to provide for it.) |
Beta Was this translation helpful? Give feedback.
-
Files with multiple valid MIME types are surprisingly common. For example, all JAR files are also strictly valid ZIP files - but most mime sniffers will only see it as a ZIP. (e.g. It's true that files with multiple intentional MIME types are pretty rare, and I can't think of any off the top of my head, but my main concern is more like: If you upload a JAR blob and it gets sniffed as a ZIP on upload, you're now "stuck" with it like that. Another concern, is that it seems to violate the principles of self-certifying data, on some level. When I request something by CID, I shouldn't expect there to be any "flexible" aspects of that result, and I shouldn't expect the result to be able to change over time. The mime-type is not certified by the CID, and so imho it should not be returned in the first place. Right now, I could make a post with an image with declared mime type "image/foo", delete it, then make another post with the same image blob but declared as "image/bar". The same |
Beta Was this translation helpful? Give feedback.
-
Thanks! These are all useful and helpful. jar/zip is a good real world example of multiple valid types. I agree that sniffing isn't ideal, and at best should be interpreted as a strictly softer "hint" than client-provided types, which should probably be encouraged. Self-certifying purity, 🤷. The schema now does have blob nodes, which include MIME type and have their own CIDs, as well as raw blobs without the types, so that concern is provided for to some degree. And deleting and recreating, yup, sounds weird, but probably rare. I'd be curious to see an example that's dangerous or exploitable. In general, I agree that these specific examples (and probably others) are real. They just seem relatively esoteric, rare, and/or harmless, compared to the very real and widespread harm that would come from suppressing known useful |
Beta Was this translation helpful? Give feedback.
-
Could you give an example of where it would be important for the content-type returned by I was actually wrong about the hotlinking - I created a test webpage that loads an PNG image named Edit: reading more about nosniff, it only prevents unexpected script execution and CSS loading (and does not affect images) https://fetch.spec.whatwg.org/#x-content-type-options-header |
Beta Was this translation helpful? Give feedback.
-
Sure! Pretty much anything in the greater web ecosystem that works with binary files and that we want to integrate with AT Proto, but already exists or otherwise won't be built native from scratch inside AT Proto, eg:
|
Beta Was this translation helpful? Give feedback.
-
This is a good discussion and agree with points on both sides. MIME types are indeed fickle and weird. It is not that small of an edge-case for files to match multiple types, sometimes in a benign way and sometimes not. There are also potential ambiguities or lack of broad support with things like compression ( One mitigation is to allow clients to indicate the type when they upload, and have the server verify against the provided type, instead of guessing from scratch. I can't remember if we actually do that yet, and it isn't mentioned in the spec. That doesn't fully "solve" anything, and may even make some hostile things easier, but does make it possible to indicate the expected type. It is true that a blob delete and re-upload could get another type, but are much more focused on blob embeds in records than the generic getBlob API response, and re-uploading won't change existing records. One of the main motivations in adding this was to help generic infrastructure like BGS handle records in Lexicons it does not know about, and be able to detect and process blobs reference from those. Eg, we could do generic image auto-labeling on any record type by recursively detecting blobs in records, then using getBlob to download and scan them. Also helpful for things like moderation interfaces or debugging tools or fall-backs when working with unknown Lexicons. There would be ways to try and bypass these detection and moderation checks, but it seems like it has a good chance of working out well in most cases, especially if popular end clients reject blobs which don't match their declared/sniffed mimetype. We kind of expect most clients to be fetching blobs from AppView-provided CDNs, or perhaps their own PDS acting as a cache/CDN, instead of fetching blobs directly from the origin PDS. Hitting origin PDS for blobs can run up bandwidth expenses really fast, especially with inefficient client behavior. Mastodon has had some issues with this, even though the ecosystem is not totally naive about it. |
Beta Was this translation helpful? Give feedback.
-
Thanks for all the details @bnewbold! Useful to hear some of the early expected use cases here. And agreed that sniffing and client declaring types both have their uses. If blobs are only ever expected or intended to be used inside ATProto-native apps and services, then @DavidBuchanan314 your arguments here make more sense. I'm mostly advocating for supporting broader use cases beyond just the ATProto bubble, especially for I love how much the team has already leaned into web-native interop in so many ways! Keeping |
Beta Was this translation helpful? Give feedback.
-
By the way, I've cheekily implemented blobs in my own PDS, according to my own idea of how they should work. I'm successfully federated with the sandbox network, and the appview's media proxy is happily displaying my images, despite getBlob returning an |
Beta Was this translation helpful? Give feedback.
-
There is simply no need for it.
Further, getBlob should always return
Content-Type: application/octet-stream
.It discourages hotlinking (getBlob is less likely to be a fast-path, or behind a CDN, compared to e.g. dedicated mediaproxy endpoints). IIUC, getBlob is primarily intended as a server-to-server API, for syncing, as opposed to a server-to-client API.image/svg+xml
blob (for example), without needing to rely on CSP as a mitigation (which is not supported in some legacy browsers).A client can know a blob's mime type, when necessary, through the context provided by the "blob" lexicon field that references it.
In most cases, a PDS (or other agent) can verify compliance with Lexicon "accept" parameters without resorting to general-case mime sniffing (i.e. by iteratively checking if the file is a valid X, for each X in the set of allowed mime types).
Beta Was this translation helpful? Give feedback.
All reactions