Blobs should not have an associated mime-type #1213

DavidBuchanan314 · 2023-06-18T17:02:59Z

DavidBuchanan314
Jun 18, 2023

There is simply no need for it.

Further, getBlob should always return Content-Type: application/octet-stream.

There is no need for it to return anything else.
It discourages hotlinking (getBlob is less likely to be a fast-path, or behind a CDN, compared to e.g. dedicated mediaproxy endpoints). IIUC, getBlob is primarily intended as a server-to-server API, for syncing, as opposed to a server-to-client API.
It eliminates subtle XSS security issues caused by linking to an image/svg+xml blob (for example), without needing to rely on CSP as a mitigation (which is not supported in some legacy browsers).
It's possible for one blob to have multiple valid mime types, depending on context. As-is, only one is allowed, and it's whatever the blob was first uploaded as.

A client can know a blob's mime type, when necessary, through the context provided by the "blob" lexicon field that references it.

In most cases, a PDS (or other agent) can verify compliance with Lexicon "accept" parameters without resorting to general-case mime sniffing (i.e. by iteratively checking if the file is a valid X, for each X in the set of allowed mime types).

snarfed · 2023-06-18T20:30:41Z

snarfed
Jun 18, 2023

Surprising proposal!

I don't think I have a strong opinion here in general, but I'd be disappointed if getBlob ignored a blob's known MIME type and always returned Content-Type: application/octet-stream.

Speaking from experience, regardless of how you design an API or protocol or platform, URLs always leak outside of their original context. Hotlinking is one example, yes, but only one of many. We already have a thriving community of alternative Bluesky frontends and other tools, many with their own internal storage formats, APIs, etc, and I expect some of those pass around raw blob URLs. Hotlinking also isn't always bad; many sites actively encourage it.

You don't necessarily need to go out of your way to actively support all out-of-context URL uses, but your ecosystem will definitely be more robust and healthy if you at least avoid actively breaking them, like suppressing Content-Type would do here.

(I get some of the specific arguments, but I guess I disagree that the tradeoffs are worth it. Yes, XML SVG vulns exist, but this is an awkward painful sledgehammer for dealing with them, and wouldn't really prevent them, since consumers still have to parse them. Similarity, multiple MIME types seems pretty esoteric, and for applications that really need it (use cases?), they can easily use their own logic to provide for it.)

0 replies

DavidBuchanan314 · 2023-06-18T21:07:26Z

DavidBuchanan314
Jun 18, 2023
Author

Files with multiple valid MIME types are surprisingly common. For example, all JAR files are also strictly valid ZIP files - but most mime sniffers will only see it as a ZIP. (e.g. file hello.jar will tell you it's a ZIP).

It's true that files with multiple intentional MIME types are pretty rare, and I can't think of any off the top of my head, but my main concern is more like: If you upload a JAR blob and it gets sniffed as a ZIP on upload, you're now "stuck" with it like that.

Another concern, is that it seems to violate the principles of self-certifying data, on some level. When I request something by CID, I shouldn't expect there to be any "flexible" aspects of that result, and I shouldn't expect the result to be able to change over time. The mime-type is not certified by the CID, and so imho it should not be returned in the first place.

Right now, I could make a post with an image with declared mime type "image/foo", delete it, then make another post with the same image blob but declared as "image/bar". The same getBlob API requests would end up returning different mime types depending on the point in time that you queried it. (feels like a recipe for weirdness in a federated world)

0 replies

snarfed · 2023-06-18T21:23:24Z

snarfed
Jun 18, 2023

Thanks! These are all useful and helpful.

jar/zip is a good real world example of multiple valid types. I agree that sniffing isn't ideal, and at best should be interpreted as a strictly softer "hint" than client-provided types, which should probably be encouraged. Self-certifying purity, 🤷. The schema now does have blob nodes, which include MIME type and have their own CIDs, as well as raw blobs without the types, so that concern is provided for to some degree. And deleting and recreating, yup, sounds weird, but probably rare. I'd be curious to see an example that's dangerous or exploitable.

In general, I agree that these specific examples (and probably others) are real. They just seem relatively esoteric, rare, and/or harmless, compared to the very real and widespread harm that would come from suppressing known useful Content-Type on every getBlob request.

0 replies

DavidBuchanan314 · 2023-06-18T21:49:12Z

DavidBuchanan314
Jun 18, 2023
Author

Could you give an example of where it would be important for the content-type returned by getBlob to have a particular value?

I was actually wrong about the hotlinking - I created a test webpage that loads an PNG image named test.png.bin, in an <img> element, served with application/octet-stream MIME, and Firefox displays the image correctly: https://retr0.id/stuff/mime_test.html (unsure how this interacts with things like "X-Content-Type-Options=nosniff", more testing needed)

Edit: reading more about nosniff, it only prevents unexpected script execution and CSS loading (and does not affect images) https://fetch.spec.whatwg.org/#x-content-type-options-header

0 replies

snarfed · 2023-06-19T04:56:39Z

snarfed
Jun 19, 2023

Could you give an example of where it would be important for the content-type returned by getBlob to have a particular value?

Sure! Pretty much anything in the greater web ecosystem that works with binary files and that we want to integrate with AT Proto, but already exists or otherwise won't be built native from scratch inside AT Proto, eg:

Media players
Media editors
File storage services, eg Dropbox
CDNs
Academic paper (PDF) tracking browser extensions
Security tools that ingest and analyze executables and other binaries, eg EDR/MDR
Cross-posting tools, eg bots that copy posts from Bluesky to other networks
Read-it-later apps that support PDFs
web-based ebook readers and related tools
3D graphics viewers, editors
GIS tools
WebVR
WASM build toolchains
content moderation tools

2 replies

DavidBuchanan314 Jun 19, 2023
Author

With the exception of CDNs, I don't see why any of those strictly require a Content-Type response header from getBlob.

As a fun example for the first case, I uploaded an h264 movie to Twitter's image CDN, which you can play back in any browser, despite the fact that the Content-Type returned by twitter's CDN is "incorrect" (twitter's mime-sniffing thinks it's a bunch of PNGs - and it's not strictly wrong): https://shaka-player-demo.appspot.com/demo/#audiolang=en-GB;textlang=en-GB;uilang=en-GB;asset=https://www.da.vidbuchanan.co.uk/cors/twitter_bbb_4k.mpd;panel=CUSTOM%20CONTENT;build=compiled (does not work in Firefox for boring CORS reasons)

For implementing a "real" CDN on top of atproto, you'd probably want to implement a "CDN AppView", with a new record type that wraps a blob in a lexicon record (encoding mime and length metadata), and a new XRPC endpoint for retrieving the referenced blob with the correct Content-Type response header (thus encoding the mime in a self-certifying way)

snarfed Jun 19, 2023

They pretty much all require (or at least work drastically better with) Content-Type if they aren't ATProto-aware and get passed a raw getBlob URL. I'll elaborate below.

bnewbold · 2023-06-19T06:06:05Z

bnewbold
Jun 19, 2023
Maintainer

This is a good discussion and agree with points on both sides.

MIME types are indeed fickle and weird. It is not that small of an edge-case for files to match multiple types, sometimes in a benign way and sometimes not. There are also potential ambiguities or lack of broad support with things like compression (.json.gz or .ps.gz or even .tar.gz) or JSON/XML sub-types (geojson, even .docx, .xhtml, etc). Detection is definitely not deterministic.

One mitigation is to allow clients to indicate the type when they upload, and have the server verify against the provided type, instead of guessing from scratch. I can't remember if we actually do that yet, and it isn't mentioned in the spec. That doesn't fully "solve" anything, and may even make some hostile things easier, but does make it possible to indicate the expected type. It is true that a blob delete and re-upload could get another type, but are much more focused on blob embeds in records than the generic getBlob API response, and re-uploading won't change existing records.

One of the main motivations in adding this was to help generic infrastructure like BGS handle records in Lexicons it does not know about, and be able to detect and process blobs reference from those. Eg, we could do generic image auto-labeling on any record type by recursively detecting blobs in records, then using getBlob to download and scan them. Also helpful for things like moderation interfaces or debugging tools or fall-backs when working with unknown Lexicons. There would be ways to try and bypass these detection and moderation checks, but it seems like it has a good chance of working out well in most cases, especially if popular end clients reject blobs which don't match their declared/sniffed mimetype.

We kind of expect most clients to be fetching blobs from AppView-provided CDNs, or perhaps their own PDS acting as a cache/CDN, instead of fetching blobs directly from the origin PDS. Hitting origin PDS for blobs can run up bandwidth expenses really fast, especially with inefficient client behavior. Mastodon has had some issues with this, even though the ecosystem is not totally naive about it.

0 replies

snarfed · 2023-06-19T18:51:24Z

snarfed
Jun 19, 2023

Thanks for all the details @bnewbold! Useful to hear some of the early expected use cases here. And agreed that sniffing and client declaring types both have their uses.

If blobs are only ever expected or intended to be used inside ATProto-native apps and services, then @DavidBuchanan314 your arguments here make more sense. I'm mostly advocating for supporting broader use cases beyond just the ATProto bubble, especially for getBlob URLs, which (like all URLs) will inevitably get passed around and used in the vastly broader web ecosystem outside that bubble. Bluesky and ATProto themselves will be significantly more useful to more people in more places if they make a baseline attempt to interoperate with the web and be good citizens there, which includes Content-Type on HTTP responses.

I love how much the team has already leaned into web-native interop in so many ways! Keeping Content-Type on getBlob is a relatively easy way to not backslide there.

0 replies

DavidBuchanan314 · 2023-06-21T18:20:51Z

DavidBuchanan314
Jun 21, 2023
Author

By the way, I've cheekily implemented blobs in my own PDS, according to my own idea of how they should work. I'm successfully federated with the sandbox network, and the appview's media proxy is happily displaying my images, despite getBlob returning an application/octent-stream content-type.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blobs should not have an associated mime-type #1213

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Blobs should not have an associated mime-type #1213

DavidBuchanan314 Jun 18, 2023

Replies: 8 comments · 2 replies

snarfed Jun 18, 2023

DavidBuchanan314 Jun 18, 2023 Author

snarfed Jun 18, 2023

DavidBuchanan314 Jun 18, 2023 Author

snarfed Jun 19, 2023

DavidBuchanan314 Jun 19, 2023 Author

snarfed Jun 19, 2023

bnewbold Jun 19, 2023 Maintainer

snarfed Jun 19, 2023

DavidBuchanan314 Jun 21, 2023 Author

DavidBuchanan314
Jun 18, 2023

Replies: 8 comments 2 replies

snarfed
Jun 18, 2023

DavidBuchanan314
Jun 18, 2023
Author

snarfed
Jun 18, 2023

DavidBuchanan314
Jun 18, 2023
Author

snarfed
Jun 19, 2023

DavidBuchanan314 Jun 19, 2023
Author

bnewbold
Jun 19, 2023
Maintainer

snarfed
Jun 19, 2023

DavidBuchanan314
Jun 21, 2023
Author