-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gateway does not run Unicode Normalization Forms leading to seemingly identical paths not resolving when using different non normalized strings #10286
Comments
@Griss168 thx a lot for the great report. $ curl -L -vvv http://127.0.0.1:8080/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x03%20Fenome%CC%81n%20strachu.txt%20
* Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
> GET /ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x03%20Fenome%CC%81n%20strachu.txt%20 HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/8.5.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Access-Control-Allow-Headers: Content-Type
< Access-Control-Allow-Headers: Range
< Access-Control-Allow-Headers: User-Agent
< Access-Control-Allow-Headers: X-Requested-With
< Access-Control-Allow-Methods: GET
< Access-Control-Allow-Methods: HEAD
< Access-Control-Allow-Methods: OPTIONS
< Access-Control-Allow-Origin: *
< Access-Control-Expose-Headers: Content-Length
< Access-Control-Expose-Headers: Content-Range
< Access-Control-Expose-Headers: X-Chunked-Output
< Access-Control-Expose-Headers: X-Ipfs-Path
< Access-Control-Expose-Headers: X-Ipfs-Roots
< Access-Control-Expose-Headers: X-Stream-Output
< Cache-Control: public, max-age=29030400, immutable
< Content-Length: 168
< Content-Type: text/plain; charset=utf-8
< Etag: "QmVht3ZMMcuf4nCsjysExEvvFppUCSZSNc6fxmBquBkJMf"
< X-Ipfs-Path: /ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x03 Fenomén strachu.txt
< X-Ipfs-Roots: QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie,QmVht3ZMMcuf4nCsjysExEvvFppUCSZSNc6fxmBquBkJMf
< Date: Thu, 11 Jan 2024 07:52:39 GMT
<
A-5x01 Tíha 1.txt
A-5x02 Tíha 2.txt
A-5x03 Fenomén strachu.txt
A-5x04 Rozklad anděla.txt
A-5x05 Stěna ztracených duší.txt
* Connection #0 to host 127.0.0.1 left intact
A-5x06 Časová smyčka.txt I don't think non % encoded is supported on any correct HTTP server, browsers sometime un % encode the URL they show to users. I can also browse the file you had issues with: Maybe firefox is doing something chrome is not doing ? what browser are you using please ? Last thing, I noticed some of your files had trailing space:
|
After checking it seems the on wire string could be utf8 but should not: From RFC3986:
From RFC7230:
|
URL with spaces is not a problem because browser etc. they always encode it. The problem is that, the UTF8 string can be encoded to this URL By the way, I have no idea how the spaces got to the end of some files. I just created a file in Sublime-text, saved it to disk with different names and drag-and-drop it to ipfs-desktop. |
@Griss168 I see now thx, this is not a decoding issue. So the two links literally have different binary representation:
The first string (and the one you uploaded to your Kubo node) uses multi-codepoint-graphemes, it encodes the file name using boring old latin letters and it then apply accent modifiers on it:
The second one use a codepoint which is literally the letter with the accent (in a single codepoint):
Kubo works on binary, it does not even know that file names are text. So because the binary representation don't match it complains. What you are asking us to do is to run Unicode Normalization Forms:
However this is annoying to implement and have security implications (because various implementations might not agree on how to resolve files), I understand the need for people to be able to share file names in their own languages, but I think this needs to be looked over in the I'll create an issue there and send it to our gateway experts. |
As an alternative to avoid security implications, we could still show an error page but add a link to the matching representation inside the file on the gateway. |
@Jorropo Thank You. Yes, that's what I was trying to explain. I hope it will be resolved. Perhaps it will be possible to use some existing solution in the form of libraries, as it is used by web servers, for example.
I don't think this is a good solution. It is applicable if the gateway is exclusively used by humans. But my use case is to use a 3rd party app to download files over http. It will not be able to understand that the given files can be found in another place. |
Interesting! Polish has a bunch of diacritics such as ąęćłśźżóś but I've never experienced them being represented with ASCII + modifier rather than a single UTF8 code. @Griss168 for the sake of prioritization, how common (real world) this problem is? Is this just this one specific software/website producing filenames in a weird notation, or a daily occurrence for you? Which notation is more common in your language? Normalised one? We could fixup UX problem of HTTP 404 here by adding extra step of retrying on "not found" scenarios as suggested in ipfs/specs#457 (comment) (Kubo already does this type fo retry on subdomain gateways, it check for _redirects file, we could add unicode retry before that) |
After your explanation, I did some more tests today and found that the problem is somewhere deeper. Incorrect representation of UTF8 characters is only a consequence. I created a test file /Test/Návrat.txt on a USB flash drive with FAT32. and then I added the same folder to ipfs on Windows from the same USB drive: As you can see, I got a different CID for the Test folder from the same data on different OS. If I then compare the URL for the Návrat.txt file that is generated on the gateway, I get: At the very beginning, I tried using a torrent client to download the torrent data via the webseed distribution method from the ipfs gateway. The torrent itself was created on Windows. I uploaded the data from the original torrent to the ipfs daemon on MacOS. Subsequently, I added the url from the gateway to the torrent client on MacOS as a webseed. Webseed reported "File not found" on the torrent client. When I used Wireshark to inspect http requests, I found a difference between the URL from the torrent client and the gateway in the diacritics representation. Torrent client has file path and name defined in UTF8 and requests use the same representation of UTF8 characters as defined in the file structure. It looks like the ipfs daemon on MacOS changes file characters from the one-character representation á (0xC3A1) to the two-character representation a ́ (0xCC81). |
Thank you for digging into this across different operating systems. My understanding of the problem here is that this is not a problem with Kubo or IPFS. 👉 It is macOS being a special snowflake with their NFD normalizations:
This means that macOS Finder and some APIs and tools often change characters with diacritical marks (like accents) to be represented using a base character followed by a separate combining diacritical mark (NFD instead of NFC normalization everyone else uses). This is a well known headache with MacOS, some examples:
@Griss168 this is to say, if Kubo (golang) doing The only idea I have for dealing with import is that we could use $ ipfs add --normalize-names none|nfc|nfd # opt-in, no normalization by default This way, users could force specific normalization like NFC when doing import on macOS, but only in cases like yours, when it matters. This would be in addition to the fixup on gateway described in ipfs/specs#457 which is a band-aid for data that was imported by other people, or requested with invalid normalization. Together, they would give end users enough to get to the data via gateway. |
I think it's a great solution. This solves my problem of how to add data to the IPFS network on different systems and thus improve their availability in the network.
But I think that this solution is also important because it solves the retrieval of data from the IPFS network. After some of my tests, it turned out that different http clients use different methods of normalization and url encoding. Both solutions are important, although for my use case normalization during import is more important. Thanks for not giving up :) |
Checklist
Installation method
ipfs-desktop
Version
Config
Description
Hello,
I'm using IPFS-Desktop, but that's not important.
I'll create some folder called "Test" and put some random files in it with these file names:
A-5x01 Tíha 1.txt
A-5x02 Tíha 2.txt
A-5x03 Fenomén strachu.txt
A-5x04 Rozklad anděla.txt
A-5x05 Stěna ztracených duší.txt
A-5x06 Časová smyčka.txt
When I add the "Test" folder to IPFS, they give me the Test folder's CID QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie.
Now I have the file path like this "/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x01 Tíha 1.txt" for each file.
If the file or path contains any characters like this ÁáÄäÉéĚěÍíÓóÔôÚúŮůÝýČčďťŇňŘřŠšŽž, the URL-encoded link can be represented in two ways:
No URL-encoded path
http://127.0.0.1:8080/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x01 Tíha 1.txt
Url encoded by adding ́ symbol (0xCC81 in UTF8) after i. This url can be represented by http gateway server.
http://127.0.0.1:8080/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x01%20Ti%CC%81ha%201.txt
Url encoded by adding í symbol (0xC3AD in UTF8) to the url. This url can't be represented by http gateway server.
http://127.0.0.1:8080/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x01%20T%c3%adha%201.txt
These special symbols are commonly used in the Czech and Slovak languages and are found in many files and folders.
I've tried both url-encoded formats on some random Apache web server and they can represent both links. I'm trying some apps that can download files from http paths, but they use the second encoding method and can't find the file on the http gateway.
Test files are there: https://ipfs.io/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/
I test this behavior on IPFS-Desktop 0.32.0 for Windows and MacOS, on kubo 0.25.0 for MacOS and also https://ipfs.io/ gateway.
I hope it will be useful.
The text was updated successfully, but these errors were encountered: