-
Notifications
You must be signed in to change notification settings - Fork 988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fwrite with compress="gzip" produces gz files with incorrect uncompressed file sizes #6356
Comments
to make this actionable, please also share the code that encounters the browser error. |
I'm using the output file from R, via node, expressjs, streams and some http headings. The condensed version of the code which interfaces with express and the browser, looks like this: const filename = 'c:/test.csv.gz';
const headers = {};
header['Content-Encoding'] = 'gzip'
header['Vary'] = 'Origin, Accept-Encoding';
header['Content-Type'] = 'text/csv'
res.set(headers);
const stream = fs.createReadStream(filename);
stream.pipe(res);
await new Promise(resolve => stream.on('end', resolve)); It's quite straight forward I think (node handles the transfer-encoding: chunked part). The file is delivered to the browser with the expectation that it'll be gunzipped in a usual fashion. The above works well when using gzip instead of fwrite. You can see the difference in the output from R, when producing files using fwrite and gzip, by opening the different outputs in 7zip and looking at the file properties. Using fwrite the size will be the size of the last chunk only, less than the packed size and not at all representative of the original file size. This is a good example of how the file size should look, when using gzip: This is what it looks like when made with fwrite: Here is a stackoverflow from someone having a similar issue to mine: https://stackoverflow.com/questions/78551554/malformed-gzip-file-with-r-data-table-fwrite When using the fwrite output the browser truncates the file early, after the csv header and the server stream continues to try to send the rest of the gzip data. I can replicate in Firefox and Chrome. |
Is there anything else you need? I'm not sure I quite answered your question. @MichaelChirico |
I'm not sure, probably won't look closely at this for some time myself. a
PR would be ideal if this is urgent for you.
…On Tue, Aug 6, 2024, 1:14 AM Oliver Foster ***@***.***> wrote:
Is there anything else you need? I'm not sure I quite answered your
question. @MichaelChirico <https://github.com/MichaelChirico>
—
Reply to this email directly, view it on GitHub
<#6356 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2BA5KJYHUZCUDCS73FDO3ZQCAVFAVCNFSM6AAAAABMBDGJ5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZQGY2TSMJRG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Cool. At least there is an issue here now. I have no idea how to develop r packages, so I'll have to leave this until I have time to learn. Thanks. 👍 |
I think I may have found the reason.
I think browsers may only support one chunk gz files. I haven't confirmed but it would make sense. The reported size of the file seems to be the size of the last chunk. It seems as though when using one thread with nThread and a huge buffer 1024, the csv header line gets written as the first gz chunk. The browser accepts only the header chunk as a single file and 7zip reads the size of the last chunk, as chunk sizes ISIZE are written in the last 4 bytes of each chunk. |
you can try setDTthreads(1) to see if that solves the issue
…On Wed, Aug 7, 2024, 2:54 AM Oliver Foster ***@***.***> wrote:
I think I may have found the reason.
"is multithreaded and compresses each chunk on-the-fly"
<https://github.com/Rdatatable/data.table/blob/bad266b64285d25b1810a4e646c12fe2d6354461/NEWS.1.md?plain=1#L428>
I think browsers may only support one chunk gz files. I haven't confirmed
but it would make sense.
—
Reply to this email directly, view it on GitHub
<#6356 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2BA5KTBO5XUG5KO4ZXBBTZQHVF5AVCNFSM6AAAAABMBDGJ5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTGA4DAOJVGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It's not purely due to the threads. #6356 (comment) is right, library(data.table)
setDTthreads(1)
d <- data.table(x = 1:10)
fwrite(d, 'd.csv.gz', compress = 'gzip')
# What does the header say?
system('gzip -l d.csv.gz')
# compressed uncompressed ratio uncompressed_name
# 63 21 -200.0% d.csv
# There are actually 23 bytes inside, not 21
system('zcat d.csv.gz | wc -c')
# 23 Every time the code calls Lines 574 to 587 in 6cee825
Unless called with Line 764 in 6cee825
Line 915 in 6cee825
A program that only looks at the gzip trailer at the end of file will indeed only see the trailer of the last chunk. zlib can only be used for parallel compression in separate chunks, so it won't be easy to unite the header chunk and the first data chunk, even if no parallelism is used. |
It sounds fundamentally unfixable. Shall I close having detailed this issue to its conclusion? |
Ahhh, someone spotted my stackoverflow question. Might I add that its not just browsers that choke on the gzip file. Gnome file manager does not like it also. But I can unconprees the file just fine using the CLI. Will link this issue to the stackoverflow question |
I think it’s fixable but with an important rewrite. Actually the gzipped file created by fwrite is a sequence of gzip segments with flush option Z_FINISH because with threads, the stream state can’t be keep. I had a look at pigz (V1,0) which is a old by simple threaded version of gzip. It uses the flush option Z_SYNC_FLUSH and create manually a minimal gzip header and the length and crc at the end. Same idea can be use. I will have a look. |
I create a PR #6393 and a new branch fix_fwrite_length. Feel free to install and test because there are many changes to the fwrite function. |
I wonder if this is why |
What else needs to happen here for the pr to get merged? Can I help in any way? |
I just need to find time :) If you're able to test the PR and report any issues, that would be great. It's earmarked for 1.17.0, so I definitely plan to include it in the next release. |
I just re-ran my |
data.table::fwrite with compress="gzip" produces slightly incompatible gz files with multiple independent chunks.
It means that browsers cannot receive the compressed .gz files using
Content-Encoding: gzip
andTransfer-Encoding: chunked
because a browser only processes the first chunk of a gzipped file, in this case the csv header chunk.My code currently looks like this:
I'm running it on Windows 10, with Rscript (R) version 4.4.1 (2024-06-14) and the latest release of data.table version 1.15.2.
Please let me know if there's anything else you need from me.
The text was updated successfully, but these errors were encountered: