Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fwrite with compress="gzip" produces gz files with incorrect uncompressed file sizes #6356

Open
oliverfoster opened this issue Aug 5, 2024 · 16 comments · May be fixed by #6393
Open

fwrite with compress="gzip" produces gz files with incorrect uncompressed file sizes #6356

oliverfoster opened this issue Aug 5, 2024 · 16 comments · May be fixed by #6393

Comments

@oliverfoster
Copy link

oliverfoster commented Aug 5, 2024

data.table::fwrite with compress="gzip" produces slightly incompatible gz files with multiple independent chunks.

It means that browsers cannot receive the compressed .gz files using Content-Encoding: gzip and Transfer-Encoding: chunked because a browser only processes the first chunk of a gzipped file, in this case the csv header chunk.

My code currently looks like this:

      filename <- "c:/test.csv.gz"
      # data.table::fwrite(df, file=filename, row.names=FALSE, na="", compress="gzip")
      # ^ inbuilt data.table compression gives incorrect uncompressed size
      # need to intentionally use R.utils for gzip
      interim <- paste(filename, "_raw", sep="")
      data.table::fwrite(df, file=interim, row.names=FALSE, na="", compress="none")
      R.utils::gzip(interim, destname=filename)

I'm running it on Windows 10, with Rscript (R) version 4.4.1 (2024-06-14) and the latest release of data.table version 1.15.2.

Please let me know if there's anything else you need from me.

@MichaelChirico
Copy link
Member

to make this actionable, please also share the code that encounters the browser error.

@oliverfoster
Copy link
Author

oliverfoster commented Aug 5, 2024

I'm using the output file from R, via node, expressjs, streams and some http headings. The condensed version of the code which interfaces with express and the browser, looks like this:

   const filename = 'c:/test.csv.gz';
   const headers = {};
   header['Content-Encoding'] =  'gzip'
   header['Vary'] = 'Origin, Accept-Encoding';
   header['Content-Type'] = 'text/csv'
   res.set(headers);
   const stream = fs.createReadStream(filename);
   stream.pipe(res);
   await new Promise(resolve => stream.on('end', resolve));

It's quite straight forward I think (node handles the transfer-encoding: chunked part). The file is delivered to the browser with the expectation that it'll be gunzipped in a usual fashion. The above works well when using gzip instead of fwrite.

You can see the difference in the output from R, when producing files using fwrite and gzip, by opening the different outputs in 7zip and looking at the file properties. Using fwrite the size will be the size of the last chunk only, less than the packed size and not at all representative of the original file size.

This is a good example of how the file size should look, when using gzip:
image

This is what it looks like when made with fwrite:
image

Here is a stackoverflow from someone having a similar issue to mine: https://stackoverflow.com/questions/78551554/malformed-gzip-file-with-r-data-table-fwrite

When using the fwrite output the browser truncates the file early, after the csv header and the server stream continues to try to send the rest of the gzip data. I can replicate in Firefox and Chrome.

@oliverfoster
Copy link
Author

Is there anything else you need? I'm not sure I quite answered your question. @MichaelChirico

@MichaelChirico
Copy link
Member

MichaelChirico commented Aug 6, 2024 via email

@oliverfoster
Copy link
Author

Cool. At least there is an issue here now. I have no idea how to develop r packages, so I'll have to leave this until I have time to learn. Thanks. 👍

@oliverfoster
Copy link
Author

oliverfoster commented Aug 7, 2024

I think I may have found the reason.

"is multithreaded and compresses each chunk on-the-fly"

I think browsers may only support one chunk gz files. I haven't confirmed but it would make sense. The reported size of the file seems to be the size of the last chunk.

It seems as though when using one thread with nThread and a huge buffer 1024, the csv header line gets written as the first gz chunk. The browser accepts only the header chunk as a single file and 7zip reads the size of the last chunk, as chunk sizes ISIZE are written in the last 4 bytes of each chunk.

@MichaelChirico
Copy link
Member

MichaelChirico commented Aug 7, 2024 via email

@aitap
Copy link
Contributor

aitap commented Aug 18, 2024

It's not purely due to the threads. #6356 (comment) is right, fwrite writes the header in a separate chunk too:

library(data.table)
setDTthreads(1)

d <- data.table(x = 1:10)
fwrite(d, 'd.csv.gz', compress = 'gzip')

# What does the header say?
system('gzip -l d.csv.gz')
#          compressed        uncompressed  ratio uncompressed_name
#                  63                  21 -200.0% d.csv

# There are actually 23 bytes inside, not 21
system('zcat d.csv.gz | wc -c')
# 23

Every time the code calls compressbuf, it writes a separate zlib chunk due to the flush argument of deflate() being set to Z_FINISH:

data.table/src/fwrite.c

Lines 574 to 587 in 6cee825

int compressbuff(z_stream *stream, void* dest, size_t *destLen, const void* source, size_t sourceLen)
{
stream->next_out = dest;
stream->avail_out = *destLen;
stream->next_in = (Bytef *)source; // don't use z_const anywhere; #3939
stream->avail_in = sourceLen;
int err = deflate(stream, Z_FINISH);
if (err == Z_OK) {
// with Z_FINISH, deflate must return Z_STREAM_END if correct, otherwise it's an error and we shouldn't return Z_OK (0)
err = -9; // # nocov
}
*destLen = stream->total_out;
return err == Z_STREAM_END ? Z_OK : err;
}

Unless called with col.names = FALSE or append = TRUE, fwrite will call compressbuf at least twice:

ret1 = compressbuff(&stream, zbuff, &zbuffUsed, buff, (size_t)(ch-buff));

int ret = compressbuff(mystream, myzBuff, &myzbuffUsed, myBuff, (size_t)(ch-myBuff));

A program that only looks at the gzip trailer at the end of file will indeed only see the trailer of the last chunk. zlib can only be used for parallel compression in separate chunks, so it won't be easy to unite the header chunk and the first data chunk, even if no parallelism is used.

@oliverfoster
Copy link
Author

It sounds fundamentally unfixable. Shall I close having detailed this issue to its conclusion?

@dvictori
Copy link

Ahhh, someone spotted my stackoverflow question. Might I add that its not just browsers that choke on the gzip file. Gnome file manager does not like it also. But I can unconprees the file just fine using the CLI.

Will link this issue to the stackoverflow question

@philippechataignon
Copy link
Contributor

I think it’s fixable but with an important rewrite. Actually the gzipped file created by fwrite is a sequence of gzip segments with flush option Z_FINISH because with threads, the stream state can’t be keep. I had a look at pigz (V1,0) which is a old by simple threaded version of gzip. It uses the flush option Z_SYNC_FLUSH and create manually a minimal gzip header and the length and crc at the end. Same idea can be use. I will have a look.

@philippechataignon philippechataignon self-assigned this Aug 22, 2024
@philippechataignon philippechataignon linked a pull request Aug 23, 2024 that will close this issue
@philippechataignon
Copy link
Contributor

I create a PR #6393 and a new branch fix_fwrite_length. Feel free to install and test because there are many changes to the fwrite function.

@philippechataignon philippechataignon removed their assignment Sep 5, 2024
@hutch3232
Copy link

I wonder if this is why h2o doesn't correctly import csvs gz-compressed by data.table:
h2oai/h2o-3#6522

@oliverfoster
Copy link
Author

What else needs to happen here for the pr to get merged? Can I help in any way?

@MichaelChirico
Copy link
Member

I just need to find time :)

If you're able to test the PR and report any issues, that would be great.

It's earmarked for 1.17.0, so I definitely plan to include it in the next release.

@hutch3232
Copy link

I just re-ran my h2o test (#6356 (comment)) using @philippechataignon's PR as of cdf4277 and it works to resolve the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants