fwrite with compress="gzip" produces gz files with incorrect uncompressed file sizes #6356

oliverfoster · 2024-08-05T22:04:54Z

data.table::fwrite with compress="gzip" produces slightly incompatible gz files with multiple independent chunks.

It means that browsers cannot receive the compressed .gz files using Content-Encoding: gzip and Transfer-Encoding: chunked because a browser only processes the first chunk of a gzipped file, in this case the csv header chunk.

My code currently looks like this:

      filename <- "c:/test.csv.gz"
      # data.table::fwrite(df, file=filename, row.names=FALSE, na="", compress="gzip")
      # ^ inbuilt data.table compression gives incorrect uncompressed size
      # need to intentionally use R.utils for gzip
      interim <- paste(filename, "_raw", sep="")
      data.table::fwrite(df, file=interim, row.names=FALSE, na="", compress="none")
      R.utils::gzip(interim, destname=filename)

I'm running it on Windows 10, with Rscript (R) version 4.4.1 (2024-06-14) and the latest release of data.table version 1.15.2.

Please let me know if there's anything else you need from me.

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2024-08-05T22:07:10Z

to make this actionable, please also share the code that encounters the browser error.

oliverfoster · 2024-08-05T22:50:42Z

I'm using the output file from R, via node, expressjs, streams and some http headings. The condensed version of the code which interfaces with express and the browser, looks like this:

   const filename = 'c:/test.csv.gz';
   const headers = {};
   header['Content-Encoding'] =  'gzip'
   header['Vary'] = 'Origin, Accept-Encoding';
   header['Content-Type'] = 'text/csv'
   res.set(headers);
   const stream = fs.createReadStream(filename);
   stream.pipe(res);
   await new Promise(resolve => stream.on('end', resolve));

It's quite straight forward I think (node handles the transfer-encoding: chunked part). The file is delivered to the browser with the expectation that it'll be gunzipped in a usual fashion. The above works well when using gzip instead of fwrite.

You can see the difference in the output from R, when producing files using fwrite and gzip, by opening the different outputs in 7zip and looking at the file properties. Using fwrite the size will be the size of the last chunk only, less than the packed size and not at all representative of the original file size.

This is a good example of how the file size should look, when using gzip:

This is what it looks like when made with fwrite:

Here is a stackoverflow from someone having a similar issue to mine: https://stackoverflow.com/questions/78551554/malformed-gzip-file-with-r-data-table-fwrite

When using the fwrite output the browser truncates the file early, after the csv header and the server stream continues to try to send the rest of the gzip data. I can replicate in Firefox and Chrome.

oliverfoster · 2024-08-06T08:13:47Z

Is there anything else you need? I'm not sure I quite answered your question. @MichaelChirico

MichaelChirico · 2024-08-06T13:29:57Z

I'm not sure, probably won't look closely at this for some time myself. a PR would be ideal if this is urgent for you.

…

On Tue, Aug 6, 2024, 1:14 AM Oliver Foster ***@***.***> wrote: Is there anything else you need? I'm not sure I quite answered your question. @MichaelChirico <https://github.com/MichaelChirico> — Reply to this email directly, view it on GitHub <#6356 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2BA5KJYHUZCUDCS73FDO3ZQCAVFAVCNFSM6AAAAABMBDGJ5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZQGY2TSMJRG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

oliverfoster · 2024-08-06T13:52:26Z

Cool. At least there is an issue here now. I have no idea how to develop r packages, so I'll have to leave this until I have time to learn. Thanks. 👍

oliverfoster · 2024-08-07T09:54:16Z

I think I may have found the reason.

"is multithreaded and compresses each chunk on-the-fly"

I think browsers may only support one chunk gz files. I haven't confirmed but it would make sense. The reported size of the file seems to be the size of the last chunk.

It seems as though when using one thread with nThread and a huge buffer 1024, the csv header line gets written as the first gz chunk. The browser accepts only the header chunk as a single file and 7zip reads the size of the last chunk, as chunk sizes ISIZE are written in the last 4 bytes of each chunk.

MichaelChirico · 2024-08-07T13:16:13Z

you can try setDTthreads(1) to see if that solves the issue

…

On Wed, Aug 7, 2024, 2:54 AM Oliver Foster ***@***.***> wrote: I think I may have found the reason. "is multithreaded and compresses each chunk on-the-fly" <https://github.com/Rdatatable/data.table/blob/bad266b64285d25b1810a4e646c12fe2d6354461/NEWS.1.md?plain=1#L428> I think browsers may only support one chunk gz files. I haven't confirmed but it would make sense. — Reply to this email directly, view it on GitHub <#6356 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2BA5KTBO5XUG5KO4ZXBBTZQHVF5AVCNFSM6AAAAABMBDGJ5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTGA4DAOJVGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

aitap · 2024-08-18T06:56:06Z

It's not purely due to the threads. #6356 (comment) is right, fwrite writes the header in a separate chunk too:

library(data.table)
setDTthreads(1)

d <- data.table(x = 1:10)
fwrite(d, 'd.csv.gz', compress = 'gzip')

# What does the header say?
system('gzip -l d.csv.gz')
#          compressed        uncompressed  ratio uncompressed_name
#                  63                  21 -200.0% d.csv

# There are actually 23 bytes inside, not 21
system('zcat d.csv.gz | wc -c')
# 23

Every time the code calls compressbuf, it writes a separate zlib chunk due to the flush argument of deflate() being set to Z_FINISH:

data.table/src/fwrite.c

Lines 574 to 587 in 6cee825

    
           int compressbuff(z_stream *stream, void* dest, size_t *destLen, const void* source, size_t sourceLen) 
        
           { 
        
             stream->next_out = dest; 
        
             stream->avail_out = *destLen; 
        
             stream->next_in = (Bytef *)source; // don't use z_const anywhere; #3939 
        
             stream->avail_in = sourceLen; 
        
             int err = deflate(stream, Z_FINISH); 
        
             if (err == Z_OK) { 
        
               // with Z_FINISH, deflate must return Z_STREAM_END if correct, otherwise it's an error and we shouldn't return Z_OK (0) 
        
               err = -9;  // # nocov 
        
             } 
        
             *destLen = stream->total_out; 
        
             return err == Z_STREAM_END ? Z_OK : err; 
        
           }

Unless called with col.names = FALSE or append = TRUE, fwrite will call compressbuf at least twice:

data.table/src/fwrite.c

Line 764 in 6cee825

ret1 = compressbuff(&stream, zbuff, &zbuffUsed, buff, (size_t)(ch-buff));

data.table/src/fwrite.c

Line 915 in 6cee825

    
           int ret = compressbuff(mystream, myzBuff, &myzbuffUsed, myBuff, (size_t)(ch-myBuff));

A program that only looks at the gzip trailer at the end of file will indeed only see the trailer of the last chunk. zlib can only be used for parallel compression in separate chunks, so it won't be easy to unite the header chunk and the first data chunk, even if no parallelism is used.

oliverfoster · 2024-08-18T08:36:04Z

It sounds fundamentally unfixable. Shall I close having detailed this issue to its conclusion?

dvictori · 2024-08-20T11:19:04Z

Ahhh, someone spotted my stackoverflow question. Might I add that its not just browsers that choke on the gzip file. Gnome file manager does not like it also. But I can unconprees the file just fine using the CLI.

Will link this issue to the stackoverflow question

philippechataignon · 2024-08-22T16:21:44Z

I think it’s fixable but with an important rewrite. Actually the gzipped file created by fwrite is a sequence of gzip segments with flush option Z_FINISH because with threads, the stream state can’t be keep. I had a look at pigz (V1,0) which is a old by simple threaded version of gzip. It uses the flush option Z_SYNC_FLUSH and create manually a minimal gzip header and the length and crc at the end. Same idea can be use. I will have a look.

philippechataignon · 2024-08-23T15:42:52Z

I create a PR #6393 and a new branch fix_fwrite_length. Feel free to install and test because there are many changes to the fwrite function.

hutch3232 · 2024-11-15T02:16:34Z

I wonder if this is why h2o doesn't correctly import csvs gz-compressed by data.table:
h2oai/h2o-3#6522

oliverfoster · 2024-11-15T16:03:04Z

What else needs to happen here for the pr to get merged? Can I help in any way?

MichaelChirico · 2024-11-15T19:09:52Z

I just need to find time :)

If you're able to test the PR and report any issues, that would be great.

It's earmarked for 1.17.0, so I definitely plan to include it in the next release.

hutch3232 · 2024-11-15T21:45:28Z

I just re-ran my h2o test (#6356 (comment)) using @philippechataignon's PR as of cdf4277 and it works to resolve the issue!

tdhock added fwrite openmp labels Aug 7, 2024

philippechataignon self-assigned this Aug 22, 2024

philippechataignon linked a pull request Aug 23, 2024 that will close this issue

Fix fwrite length for gzip output #6393

Open

philippechataignon removed their assignment Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fwrite with compress="gzip" produces gz files with incorrect uncompressed file sizes #6356

fwrite with compress="gzip" produces gz files with incorrect uncompressed file sizes #6356

oliverfoster commented Aug 5, 2024 •

edited

Loading

MichaelChirico commented Aug 5, 2024

oliverfoster commented Aug 5, 2024 •

edited

Loading

oliverfoster commented Aug 6, 2024

MichaelChirico commented Aug 6, 2024 via email

oliverfoster commented Aug 6, 2024

oliverfoster commented Aug 7, 2024 •

edited

Loading

MichaelChirico commented Aug 7, 2024 via email

aitap commented Aug 18, 2024

oliverfoster commented Aug 18, 2024

dvictori commented Aug 20, 2024

philippechataignon commented Aug 22, 2024

philippechataignon commented Aug 23, 2024

hutch3232 commented Nov 15, 2024

oliverfoster commented Nov 15, 2024

MichaelChirico commented Nov 15, 2024

hutch3232 commented Nov 15, 2024

fwrite with compress="gzip" produces gz files with incorrect uncompressed file sizes #6356

fwrite with compress="gzip" produces gz files with incorrect uncompressed file sizes #6356

Comments

oliverfoster commented Aug 5, 2024 • edited Loading

MichaelChirico commented Aug 5, 2024

oliverfoster commented Aug 5, 2024 • edited Loading

oliverfoster commented Aug 6, 2024

MichaelChirico commented Aug 6, 2024 via email

oliverfoster commented Aug 6, 2024

oliverfoster commented Aug 7, 2024 • edited Loading

MichaelChirico commented Aug 7, 2024 via email

aitap commented Aug 18, 2024

oliverfoster commented Aug 18, 2024

dvictori commented Aug 20, 2024

philippechataignon commented Aug 22, 2024

philippechataignon commented Aug 23, 2024

hutch3232 commented Nov 15, 2024

oliverfoster commented Nov 15, 2024

MichaelChirico commented Nov 15, 2024

hutch3232 commented Nov 15, 2024

oliverfoster commented Aug 5, 2024 •

edited

Loading

oliverfoster commented Aug 5, 2024 •

edited

Loading

oliverfoster commented Aug 7, 2024 •

edited

Loading