Fix fwrite length for gzip output #6393

philippechataignon · 2024-08-23T15:35:21Z

Closes #6356. Closes #5506.

This PR is an attempt to create a better gzip file with fwrite. Its an important rewrite because it includes some refactoring of actual code.

zlib

use Z_SYNC_FLUSH instead of Z_FINISH
create manual heading and write crc and len in tail
calc len and crc in thread and summarize in main thread
len in gzip specification is 32 bits and then is modulo 2 ^ 32 for uncompressed size > 4GiB

C code

simplify the implementation with only a #pragma omp parallel for for chunk loop and #pragma omp ordered for the writing and summarizing part.
Matt Dowle introduces the use of pool of buffers : the idea is generalized. The pools are created at the beginning and then uses for writing headers and rows. All the malloc occur early and no need for an header buffer.
Deobfuscate some part, especially if ( ) is followed by a new line, no =- or =*. Lot of work remains. Use of indent command ?
Remove some old debug code (msg)

* gzip length and crc are manually computed in each thread and then added/combined * gzip header is minimal * remove some old debug code

github-actions · 2024-08-23T15:53:37Z

Generated via commit 5c57eba

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	4 minutes and 42 seconds
Installing different package versions	8 minutes and 11 seconds
Running and plotting the test cases	2 minutes and 21 seconds

oliverfoster · 2024-08-23T18:44:19Z

len in gzip specification is 32 bits and then is false for uncompressed size > 4GiB

Is it not meant to be length modulo 4GiB rather than false? The remainder of dividing by 4GiB multiplied by 4GiB?

philippechataignon · 2024-08-24T08:38:56Z

You're right and this PR version stores the modulo 2**32 as requested but its not the right size.
Note that in new versions of gzip (after 1.12), the length stored in file is no more used. See note on gzip -l on https://savannah.gnu.org/news/?id=10121

7z l # (use header)
    Date      Time    Attr         Size   Compressed  Name
                     .....   1298954309   2212092276  mtcars.csv

% gzip --version                                                                                                                                                    
gzip 1.12
% gzip -lv mtcars.csv.gz  # (decompress and takes much more time)
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 5cd79282 Aug 24 10:19          2212092276          9888888901  77.6% mtcars.csv

philippechataignon · 2024-08-24T10:10:53Z

Put PR #5513 in this PR with new param compressLevel.

MichaelChirico · 2024-09-03T04:39:32Z

I documented the current process here,

Thanks Toby, I had looked at the .ci/atime/tests.R script and some {atime} documentation directly and didn't think to check the Wiki. Should we maybe (1) migrate that documentation into .ci/atime directly (2) add .ci/atime/README.md pointing to the Wiki (3) Point to the Wiki from the first line of .ci/atime/tests.R?

MichaelChirico · 2024-09-03T04:44:57Z

@philippechataignon do you want to have a go at adding a atime performance regression test? Totally fine if not -- what would help at least would be to write a simple benchmark of gzipped fwrite that you think would capture the important pieces of what's changed here, does that make sense?

tdhock · 2024-09-03T12:59:30Z

yes that would be great to Point to the Wiki from the first line of .ci/atime/tests.R

I would suggest keeping docs on the wiki, which is easier to update, and include screenshots/graphics.

philippechataignon · 2024-09-03T14:43:49Z

@philippechataignon do you want to have a go at adding a atime performance regression test? Totally fine if not -- what would help at least would be to write a simple benchmark of gzipped fwrite that you think would capture the important pieces of what's changed here, does that make sense?

OK for testing regression but notice that the core of fwrite hasn't change : same buffer sizes, same number of jobs, same number of rows per job. Personally I observe similar timings that previous version.

One point of discussion : I notice that #2020 introduces a change that I never realized before this PR. By default scipen fwrite parameter uses the value of R option scipen. Like a lot of persons, I have a options(scipen = 999) in my Rprofile because I don't like sci output on screen but I never realize that fwrite was penalized. Why ? Because of the maxLineLen which can be very high and gives a lot of batches with few lines and little chunk. In fwrite, scipen has an maximum of 350 but it's a very high limit. And maxLineLen is used for determining number and sizes of fwrite chunks.

For testing impact, I have this little program :

n = 10000
ncol = 1000
dt <- data.table(i=1:n)
dt[, paste0("V", 1:ncol) := lapply(1:ncol, function(x) as.numeric(sample(1:n, replace=T)))]
print(sessionInfo())
system.time(fwrite(dt, "/dev/null", compress="gzip", verbose=T, scipen=0))
system.time(fwrite(dt, "/dev/null", compress="gzip", verbose=T, scipen=999))

With scipen = 0

maxLineLen=61026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 73 batches of 137 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=21719798 (20 MiB), ratio=44.4%, crc=26a23b0a
Written 10000 rows in 1.569 secs using 4 threads. MaxBuffUsed=7%

With scipen = 999

maxLineLen=761026. Found in 0.000s
Writing bom (false), yaml (0 characters) and column names (true)
Writing 10000 rows in 910 batches of 11 rows, each buffer size 8388608 bytes (8 MiB), showProgress=1, nth=4
zlib: uncompressed length=48947760 (46 MiB), compressed length=22745125 (21 MiB), ratio=46.5%, crc=26a23b0a
Written 10000 rows in 1.202 secs using 4 threads. MaxBuffUsed=0%

In last case real mean line length is ~ 5000 but estimated to 761026. Compression ratio is higher because the buffers are very little used. Surprisingly timing is better despite openmp number of threads overhead.

In my opinion, scipen parameter should be 0 in fwrite and not scipen option witch is related to digits option not present in fwrite and digits is higher in fwrite output (20 ?).

I use this little bench for scipen impact and I think it can be used for atime. I've tried to add this :

  # Issue with fwrite length for gzip output, fixed in: https://github.com/Rdatatable/data.table/pull/6393
  # No regression timing test
  "No regression fwrite" = atime::atime_test(
    N = 10^seq(2, 8),
    setup = {
      set.seed(1)
      ncol = 1000
      L <- data.table(i=1:N)
      L[, paste0("V", 1:ncol) := lapply(1:ncol, function(x) rnorm(N))]
    },
    expr = {
      fwrite(dt, "/dev/null", compress="gzip")
    },
    Fast = "117ab45674f1e56304abca83f9f0df50ab0274be", # Close-to-last merge commit in the PR
    Slow = "e73c2c849f921cf4ef51e3809842e0fee9b9f52c"),

but I'm not sure that /dev/null is portable and if we write a real file, that's made the timing.

OK for another one to continue and test that there is not time regression.

Anirban166 · 2024-09-03T19:14:58Z

I documented the current process here,

Should we maybe (1) migrate that documentation into .ci/atime directly (2) add .ci/atime/README.md pointing to the Wiki (3) Point to the Wiki from the first line of .ci/atime/tests.R?

2 and 3 sounds good to me

yes that would be great to Point to the Wiki from the first line of .ci/atime/tests.R

Should I go ahead and make a PR for this quick addition?

I would suggest keeping docs on the wiki, which is easier to update, and include screenshots/graphics.

I agree, both for being able to include images and in case we miss out on something that other people notice, they should be able to fill in points quickly

MichaelChirico · 2024-09-03T20:24:27Z

I notice that #2020 introduces a change that I never realized before this PR. By default scipen fwrite parameter uses the value of R option scipen.

Moved this to #6457 for further discussion, I think it's out of scope here. Thanks!

src/fwrite.c

tdhock · 2024-09-03T20:39:39Z

but I'm not sure that /dev/null is portable

this only has to run on github actions ubuntu vm, so /dev/null should be ok in principle, but I changed it to tempfile() which should be fine too.

Thanks for sharing your code for scipen benchmarking. I adapted it to get the following atime result, which indicates little to no impact on computation time, but a small constant factor increase in memory usage.

edit.data.table = function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}
out.csv <- tempfile()
issue6393 <- atime::atime_versions(
  "~/R/data.table",
  N = 2^seq(1, 20),
  pkg.edit.fun=edit.data.table,
  setup = {
    set.seed(1)
    NC = 10
    L <- data.table(i=1:N)
    L[, paste0("V", 1:NC) := replicate(NC, rnorm(N), simplify=FALSE)]
  },
  expr = {
    data.table::fwrite(L, out.csv, compress="gzip")
  },
  Fast="f339aa64c426a9cd7cf2fcb13d91fc4ed353cd31", # Parent of the first commit https://github.com/Rdatatable/data.table/commit/fcc10d73a20837d0f1ad3278ee9168473afa5ff1 in the PR https://github.com/Rdatatable/data.table/pull/6393/commits with major change to fwrite with gzip.
  PR = "117ab45674f1e56304abca83f9f0df50ab0274be") # Close-to-last merge commit in the PR.
plot(issue6393)

Co-authored-by: Michael Chirico <[email protected]>

codecov · 2024-09-27T21:34:47Z

Codecov Report

Attention: Patch coverage is 91.85185% with 11 lines in your changes missing coverage. Please review.

Project coverage is 98.54%. Comparing base (23dac21) to head (5c57eba).

Files with missing lines	Patch %	Lines
src/fwrite.c	91.12%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6393      +/-   ##
==========================================
- Coverage   98.61%   98.54%   -0.08%     
==========================================
  Files          79       79              
  Lines       14536    14591      +55     
==========================================
+ Hits        14334    14378      +44     
- Misses        202      213      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MichaelChirico · 2024-12-03T02:23:22Z

Thanks again @philippechataignon! I still can't say I've understood the C changes thoroughly, but they pass the existing suite and we have user reports it is working. I am happy to submit now and see if revdep checks tell us anything new. 🚀

MichaelChirico · 2024-12-03T02:31:31Z

src/fwrite.c

+        DTPRINT(_("Allocate %zu bytes for thread_streams\n"), nth * sizeof(z_stream));
+    }
+    if (!thread_streams)
+        STOP(_("Failed to allocated %d bytes for threads_streams."), (int)(nth * sizeof(z_stream)));


Suggested change

STOP(_("Failed to allocated %d bytes for threads_streams."), (int)(nth * sizeof(z_stream)));

STOP(_("Failed to allocated %d bytes for threads_streams."), (int)(nth * sizeof(z_stream))); // # nocov

MichaelChirico · 2024-12-03T02:31:42Z

src/fwrite.c

+  // compute zbuffSize which is the same for each thread
+    z_stream *stream = thread_streams;
+    if (init_stream(stream) != Z_OK)
+        STOP(_("Can't init stream structure for deflateBound"));


Suggested change

STOP(_("Can't init stream structure for deflateBound"));

STOP(_("Can't init stream structure for deflateBound")); // # nocov

MichaelChirico · 2024-12-03T02:31:54Z

src/fwrite.c

-        }
+        z_stream *stream = thread_streams;
+        if (init_stream(stream) != Z_OK)
+            STOP(_("Can't init stream structure for writing header"));


Suggested change

STOP(_("Can't init stream structure for writing header"));

STOP(_("Can't init stream structure for writing header")); // # nocov

MichaelChirico · 2024-12-03T02:33:48Z

Ah, apologies @philippechataignon I may draw this out just a wee bit longer 🙃

I think our code coverage suite was not up & running when you first posted this PR -- it's back now.

Would you mind taking a look through the report and suggesting which lines could reasonably be covered by new tests, and which should just get # nocov treatment? I've suggested a few likely # nocov candidates already.

philippechataignon added 8 commits August 23, 2024 00:15

fwrite with correct file length

fcc10d7

* gzip length and crc are manually computed in each thread and then added/combined * gzip header is minimal * remove some old debug code

Escape with NOZLIB for compilation succeed without zlib

97c919b

Move zlib check at start to avoid oufile deletion

4914d7a

Indent and add comments

d9e9e92

Buffers unification

c8970e1

Restore schedule(dynamic) more efficient and progress

2fc184e

Use alloc_size to see allocation when verbose

47c16f2

Test if stream init succeded

3dcb7d6

philippechataignon requested a review from MichaelChirico as a code owner August 23, 2024 15:35

philippechataignon marked this pull request as draft August 23, 2024 15:37

philippechataignon mentioned this pull request Aug 23, 2024

fwrite with compress="gzip" produces gz files with incorrect uncompressed file sizes #6356

Open

philippechataignon added 2 commits August 23, 2024 18:21

Add cast to avoid warnings on Windows

6569cfe

More explicit timing messages

bae11e3

philippechataignon and others added 5 commits August 24, 2024 12:05

Free stream structs

5ce31f3

Add option to control compression level for fwrite with gzip

b07306f

Rework namings and default value

8722519

Rename gzipLevel to compressLevel

0800f10

compressLevel param documentation

a0a9c1f

philippechataignon added 8 commits August 25, 2024 11:29

Put zlib initialization together

a00761d

Refact buffSize, numBatchs and numBatches

aaf85ab

Add missing NOZLIB

478b862

Increase outputs in last message when verbose

b09aa34

No real init for stream_thread when is_gzip false

669eb01

Minor corrections

5668431

Uses %zu format for size_t

e9e2e83

Last verbose msg was not printed when not is_gzip

7fb8738

Add parenthesis to be more explicit

5687a0c

Add comment for DeflateInit2

117ab45

This was referenced Sep 3, 2024

Mention wiki in comments #6456

Merged

Reconsider handling of scipen in fwrite #6457

Open

MichaelChirico reviewed Sep 3, 2024

View reviewed changes

src/fwrite.c Outdated Show resolved Hide resolved

typo

4e91a21

MichaelChirico reviewed Sep 3, 2024

View reviewed changes

src/fwrite.c Outdated Show resolved Hide resolved

Merge branch 'master' into fix_fwrite_length

45bf1d4

MichaelChirico mentioned this pull request Sep 4, 2024

control compression level for fwrite with gzip #5513

Closed

3 tasks

philippechataignon and others added 2 commits September 4, 2024 11:49

Add parenthesis to be more explicit (2)

255f1ce

Co-authored-by: Michael Chirico <[email protected]>

Merge branch 'master' into fix_fwrite_length

cdf4277

MichaelChirico added 2 commits December 2, 2024 18:18

Try to emphasize that '-' is "command flag hyphen", not "negative"

309da8f

Merge branch 'master' into fix_fwrite_length

3630413

MichaelChirico approved these changes Dec 3, 2024

View reviewed changes

MichaelChirico reviewed Dec 3, 2024

View reviewed changes

Convert Toby'd comment to atime_test()

5c57eba

MichaelChirico requested review from tdhock and Anirban166 as code owners December 3, 2024 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fwrite length for gzip output #6393

Fix fwrite length for gzip output #6393

philippechataignon commented Aug 23, 2024 •

edited

Loading

github-actions bot commented Aug 23, 2024 •

edited

Loading

oliverfoster commented Aug 23, 2024 •

edited

Loading

philippechataignon commented Aug 24, 2024 •

edited

Loading

philippechataignon commented Aug 24, 2024

MichaelChirico commented Sep 3, 2024

MichaelChirico commented Sep 3, 2024

tdhock commented Sep 3, 2024

philippechataignon commented Sep 3, 2024

Anirban166 commented Sep 3, 2024

MichaelChirico commented Sep 3, 2024

tdhock commented Sep 3, 2024 •

edited

Loading

codecov bot commented Sep 27, 2024 •

edited

Loading

MichaelChirico commented Dec 3, 2024

MichaelChirico Dec 3, 2024

MichaelChirico Dec 3, 2024

MichaelChirico Dec 3, 2024

MichaelChirico commented Dec 3, 2024

	STOP(_("Failed to allocated %d bytes for threads_streams."), (int)(nth * sizeof(z_stream)));
	STOP(_("Failed to allocated %d bytes for threads_streams."), (int)(nth * sizeof(z_stream))); // # nocov

	STOP(_("Can't init stream structure for deflateBound"));
	STOP(_("Can't init stream structure for deflateBound")); // # nocov

Fix fwrite length for gzip output #6393

Are you sure you want to change the base?

Fix fwrite length for gzip output #6393

Conversation

philippechataignon commented Aug 23, 2024 • edited Loading

zlib

C code

github-actions bot commented Aug 23, 2024 • edited Loading

oliverfoster commented Aug 23, 2024 • edited Loading

philippechataignon commented Aug 24, 2024 • edited Loading

philippechataignon commented Aug 24, 2024

MichaelChirico commented Sep 3, 2024

MichaelChirico commented Sep 3, 2024

tdhock commented Sep 3, 2024

philippechataignon commented Sep 3, 2024

Anirban166 commented Sep 3, 2024

MichaelChirico commented Sep 3, 2024

tdhock commented Sep 3, 2024 • edited Loading

codecov bot commented Sep 27, 2024 • edited Loading

Codecov Report

MichaelChirico commented Dec 3, 2024

MichaelChirico Dec 3, 2024

Choose a reason for hiding this comment

MichaelChirico Dec 3, 2024

Choose a reason for hiding this comment

MichaelChirico Dec 3, 2024

Choose a reason for hiding this comment

MichaelChirico commented Dec 3, 2024

philippechataignon commented Aug 23, 2024 •

edited

Loading

github-actions bot commented Aug 23, 2024 •

edited

Loading

oliverfoster commented Aug 23, 2024 •

edited

Loading

philippechataignon commented Aug 24, 2024 •

edited

Loading

tdhock commented Sep 3, 2024 •

edited

Loading

codecov bot commented Sep 27, 2024 •

edited

Loading