-
Notifications
You must be signed in to change notification settings - Fork 988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix fwrite length for gzip output #6393
base: master
Are you sure you want to change the base?
Conversation
* gzip length and crc are manually computed in each thread and then added/combined * gzip header is minimal * remove some old debug code
Generated via commit 5c57eba Download link for the artifact containing the test results: ↓ atime-results.zip
|
You're right and this PR version stores the modulo 2**32 as requested but its not the right size.
|
Put PR #5513 in this PR with new param compressLevel. |
Thanks Toby, I had looked at the .ci/atime/tests.R script and some {atime} documentation directly and didn't think to check the Wiki. Should we maybe (1) migrate that documentation into .ci/atime directly (2) add .ci/atime/README.md pointing to the Wiki (3) Point to the Wiki from the first line of .ci/atime/tests.R? |
@philippechataignon do you want to have a go at adding a atime performance regression test? Totally fine if not -- what would help at least would be to write a simple benchmark of gzipped fwrite that you think would capture the important pieces of what's changed here, does that make sense? |
yes that would be great to Point to the Wiki from the first line of .ci/atime/tests.R I would suggest keeping docs on the wiki, which is easier to update, and include screenshots/graphics. |
OK for testing regression but notice that the core of fwrite hasn't change : same buffer sizes, same number of jobs, same number of rows per job. Personally I observe similar timings that previous version. One point of discussion : I notice that #2020 introduces a change that I never realized before this PR. By default For testing impact, I have this little program :
With scipen = 0
With scipen = 999
In last case real mean line length is ~ 5000 but estimated to 761026. Compression ratio is higher because the buffers are very little used. Surprisingly timing is better despite openmp number of threads overhead. In my opinion, I use this little bench for scipen impact and I think it can be used for atime. I've tried to add this :
but I'm not sure that /dev/null is portable and if we write a real file, that's made the timing. OK for another one to continue and test that there is not time regression. |
2 and 3 sounds good to me
Should I go ahead and make a PR for this quick addition?
I agree, both for being able to include images and in case we miss out on something that other people notice, they should be able to fill in points quickly |
this only has to run on github actions ubuntu vm, so /dev/null should be ok in principle, but I changed it to tempfile() which should be fine too. Thanks for sharing your code for scipen benchmarking. I adapted it to get the following atime result, which indicates little to no impact on computation time, but a small constant factor increase in memory usage. edit.data.table = function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
out.csv <- tempfile()
issue6393 <- atime::atime_versions(
"~/R/data.table",
N = 2^seq(1, 20),
pkg.edit.fun=edit.data.table,
setup = {
set.seed(1)
NC = 10
L <- data.table(i=1:N)
L[, paste0("V", 1:NC) := replicate(NC, rnorm(N), simplify=FALSE)]
},
expr = {
data.table::fwrite(L, out.csv, compress="gzip")
},
Fast="f339aa64c426a9cd7cf2fcb13d91fc4ed353cd31", # Parent of the first commit https://github.com/Rdatatable/data.table/commit/fcc10d73a20837d0f1ad3278ee9168473afa5ff1 in the PR https://github.com/Rdatatable/data.table/pull/6393/commits with major change to fwrite with gzip.
PR = "117ab45674f1e56304abca83f9f0df50ab0274be") # Close-to-last merge commit in the PR.
plot(issue6393) |
Co-authored-by: Michael Chirico <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6393 +/- ##
==========================================
- Coverage 98.61% 98.54% -0.08%
==========================================
Files 79 79
Lines 14536 14591 +55
==========================================
+ Hits 14334 14378 +44
- Misses 202 213 +11 ☔ View full report in Codecov by Sentry. |
Thanks again @philippechataignon! I still can't say I've understood the C changes thoroughly, but they pass the existing suite and we have user reports it is working. I am happy to submit now and see if revdep checks tell us anything new. 🚀 |
DTPRINT(_("Allocate %zu bytes for thread_streams\n"), nth * sizeof(z_stream)); | ||
} | ||
if (!thread_streams) | ||
STOP(_("Failed to allocated %d bytes for threads_streams."), (int)(nth * sizeof(z_stream))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STOP(_("Failed to allocated %d bytes for threads_streams."), (int)(nth * sizeof(z_stream))); | |
STOP(_("Failed to allocated %d bytes for threads_streams."), (int)(nth * sizeof(z_stream))); // # nocov |
// compute zbuffSize which is the same for each thread | ||
z_stream *stream = thread_streams; | ||
if (init_stream(stream) != Z_OK) | ||
STOP(_("Can't init stream structure for deflateBound")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STOP(_("Can't init stream structure for deflateBound")); | |
STOP(_("Can't init stream structure for deflateBound")); // # nocov |
} | ||
z_stream *stream = thread_streams; | ||
if (init_stream(stream) != Z_OK) | ||
STOP(_("Can't init stream structure for writing header")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STOP(_("Can't init stream structure for writing header")); | |
STOP(_("Can't init stream structure for writing header")); // # nocov |
Ah, apologies @philippechataignon I may draw this out just a wee bit longer 🙃 I think our code coverage suite was not up & running when you first posted this PR -- it's back now. Would you mind taking a look through the report and suggesting which lines could reasonably be covered by new tests, and which should just get |
Closes #6356. Closes #5506.
This PR is an attempt to create a better gzip file with fwrite. Its an important rewrite because it includes some refactoring of actual code.
zlib
C code
#pragma omp parallel for
for chunk loop and#pragma omp ordered
for the writing and summarizing part.malloc
occur early and no need for an header buffer.=-
or=*
. Lot of work remains. Use ofindent
command ?