Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fwrite length for gzip output #6393

Open
wants to merge 46 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
fcc10d7
fwrite with correct file length
philippechataignon Aug 22, 2024
97c919b
Escape with NOZLIB for compilation succeed without zlib
philippechataignon Aug 23, 2024
4914d7a
Move zlib check at start to avoid oufile deletion
philippechataignon Aug 23, 2024
d9e9e92
Indent and add comments
philippechataignon Aug 23, 2024
c8970e1
Buffers unification
philippechataignon Aug 23, 2024
2fc184e
Restore schedule(dynamic) more efficient and progress
philippechataignon Aug 23, 2024
47c16f2
Use alloc_size to see allocation when verbose
philippechataignon Aug 23, 2024
3dcb7d6
Test if stream init succeded
philippechataignon Aug 23, 2024
6569cfe
Add cast to avoid warnings on Windows
philippechataignon Aug 23, 2024
bae11e3
More explicit timing messages
philippechataignon Aug 23, 2024
5ce31f3
Free stream structs
philippechataignon Aug 24, 2024
b07306f
Add option to control compression level for fwrite with gzip
ben-schwen Nov 3, 2022
8722519
Rework namings and default value
ben-schwen Nov 15, 2022
0800f10
Rename gzipLevel to compressLevel
philippechataignon Aug 24, 2024
a0a9c1f
compressLevel param documentation
philippechataignon Aug 24, 2024
a00761d
Put zlib initialization together
philippechataignon Aug 25, 2024
aaf85ab
Refact buffSize, numBatchs and numBatches
philippechataignon Aug 25, 2024
478b862
Add missing NOZLIB
philippechataignon Aug 25, 2024
b09aa34
Increase outputs in last message when verbose
philippechataignon Aug 25, 2024
669eb01
No real init for stream_thread when is_gzip false
philippechataignon Aug 25, 2024
5668431
Minor corrections
philippechataignon Aug 25, 2024
e9e2e83
Uses %zu format for size_t
philippechataignon Aug 25, 2024
7fb8738
Last verbose msg was not printed when not is_gzip
philippechataignon Aug 25, 2024
024a34d
minor operator ws change
MichaelChirico Aug 30, 2024
2fe3099
Add test for compressLevel=1
philippechataignon Sep 2, 2024
e9f2861
Add url link in compressLevel documentation
philippechataignon Sep 2, 2024
6a75749
Add 2 lines in NEWS for fwrite fix and compressLevel
philippechataignon Sep 2, 2024
4936e45
tidy-up, expand NEWS for compressLevel
MichaelChirico Sep 2, 2024
6b76bea
Use match.arg() for arg validation
MichaelChirico Sep 2, 2024
eede93f
add a test for the other extreme compressLevel=9
MichaelChirico Sep 2, 2024
e39259c
partial test fix
MichaelChirico Sep 2, 2024
182432b
fix updated test errors
MichaelChirico Sep 2, 2024
6999dc6
confirmed NEWS wording, fix typo
MichaelChirico Sep 2, 2024
47464db
fix order
MichaelChirico Sep 2, 2024
e2e7022
weak ordering
MichaelChirico Sep 2, 2024
c4971f5
Merge branch 'master' into fix_fwrite_length
MichaelChirico Sep 2, 2024
6bebfc2
place in 1.17.0 NEWS
MichaelChirico Sep 2, 2024
5687a0c
Add parenthesis to be more explicit
philippechataignon Sep 3, 2024
117ab45
Add comment for DeflateInit2
philippechataignon Sep 3, 2024
4e91a21
typo
MichaelChirico Sep 3, 2024
45bf1d4
Merge branch 'master' into fix_fwrite_length
MichaelChirico Sep 3, 2024
255f1ce
Add parenthesis to be more explicit (2)
philippechataignon Sep 4, 2024
cdf4277
Merge branch 'master' into fix_fwrite_length
MichaelChirico Sep 27, 2024
309da8f
Try to emphasize that '-' is "command flag hyphen", not "negative"
MichaelChirico Dec 3, 2024
3630413
Merge branch 'master' into fix_fwrite_length
MichaelChirico Dec 3, 2024
5c57eba
Convert Toby'd comment to atime_test()
MichaelChirico Dec 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,10 @@ rowwiseDT(

4. `patterns()` in `melt()` combines correctly with user-defined `cols=`, which can be useful to specify a subset of columns to reshape without having to use a regex, for example `patterns("2", cols=c("y1", "y2"))` will only give `y2` even if there are other columns in the input matching `2`, [#6498](https://github.com/Rdatatable/data.table/issues/6498). Thanks to @hongyuanjia for the report, and to @tdhock for the PR.

5. `fwrite()` with `compress="gzip"` produces compatible gz files when composed of multiple independent chunks owing to parallelization, [#6356](https://github.com/Rdatatable/data.table/issues/6356). Earlier `fwrite()` versions could have issues with HTTP upload using `Content-Encoding: gzip` and `Transfer-Encoding: chunked`. Thanks to @oliverfoster for report and @philippechataignon for the fix.

6. `fwrite()` gains a new parameter `compressLevel` to control compression level for gzip, [#5506](https://github.com/Rdatatable/data.table/issues/5506). This parameter balances compression speed and total compression, and corresponds directly to the analogous command-line parameter, e.g. `compressLevel=4` corresponds to passing `-4`; the default, `6`, matches the command-line default, i.e. equivalent to passing `-6`. Thanks @mgarbuzov for the request and @philippechataignon for implementing.

## BUG FIXES

1. `fwrite()` respects `dec=','` for timestamp columns (`POSIXct` or `nanotime`) with sub-second accuracy, [#6446](https://github.com/Rdatatable/data.table/issues/6446). Thanks @kav2k for pointing out the inconsistency and @MichaelChirico for the PR.
Expand Down Expand Up @@ -268,7 +272,7 @@ rowwiseDT(

5. Input files are now kept open during `mmap()` when running under Emscripten, [emscripten-core/emscripten#20459](https://github.com/emscripten-core/emscripten/issues/20459). This avoids an error in `fread()` when running in WebAssembly, [#5969](https://github.com/Rdatatable/data.table/issues/5969). Thanks to @maek-ies for the report and @georgestagg for the PR.

6. `dcast()` improves behavior for the situation that the `fun.aggregate` value of `length()` is used but not provided by the user.
6. `dcast()` improves behavior for the situation that the `fun.aggregate` value of `length()` is used but not provided by the user.

a. This now triggers a warning, not a message, since relying on this default often signals unexpected duplicates in the data, [#5386](https://github.com/Rdatatable/data.table/issues/5386). The warning is classed as `dt_missing_fun_aggregate_warning`, allowing for more targeted handling in user code. Thanks @MichaelChirico for the suggestion and @Nj221102 for the fix.

Expand Down Expand Up @@ -983,7 +987,7 @@ rowwiseDT(

14. The options `datatable.print.class` and `datatable.print.keys` are now `TRUE` by default. They have been available since v1.9.8 (Nov 2016) and v1.11.0 (May 2018) respectively.

15. Thanks to @ssh352, Václav Tlapák, Cole Miller, András Svraka and Toby Dylan Hocking for reporting and bisecting a significant performance regression in dev. This was fixed before release thanks to a PR by Jan Gorecki, [#5463](https://github.com/Rdatatable/data.table/pull/5463).
15. Thanks to @ssh352, Václav Tlapák, Cole Miller, András Svraka and Toby Dylan Hocking for reporting and bisecting a significant performance regression in dev. This was fixed before release thanks to a PR by Jan Gorecki, [#5463](https://github.com/Rdatatable/data.table/pull/5463).

16. `key(x) <- value` is now fully deprecated (from warning to error). Use `setkey()` to set a table's key. We started warning not to use this approach in 2012, with a stronger warning starting in 2019 (1.12.2). This function will be removed in the next release.

Expand Down
20 changes: 11 additions & 9 deletions R/fwrite.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
buffMB=8, nThread=getDTthreads(verbose),
showProgress=getOption("datatable.showProgress", interactive()),
compress = c("auto", "none", "gzip"),
compressLevel = 6L,
yaml = FALSE,
bom = FALSE,
verbose=getOption("datatable.verbose", FALSE),
Expand All @@ -18,12 +19,10 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
if (length(encoding) != 1L || !encoding %chin% c("", "UTF-8", "native")) {
stopf("Argument 'encoding' must be '', 'UTF-8' or 'native'.")
}
if (missing(qmethod)) qmethod = qmethod[1L]
if (missing(compress)) compress = compress[1L]
if (missing(dateTimeAs)) { dateTimeAs = dateTimeAs[1L] }
else if (length(dateTimeAs)>1L) stopf("dateTimeAs must be a single string")
dateTimeAs = chmatch(dateTimeAs, c("ISO","squash","epoch","write.csv"))-1L
if (is.na(dateTimeAs)) stopf("dateTimeAs must be 'ISO','squash','epoch' or 'write.csv'")
qmethod = match.arg(qmethod)
compress = match.arg(compress)
dateTimeAs = match.arg(dateTimeAs)
dateTimeAs = chmatch(dateTimeAs, c("ISO", "squash", "epoch", "write.csv"))-1L
if (!missing(logical01) && !missing(logicalAsInt))
stopf("logicalAsInt has been renamed logical01. Use logical01 only, not both.")
if (!missing(logicalAsInt)) {
Expand All @@ -34,6 +33,7 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
scipen = if (is.numeric(scipen)) as.integer(scipen) else 0L
buffMB = as.integer(buffMB)
nThread = as.integer(nThread)
compressLevel = as.integer(compressLevel)
# write.csv default is 'double' so fwrite follows suit. write.table's default is 'escape'
# validate arguments
if (is.matrix(x)) { # coerce to data.table if input object is matrix
Expand All @@ -46,7 +46,8 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
x = as.data.table(x)
}
}
stopifnot(is.list(x),
stopifnot(
is.list(x),
identical(quote,"auto") || isTRUEorFALSE(quote),
is.character(sep) && length(sep)==1L && (nchar(sep) == 1L || identical(sep, "")),
is.character(sep2) && length(sep2)==3L && nchar(sep2[2L])==1L,
Expand All @@ -55,14 +56,15 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
is.character(eol) && length(eol)==1L,
length(qmethod) == 1L && qmethod %chin% c("double", "escape"),
length(compress) == 1L && compress %chin% c("auto", "none", "gzip"),
length(compressLevel) == 1L && 0 <= compressLevel && compressLevel <= 9,
isTRUEorFALSE(col.names), isTRUEorFALSE(append), isTRUEorFALSE(row.names),
isTRUEorFALSE(verbose), isTRUEorFALSE(showProgress), isTRUEorFALSE(logical01),
isTRUEorFALSE(bom),
length(na) == 1L, #1725, handles NULL or character(0) input
is.character(file) && length(file)==1L && !is.na(file),
length(buffMB)==1L && !is.na(buffMB) && 1L<=buffMB && buffMB<=1024L,
length(nThread)==1L && !is.na(nThread) && nThread>=1L
)
)

is_gzip = compress == "gzip" || (compress == "auto" && endsWithAny(file, ".gz"))

Expand Down Expand Up @@ -119,7 +121,7 @@ fwrite = function(x, file="", append=FALSE, quote="auto",
file = enc2native(file) # CfwriteR cannot handle UTF-8 if that is not the native encoding, see #3078.
.Call(CfwriteR, x, file, sep, sep2, eol, na, dec, quote, qmethod=="escape", append,
row.names, col.names, logical01, scipen, dateTimeAs, buffMB, nThread,
showProgress, is_gzip, bom, yaml, verbose, encoding)
showProgress, is_gzip, compressLevel, bom, yaml, verbose, encoding)
invisible()
}

Expand Down
31 changes: 18 additions & 13 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,9 @@ TZnotUTC = !identical(tt,"") && !is_utc(tt)
# (3) function factory for matching messages exactly by substituting anything between delimiters [delim, fmt=TRUE]
# (4) function factory for matching messages exactly by substituting a generic string [fmt=string]
get_msg = function(e, delim, fmt=FALSE) {
ufq = options(useFancyQuotes = FALSE) # otherwise we get angled quotes, hard to match robustly
on.exit(options(ufq))

condition = tryCatch({e; NULL}, error=identity, warning=identity)
if (is.null(condition)) return(condition)
msg = condition$message
Expand All @@ -170,17 +173,13 @@ get_msg = function(e, delim, fmt=FALSE) {
sprintf("%s%s%s", delim[1L], if (fmt) "%s" else ".+", delim[2L]),
msg
)
if (fmt) return(function(x) sprintf(msg, x))
if (fmt) return(function(...) sprintf(msg, ...))
return(msg)
}
base_messages = list(
missing_object = get_msg(`__dt_test_missing_` + 1, "'", fmt=TRUE),
missing_function = get_msg(`__dt_test_missing_`(), '"', fmt=TRUE),
missing_coerce_method = get_msg(delim = '"', {
old = options(useFancyQuotes = FALSE) # otherwise we get angled quotes, hard to match robustly
on.exit(options(old))
methods::as(TRUE, 'foo')
}),
missing_coerce_method = get_msg(methods::as(TRUE, 'foo'), delim = '"'),
missing_dispatch_method = get_msg(conditionMessage(structure(1, class="foo")), '[\'"]'),
invalid_arg_unary_operator = get_msg(-'a'),
invalid_arg_binary_operator = get_msg(1 + 'a'),
Expand All @@ -199,6 +198,8 @@ base_messages = list(
stopifnot = get_msg(stopifnot(FALSE), fmt="FALSE"),
not_yet_used = get_msg(.NotYetUsed("abc"), "'", fmt=TRUE), # NB: need fmt= because the English message has '(yet)' --> parens in regex
ambiguous_date_fmt = get_msg(as.Date('xxx')),
match_arg_length = get_msg(match.arg(c('a', 'b'), letters)),
match_arg_4_choices = get_msg(match.arg('e', letters[1:4]), delim='"', fmt=TRUE),
NULL
)

Expand Down Expand Up @@ -9981,7 +9982,7 @@ test(1658.27, fwrite(DT, na="NA", verbose=TRUE), output='Writing bom .false., ya
test(1658.28, fwrite(ok_dt, 1), error=base_messages$stopifnot("is.character(file) && length(file) == 1L && !is.na(file)"))
test(1658.29, fwrite(ok_dt, quote=123), error="identical\\(quote.*auto.*FALSE.*TRUE")
test(1658.30, fwrite(ok_dt, sep="..."), error="nchar(sep)")
test(1658.31, fwrite(ok_dt, qmethod=c("double", "double")), error="length(qmethod)")
test(1658.31, fwrite(ok_dt, qmethod=c("double", "double")), error=base_messages$match_arg_length)
test(1658.32, fwrite(ok_dt, col.names="foobar"), error="isTRUEorFALSE(col.names)")

# null data table (no columns)
Expand Down Expand Up @@ -10023,8 +10024,12 @@ if (!haszlib()) {
test(1658.423, file.info(f1)$size < file.info(f2)$size) # 74 < 804 (file.size() isn't available in R 3.1.0)
if (test_R.utils) test(1658.43, fread(f1), DT) # use fread to decompress gz (works cross-platform)
fwrite(DT, file=f3<-tempfile(), compress="gzip") # compress to filename not ending .gz
fwrite(DT, file=f4<-tempfile(), compress="gzip", compressLevel=1) # test compressLevel
fwrite(DT, file=f5<-tempfile(), compress="gzip", compressLevel=9)
test(1658.441, file.info(f3)$size, file.info(f1)$size)
unlink(c(f1,f2,f3))
test(1658.442, file.info(f4)$size >= file.info(f1)$size)
test(1658.443, file.info(f1)$size >= file.info(f5)$size)
unlink(c(f1,f2,f3,f4,f5))
}
DT = data.table(a=1:3, b=list(1:4, c(3.14, 100e10), c("foo", "bar", "baz")))
test(1658.45, fwrite(DT), output=c("a,b","1,1|2|3|4","2,3.14|1e+12","3,foo|bar|baz"))
Expand Down Expand Up @@ -10947,9 +10952,9 @@ DT = data.table(
D = as.POSIXct(dt<-paste(d,t), tz="UTC"),
E = as.POSIXct(paste0(dt,c(".999",".0",".5",".111112",".123456",".023",".0",".999999",".99",".0009")), tz="UTC"))

test(1740.0, fwrite(DT,dateTimeAs="iso"), error="dateTimeAs must be 'ISO','squash','epoch' or 'write.csv'")
test(1740.1, fwrite(DT,dateTimeAs=c("ISO","squash")), error="dateTimeAs must be a single string")
test(1740.2, capture.output(fwrite(DT,dateTimeAs="ISO")), c(
test(1740.1, fwrite(DT,dateTimeAs="iso"), error=base_messages$match_arg_4_choices("ISO", "squash", "epoch", "write.csv"))
test(1740.2, fwrite(DT,dateTimeAs=c("ISO","squash")), error=base_messages$match_arg_length)
test(1740.3, capture.output(fwrite(DT,dateTimeAs="ISO")), c(
"A,B,C,D,E",
"1907-10-21,1907-10-21,23:59:59,1907-10-21T23:59:59Z,1907-10-21T23:59:59.999Z",
"1907-10-22,1907-10-22,00:00:00,1907-10-22T00:00:00Z,1907-10-22T00:00:00Z",
Expand All @@ -10961,7 +10966,7 @@ test(1740.2, capture.output(fwrite(DT,dateTimeAs="ISO")), c(
"1999-12-31,1999-12-31,01:23:45,1999-12-31T01:23:45Z,1999-12-31T01:23:45.999999Z",
"2000-02-29,2000-02-29,23:59:59,2000-02-29T23:59:59Z,2000-02-29T23:59:59.990Z",
"2016-09-12,2016-09-12,01:30:30,2016-09-12T01:30:30Z,2016-09-12T01:30:30.000900Z"))
test(1740.3, capture.output(fwrite(DT,dateTimeAs="squash")), c(
test(1740.4, capture.output(fwrite(DT,dateTimeAs="squash")), c(
"A,B,C,D,E",
"19071021,19071021,235959,19071021235959000,19071021235959999",
"19071022,19071022,000000,19071022000000000,19071022000000000",
Expand All @@ -10973,7 +10978,7 @@ test(1740.3, capture.output(fwrite(DT,dateTimeAs="squash")), c(
"19991231,19991231,012345,19991231012345000,19991231012345999",
"20000229,20000229,235959,20000229235959000,20000229235959990",
"20160912,20160912,013030,20160912013030000,20160912013030000"))
test(1740.4, capture.output(fwrite(DT,dateTimeAs="epoch")), c(
test(1740.5, capture.output(fwrite(DT,dateTimeAs="epoch")), c(
"A,B,C,D,E",
"-22718,-22718,86399,-1962748801,-1962748800.001",
"-22717,-22717,0,-1962748800,-1962748800",
Expand Down
2 changes: 2 additions & 0 deletions man/fwrite.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ fwrite(x, file = "", append = FALSE, quote = "auto",
buffMB = 8L, nThread = getDTthreads(verbose),
showProgress = getOption("datatable.showProgress", interactive()),
compress = c("auto", "none", "gzip"),
compressLevel = 6L,
yaml = FALSE,
bom = FALSE,
verbose = getOption("datatable.verbose", FALSE),
Expand Down Expand Up @@ -58,6 +59,7 @@ fwrite(x, file = "", append = FALSE, quote = "auto",
\item{nThread}{The number of threads to use. Experiment to see what works best for your data on your hardware.}
\item{showProgress}{ Display a progress meter on the console? Ignored when \code{file==""}. }
\item{compress}{If \code{compress = "auto"} and if \code{file} ends in \code{.gz} then output format is gzipped csv else csv. If \code{compress = "none"}, output format is always csv. If \code{compress = "gzip"} then format is gzipped csv. Output to the console is never gzipped even if \code{compress = "gzip"}. By default, \code{compress = "auto"}.}
\item{compressLevel}{Level of compression between 1 and 9, 6 by default. See \url{https://linux.die.net/man/1/gzip} for details.}
\item{yaml}{If \code{TRUE}, \code{fwrite} will output a CSVY file, that is, a CSV file with metadata stored as a YAML header, using \code{\link[yaml]{as.yaml}}. See \code{Details}. }
\item{bom}{If \code{TRUE} a BOM (Byte Order Mark) sequence (EF BB BF) is added at the beginning of the file; format 'UTF-8 with BOM'.}
\item{verbose}{Be chatty and report timings?}
Expand Down
2 changes: 1 addition & 1 deletion src/data.table.h
Original file line number Diff line number Diff line change
Expand Up @@ -294,7 +294,7 @@ SEXP chmatch_R(SEXP, SEXP, SEXP);
SEXP chmatchdup_R(SEXP, SEXP, SEXP);
SEXP chin_R(SEXP, SEXP);
SEXP freadR(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
SEXP fwriteR(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
SEXP fwriteR(SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP, SEXP);
SEXP rbindlist(SEXP, SEXP, SEXP, SEXP, SEXP);
SEXP setlistelt(SEXP, SEXP, SEXP);
SEXP address(SEXP);
Expand Down
Loading
Loading