From cde7333938a590a3cddbda1b02103650e2f55d15 Mon Sep 17 00:00:00 2001 From: Michael Chirico Date: Sat, 13 Jan 2024 01:10:42 +0800 Subject: [PATCH] Add an illustrative example to ?GForce when sorting locale matters (#5342) * improve documentation for GForce where sorting affects the result * link issue * tests * typo * mention Sys.setlocale * obsolete comment * 1.15.0 on CRAN. Bump to 1.15.99 * Fix transform slowness (#5493) * Fix 5492 by limiting the costly deparse to `nlines=1` * Implementing PR feedbacks * Added inside * Fix typo in name * Idiomatic use of inside * Separating the deparse line limit to a different PR --------- Co-authored-by: Michael Chirico * Improvements to the introductory vignette (#5836) * Added my improvements to the intro vignette * Removed two lines I added extra as a mistake earlier * Requested changes * Vignette typo patch (#5402) * fix typos and grammatical mistakes * fix typos and punctuation * remove double spaces where it wasn't necessary * fix typos and adhere to British English spelling * fix typos * fix typos * add missing closing bracket * fix typos * review fixes * Update vignettes/datatable-benchmarking.Rmd Co-authored-by: Michael Chirico * Update vignettes/datatable-benchmarking.Rmd Co-authored-by: Michael Chirico * Apply suggestions from code review benchmarking Co-authored-by: Michael Chirico * remove unnecessary [ ] from datatable-keys-fast-subset.Rmd * Update vignettes/datatable-programming.Rmd Co-authored-by: Michael Chirico * Update vignettes/datatable-reshape.Rmd Co-authored-by: Michael Chirico * One last batch of fine-tuning --------- Co-authored-by: Michael Chirico Co-authored-by: Michael Chirico * fix bad merge * Improved handling of list columns with NULL entries (#4250) * Updated documentation for rbindlist(fill=TRUE) * Print NULL entries of list as NULL * Added news item * edit NEWS, use '[NULL]' not 'NULL' * fix test * split NEWS item * add example --------- Co-authored-by: Michael Chirico Co-authored-by: Michael Chirico Co-authored-by: Benjamin Schwendinger * clarify that list input->unnamed list output (#5383) * clarify that list input->unnamed list output * Add example where make.names is used * mention role of make.names * revert from next release branch * manual merge NEWS * manual rebase tests * manual rebase data.table.R * clarify 0 turns off everything --------- Co-authored-by: Ofek Co-authored-by: Ani Co-authored-by: David Budzynski <56514985+davidbudzynski@users.noreply.github.com> Co-authored-by: Scott Ritchie Co-authored-by: Benjamin Schwendinger --- NEWS.md | 2 ++ R/data.table.R | 6 +++--- inst/tests/tests.Rraw | 2 +- man/datatable-optimize.Rd | 24 ++++++++++++++++++++++++ 4 files changed, 30 insertions(+), 4 deletions(-) diff --git a/NEWS.md b/NEWS.md index 5b91ecadd..fa7deb3b9 100644 --- a/NEWS.md +++ b/NEWS.md @@ -617,4 +617,6 @@ 19. In the NEWS for v1.11.0 (May 2018), section "NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES" item 2, we stated the intention to eventually change `logical01` to be `TRUE` by default. After some consideration, reflection, and community input, we have decided not to move forward with this plan, i.e., `logical01` will remain `FALSE` by default in both `fread()` and `fwrite()`. See discussion in #5856; most importantly, we think changing the default would be a major disruption to reading "sharded" CSVs where data with the same schema is split into many files, some of which could be converted to `logical` while others remain `integer`. We will retain the option `datatable.logical01` for users who wish to use a different default -- for example, if you are doing input/output on tables with a large number of logical columns, where writing '0'/'1' to the CSV many millions of times is preferable to writing 'TRUE'/'FALSE'. +20. Some clarity is added to `?GForce` for the case when subtle changes to `j` produce different results because of differences in locale. Because `data.table` _always_ uses the "C" locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, [#5331](https://github.com/Rdatatable/data.table/issues/5331). The inspirational example compared `DT[, .(max(a), max(b)), by=grp]` and `DT[, .(max(a), max(tolower(b))), by=grp]` -- in the latter case, GForce is deactivated owing to the _ad-hoc_ column, so the result for `max(a)` might differ for the two queries. An example is added to `?GForce`. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce: `DT[, .(base::max(a), base::max(b)), by=grp]`; (2) turn off all optimizations with `options(datatable.optimize = 0)`; or (3) set your R session to always sort in C locale with `Sys.setlocale("LC_COLLATE", "C")` (or temporarily with e.g. `withr::with_locale()`). Thanks @markseeto for the example and @michaelchirico for the improved documentation. + # data.table v1.14.10 (Dec 2023) back to v1.10.0 (Dec 2016) has been moved to [NEWS.1.md](https://github.com/Rdatatable/data.table/blob/master/NEWS.1.md) diff --git a/R/data.table.R b/R/data.table.R index 3e003a03b..a7009399f 100644 --- a/R/data.table.R +++ b/R/data.table.R @@ -1733,7 +1733,7 @@ replace_dot_alias = function(e) { GForce = FALSE if ( ((is.name(jsub) && jsub==".N") || (jsub %iscall% 'list' && length(jsub)==2L && jsub[[2L]]==".N")) && !length(lhs) ) { GForce = TRUE - if (verbose) catf("GForce optimized j to '%s'\n",deparse(jsub, width.cutoff=200L, nlines=1L)) + if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n",deparse(jsub, width.cutoff=200L, nlines=1L)) } } else if (length(lhs) && is.symbol(jsub)) { # turn off GForce for the combination of := and .N GForce = FALSE @@ -1818,8 +1818,8 @@ replace_dot_alias = function(e) { if (length(jsub) == 2L && jsub[[1L]] %chin% c("head", "tail")) jsub[["n"]] = 6L jsub = gforce_jsub(jsub, names_x) } - if (verbose) catf("GForce optimized j to '%s'\n", deparse(jsub, width.cutoff=200L, nlines=1L)) - } else if (verbose) catf("GForce is on, left j unchanged\n"); + if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n", deparse(jsub, width.cutoff=200L, nlines=1L)) + } else if (verbose) catf("GForce is on, but not activated for this query; left j unchanged (see ?GForce)\n"); } } if (!GForce && !is.name(jsub)) { diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index d3b9e2095..935d63825 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -14918,7 +14918,7 @@ test(2041.2, DT[, median(time), by=g], DT[c(2,5),.(g=g, V1=time)]) # They could run in level 1 with level 2 off, but output= would need to be changed and there's no need. test(2042.1, DT[ , as.character(mean(date)), by=g, verbose=TRUE ], data.table(g=c("a","b"), V1=c("2018-01-04","2018-01-21")), - output=msg<-"GForce is on, left j unchanged.*Old mean optimization is on, left j unchanged") + output=msg<-"GForce is on, but not activated.*Old mean optimization is on, left j unchanged") # Since %b is e.g. "janv." in LC_TIME=fr_FR.UTF-8 locale, we need to # have the target/y value in these tests depend on the locale as well, #3450. Jan.2018 = format(strptime("2018-01-01", "%Y-%m-%d"), "%b-%Y") diff --git a/man/datatable-optimize.Rd b/man/datatable-optimize.Rd index b4ba0f2f8..9ce7f308f 100644 --- a/man/datatable-optimize.Rd +++ b/man/datatable-optimize.Rd @@ -21,6 +21,10 @@ of these optimisations. They happen automatically. Run the code under the \emph{example} section to get a feel for the performance benefits from these optimisations. +Note that for all optimizations involving efficient sorts, the caveat mentioned +in \code{\link{setorder}} applies -- whenever data.table does the sorting, +it does so in "C-locale". This has some subtle implications; see Examples. + } \details{ \code{data.table} reads the global option \code{datatable.optimize} to figure @@ -101,6 +105,8 @@ indices set global option \code{options(datatable.use.index = FALSE)}. \seealso{ \code{\link{setNumericRounding}}, \code{\link{getNumericRounding}} } \examples{ \dontrun{ +old = options(datatable.optimize = Inf) + # Generate a big data.table with a relatively many columns set.seed(1L) DT = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE)) @@ -151,6 +157,24 @@ system.time(ans1 <- DT[id == 100L]) # index + binary search subset system.time(ans2 <- DT[id == 100L]) # only binary search subset system.time(DT[id \%in\% 100:500]) # only binary search subset again +# sensitivity to collate order +old_lc_collate = Sys.getlocale("LC_COLLATE") + +if (old_lc_collate == "C") { + Sys.setlocale("LC_COLLATE", "") +} +DT = data.table( + grp = rep(1:2, each = 4L), + var = c("A", "a", "0", "1", "B", "b", "0", "1") +) +options(datatable.optimize = Inf) +DT[, .(max(var), min(var)), by=grp] +# GForce is deactivated because of the ad-hoc column 'tolower(var)', +# through which the result for 'max(var)' may also change +DT[, .(max(var), min(tolower(var))), by=grp] + +Sys.setlocale("LC_COLLATE", old_lc_collate) +options(old) }} \keyword{ data }