From cde7333938a590a3cddbda1b02103650e2f55d15 Mon Sep 17 00:00:00 2001
From: Michael Chirico <michaelchirico4@gmail.com>
Date: Sat, 13 Jan 2024 01:10:42 +0800
Subject: [PATCH] Add an illustrative example to ?GForce when sorting locale
 matters (#5342)

* improve documentation for GForce where sorting affects the result

* link issue

* tests

* typo

* mention Sys.setlocale

* obsolete comment

* 1.15.0 on CRAN. Bump to 1.15.99

* Fix transform slowness (#5493)

* Fix 5492 by limiting the costly deparse to `nlines=1`

* Implementing PR feedbacks

* Added  inside

* Fix typo in name

* Idiomatic use of  inside

* Separating the deparse line limit to a different PR

---------

Co-authored-by: Michael Chirico <chiricom@google.com>

* Improvements to the introductory vignette (#5836)

* Added my improvements to the intro vignette

* Removed two lines I added extra as a mistake earlier

* Requested changes

* Vignette typo patch (#5402)

* fix typos and grammatical mistakes

* fix typos and punctuation

* remove double spaces where it wasn't necessary

* fix typos and adhere to British English spelling

* fix typos

* fix typos

* add missing closing bracket

* fix typos

* review fixes

* Update vignettes/datatable-benchmarking.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Update vignettes/datatable-benchmarking.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Apply suggestions from code review benchmarking

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* remove unnecessary [ ] from datatable-keys-fast-subset.Rmd

* Update vignettes/datatable-programming.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* Update vignettes/datatable-reshape.Rmd

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

* One last batch of fine-tuning

---------

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Michael Chirico <chiricom@google.com>

* fix bad merge

* Improved handling of list columns with NULL entries (#4250)

* Updated documentation for rbindlist(fill=TRUE)

* Print NULL entries of list as NULL

* Added news item

* edit NEWS, use '[NULL]' not 'NULL'

* fix test

* split NEWS item

* add example

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at>

* clarify that list input->unnamed list output (#5383)

* clarify that list input->unnamed list output

* Add example where make.names is used

* mention role of make.names

* revert from next release branch

* manual merge NEWS

* manual rebase tests

* manual rebase data.table.R

* clarify 0 turns off everything

---------

Co-authored-by: Ofek <ofekshilon@gmail.com>
Co-authored-by: Ani <bloodraven166@gmail.com>
Co-authored-by: David Budzynski <56514985+davidbudzynski@users.noreply.github.com>
Co-authored-by: Scott Ritchie <sritchie73@gmail.com>
Co-authored-by: Benjamin Schwendinger <benjamin.schwendinger@tuwien.ac.at>
---
 NEWS.md                   |  2 ++
 R/data.table.R            |  6 +++---
 inst/tests/tests.Rraw     |  2 +-
 man/datatable-optimize.Rd | 24 ++++++++++++++++++++++++
 4 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/NEWS.md b/NEWS.md
index 5b91ecadd..fa7deb3b9 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -617,4 +617,6 @@
 
 19. In the NEWS for v1.11.0 (May 2018), section "NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES" item 2, we stated the intention to eventually change `logical01` to be `TRUE` by default. After some consideration, reflection, and community input, we have decided not to move forward with this plan, i.e., `logical01` will remain `FALSE` by default in both `fread()` and `fwrite()`. See discussion in #5856; most importantly, we think changing the default would be a major disruption to reading "sharded" CSVs where data with the same schema is split into many files, some of which could be converted to `logical` while others remain `integer`. We will retain the option `datatable.logical01` for users who wish to use a different default -- for example, if you are doing input/output on tables with a large number of logical columns, where writing '0'/'1' to the CSV many millions of times is preferable to writing 'TRUE'/'FALSE'.
 
+20. Some clarity is added to `?GForce` for the case when subtle changes to `j` produce different results because of differences in locale. Because `data.table` _always_ uses the "C" locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, [#5331](https://github.com/Rdatatable/data.table/issues/5331). The inspirational example compared `DT[, .(max(a), max(b)), by=grp]` and `DT[, .(max(a), max(tolower(b))), by=grp]` -- in the latter case, GForce is deactivated owing to the _ad-hoc_ column, so the result for `max(a)` might differ for the two queries. An example is added to `?GForce`. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce: `DT[, .(base::max(a), base::max(b)), by=grp]`; (2) turn off all optimizations with `options(datatable.optimize = 0)`; or (3) set your R session to always sort in C locale with `Sys.setlocale("LC_COLLATE", "C")` (or temporarily with e.g. `withr::with_locale()`). Thanks @markseeto for the example and @michaelchirico for the improved documentation.
+
 # data.table v1.14.10 (Dec 2023) back to v1.10.0 (Dec 2016) has been moved to [NEWS.1.md](https://github.com/Rdatatable/data.table/blob/master/NEWS.1.md)
diff --git a/R/data.table.R b/R/data.table.R
index 3e003a03b..a7009399f 100644
--- a/R/data.table.R
+++ b/R/data.table.R
@@ -1733,7 +1733,7 @@ replace_dot_alias = function(e) {
         GForce = FALSE
         if ( ((is.name(jsub) && jsub==".N") || (jsub %iscall% 'list' && length(jsub)==2L && jsub[[2L]]==".N")) && !length(lhs) ) {
           GForce = TRUE
-          if (verbose) catf("GForce optimized j to '%s'\n",deparse(jsub, width.cutoff=200L, nlines=1L))
+          if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n",deparse(jsub, width.cutoff=200L, nlines=1L))
         }
       } else if (length(lhs) && is.symbol(jsub)) { # turn off GForce for the combination of := and .N
         GForce = FALSE
@@ -1818,8 +1818,8 @@ replace_dot_alias = function(e) {
             if (length(jsub) == 2L && jsub[[1L]] %chin% c("head", "tail")) jsub[["n"]] = 6L
             jsub = gforce_jsub(jsub, names_x)
           }
-          if (verbose) catf("GForce optimized j to '%s'\n", deparse(jsub, width.cutoff=200L, nlines=1L))
-        } else if (verbose) catf("GForce is on, left j unchanged\n");
+          if (verbose) catf("GForce optimized j to '%s' (see ?GForce)\n", deparse(jsub, width.cutoff=200L, nlines=1L))
+        } else if (verbose) catf("GForce is on, but not activated for this query; left j unchanged (see ?GForce)\n");
       }
     }
     if (!GForce && !is.name(jsub)) {
diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw
index d3b9e2095..935d63825 100644
--- a/inst/tests/tests.Rraw
+++ b/inst/tests/tests.Rraw
@@ -14918,7 +14918,7 @@ test(2041.2, DT[, median(time), by=g], DT[c(2,5),.(g=g, V1=time)])
 # They could run in level 1 with level 2 off, but output= would need to be changed and there's no need.
 test(2042.1, DT[ , as.character(mean(date)), by=g, verbose=TRUE ],
              data.table(g=c("a","b"), V1=c("2018-01-04","2018-01-21")),
-     output=msg<-"GForce is on, left j unchanged.*Old mean optimization is on, left j unchanged")
+     output=msg<-"GForce is on, but not activated.*Old mean optimization is on, left j unchanged")
 # Since %b is e.g. "janv." in LC_TIME=fr_FR.UTF-8 locale, we need to
 # have the target/y value in these tests depend on the locale as well, #3450.
 Jan.2018 = format(strptime("2018-01-01", "%Y-%m-%d"), "%b-%Y")
diff --git a/man/datatable-optimize.Rd b/man/datatable-optimize.Rd
index b4ba0f2f8..9ce7f308f 100644
--- a/man/datatable-optimize.Rd
+++ b/man/datatable-optimize.Rd
@@ -21,6 +21,10 @@ of these optimisations. They happen automatically.
 Run the code under the \emph{example} section to get a feel for the performance
 benefits from these optimisations.
 
+Note that for all optimizations involving efficient sorts, the caveat mentioned
+in \code{\link{setorder}} applies -- whenever data.table does the sorting,
+it does so in "C-locale". This has some subtle implications; see Examples.
+
 }
 \details{
 \code{data.table} reads the global option \code{datatable.optimize} to figure
@@ -101,6 +105,8 @@ indices set global option \code{options(datatable.use.index = FALSE)}.
 \seealso{ \code{\link{setNumericRounding}}, \code{\link{getNumericRounding}} }
 \examples{
 \dontrun{
+old = options(datatable.optimize = Inf)
+
 # Generate a big data.table with a relatively many columns
 set.seed(1L)
 DT = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
@@ -151,6 +157,24 @@ system.time(ans1 <- DT[id == 100L]) # index + binary search subset
 system.time(ans2 <- DT[id == 100L]) # only binary search subset
 system.time(DT[id \%in\% 100:500])    # only binary search subset again
 
+# sensitivity to collate order
+old_lc_collate = Sys.getlocale("LC_COLLATE")
+
+if (old_lc_collate == "C") {
+  Sys.setlocale("LC_COLLATE", "")
+}
+DT = data.table(
+  grp = rep(1:2, each = 4L),
+  var = c("A", "a", "0", "1", "B", "b", "0", "1")
+)
+options(datatable.optimize = Inf)
+DT[, .(max(var), min(var)), by=grp]
+# GForce is deactivated because of the ad-hoc column 'tolower(var)',
+#   through which the result for 'max(var)' may also change
+DT[, .(max(var), min(tolower(var))), by=grp]
+
+Sys.setlocale("LC_COLLATE", old_lc_collate)
+options(old)
 }}
 \keyword{ data }