From 8201f1b4a0ef29f87c08af8a5c1677b747a22e00 Mon Sep 17 00:00:00 2001
From: Sebastian Krantz <sebastian.krantz@graduateinstitute.ch>
Date: Tue, 3 Sep 2024 14:39:47 +0200
Subject: [PATCH 1/4] More extensive checking.

---
 NEWS.md                | 2 +-
 src/data.table_utils.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/NEWS.md b/NEWS.md
index f0385c4d..ce7e0b54 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -2,7 +2,7 @@
 
 * Fixes an installation bug on some Linux systems (conflicting types) (#613). 
 
-* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, middle and last string of a character vector are checked, and if not UTF8, the entire vector is coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. 
+* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th, 75th, and last string of a character vector are checked, and if not UTF8, the entire vector is coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. 
 
 * Fixes a bug using qualified names for fast statistical functions inside `across()` (#621, thanks @alinacherkas). 
 
diff --git a/src/data.table_utils.c b/src/data.table_utils.c
index 0572253b..3fc240c6 100644
--- a/src/data.table_utils.c
+++ b/src/data.table_utils.c
@@ -15,7 +15,7 @@ int need2utf8(SEXP x) {
   // }
   // return(false);
   if (xlen <= 1) return xlen == 1 ? NEED2UTF8(xd[0]) : 0;
-  return NEED2UTF8(xd[0]) || NEED2UTF8(xd[xlen/2]) || NEED2UTF8(xd[xlen-1]);
+  return NEED2UTF8(xd[0]) || NEED2UTF8(xd[xlen/4]) || NEED2UTF8(xd[xlen/2]) || NEED2UTF8(xd[xlen/1.3333]) || NEED2UTF8(xd[xlen-1]);
 }
 
 SEXP coerceUtf8IfNeeded(SEXP x) {

From 65f1f7f079f4945c8c5ea01eb03917b7d88292c0 Mon Sep 17 00:00:00 2001
From: Sebastian Krantz <sebastian.krantz@graduateinstitute.ch>
Date: Tue, 3 Sep 2024 14:41:35 +0200
Subject: [PATCH 2/4] Fix.

---
 src/data.table_utils.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/data.table_utils.c b/src/data.table_utils.c
index 3fc240c6..0d3fef95 100644
--- a/src/data.table_utils.c
+++ b/src/data.table_utils.c
@@ -15,7 +15,7 @@ int need2utf8(SEXP x) {
   // }
   // return(false);
   if (xlen <= 1) return xlen == 1 ? NEED2UTF8(xd[0]) : 0;
-  return NEED2UTF8(xd[0]) || NEED2UTF8(xd[xlen/4]) || NEED2UTF8(xd[xlen/2]) || NEED2UTF8(xd[xlen/1.3333]) || NEED2UTF8(xd[xlen-1]);
+  return NEED2UTF8(xd[0]) || NEED2UTF8(xd[xlen/4]) || NEED2UTF8(xd[xlen/2]) || NEED2UTF8(xd[(int)(xlen/1.3333)]) || NEED2UTF8(xd[xlen-1]);
 }
 
 SEXP coerceUtf8IfNeeded(SEXP x) {

From f11bbc7520902770f1630ebcab0f10f644f39144 Mon Sep 17 00:00:00 2001
From: Sebastian Krantz <sebastian.krantz@graduateinstitute.ch>
Date: Tue, 3 Sep 2024 14:43:25 +0200
Subject: [PATCH 3/4] Better.

---
 NEWS.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/NEWS.md b/NEWS.md
index ce7e0b54..ee214c33 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -2,7 +2,7 @@
 
 * Fixes an installation bug on some Linux systems (conflicting types) (#613). 
 
-* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th, 75th, and last string of a character vector are checked, and if not UTF8, the entire vector is coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. 
+* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th and 75th percentile and last string of a character vector are checked, and if not UTF8, the entire vector is coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. 
 
 * Fixes a bug using qualified names for fast statistical functions inside `across()` (#621, thanks @alinacherkas). 
 

From 4baa927f8f7cef25c4e1c8144b60459173edd18c Mon Sep 17 00:00:00 2001
From: Sebastian Krantz <sebastian.krantz@graduateinstitute.ch>
Date: Tue, 3 Sep 2024 14:56:32 +0200
Subject: [PATCH 4/4] Minors wording.

---
 NEWS.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/NEWS.md b/NEWS.md
index ee214c33..44d48684 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -2,7 +2,7 @@
 
 * Fixes an installation bug on some Linux systems (conflicting types) (#613). 
 
-* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th and 75th percentile and last string of a character vector are checked, and if not UTF8, the entire vector is coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. 
+* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th and 75th percentile and last string of a character vector are checked, and if not UTF8, the entire vector is internally coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. Heterogeneous strings should be coerced using tools like `stringi::stri_trans_general(x, "latin-ascii")`.
 
 * Fixes a bug using qualified names for fast statistical functions inside `across()` (#621, thanks @alinacherkas).