From 8201f1b4a0ef29f87c08af8a5c1677b747a22e00 Mon Sep 17 00:00:00 2001 From: Sebastian Krantz Date: Tue, 3 Sep 2024 14:39:47 +0200 Subject: [PATCH 1/4] More extensive checking. --- NEWS.md | 2 +- src/data.table_utils.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/NEWS.md b/NEWS.md index f0385c4d..ce7e0b54 100644 --- a/NEWS.md +++ b/NEWS.md @@ -2,7 +2,7 @@ * Fixes an installation bug on some Linux systems (conflicting types) (#613). -* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, middle and last string of a character vector are checked, and if not UTF8, the entire vector is coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. +* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th, 75th, and last string of a character vector are checked, and if not UTF8, the entire vector is coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. * Fixes a bug using qualified names for fast statistical functions inside `across()` (#621, thanks @alinacherkas). diff --git a/src/data.table_utils.c b/src/data.table_utils.c index 0572253b..3fc240c6 100644 --- a/src/data.table_utils.c +++ b/src/data.table_utils.c @@ -15,7 +15,7 @@ int need2utf8(SEXP x) { // } // return(false); if (xlen <= 1) return xlen == 1 ? NEED2UTF8(xd[0]) : 0; - return NEED2UTF8(xd[0]) || NEED2UTF8(xd[xlen/2]) || NEED2UTF8(xd[xlen-1]); + return NEED2UTF8(xd[0]) || NEED2UTF8(xd[xlen/4]) || NEED2UTF8(xd[xlen/2]) || NEED2UTF8(xd[xlen/1.3333]) || NEED2UTF8(xd[xlen-1]); } SEXP coerceUtf8IfNeeded(SEXP x) { From 65f1f7f079f4945c8c5ea01eb03917b7d88292c0 Mon Sep 17 00:00:00 2001 From: Sebastian Krantz Date: Tue, 3 Sep 2024 14:41:35 +0200 Subject: [PATCH 2/4] Fix. --- src/data.table_utils.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/data.table_utils.c b/src/data.table_utils.c index 3fc240c6..0d3fef95 100644 --- a/src/data.table_utils.c +++ b/src/data.table_utils.c @@ -15,7 +15,7 @@ int need2utf8(SEXP x) { // } // return(false); if (xlen <= 1) return xlen == 1 ? NEED2UTF8(xd[0]) : 0; - return NEED2UTF8(xd[0]) || NEED2UTF8(xd[xlen/4]) || NEED2UTF8(xd[xlen/2]) || NEED2UTF8(xd[xlen/1.3333]) || NEED2UTF8(xd[xlen-1]); + return NEED2UTF8(xd[0]) || NEED2UTF8(xd[xlen/4]) || NEED2UTF8(xd[xlen/2]) || NEED2UTF8(xd[(int)(xlen/1.3333)]) || NEED2UTF8(xd[xlen-1]); } SEXP coerceUtf8IfNeeded(SEXP x) { From f11bbc7520902770f1630ebcab0f10f644f39144 Mon Sep 17 00:00:00 2001 From: Sebastian Krantz Date: Tue, 3 Sep 2024 14:43:25 +0200 Subject: [PATCH 3/4] Better. --- NEWS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/NEWS.md b/NEWS.md index ce7e0b54..ee214c33 100644 --- a/NEWS.md +++ b/NEWS.md @@ -2,7 +2,7 @@ * Fixes an installation bug on some Linux systems (conflicting types) (#613). -* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th, 75th, and last string of a character vector are checked, and if not UTF8, the entire vector is coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. +* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th and 75th percentile and last string of a character vector are checked, and if not UTF8, the entire vector is coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. * Fixes a bug using qualified names for fast statistical functions inside `across()` (#621, thanks @alinacherkas). From 4baa927f8f7cef25c4e1c8144b60459173edd18c Mon Sep 17 00:00:00 2001 From: Sebastian Krantz Date: Tue, 3 Sep 2024 14:56:32 +0200 Subject: [PATCH 4/4] Minors wording. --- NEWS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/NEWS.md b/NEWS.md index ee214c33..44d48684 100644 --- a/NEWS.md +++ b/NEWS.md @@ -2,7 +2,7 @@ * Fixes an installation bug on some Linux systems (conflicting types) (#613). -* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th and 75th percentile and last string of a character vector are checked, and if not UTF8, the entire vector is coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. +* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th and 75th percentile and last string of a character vector are checked, and if not UTF8, the entire vector is internally coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. Heterogeneous strings should be coerced using tools like `stringi::stri_trans_general(x, "latin-ascii")`. * Fixes a bug using qualified names for fast statistical functions inside `across()` (#621, thanks @alinacherkas).