Skip to content

Commit

Permalink
Merge pull request #629 from SebKrantz/development
Browse files Browse the repository at this point in the history
Development
  • Loading branch information
SebKrantz authored Sep 3, 2024
2 parents 0b1f2c8 + 4baa927 commit 812eba3
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

* Fixes an installation bug on some Linux systems (conflicting types) (#613).

* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, middle and last string of a character vector are checked, and if not UTF8, the entire vector is coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding.
* *collapse* now enforces string encoding in `fmatch()` / `join()`, which caused problems if strings being matched had different encodings (#566, #579, and #618). To avoid noticeable performance implications, checks are done heuristically, i.e., the first, 25th, 50th and 75th percentile and last string of a character vector are checked, and if not UTF8, the entire vector is internally coerced to UTF8 strings *before* the matching process. In general, character vectors in R can contain strings of different encodings, but this is not the case with most regular data. For performance reasons, *collapse* assumes that character vectors are uniform in terms of string encoding. Heterogeneous strings should be coerced using tools like `stringi::stri_trans_general(x, "latin-ascii")`.

* Fixes a bug using qualified names for fast statistical functions inside `across()` (#621, thanks @alinacherkas).

Expand Down
2 changes: 1 addition & 1 deletion src/data.table_utils.c
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ int need2utf8(SEXP x) {
// }
// return(false);
if (xlen <= 1) return xlen == 1 ? NEED2UTF8(xd[0]) : 0;
return NEED2UTF8(xd[0]) || NEED2UTF8(xd[xlen/2]) || NEED2UTF8(xd[xlen-1]);
return NEED2UTF8(xd[0]) || NEED2UTF8(xd[xlen/4]) || NEED2UTF8(xd[xlen/2]) || NEED2UTF8(xd[(int)(xlen/1.3333)]) || NEED2UTF8(xd[xlen-1]);
}

SEXP coerceUtf8IfNeeded(SEXP x) {
Expand Down

0 comments on commit 812eba3

Please sign in to comment.