Skip to content

Commit

Permalink
rbindlist(l, use.names=TRUE) handle different encodings for column na…
Browse files Browse the repository at this point in the history
…mes (#5453)

* fix handling of different encodings for column names

* improve comment

* write special chars in unicode

* simplify tests

* Fix NEWS numbering

* Update NEWS.md

Co-authored-by: Michael Chirico <[email protected]>

* add comments

* fix lint

---------

Co-authored-by: Michael Chirico <[email protected]>
  • Loading branch information
ben-schwen and MichaelChirico authored Dec 3, 2024
1 parent 014dafb commit f05893e
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 8 deletions.
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,8 @@ rowwiseDT(
12. Joins on multiple columns, such as `x[y, on=c("x1==y1", "x2==y1")]`, could fail during implicit type coercions if `x1` and `x2` had different but still compatible types, [#6602](https://github.com/Rdatatable/data.table/issues/6602). This was particularly unexpected when columns `x1`, `x2`, and `y1` were all of the same class, e.g. `Date`, but differed in their underlying storage types. Thanks to Benjamin Schwendinger for the report and the fix.
13. `rbindlist(l, use.names=TRUE)` can now handle different encodings for the column names in different entries of `l`, [#5452](https://github.com/Rdatatable/data.table/issues/5452). Thanks to @MEO265 for the report, and Benjamin Schwendinger for the fix.
## NOTES
1. Tests run again when some Suggests packages are missing, [#6411](https://github.com/Rdatatable/data.table/issues/6411). Thanks @aadler for the note and @MichaelChirico for the fix.
Expand Down
12 changes: 12 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -20639,3 +20639,15 @@ test(2297.22, y[x, on=.(d == a, c == a)], data.table(c=1, d=1))
x = data.table(a=1, b=2L)
y = data.table(c=1.5, d=1L)
test(2297.31, y[x, on=.(c == a, d == a), nomatch=NULL], output="Empty data.table (0 rows and 3 cols): c,d,b")

# rbindlist(l, use.names=TRUE) should handle different colnames encodings #5452
x = data.table(a = 1, b = 2, c = 3)
y = data.table(x = 4, y = 5, z = 6)
# a-umlaut, o-umlaut, u-umlaut
setnames(x , c("\u00e4", "\u00f6", "\u00fc"))
setnames(y , iconv(c("\u00f6", "\u00fc", "\u00e4"), from = "UTF-8", to = "latin1"))
test(2298.1, rbindlist(list(x,y), use.names=TRUE), data.table("\u00e4"=c(1,6), "\u00f6"=c(2,4), "\u00fc"=c(3,5)))
test(2298.2, rbindlist(list(y,x), use.names=TRUE), data.table("\u00f6"=c(4,2), "\u00fc"=c(5,3), "\u00e4"=c(6,1)))
set(y, j="\u00e4", value=NULL)
test(2298.3, rbindlist(list(x,y), use.names=TRUE, fill=TRUE), data.table("\u00e4"=c(1,NA), "\u00f6"=c(2,4), "\u00fc"=c(3,5)))
test(2298.4, rbindlist(list(y,x), use.names=TRUE, fill=TRUE), data.table("\u00f6"=c(4,2), "\u00fc"=c(5,3), "\u00e4"=c(NA,1)))
16 changes: 8 additions & 8 deletions src/rbindlist.c
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
error(_("Failed to allocate upper bound of %"PRId64" unique column names [sum(lapply(l,ncol))]"), (int64_t)upperBoundUniqueNames); // # nocov
savetl_init();
int nuniq=0;
// first pass - gather unique column names
for (int i=0; i<LENGTH(l); i++) {
SEXP li = VECTOR_ELT(l, i);
int thisncol=LENGTH(li);
Expand All @@ -84,18 +85,15 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
if (!length(cn)) continue;
const SEXP *cnp = STRING_PTR_RO(cn);
for (int j=0; j<thisncol; j++) {
SEXP s = cnp[j];
SEXP s = ENC2UTF8(cnp[j]); // convert different encodings for use.names #5452
if (TRUELENGTH(s)<0) continue; // seen this name before
if (TRUELENGTH(s)>0) savetl(s);
uniq[nuniq++] = s;
SET_TRUELENGTH(s,-nuniq);
}
}
if (nuniq>0) {
SEXP *tt = realloc(uniq, nuniq*sizeof(SEXP)); // shrink to only what we need to release the spare
if (!tt) free(uniq); // shrink never fails; just keep codacy happy
uniq = tt;
}
if (nuniq>0) uniq = realloc(uniq, nuniq*sizeof(SEXP)); // shrink to only what we need to release the spare

// now count the dups (if any) and how they're distributed across the items
int *counts = (int *)calloc(nuniq, sizeof(int)); // counts of names for each colnames
int *maxdup = (int *)calloc(nuniq, sizeof(int)); // the most number of dups for any name within one colname vector
Expand All @@ -107,6 +105,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
error(_("Failed to allocate nuniq=%d items working memory in rbindlist.c"), nuniq);
// # nocov end
}
// second pass - count duplicates
for (int i=0; i<LENGTH(l); i++) {
SEXP li = VECTOR_ELT(l, i);
int thisncol=length(li);
Expand All @@ -116,7 +115,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
const SEXP *cnp = STRING_PTR_RO(cn);
memset(counts, 0, nuniq*sizeof(int));
for (int j=0; j<thisncol; j++) {
SEXP s = cnp[j];
SEXP s = ENC2UTF8(cnp[j]); // convert different encodings for use.names #5452
counts[ -TRUELENGTH(s)-1 ]++;
}
for (int u=0; u<nuniq; u++) {
Expand Down Expand Up @@ -145,6 +144,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
for (int i=0; i<ncol; ++i) {uniqMap[i] = dupLink[i] = -1;}
int nextCol=0, lastDup=ncol-1;

// third pass - create final column mapping colMapRaw
for (int i=0; i<LENGTH(l); ++i) {
SEXP li = VECTOR_ELT(l, i);
int thisncol=length(li);
Expand All @@ -156,7 +156,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
const SEXP *cnp = STRING_PTR_RO(cn);
memset(counts, 0, nuniq*sizeof(int));
for (int j=0; j<thisncol; j++) {
SEXP s = cnp[j];
SEXP s = ENC2UTF8(cnp[j]); // convert different encodings for use.names #5452
int w = -TRUELENGTH(s)-1;
int wi = counts[w]++; // how many dups have we seen before of this name within this item
if (uniqMap[w]==-1) {
Expand Down

0 comments on commit f05893e

Please sign in to comment.