Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbindlist(l, use.names=TRUE) handle different encodings for column names #5453

Merged
merged 14 commits into from
Dec 3, 2024
Merged
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,8 @@ rowwiseDT(

12. Joins on multiple columns, such as `x[y, on=c("x1==y1", "x2==y1")]`, could fail during implicit type coercions if `x1` and `x2` had different but still compatible types, [#6602](https://github.com/Rdatatable/data.table/issues/6602). This was particularly unexpected when columns `x1`, `x2`, and `y1` were all of the same class, e.g. `Date`, but differed in their underlying storage types. Thanks to Benjamin Schwendinger for the report and the fix.

13. `rbindlist(l, use.names=TRUE)` can now handle different encodings for the column names in different entries of `l`, [#5452](https://github.com/Rdatatable/data.table/issues/5452). Thanks to @MEO265 for the report, and Benjamin Schwendinger for the fix.

## NOTES

1. Tests run again when some Suggests packages are missing, [#6411](https://github.com/Rdatatable/data.table/issues/6411). Thanks @aadler for the note and @MichaelChirico for the fix.
Expand Down
12 changes: 12 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -20639,3 +20639,15 @@ test(2297.22, y[x, on=.(d == a, c == a)], data.table(c=1, d=1))
x = data.table(a=1, b=2L)
y = data.table(c=1.5, d=1L)
test(2297.31, y[x, on=.(c == a, d == a), nomatch=NULL], output="Empty data.table (0 rows and 3 cols): c,d,b")

# rbindlist(l, use.names=TRUE) should handle different colnames encodings #5452
x = data.table(a = 1, b = 2, c = 3)
y = data.table(x = 4, y = 5, z = 6)
# a-umlaut, o-umlaut, u-umlaut
setnames(x , c("\u00e4", "\u00f6", "\u00fc"))
setnames(y , iconv(c("\u00f6", "\u00fc", "\u00e4"), from = "UTF-8", to = "latin1"))
test(2298.1, rbindlist(list(x,y), use.names=TRUE), data.table("\u00e4"=c(1,6), "\u00f6"=c(2,4), "\u00fc"=c(3,5)))
test(2298.2, rbindlist(list(y,x), use.names=TRUE), data.table("\u00f6"=c(4,2), "\u00fc"=c(5,3), "\u00e4"=c(6,1)))
set(y, j="\u00e4", value=NULL)
test(2298.3, rbindlist(list(x,y), use.names=TRUE, fill=TRUE), data.table("\u00e4"=c(1,NA), "\u00f6"=c(2,4), "\u00fc"=c(3,5)))
test(2298.4, rbindlist(list(y,x), use.names=TRUE, fill=TRUE), data.table("\u00f6"=c(4,2), "\u00fc"=c(5,3), "\u00e4"=c(NA,1)))
16 changes: 8 additions & 8 deletions src/rbindlist.c
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
error(_("Failed to allocate upper bound of %"PRId64" unique column names [sum(lapply(l,ncol))]"), (int64_t)upperBoundUniqueNames); // # nocov
savetl_init();
int nuniq=0;
// first pass - gather unique column names
for (int i=0; i<LENGTH(l); i++) {
SEXP li = VECTOR_ELT(l, i);
int thisncol=LENGTH(li);
Expand All @@ -84,18 +85,15 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
if (!length(cn)) continue;
const SEXP *cnp = STRING_PTR_RO(cn);
for (int j=0; j<thisncol; j++) {
SEXP s = cnp[j];
SEXP s = ENC2UTF8(cnp[j]); // convert different encodings for use.names #5452
ben-schwen marked this conversation as resolved.
Show resolved Hide resolved
if (TRUELENGTH(s)<0) continue; // seen this name before
if (TRUELENGTH(s)>0) savetl(s);
uniq[nuniq++] = s;
SET_TRUELENGTH(s,-nuniq);
}
}
if (nuniq>0) {
SEXP *tt = realloc(uniq, nuniq*sizeof(SEXP)); // shrink to only what we need to release the spare
if (!tt) free(uniq); // shrink never fails; just keep codacy happy
uniq = tt;
}
if (nuniq>0) uniq = realloc(uniq, nuniq*sizeof(SEXP)); // shrink to only what we need to release the spare

// now count the dups (if any) and how they're distributed across the items
int *counts = (int *)calloc(nuniq, sizeof(int)); // counts of names for each colnames
int *maxdup = (int *)calloc(nuniq, sizeof(int)); // the most number of dups for any name within one colname vector
Expand All @@ -107,6 +105,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
error(_("Failed to allocate nuniq=%d items working memory in rbindlist.c"), nuniq);
// # nocov end
}
// second pass - count duplicates
for (int i=0; i<LENGTH(l); i++) {
SEXP li = VECTOR_ELT(l, i);
int thisncol=length(li);
Expand All @@ -116,7 +115,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
const SEXP *cnp = STRING_PTR_RO(cn);
memset(counts, 0, nuniq*sizeof(int));
for (int j=0; j<thisncol; j++) {
SEXP s = cnp[j];
SEXP s = ENC2UTF8(cnp[j]); // convert different encodings for use.names #5452
counts[ -TRUELENGTH(s)-1 ]++;
}
for (int u=0; u<nuniq; u++) {
Expand Down Expand Up @@ -145,6 +144,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
for (int i=0; i<ncol; ++i) {uniqMap[i] = dupLink[i] = -1;}
int nextCol=0, lastDup=ncol-1;

// third pass - create final column mapping colMapRaw
for (int i=0; i<LENGTH(l); ++i) {
SEXP li = VECTOR_ELT(l, i);
int thisncol=length(li);
Expand All @@ -156,7 +156,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
const SEXP *cnp = STRING_PTR_RO(cn);
memset(counts, 0, nuniq*sizeof(int));
for (int j=0; j<thisncol; j++) {
SEXP s = cnp[j];
SEXP s = ENC2UTF8(cnp[j]); // convert different encodings for use.names #5452
int w = -TRUELENGTH(s)-1;
int wi = counts[w]++; // how many dups have we seen before of this name within this item
if (uniqMap[w]==-1) {
Expand Down
Loading