Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbindlist(l, use.names=TRUE) handle different encodings for column names #5453

Merged
merged 14 commits into from
Dec 3, 2024
Merged
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@

2. `fwrite()` respects `dec=','` for timestamp columns (`POSIXct` or `nanotime`) with sub-second accuracy, [#6446](https://github.com/Rdatatable/data.table/issues/6446). Thanks @kav2k for pointing out the inconsistency and @MichaelChirico for the PR.

3. `rbindlist(l, use.names=TRUE)` can now handle different encodings for the column names, [#5452](https://github.com/Rdatatable/data.table/issues/5452). Thanks to @MEO265 for the report, and Benjamin Schwendinger for the fix.
ben-schwen marked this conversation as resolved.
Show resolved Hide resolved

## NOTES

1. Tests run again when some Suggests packages are missing, [#6411](https://github.com/Rdatatable/data.table/issues/6411). Thanks @aadler for the note and @MichaelChirico for the fix.
Expand Down
11 changes: 11 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -19063,3 +19063,14 @@ test(2280.3, foo(), error="Internal error in foo: broken")
# fwrite respects dec=',' for sub-second timestamps, #6446
test(2281.1, fwrite(data.table(a=.POSIXct(0.001)), dec=',', sep=';'), output="1970-01-01T00:00:00,001Z")
test(2281.2, fwrite(data.table(a=.POSIXct(0.0001)), dec=',', sep=';'), output="1970-01-01T00:00:00,000100Z")

# rbindlist(l, use.names=TRUE) should handle different colnames encodings #5452
x = data.table(a = 1, b = 2, c = 3)
y = data.table(x = 4, y = 5, z = 6)
setnames(x , c("\u00e4", "\u00f6", "\u00fc"))
ben-schwen marked this conversation as resolved.
Show resolved Hide resolved
setnames(y , iconv(c("\u00f6", "\u00fc", "\u00e4"), from = "UTF-8", to = "latin1"))
test(2282.1, rbindlist(list(x,y), use.names=TRUE), data.table("\u00e4"=c(1,6), "\u00f6"=c(2,4), "\u00fc"=c(3,5)))
ben-schwen marked this conversation as resolved.
Show resolved Hide resolved
test(2282.2, rbindlist(list(y,x), use.names=TRUE), data.table("\u00f6"=c(4,2), "\u00fc"=c(5,3), "\u00e4"=c(6,1)))
set(y, j="\u00e4", value=NULL)
test(2282.3, rbindlist(list(x,y), use.names=TRUE, fill=TRUE), data.table("\u00e4"=c(1,NA), "\u00f6"=c(2,4), "\u00fc"=c(3,5)))
test(2282.4, rbindlist(list(y,x), use.names=TRUE, fill=TRUE), data.table("\u00f6"=c(4,2), "\u00fc"=c(5,3), "\u00e4"=c(NA,1)))
ben-schwen marked this conversation as resolved.
Show resolved Hide resolved
6 changes: 3 additions & 3 deletions src/rbindlist.c
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
if (!length(cn)) continue;
const SEXP *cnp = STRING_PTR_RO(cn);
for (int j=0; j<thisncol; j++) {
SEXP s = cnp[j];
SEXP s = ENC2UTF8(cnp[j]); // convert different encodings for use.names #5452
ben-schwen marked this conversation as resolved.
Show resolved Hide resolved
if (TRUELENGTH(s)<0) continue; // seen this name before
if (TRUELENGTH(s)>0) savetl(s);
uniq[nuniq++] = s;
Expand Down Expand Up @@ -114,7 +114,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
const SEXP *cnp = STRING_PTR_RO(cn);
memset(counts, 0, nuniq*sizeof(int));
for (int j=0; j<thisncol; j++) {
SEXP s = cnp[j];
SEXP s = ENC2UTF8(cnp[j]); // convert different encodings for use.names #5452
counts[ -TRUELENGTH(s)-1 ]++;
}
for (int u=0; u<nuniq; u++) {
Expand Down Expand Up @@ -154,7 +154,7 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg, SEXP ignor
const SEXP *cnp = STRING_PTR_RO(cn);
memset(counts, 0, nuniq*sizeof(int));
for (int j=0; j<thisncol; j++) {
SEXP s = cnp[j];
SEXP s = ENC2UTF8(cnp[j]); // convert different encodings for use.names #5452
int w = -TRUELENGTH(s)-1;
int wi = counts[w]++; // how many dups have we seen before of this name within this item
if (uniqMap[w]==-1) {
Expand Down
Loading