Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join, which=NA, fixes #4303 #4342

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ vignettes/plots/figures
*.o
*.so
*.dll
*.dSYM

# temp files
*~
Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@

## NEW FEATURES

0. Using `which = NA` during join operation could eventually yield incorrect row indices, [#4303](https://github.com/Rdatatable/data.table/issues/4303). Thanks to @cbilot for reporting.

1. `nafill()` now applies `fill=` to the front/back of the vector when `type="locf|nocb"`, [#3594](https://github.com/Rdatatable/data.table/issues/3594). Thanks to @ben519 for the feature request. It also now returns a named object based on the input names. Note that if you are considering joining and then using `nafill(...,type='locf|nocb')` afterwards, please review `roll=`/`rollends=` which should achieve the same result in one step more efficiently. `nafill()` is for when filling-while-joining (i.e. `roll=`/`rollends=`/`nomatch=`) cannot be applied.

2. `mean(na.rm=TRUE)` by group is now GForce optimized, [#4849](https://github.com/Rdatatable/data.table/issues/4849). Thanks to the [h2oai/db-benchmark](https://github.com/h2oai/db-benchmark) project for spotting this issue. The 1 billion row example in the issue shows 48s reduced to 14s. The optimization also applies to type `integer64` resulting in a difference to the `bit64::mean.integer64` method: `data.table` returns a `double` result whereas `bit64` rounds the mean to the nearest integer.
Expand Down
2 changes: 1 addition & 1 deletion R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -518,7 +518,7 @@ replace_dot_alias = function(e) {
# If using secondary key of x, f__ will refer to xo
if (is.na(which)) {
w = if (notjoin) f__!=0L else is.na(f__)
return( if (length(xo)) fsort(xo[w], internal=TRUE) else which(w) )
return( if (length(xo) && notjoin) fsort(xo[w], internal=TRUE) else which(w) )
jangorecki marked this conversation as resolved.
Show resolved Hide resolved
}
if (mult=="all") {
# is by=.EACHI along with non-equi join?
Expand Down
13 changes: 13 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -18111,3 +18111,16 @@ test(2238.9, NA %notin% c(1:5, NA), FALSE)

# shift actionable error on matrix input #5287
test(2239.1, shift(matrix(1:10, ncol = 1)), error="consider wrapping")

# which = NA yields incorrect results #4303
customers = data.table(ID = c(
108924851L, 105257553L, 118054200L, 108365953L,
116642294L, 100419961L, 115677488L, 100405475L,
119246064L, 100383251L
))
orders = data.table(ID = c(105257553L))
test(2140.1, customers[orders, on = .(ID), which = NA], integer())
orders = data.table(ID = c(105257554L))
test(2140.2, customers[orders, on = .(ID), which = NA], 1L)
orders = data.table(ID = c(105257554L, 108924851L))
test(2140.3, customers[orders, on = .(ID), which = NA], 1L)
Loading