-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
which = NA yields incorrect results #4303
Comments
Thnank you @cbilot for reporting. I am able to reproduce the problem. |
jangorecki
added
bug
joins
Use label:"non-equi joins" for rolling, overlapping, and non-equi joins
labels
Apr 2, 2020
@jangorecki , #4342 does indeed resolve all my use cases. |
No need to close the issue, it will happen automatically once the PR reference above is included to the master branch. Thanks for the report! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
While working with the "which = (FALSE/TRUE/NA)" option of a data.table join, I received some quite unexpected results.
Let's start with a simple case that works correctly. Intuitively, we are asking the question "which orders do not have a valid customer ID"?
This yields the (correct) output:
integer(0)
because ID: 105257553 (the only row in the orders table) is the second ID in the customers table.
FWIW, with datatable.verbose = TRUE, we get the following output:
Now, let's make one tiny change. Let's change only the last digit of the ID in the orders table from 105257553 to 105257554. Note: this ID is not in the customers table.
Now, we get following output:
[1] 1 2 3 4 5 6 7 8 9 10
This result cannot be correct. For one, the orders table has only one row, so no element of the result should be anything other than 1. Also notice that we receive 10 values in our result -- even though our orders table only has one row.
But things get even stranger. Let's keep ID 105257554 (which is not in the customers table), but add a second ID to the orders table. This time let's add ID: 108924851, which is in the customers table:
Now we get the following output:
[1] 3 4 6 7 10
Certainly not the expected result. But perhaps more strangely .. the values 5, 8, and 9 disappear, compared to the prior result.
FWIW, I discovered this while working on some largish datasets (i of size ~200 million records, x of size ~25 million records). I received results that often contained a large number of "NA" values, as well as result vectors of strange sizes.
Oddly, when I tried to construct simplified examples for this issue ticket using made-up ID's, all of them worked correctly. The above examples I found by using real data.
As a mitigation for now, I simply revert to using code like:
#
Output of sessionInfo()
The text was updated successfully, but these errors were encountered: