Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

closes #6556 [feature request] diagnostic for merge.data.table when by = key is not present in dt being merged #6691

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -1021,4 +1021,7 @@ rowwiseDT(

20. Some clarity is added to `?GForce` for the case when subtle changes to `j` produce different results because of differences in locale. Because `data.table` _always_ uses the "C" locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, [#5331](https://github.com/Rdatatable/data.table/issues/5331). The inspirational example compared `DT[, .(max(a), max(b)), by=grp]` and `DT[, .(max(a), max(tolower(b))), by=grp]` -- in the latter case, GForce is deactivated owing to the _ad-hoc_ column, so the result for `max(a)` might differ for the two queries. An example is added to `?GForce`. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce: `DT[, .(base::max(a), base::max(b)), by=grp]`; (2) turn off all optimizations with `options(datatable.optimize = 0)`; or (3) set your R session to always sort in C locale with `Sys.setlocale("LC_COLLATE", "C")` (or temporarily with e.g. `withr::with_locale()`). Thanks @markseeto for the example and @michaelchirico for the improved documentation.

Merge.data.table: Improved Error Handling
Argument by in merge.data.table() now provides more informative error messages. When columns specified in by are missing from either of the data tables (x or y), the error will now clearly list which columns are missing from each table, making debugging easier. This improvement ensures better diagnostics for users when performing merges with invalid column names. This change will help users resolve errors more efficiently during data table merges.

# data.table v1.14.10 (Dec 2023) back to v1.10.0 (Dec 2016) has been moved to [NEWS.1.md](https://github.com/Rdatatable/data.table/blob/master/NEWS.1.md)
27 changes: 23 additions & 4 deletions R/merge.R
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,30 @@ merge.data.table = function(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FAL
by = intersect(nm_x, nm_y)
if (length(by) == 0L || !is.character(by))
stopf("A non-empty vector of column names for `by` is required.")
if (!all(by %chin% intersect(nm_x, nm_y)))
stopf("Elements listed in `by` must be valid column names in x and y")
by = unname(by)
by.x = by.y = by
# UPDATED PART STARTS HERE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment will stop being helpful after the suggested change is accepted, and we have more reliable ways of finding out which parts of the code are updated.

if (!all(by %chin% intersect(nm_x, nm_y))) {
# Identify which keys are missing from each data table
missing_x = setdiff(by, nm_x)
missing_y = setdiff(by, nm_y)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@venom1204, do you see a "check warning" from the "lint-r" check around this line when you visit the files changed tab of the pull request? It is asking you to remove the spaces on line 58. Since it doesn't contain any text, it shouldn't contain any spaces either.

# Construct a more detailed error message
error_message = "Elements listed in `by` must be valid column names in x and y."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove the spaces at the beginning of the line 61.

if (length(missing_x) > 0) {
error_message = paste(error_message, "\nMissing columns in 'x':", paste(missing_x, collapse = ", "))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, lint-r is asking you to use toString(missing_x) instead of paste(missing_x, collapse = ", ").

}
if (length(missing_y) > 0) {
error_message = paste(error_message, "\nMissing columns in 'y':", paste(missing_y, collapse = ", "))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again some spaces at the beginning of the line.

# Raise the error with the detailed message
stopf(error_message)
}
# UPDATED PART ENDS HERE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are using a version control system, there is no need to delimit the updated part using comments. The difference between the original code and your change request can be reliably computed by Git itself.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again spaces on the otherwise empty line.

by = unname(by)
by.x = by.y = by
}

# warn about unused arguments #2587
if (length(list(...))) {
Expand Down