Skip to content

Commit

Permalink
add support for a list of selection expressions in .SDcols
Browse files Browse the repository at this point in the history
  • Loading branch information
Bill Evans authored and Bill Evans committed Nov 25, 2024
1 parent 546259d commit 3fa4dec
Show file tree
Hide file tree
Showing 5 changed files with 84 additions and 28 deletions.
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -98,5 +98,6 @@ Authors@R: c(
person("Christian", "Wia", role="ctb"),
person("Elise", "Maigné", role="ctb"),
person("Vincent", "Rocher", role="ctb"),
person("Vijay", "Lulla", role="ctb")
person("Vijay", "Lulla", role="ctb"),
person("Bill", "Evans", role="ctb")
)
23 changes: 23 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,29 @@ rowwiseDT(

4. `patterns()` in `melt()` combines correctly with user-defined `cols=`, which can be useful to specify a subset of columns to reshape without having to use a regex, for example `patterns("2", cols=c("y1", "y2"))` will only give `y2` even if there are other columns in the input matching `2`, [#6498](https://github.com/Rdatatable/data.table/issues/6498). Thanks to @hongyuanjia for the report, and to @tdhock for the PR.

5. `.SDcols=` now supports a list of expressions. The default action for the second and subsequent expressions will be to add to columns already selected, but if an expression is prefaced with the literal `--` then columns found in that expression will be removed from the columns selected so far. The order matters, columns removed in one `--expr` may be added in the next expression (so order of `--`-expressions within the list of expressions matters). All current forms of expressions normally supported are still allowed in the list of expressions. Thanks to @r2evans for the request and PR.

```r
DT = data.table(int1=1L, int2=2L, chr1="A", chr2="B", num1=1, num2=2, lgl=TRUE)
DT[, .SD, .SDcols = .(is.logical, !is.numeric)]
# lgl chr1 chr2
# 1: TRUE A B
DT[, .SD, .SDcols = .(patterns("r2"), c(1L, 1L, 2L, 3L))]
# chr2 int1 int1 int2 chr1
# 1: B 1 1 2 A
DT[, .SD, .SDcols = .("lgl", is.numeric)]
# lgl int1 int2 num1 num2
# 1: TRUE 1 2 1 2
DT[, .SD, .SDcols = .(patterns("1$"), --is.integer)]
# chr1 num1
# 1: A 1
DT[, .SD, .SDcols = .(patterns("1$"), !is.integer)]
# int1 chr1 num1 chr2 num2 lgl
# 1: 1 A 1 B 2 TRUE
```

Please note in the last two examples that `--` is distinct from `!`/`-`: with `--is.integer`, already-selected columns that are integers will be _removed_ from the set of columns to return, whereas with `!is.integer`, all columns that are not integers will be _added_ to the set of columns to return.

## BUG FIXES

1. `fwrite()` respects `dec=','` for timestamp columns (`POSIXct` or `nanotime`) with sub-second accuracy, [#6446](https://github.com/Rdatatable/data.table/issues/6446). Thanks @kav2k for pointing out the inconsistency and @MichaelChirico for the PR.
Expand Down
65 changes: 39 additions & 26 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -1004,7 +1004,16 @@ replace_dot_alias = function(e) {
ansvals = chmatchdup(ansvars, names_x)
} else {
# FR #355 - negative numeric and character indices for SDcols
colsub = substitute(.SDcols)
colsubs = substitute(.SDcols)
# FR #6619 - list of expressions for .SDcols
colsubs = if (colsubs %iscall% c(".", "list")) as.list(colsubs)[-1] else list(colsubs)
combine = function(a, b, subtract = FALSE) if (subtract) a[!a %in% b] else c(a, b[!b %in% a])
ansvars = sdvars = character()
ansvals = integer()
for (colsub in colsubs) {
# double-minus means to _remove_ columns from selection
rem_cols = (colsub %iscall% "-") && as.list(colsub[-1])[[1]] %iscall% "-"
if (rem_cols) colsub = colsub[[-1]][[-1]]
# peel from parentheses before negation so (-1L) works as well: as.data.table(as.list(1:3))[, .SD,.SDcols=(-1L)] #4231
while(colsub %iscall% "(") colsub = as.list(colsub)[[-1L]]
# fix for R-Forge #5190. colsub[[1L]] gave error when it's a symbol.
Expand All @@ -1017,48 +1026,52 @@ replace_dot_alias = function(e) {
while(colsub %iscall% "(") colsub = as.list(colsub)[[-1L]]
if (colsub %iscall% ':' && length(colsub)==3L && !is.call(colsub[[2L]]) && !is.call(colsub[[3L]])) {
# .SDcols is of the format a:b, ensure none of : arguments is a call data.table(V1=-1L, V2=-2L, V3=-3L)[,.SD,.SDcols=-V2:-V1] #4231
.SDcols = eval(colsub, setattr(as.list(seq_along(x)), 'names', names_x), parent.frame())
colsub = eval(colsub, setattr(as.list(seq_along(x)), 'names', names_x), parent.frame())
} else {
if (colsub %iscall% 'patterns') {
patterns_list_or_vector = eval_with_cols(colsub, names_x)
.SDcols = if (is.list(patterns_list_or_vector)) {
colsub = if (is.list(patterns_list_or_vector)) {
# each pattern gives a new filter condition, intersect the end result
Reduce(intersect, patterns_list_or_vector)
} else {
patterns_list_or_vector
}
} else {
.SDcols = eval(colsub, parent.frame(), parent.frame())
colsub = eval(colsub, parent.frame(), parent.frame())
# allow filtering via function in .SDcols, #3950
if (is.function(.SDcols)) {
.SDcols = lapply(x, .SDcols)
if (any(idx <- lengths(.SDcols) > 1L | vapply_1c(.SDcols, typeof) != 'logical' | vapply_1b(.SDcols, anyNA)))
if (is.function(colsub)) {
colsub = lapply(x, colsub)
if (any(idx <- lengths(colsub) > 1L | vapply_1c(colsub, typeof) != 'logical' | vapply_1b(colsub, anyNA)))
stopf("When .SDcols is a function, it is applied to each column; the output of this function must be a non-missing boolean scalar signalling inclusion/exclusion of the column. However, these conditions were not met for: %s", brackify(names(x)[idx]))
.SDcols = unlist(.SDcols, use.names = FALSE)
colsub = unlist(colsub, use.names = FALSE)
}
}
}
if (anyNA(.SDcols))
stopf(".SDcols missing at the following indices: %s", brackify(which(is.na(.SDcols))))
if (is.logical(.SDcols)) {
if (length(.SDcols)!=length(x)) stopf(".SDcols is a logical vector of length %d but there are %d columns", length(.SDcols), length(x))
ansvals = which_(.SDcols, !negate_sdcols)
ansvars = sdvars = names_x[ansvals]
} else if (is.numeric(.SDcols)) {
.SDcols = as.integer(.SDcols)
# if .SDcols is numeric, use 'dupdiff' instead of 'setdiff'
if (length(unique(sign(.SDcols))) > 1L) stopf(".SDcols is numeric but has both +ve and -ve indices")
if (any(idx <- abs(.SDcols)>ncol(x) | abs(.SDcols)<1L))
if (anyNA(colsub))
stopf(".SDcols missing at the following indices: %s", brackify(which(is.na(colsub))))
if (is.logical(colsub)) {
if (length(colsub)!=length(x)) stopf(".SDcols is a logical vector of length %d but there are %d columns", length(colsub), length(x))
newvars = which_(colsub, !negate_sdcols)
ansvals = combine(ansvals, newvars, rem_cols)
ansvars = sdvars = combine(ansvars, names_x[newvars], rem_cols)
} else if (is.numeric(colsub)) {
colsub = as.integer(colsub)
# if colsub is numeric, use 'dupdiff' instead of 'setdiff'
if (length(unique(sign(colsub))) > 1L) stopf(".SDcols is numeric but has both +ve and -ve indices")
if (any(idx <- abs(colsub)>ncol(x) | abs(colsub)<1L))
stopf(".SDcols is numeric but out of bounds [1, %d] at: %s", ncol(x), brackify(which(idx)))
ansvars = sdvars = if (negate_sdcols) dupdiff(names_x[-.SDcols], bynames) else names_x[.SDcols]
ansvals = if (negate_sdcols) setdiff(seq_along(names(x)), c(.SDcols, which(names(x) %chin% bynames))) else .SDcols
newvars = if (negate_sdcols) dupdiff(names_x[-colsub], bynames) else names_x[colsub]
ansvars = sdvars = combine(ansvars, newvars, rem_cols)
ansvals = combine(ansvals, if (negate_sdcols) setdiff(seq_along(names(x)), c(colsub, which(names(x) %chin% bynames))) else colsub, rem_cols)
} else {
if (!is.character(.SDcols)) stopf(".SDcols should be column numbers or names")
if (!all(idx <- .SDcols %chin% names_x))
stopf("Some items of .SDcols are not column names: %s", brackify(.SDcols[!idx]))
ansvars = sdvars = if (negate_sdcols) setdiff(names_x, c(.SDcols, bynames)) else .SDcols
if (!is.character(colsub)) stopf(".SDcols should be column numbers or names")
if (!all(idx <- colsub %chin% names_x))
stopf("Some items of .SDcols are not column names: %s", brackify(colsub[!idx]))
newvars = if (negate_sdcols) setdiff(names_x, c(colsub, bynames)) else colsub
ansvars = sdvars = combine(ansvars, newvars, rem_cols)
# dups = FALSE here. DT[, .SD, .SDcols=c("x", "x")] again doesn't really help with which 'x' to keep (and if '-' which x to remove)
ansvals = chmatch(ansvars, names_x)
ansvals = combine(ansvals, chmatch(newvars, names_x), rem_cols)
}
}
}
# fix for long standing FR/bug, #495 and #484
Expand Down
9 changes: 9 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -20596,3 +20596,12 @@ test(2295.3, is.data.table(d2))

# #6588: .checkTypos used to give arbitrary strings to stopf as the first argument
test(2296, d2[x %no such operator% 1], error = '%no such operator%')

# #6619: .SDcols supports list of expressions
DT = data.table(int1=1L, int2=2L, chr1="A", chr2="B", num1=1, num2=2, lgl=TRUE)
test(2297.1, DT[, .SD, .SDcols = .(is.logical, !is.numeric)], DT[, .(lgl, chr1, chr2)])
test(2297.2, DT[, .SD, .SDcols = .(patterns("r2"), c(1L, 1L, 2L, 3L))], DT[, .(chr2, int1, int1, int2, chr1)])
test(2297.3, DT[, .SD, .SDcols = .("lgl", is.numeric)], DT[, .(lgl, int1, int2, num1, num2)])
# difference between `--` and `!`
test(2297.4, DT[, .SD, .SDcols = .(patterns("1$"), --is.integer)], DT[, .(chr1, num1)])
test(2297.5, DT[, .SD, .SDcols = .(patterns("1$"), !is.integer)], DT[, .(int1, chr1, num1, chr2 ,num2, lgl)])
12 changes: 11 additions & 1 deletion man/data.table.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,17 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
Inversion (column dropping instead of keeping) can be accomplished be prepending the argument with \code{!} or \code{-} (there's no difference between these), e.g. \code{.SDcols = !c('x', 'y')}.

Finally, you can filter columns to include in \code{.SD} based on their \emph{names} according to regular expressions via \code{.SDcols=patterns(regex1, regex2, ...)}. The included columns will be the \emph{intersection} of the columns identified by each pattern; pattern unions can easily be specified with \code{|} in a regex. You can filter columns on \code{values} by passing a function, e.g. \code{.SDcols=\link{is.numeric}}. You can also invert a pattern as usual with \code{.SDcols=!patterns(...)} or \code{.SDcols=!is.numeric}.
You can filter columns to include in \code{.SD} based on their \emph{names} according to regular expressions via \code{.SDcols=patterns(regex1, regex2, ...)}. The included columns will be the \emph{intersection} of the columns identified by each pattern; pattern unions can easily be specified with \code{|} in a regex. You can filter columns on \code{values} by passing a function, e.g. \code{.SDcols=\link{is.numeric}}. You can also invert a pattern as usual with \code{.SDcols=!patterns(...)} or \code{.SDcols=!is.numeric}.

Finally, you can combine any of the other column-selection methods in a list of different column-selection methods. Considerations:

\itemize{
\item The selection of columns is \emph{additive}, meaning that \code{.SDcols=.(is.numeric, 'x')} will include all columns that are numeric and the column named 'x' (whether or not it contains numeric data).
\item Selection-inversion (selecting columns that do not meet a condition) works the same using the traditional \code{!} or \code{-}.
\item Deduplication. Columns in one element that were selected in previous elements will not be added again. For example, if column \code{x} is a numeric column, then \code{.SDcols=.(is.numeric, 'x')} will return only one instance of the \code{x} column. One can still explicitly choose duplicate columns within each selection, such as \code{.SDcols=.(is.numeric, c(1,1,2,3))}.
\item Removing columns that were selected in previous elements in the list is done with a double-minus, as in \code{--cols}. This supports any of the accepted column-selection methods (integers, strings, function calls, etc). For example, \code{.SDcols=.(is.numeric, --'x'} will select all numeric columns except one whose column name is 'x'. Because this only removes columns that were previously selected, a \code{--cols} removal as the first element of \code{.SDcols} has no effect.
}

}
\item{verbose}{ \code{TRUE} turns on status and information messages to the console. Turn this on by default using \code{options(datatable.verbose=TRUE)}. The quantity and types of verbosity may be expanded in future.

Expand Down

0 comments on commit 3fa4dec

Please sign in to comment.