Both columns for rolling and non-equi joins #3093

mattdowle · 2018-10-03T22:00:10Z

I can't push to remote branch sritchie73:non-equi-key so copied it into main from PR #2706.
Looks like commit history from Scott is retained which is good, and retained the branch name.
The extended discussion history will just be in the previous PR #2706 though. Can't think of a way round this. Scott is member of project so can create branches in main going forward.

Closes #1615
Closes #1700
Closes #2006
Closes #2569
Other related issues already closed (e.g. dups) : #1469, #1807, #2307, #2595, #2602

test.data.table() fails with 28 errors when datatable.bothkeycols is set to TRUE (caused by typo datatable.bothkeycol vs datatable.bothkeycols)
bothkeycols= renamed to old.nonequi= (explanation in comments below)
Place extra roll= column just after the first one, not at the end of the result.
Add short examples to the news item

New SQL-like column return now works when j is provided and there are only non-equi / roll key columns in 'on'.

…n(s) 129: added additional V1 column from J() to expected output. 130: added additional V1 column from J() to expected output. 131: added additional V1 column from J() to expected output. 148: added additional V1 column from CJ() to expected output. 149: added additional V1 column from J() to expected output. 917: updated expected output. added i.ts and updated ts (from x) to contain matched values from x$ts instead of matched values from i$ts. 1317.1: updated expected output. added i.y and updated y (from x) to contain matched values from x$y instead of matched values from i$y. 1317.2: updated expected output. added i.y and updated y (from x) to contain matched values from x$y instead of matched values from i$y. 1470.1: rolling join on J(), renamed x column from output as V1 and put in correct position, then added x as DT$x. 1470.2: rolling join on J(), renamed x column from output as V1 and put in correct position, then added x as DT$x. 1470.3: rolling join on SJ(), renamed x column from output as V1 and put in correct position, then added x as DT$x. 1614.1: updated expected output. renamed x column from output as V1 and put in correct position, then added dt$x column to expected output. 1614.2: updated expected output. renamed x column from output as V1 and put in correct position, then added dt$x column to expected output. 1614.3: updated expected output. renamed x column from output as V1 and put in correct position, then added dt$x column to expected output. 1614.4: updated expected output. renamed x column from output as V1 and put in correct position, then added dt$x column to expected output. 1660.1: updated expected return columns. 1660.2: updated expected return columns. 1682.1: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI` 1682.2: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI` 1682.3: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI` 1682.4: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI` 1682.5: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI` 1682.6: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI` 1682.7: Updated `by` statement in chained operator to use column names in dt2 to match columns used by `by=.EACHI` New bugs (test ouput different from expected new behaviour): 1660.1: "End" column missing from output, and Location column has contents of dt1$end even though it is involved in a non-equi join. 1660.2: "Location" column missing from output. 1682.1: Column V2 is missing from non-equi join result 1682.2: Column V2 is missing from non-equi join result 1682.3: Column V2 is missing from non-equi join result 1682.4: Column V2 is missing from non-equi join result 1682.5: Column V2 is missing from non-equi join result 1682.6: Column V2 is missing from non-equi join result 1682.7: Column V2 is missing from non-equi join result Bugs that have regressed with this PR and need to be patched again: 1469.2: after roll DT now has a key when it shouldnt Tests that need following up, but may not be due to introduced bugs 1540.15: unexpected difference in column order (but not contents) after rolling join. 1614.4: values in dt$x do not match those expected based on values in dt$y, but this might be due to rollends=TRUE. 1846.1: columns do not match - check that handling of 0-row dt in i is correct based on linked issue. 1846.2 columns do not match - check that handling of 0-row dt in i is correct based on linked issue.

Expected output still had value from `J(4)` in `a` column instead of `TESTDT$a`.

Expected output still had values from `CJ(2,3)` in `a` column instead of `dt$a`.

missing c() inside of as.integer64

Added columns from empty d1 to expected output and and added column names prefixed with i. for the columns from d2.

Tests 1660.1, 1660.2 and 1846.1 now pass

To get the equivalent grouping of `by = .EACHI` when chaining operators you now have to specify `by = .(key.x = key.i)` to get the same answer and column names

This test checks whether a rolling join can be perform when a key has been set in the left data.table (DT1.copy), but not in the the right data.table (DT3). DT3 has to be reordered in this case so that the column to join on is the first column in DT3. Now that the rolling join column y is returned from both DT1 and DT3, the test needed to be updated to reorder the columns in the expected output to match the ordering in the tested join.

Updated columns extracted in j to `V1` to reflect `x` is now from `DT` rather than `J(c(2,0))`. When extracting `x` instead of `V1`, `irows` is sorted so the result has `key() == "x"`

Functionality tested is captured by earlier unit tests that have been updated.

When TRUE (default) non-equi and roll columns are returned from both x and i.

options(datatable.bothkeycols=FALSE) is set at the top of the file so these tests can remain unmodified

incremented test number in tests.Rraw to solve merge conflict. Merge branch 'master' into non-equi-key # Conflicts: # inst/tests/tests.Rraw

jangorecki · 2020-04-10T12:21:48Z

@sritchie73 could you please merge to recent master and work out code coverage?

tdhock · 2021-10-05T18:29:51Z

hi any update/plan to merge this soon? I would love to use non-equi join more often in research and teaching and this PR would make it so much easier to use/explain.

tdhock · 2024-11-28T18:37:11Z

I'm not sure how easy this will be to get work after all this time, but I think it is an important feature, so I tried to resolve the conflicts with master.

tdhock · 2024-11-28T18:41:32Z

on master we have in data.table.R

      ans = bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult, ops, verbose=verbose)

whereas that line in this branch is now

      ans = bmerge(i, x, leftcols, rightcols, xo, roll, rollends, nomatch, mult, ops, nqgrp, nqmaxgrp, verbose=verbose)

https://github.com/Rdatatable/data.table/blame/c6e5a0384ce2d6fe5f45406ffc07baea9ab2d460/R/data.table.R#L513 shows this line was modified in this pr, but why is the definition of bmerge not consistently updated in this pr?
I guess master removed some arguments in bmerge, relative to when this pr started.
there were conflicts on the line above (usage of bmerge in data.table.R), but not in definition of bmerge in bmerge.R.

currently bmerge.R is defined as below so I guess the solution is to remove the extra arguments.

bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbose)

github-actions · 2024-11-28T18:50:55Z

Generated via commit 7889c86

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	4 minutes and 43 seconds
Installing different package versions	7 minutes and 35 seconds
Running and plotting the test cases	2 minutes and 11 seconds

tdhock · 2024-11-28T19:30:59Z

we are getting non_equi not found error now.
https://github.com/Rdatatable/data.table/blame/ded8afba061cf016064d6efaa26483c72f9aca06/R/data.table.R#L502 says that non_equi was present before trying to merge with master. it was previously touched in #2420 (Oct 2017)
if(... is.na(non_equi) ) is deleted on master, but used in this pr.
the merge conflict diff shows that, but not the definition of non_equi.
i guess we have to dig in the previous version of the code before I attempted the merge, https://github.com/Rdatatable/data.table/blob/ded8afba061cf016064d6efaa26483c72f9aca06/R/data.table.R
sorry for the messy merge conflict.
this is the first merge conflict in which I have had to dig for contexual code outside the <<< === >>> merge conflict diff block (which usually is sufficient in my experience).
I computed a diff between the two versions of the code which seem to be analogous

diff -u "c:/Users/hoct2726/R/data.table/R/data.table-master.R" "c:/Users/hoct2726/R/data.table/R/data.table-before-merge.R"
--- c:/Users/hoct2726/R/data.table/R/data.table-master.R	2024-11-28 14:37:50.101759300 -0500
+++ c:/Users/hoct2726/R/data.table/R/data.table-before-merge.R	2024-11-28 14:37:47.261831400 -0500
@@ -1,43 +1,78 @@
-    if (is.data.frame(i)) {
-      if (missing(on)) {
-        if (!haskey(x)) {
-          stopf("When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.")
-        }
-      } else if (identical(substitute(on), as.name(".NATURAL"))) {
-        naturaljoin = TRUE
-      }
-      if (naturaljoin) { # natural join #629
-        common_names = intersect(names_x, names(i))
-        len_common_names = length(common_names)
-        if (!len_common_names) stopf("Attempting to do natural join but no common columns in provided tables")
-        if (verbose) {
-          which_cols_msg = if (len_common_names == length(x)) {
-            catf("Joining but 'x' has no key, natural join using all 'x' columns")
-          } else {
-            catf("Joining but 'x' has no key, natural join using: %s", brackify(common_names))
-          }
-        }
-        on = common_names
+    if (is.data.table(i)) {
+      if (!haskey(x) && missing(on) && is.null(xo)) {
+        stop("When i is a data.table (or character vector), the columns to join by must be specified either using 'on=' argument (see ?data.table) or by keying x (i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.")
       }
       if (!missing(on)) {
         # on = .() is now possible, #1257
         on_ops = .parse_on(substitute(on), isnull_inames)
         on = on_ops[[1L]]
         ops = on_ops[[2L]]
-        if (any(ops > 1L)) { ## fix for #4489;  ops = c("==", "<=", "<", ">=", ">", "!=")
-          allow.cartesian = TRUE
-        }
         # TODO: collect all '==' ops first to speeden up Cnestedid
-        rightcols = colnamesInt(x, names(on), check_dups=FALSE)
-        leftcols  = colnamesInt(i, unname(on), check_dups=FALSE)
-      } else {
-        ## missing on
-        rightcols = chmatch(key(x), names_x)   # NAs here (i.e. invalid data.table) checked in bmerge()
+        rightcols = chmatch(names(on), names(x))
+        if (length(nacols <- which(is.na(rightcols))))
+          stop("Column(s) [", paste(names(on)[nacols], collapse=","), "] not found in x")
+        leftcols  = chmatch(unname(on), names(i))
+        if (length(nacols <- which(is.na(leftcols))))
+          stop("Column(s) [", paste(unname(on)[nacols], collapse=","), "] not found in i")
+        # figure out the columns on which to compute groups on
+        non_equi = which.first(ops != 1L) # 1 is "==" operator
+        if (!is.na(non_equi)) {
+          # non-equi operators present.. investigate groups..
+          if (verbose) cat("Non-equi join operators detected ... \n")
+          if (!missingroll) stop("roll is not implemented for non-equi joins yet.")
+          if (verbose) {last.started.at=proc.time();cat("  forder took ... ");flush.console()}
+          # TODO: could check/reuse secondary indices, but we need 'starts' attribute as well!
+          xo = forderv(x, rightcols, retGrp=TRUE)
+          if (verbose) {cat(timetaken(last.started.at),"\n"); flush.console()}
+          xg = attr(xo, 'starts')
+          resetcols = head(rightcols, non_equi-1L)
+          if (length(resetcols)) {
+            # TODO: can we get around having to reorder twice here?
+            # or at least reuse previous order?
+            if (verbose) {last.started.at=proc.time();cat("  Generating group lengths ... ");flush.console()}
+            resetlen = attr(forderv(x, resetcols, retGrp=TRUE), 'starts')
+            resetlen = .Call(Cuniqlengths, resetlen, nrow(x))
+            if (verbose) {cat("done in",timetaken(last.started.at),"\n"); flush.console()}
+          } else resetlen = integer(0L)
+          if (verbose) {last.started.at=proc.time();cat("  Generating non-equi group ids ... ");flush.console()}
+          nqgrp = .Call(Cnestedid, x, rightcols[non_equi:length(rightcols)], xo, xg, resetlen, mult)
+          if (verbose) {cat("done in",timetaken(last.started.at),"\n"); flush.console()}
+          if (length(nqgrp)) nqmaxgrp = max(nqgrp) # fix for #1986, when 'x' is 0-row table max(.) returns -Inf.
+          if (nqmaxgrp > 1L) { # got some non-equi join work to do
+            if ("_nqgrp_" %in% names(x)) stop("Column name '_nqgrp_' is reserved for non-equi joins.")
+            if (verbose) {last.started.at=proc.time();cat("  Recomputing forder with non-equi ids ... ");flush.console()}
+            set(nqx<-shallow(x), j="_nqgrp_", value=nqgrp)
+            xo = forderv(nqx, c(ncol(nqx), rightcols))
+            if (verbose) {cat("done in",timetaken(last.started.at),"\n"); flush.console()}
+          } else nqgrp = integer(0L)
+          if (verbose) cat("  Found", nqmaxgrp, "non-equi group(s) ...\n")
+        }
+        if (is.na(non_equi)) {
+          # equi join. use existing key (#1825) or existing secondary index (#1439)
+          if ( identical(head(key(x), length(on)), names(on)) ) {
+            xo = integer(0L)
+            if (verbose) cat("on= matches existing key, using key\n")
+          } else {
+            if (isTRUE(getOption("datatable.use.index"))) {
+              xo = getindex(x, names(on))
+              if (verbose && !is.null(xo)) cat("on= matches existing index, using index\n")
+            }
+            if (is.null(xo)) {
+              if (verbose) {last.started.at=proc.time(); flush.console()}
+              xo = forderv(x, by = rightcols)
+              if (verbose) {cat("Calculated ad hoc index in",timetaken(last.started.at),"\n"); flush.console()}
+              # TODO: use setindex() instead, so it's cached for future reuse
+            }
+          }
+        }
+      } else if (is.null(xo)) {
+        rightcols = chmatch(key(x),names(x))   # NAs here (i.e. invalid data.table) checked in bmerge()
         leftcols = if (haskey(i))
-          chmatch(head(key(i), length(rightcols)), names(i))
+          chmatch(head(key(i),length(rightcols)),names(i))
         else
           seq_len(min(length(i),length(rightcols)))
         rightcols = head(rightcols,length(leftcols))
+        xo = integer()  ## signifies 1:.N
         ops = rep(1L, length(leftcols))
       }
       # Implementation for not-join along with by=.EACHI, #604

I could use some help from anyone who has hacked data.table.R (this is my first time).

jangorecki · 2024-11-29T12:56:49Z

I would rather look at the diff of this PR vs master at the time of fork, then understand the logic, then apply same logic in a fresh branch forked from current master.

sritchie73 · 2024-11-29T13:05:53Z

👍 to @jangorecki 's suggestion - from recollection at the time it was also one of my first attempts at modifying the data.table package, so the diff between the original upstream commit and the changes made should be relatively small. You may then want to create a new clean branch from the current master and incorporate that same logic as needed, rather than modifying my very old branch.

sritchie73 and others added 30 commits March 26, 2018 13:54

Non-equi and roll keep cols from x and i.

3ff31a3

All columns kept when only non-equi / roll keys

1114038

= instead of <-

c4b5bcf

Updated non-equi / rolling joins with j

f593ac9

New SQL-like column return now works when j is provided and there are only non-equi / roll key columns in 'on'.

non-equi and roll now work with by = .EACHI again

0e9a7cb

typo introduced in test 149

1f05b17

Fixed tests 129, 130, 131

d1719d4

Expected output still had value from `J(4)` in `a` column instead of `TESTDT$a`.

Fixed tests 148 and 149

f64f74b

Expected output still had values from `CJ(2,3)` in `a` column instead of `dt$a`.

Fixed tests 1614.x

fe6ac41

missing c() inside of as.integer64

Fixed tests 1846.x

61e2c92

Added columns from empty d1 to expected output and and added column names prefixed with i. for the columns from d2.

All non-equi keys dropped instead of first.

a955915

Tests 1660.1, 1660.2 and 1846.1 now pass

Updated tests 1682.x

fba50ed

To get the equivalent grouping of `by = .EACHI` when chaining operators you now have to specify `by = .(key.x = key.i)` to get the same answer and column names

Fixed test 1469.2

14768d1

Updated columns extracted in j to `V1` to reflect `x` is now from `DT` rather than `J(c(2,0))`. When extracting `x` instead of `V1`, `irows` is sorted so the result has `key() == "x"`

Additional unit tests not required

9d6c44d

Functionality tested is captured by earlier unit tests that have been updated.

Merge branch 'master' into non-equi-key

df857c8

Added option bothkeycols to control column return

5596774

When TRUE (default) non-equi and roll columns are returned from both x and i.

Copied changed unit tests to end

baf13f1

Reverted tests to use bothkeycols = FALSE

2419f70

Documentation for bothkeycols

6f4f11d

Added option to for bothkeycols and used in tests

403e225

Added NEWS entry

2648d2b

Reverted old unit tests to master

5dfdbd5

options(datatable.bothkeycols=FALSE) is set at the top of the file so these tests can remain unmodified

Reverted newline changes in tests.Rraw

c57a6c5

merging in master

b571618

incremented test number in tests.Rraw to solve merge conflict. Merge branch 'master' into non-equi-key # Conflicts: # inst/tests/tests.Rraw

Merge branch 'master' into non-equi-key

2e3037e

Added additional test to maintain code coverage

5b3cd83

paste(..., sep="") -> paste0 in PR

8b9a920

Merge branch 'master' into non-equi-key

1bb30b5

mattdowle added this to the 1.13.0 milestone Sep 19, 2019

MichaelChirico mentioned this pull request Oct 28, 2019

rolling join keeps only one on column, takes name from x table and values from i table #4005

Open

mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019

jangorecki mentioned this pull request May 4, 2020

Unexpected result from non-equi join without explicit declaration of columns in j #4426

Closed

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

Henrik-P mentioned this pull request Mar 31, 2021

Unexpected result from non-equi-join #4949

Open

jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022

tdhock mentioned this pull request Mar 21, 2023

Non-equi join showing columns in a not so perfect style #5593

Open

jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023

jangorecki removed this from the 1.16.0 milestone Nov 6, 2023

tdhock mentioned this pull request Jul 18, 2024

Bizarre behavior of by=.EACHI + update join (:=) #6298

Open

myoung3 mentioned this pull request Sep 13, 2024

Conditional left join | left table column data is wrong #6495

Closed

tdhock self-assigned this Nov 28, 2024

tdhock added the non-equi joins rolling, overlapping, non-equi joins label Nov 28, 2024

tdhock added this to the 1.17.0 milestone Nov 28, 2024

merge master

c6e5a03

tdhock requested a review from MichaelChirico as a code owner November 28, 2024 18:35

use new bmerge

7889c86

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Both columns for rolling and non-equi joins #3093

Both columns for rolling and non-equi joins #3093

mattdowle commented Oct 3, 2018 •

edited

Loading

jangorecki commented Apr 10, 2020

tdhock commented Oct 5, 2021

tdhock commented Nov 28, 2024

tdhock commented Nov 28, 2024 •

edited

Loading

github-actions bot commented Nov 28, 2024 •

edited

Loading

tdhock commented Nov 28, 2024 •

edited

Loading

jangorecki commented Nov 29, 2024

sritchie73 commented Nov 29, 2024

Both columns for rolling and non-equi joins #3093

Are you sure you want to change the base?

Both columns for rolling and non-equi joins #3093

Conversation

mattdowle commented Oct 3, 2018 • edited Loading

jangorecki commented Apr 10, 2020

tdhock commented Oct 5, 2021

tdhock commented Nov 28, 2024

tdhock commented Nov 28, 2024 • edited Loading

github-actions bot commented Nov 28, 2024 • edited Loading

tdhock commented Nov 28, 2024 • edited Loading

jangorecki commented Nov 29, 2024

sritchie73 commented Nov 29, 2024

mattdowle commented Oct 3, 2018 •

edited

Loading

tdhock commented Nov 28, 2024 •

edited

Loading

github-actions bot commented Nov 28, 2024 •

edited

Loading

tdhock commented Nov 28, 2024 •

edited

Loading