-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cbindlist/mergelist] Implement cbindlist #6435
base: cbind-merge-list-utils
Are you sure you want to change the base?
Conversation
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
090a21b
to
f1ee884
Compare
874e7af
to
e4f3bc3
Compare
Generated via commit 4664e84 Download link for the artifact containing the test results: ↓ atime-results.zip
|
e4f3bc3
to
5e149be
Compare
} | ||
\arguments{ | ||
\item{l}{ \code{list} of \code{data.table}s to merge. } | ||
\item{copy}{ \code{logical}, decides if columns has to be copied into resulting object (default) or just referred. } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this description is not clear to me, and there's no example
A new \code{data.table} based on the stacked objects. Eventually when \code{copy} is \code{FALSE}, then resulting object will share columns with \code{l} tables. | ||
} | ||
\note{ | ||
If output object has any duplicate names, then key and indices are removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that also apply if the duplicates are not among the key/index columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and this seems the save approach, since setnames
does currently not check if keys/indices are doubled.
\code{\link{data.table}}, \code{\link{rbindlist}} | ||
} | ||
\examples{ | ||
l = list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more examples, e.g. copy=FALSE
, and demonstrating the behavior for keys/indices, and possibly duplicate names.
if (isNull(t)) | ||
SET_ATTRIB(to, shallow_duplicate(f)); | ||
else { | ||
for (t = ATTRIB(to); CDR(t) != R_NilValue; t = CDR(t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand what's happening here. Please add a one-sentence description of the mergeIndexAttrib
routine as a comment.
Did you explore implementing this in R without needing any new C code? |
test(13.01, class(d), "list") | ||
setDT(d) | ||
test(13.02, key(d), "x") | ||
# test(13.03, hasindex(d, "y") && hasindex(d, "z")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing test for case when keyed entry is not the first input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See added testcase below
4b4b117
to
bd543a4
Compare
88b1e3c
to
90a04e3
Compare
bd543a4
to
89baf39
Compare
90a04e3
to
4478411
Compare
89baf39
to
092c4f2
Compare
4478411
to
f9d245f
Compare
092c4f2
to
6bc0ebe
Compare
f9d245f
to
8bba6af
Compare
6bc0ebe
to
054b426
Compare
8bba6af
to
0c42ca9
Compare
054b426
to
104703a
Compare
0c42ca9
to
e6971cf
Compare
104703a
to
4b1ea90
Compare
e6971cf
to
3ef33d7
Compare
4b1ea90
to
b9d0773
Compare
add cbind by reference, timing R prototype of mergelist wording use lower overhead funs stick to int32 for now, correct R_alloc bmerge C refactor for codecov and one loop for speed address revealed codecov gaps refactor vecseq for codecov seqexp helper, some alloccol export on C bmerge codecov, types handled in R bmerge already better comment seqexp bmerge mult=error #655 multiple new C utils swap if branches explain new C utils comments mostly reduce conflicts to PR #4386 comment C code address multiple matches during update-on-join #3747 Revert "address multiple matches during update-on-join #3747" This reverts commit b64c0c3. merge.dt has temporarily mult arg, for testing minor changes to cbindlist c dev mergelist, for single pair now add quiet option to cc() mergelist tests add check for names to perhaps.dt rm mult from merge.dt method rework, clean, polish multer, fix righ and full joins make full join symmetric mergepair inner function to loop on extra check for symmetric mergelist manual ensure no df-dt passed where list expected comments and manual handle 0 cols tables more tests more tests and debugging move more logic closer to bmerge, simplify mergepair more tests revert not used changes reduce not needed checks, cleanup copy arg behavior, manual, no tests yet cbindlist manual, export both cleanup processing bmerge to dtmatch test function match order for easier preview vecseq gets short-circuit batch test allow browser big cleanup remmove unneeded stuff, reduce diff more cleanup, minor manual fixes add proper test scripts Merge branch 'master' into cbind-merge-list comment out not used code for coverage more tests, some nocopy opts rename sql test script, should fix codecov simplify dtmatch inner branch more precise copy, now copy only T or F unused arg not yet in api, wording comments and refer issues codecov hasindex coverage codecov gap tests for join using key, cols argument fix missing import forderv more tests, improve missing on handling more tests for order of inner and full join for long keys new allow.cartesian option, #4383, #914 reduce diff, improve codecov reduce diff, comments need more DT, not lists, mergelist 3+ tbls proper escape heavy check unit tests more tests, address overalloc failure mergelist and cbindlist retain index manual, examples fix manual minor clarify in manual retain keys, right outer join for snowflake schema joins duplicates in cbindlist recycling in cbindlist escape 0 input in copyCols empty input handling closing cbindlist vectorized _on_ and _join.many_ arg rename dtmatch to dtmerge vectorized args: how, mult push down input validation add support for cross join, semi join, anti join full join, reduce overhead for mult=error mult default value dynamic fix manual add "see details" to Rd mention shared on in arg description amend feedback from Michael semi and anti joins will not reorder x columns Merge branch 'master' into cbind-merge-list spelling, thx to @jan-glx check all new funs used and add comments bugfix, sort=T needed for now Merge branch 'master' into cbind-merge-list Update NEWS.md Merge branch 'master' into cbind-merge-list Merge branch 'master' into cbind-merge-list NEWS placement numbering ascArg->order Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list attempt to restore from master Update to stopf() error style Need isFrame for now More quality checks: any(!x)->!all(x); use vapply_1{b,c,i} really restore from master try to PROTECT() before duplicate() update error message in test appease the rchk gods extraneous space missing ';' use catf simplify perhapsDataTableR move sqlite.Rraw.manual into other.Rraw simplify for loop Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list
b9d0773
to
4664e84
Compare
Hey @jangorecki, PTAL at the comments here, your feedback would be helpful. Thanks! |
ans = .Call(Ccbindlist, l, copy) | ||
if (anyDuplicated(names(ans))) { ## invalidate key and index | ||
setattr(ans, "sorted", NULL) | ||
setattr(ans, "index", integer()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setattr(ans, "index", integer()) | |
setattr(ans, "index", NULL) |
Since we also do not set "index"
to integer()
initially, see also .shallow
if (isNull(key)) // first key is retained | ||
key = getAttrib(thisx, sym_sorted); | ||
UNPROTECT(protecti); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} | |
if (isNull(ATTRIB(index))) | |
setAttrib(ans, sym_index, R_NilValue); |
do not set index as empty vector
l = list( | ||
d1 = data.table(x=1:3, v1=1L), | ||
d2 = data.table(y=3:1, v2=2L), | ||
d3 = data.table(z=2:4, v3=3L) | ||
) | ||
cbindlist(l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l = list( | |
d1 = data.table(x=1:3, v1=1L), | |
d2 = data.table(y=3:1, v2=2L), | |
d3 = data.table(z=2:4, v3=3L) | |
) | |
cbindlist(l) | |
d1 = data.table(x=1:3, v1=1L, key="x") | |
d2 = data.table(y=3:1, v2=2L, key="y") | |
d3 = data.table(z=2:4, v3=3L) | |
cbindlist(list(d1, d2, d3)) | |
cbindlist(list(d1, d1)) | |
d4 = cbindlist(list(d1), copy=FALSE) | |
d4[, v1:=2L] | |
identical(d4, d1) |
also subject to the change that cbindlist()
does not automatically add index=integer()
as attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there possibly was a reason why integer() was used instead of NULL. You may want to look at other PRs that this one was split into, maybe there is a use of it. Unfortunately don't remember now.
if (isNull(from)) | ||
return; | ||
SEXP t = ATTRIB(to), f = ATTRIB(from); | ||
if (isNull(f)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (isNull(f)) | |
if (isNull(f)) // nothing to merge |
SEXP t = ATTRIB(to), f = ATTRIB(from); | ||
if (isNull(f)) | ||
return; | ||
if (isNull(t)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (isNull(t)) | |
if (isNull(t)) // target has no attributes -> overwrite |
if (isNull(t)) | ||
SET_ATTRIB(to, shallow_duplicate(f)); | ||
else { | ||
for (t = ATTRIB(to); CDR(t) != R_NilValue; t = CDR(t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (t = ATTRIB(to); CDR(t) != R_NilValue; t = CDR(t)); | |
for (t = ATTRIB(to); CDR(t) != R_NilValue; t = CDR(t)); // traverse to end of attributes list of to |
for (t = ATTRIB(to); CDR(t) != R_NilValue; t = CDR(t)); | ||
SETCDR(t, shallow_duplicate(f)); | ||
} | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return; |
skip empty return at end
SEXP ans = PROTECT(allocVector(VECSXP, nans)); | ||
SEXP index = PROTECT(allocVector(INTSXP, 0)); | ||
SEXP key = R_NilValue; | ||
setAttrib(ans, sym_index, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also set index below once, after checking if it contains atttributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set here integer() btw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and AFAIU the reason for setting it to integer()
because it is convenient to merge in the other indices which is pretty neat, but right now we are always setting it regardless of whether the result will have indices or.
ans = cbindlist(l) | ||
test(13.04, key(ans), "id1") | ||
test(13.05, indices(ans), c("id1","id2","id3","id1__id2__id3","id6","id7","id9")) | ||
test(13.06, ii, lapply(l, indices)) ## this tests that original indices have not been touched, shallow_duplicate in mergeIndexAttrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```gt_suggestion
ans = cbindlist(list(data.table(a=1:2), data.table(b=3:4, key="b")))
test(13.07, ans, data.table(a=1:2, b=3:4, key="b"))
test(13.08, key(ans), "b")
<!--__GRAPHITE_HTML_TAG_END__-->
@MichaelChirico I addressed your comments and added some more below. Most important part is the design decision if we set |
Towards #4370