-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement conditions
parameter in dscquery
with R syntax
#167
Conversation
It is not converted to SQL query's
This is what the regular expressions are there for; see code chunks under this lline. The
That is exactly what I did. Also it is reflected in the unit tests for the R package. |
One more question:
@gaow Can you please provide more details? How is this possible if it is being converted to an SQL query? Presumably it only allows R expressions that have corresponding valid SQL syntax, or am I misunderstanding? |
I guess the confusion is what it is meant by SQL query. Does my explanation above clarify it a bit? We do not convert the condition filter in R to SQL's That means it should allow for any R expression. For example, say: conditions = c("apply_customized_criteria($(simulate.n), $(score.error))", ...) where the first condition uses a function |
@gaow I'm attaching an example for discussion: library(dscrutils)
dsc.dir <- system.file("datafiles","ash","dsc_result",package = "dscrutils")
timing1 <- system.time(
dat1 <- dscquery(dsc.dir,
targets = c(paste("simulate",c("nsamp","g"),sep="."),
paste("shrink",c("mixcompdist","beta_est","pi0_est"),
sep=".")),
conditions = "$(simulate.g) == 'list(c(2/3,1/3),c(0,0),c(1,2))'"))
timing2 <- system.time(
dat2 <- dscquery(dsc.dir,
targets = c(paste("simulate",c("nsamp","g"),sep="."),
paste("shrink",c("mixcompdist","beta_est","pi0_est"),
sep="."))))
dat2 <- subset(dat2,simulate.g == "list(c(2/3,1/3),c(0,0),c(1,2))")
print(all(dat1 == dat2))
print(timing1)
print(timing2) A few examples like this could be great for This are the runtimes I got for these two calls (on my MacBook Pro):
Should I expect the first call to be faster? Maybe we can come up with an example that better illustrates the advantage of using |
Thanks @pcarbo this is one of the |
@gaow Two suggested to-do items:
Do you want to handle these, or would you prefer that I do them? |
you mean to use some non-trivial examples to demonstrate benefit of using condition vs not using them, by showing the runtime benchmark? Otherwise we already have those examples that you used. Possibly more relevant is to show the use of arbitrary R expression. But I'm not sure if we should do it in
This should be done. We currently have limited tests for
The example data we have is already enough for adding more unit-tests I think. How about you help with 2, i.e., adding more unit tests (along with the 2nd item #168 because that needs tests anyways)? And I will use some non-trivial examples to demonstrate benefit of 1, by writing a separate vignette on |
Previously I rely on SQL syntax for
conditions
parameter. But there are two issues with it:dsc-query
(the Python command program) , it cannot filter data after extraction. One has to do process the data further.To overcome the above issues, this pull request replaces the
conditions
implementation with R expressions. Interface is otherwise the same except for one important difference: the use of wildcard sigil$()
. For example previous querysimulate.n > 10
has to be$(simulate.n) > 10
. With this wildcard sigil we can allow for arbitrary R expressions in theconditions
query. The code will be able to extract from the sigil variables involved, check and insert them into SQL query to get their values, then filter the result using provided R expression.@pcarbo I think this is more or less the final solution. Please take a look. It works for some examples I have.
BTW I noticed #166 that I'm hoping you could take a look at (I'm not sure how to do it elegantly). I'm going to merge this PR to
master
and we can take from there.