Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement conditions parameter in dscquery with R syntax #167

Merged
merged 3 commits into from
Feb 10, 2019
Merged

Conversation

gaow
Copy link
Member

@gaow gaow commented Feb 10, 2019

Previously I rely on SQL syntax for conditions parameter. But there are two issues with it:

  1. SQL syntax can confuse R users, see here for an example.
  2. Because the condition is specified to dsc-query (the Python command program) , it cannot filter data after extraction. One has to do process the data further.

To overcome the above issues, this pull request replaces the conditions implementation with R expressions. Interface is otherwise the same except for one important difference: the use of wildcard sigil $(). For example previous query simulate.n > 10 has to be $(simulate.n) > 10. With this wildcard sigil we can allow for arbitrary R expressions in the conditions query. The code will be able to extract from the sigil variables involved, check and insert them into SQL query to get their values, then filter the result using provided R expression.

@pcarbo I think this is more or less the final solution. Please take a look. It works for some examples I have.

BTW I noticed #166 that I'm hoping you could take a look at (I'm not sure how to do it elegantly). I'm going to merge this PR to master and we can take from there.

@gaow gaow merged commit ef1d94c into master Feb 10, 2019
@gaow gaow deleted the dev branch February 11, 2019 19:32
@pcarbo
Copy link
Member

pcarbo commented Feb 11, 2019

@gaow How is the R expression converted to an SQL query, and where is i done inside the code?

It works for some examples I have.

@gaow Where are these examples? Maybe we can add a few to the @examples section?

@gaow
Copy link
Member Author

gaow commented Feb 11, 2019

How is the R expression converted to an SQL query

It is not converted to SQL query's WHERE clause, but contributed to SELECT and FROM clause by being part of others parameter for the R function. Because when some columns in the conditions are not targets and since we are doing the condition filter after the SQL step, we need to make sure we do also get these columns in the SQL query by making them others first, then remove them after use.

and where is i done inside the code?

This is what the regular expressions are there for; see code chunks under this lline. The additional_columns are kept for record here and are dropped at the end.

Where are these examples? Maybe we can add a few to the @examples section?

That is exactly what I did. Also it is reflected in the unit tests for the R package.

@pcarbo
Copy link
Member

pcarbo commented Feb 11, 2019

One more question:

we can allow for arbitrary R expressions in the conditions query.

@gaow Can you please provide more details? How is this possible if it is being converted to an SQL query? Presumably it only allows R expressions that have corresponding valid SQL syntax, or am I misunderstanding?

@gaow
Copy link
Member Author

gaow commented Feb 11, 2019

How is this possible if it is being converted to an SQL query?

I guess the confusion is what it is meant by SQL query. Does my explanation above clarify it a bit? We do not convert the condition filter in R to SQL's WHERE clause. All we did was to make sure the SQL query include these relevant columns for next steps filtering (in SELECT and FROM clauses) and extract their information.

That means it should allow for any R expression. For example, say:

conditions = c("apply_customized_criteria($(simulate.n), $(score.error))", ...)

where the first condition uses a function apply_customized_criteria based on simulate.n and score.error. Then the regular expression will detect that simulate.n and score.error are involved, and will add them to dsc-query command's target to ensure they are both extracted. Later when all data are in the R dataframe, we simply use subset(..., eval(parse(text=...))) to extract. Here text= should be able to work with any R expressions. Of course we'll remove in the very end columns thus added yet not in targets or others.

@pcarbo
Copy link
Member

pcarbo commented Feb 11, 2019

@gaow I'm attaching an example for discussion:

library(dscrutils)
dsc.dir <- system.file("datafiles","ash","dsc_result",package = "dscrutils")
timing1 <- system.time(
  dat1 <- dscquery(dsc.dir,
            targets = c(paste("simulate",c("nsamp","g"),sep="."),
                paste("shrink",c("mixcompdist","beta_est","pi0_est"),
                      sep=".")),
            conditions = "$(simulate.g) == 'list(c(2/3,1/3),c(0,0),c(1,2))'"))
timing2 <- system.time(
  dat2 <- dscquery(dsc.dir,
            targets = c(paste("simulate",c("nsamp","g"),sep="."),
    	        paste("shrink",c("mixcompdist","beta_est","pi0_est"),
                      sep="."))))
dat2 <- subset(dat2,simulate.g == "list(c(2/3,1/3),c(0,0),c(1,2))")
print(all(dat1 == dat2))
print(timing1)
print(timing2)

A few examples like this could be great for @examples because it shows clearly what the conditions argument does, and it is a great way to provide a more detailed test.

This are the runtimes I got for these two calls (on my MacBook Pro):

   user  system elapsed
  0.922   0.170   1.107
   user  system elapsed
  0.918   0.166   1.096

Should I expect the first call to be faster? Maybe we can come up with an example that better illustrates the advantage of using conditions over subsequent filtering of the data frame inside R?

@gaow
Copy link
Member Author

gaow commented Feb 12, 2019

Thanks @pcarbo this is one of the @example right? I think we can come up with something simpler because the query condition involves a large string. Yes, we would expect the one with condition to save some computation, but the example might be too minimal to illustrate.

@pcarbo
Copy link
Member

pcarbo commented Feb 12, 2019

@gaow Two suggested to-do items:

  1. Improve the @examples along the lines of what I suggest above.

  2. Improve the tests for the conditions argument along the lines of what I suggest above.

Do you want to handle these, or would you prefer that I do them?

@gaow
Copy link
Member Author

gaow commented Feb 12, 2019

Improve the @examples along the lines of what I suggest above.

you mean to use some non-trivial examples to demonstrate benefit of using condition vs not using them, by showing the runtime benchmark? Otherwise we already have those examples that you used. Possibly more relevant is to show the use of arbitrary R expression. But I'm not sure if we should do it in @example or use a separate documentation page on dsc-wiki.

Improve the tests for the conditions argument along the lines of what I suggest above.

This should be done. We currently have limited tests for conditions.

Do you want to handle these, or would you prefer that I do them?

The example data we have is already enough for adding more unit-tests I think. How about you help with 2, i.e., adding more unit tests (along with the 2nd item #168 because that needs tests anyways)? And I will use some non-trivial examples to demonstrate benefit of 1, by writing a separate vignette on dsc-wiki?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants