Failure to merge multiple pipelines automatically in query when pipeline parameters are involved #145

gaow · 2018-06-12T16:58:49Z

The issue was reported by @jean997 over slack:

Lets say I have two tracks that get run AB and AC. Lets say that C has three possible parameter values and B only has 1. Right now if I do something like

dscout <- dscquery(dsc.outdir="results",
                   targets=c("A.A_params", "B.summary", "C.C_params", "C.summary"))

I will only get results for the first parameter value for C but if I do

dscout <- dscquery(dsc.outdir="results",
                   targets=c("A.A_params",  "C.C_params", "C.summary"))

I get all three. This makes sense but it is awkward because I have to merge data frames

My response:

This is a result of joining in tables by common key when constructing SQL query. But I suppose I can break it up smarter into multiple SQL queries more properly, and join separate tables, then merge the outcome internally, so that you'll not have to merge tables outside the query function.

The text was updated successfully, but these errors were encountered:

pcarbo · 2018-06-12T17:01:28Z

@gaow Can you create a small DSC example that illustrates this behaviour so I can inspect the output?

gaow · 2018-06-12T17:05:13Z

@pcarbo sure! that'd also be the first thing I should do. I'll try to do this later today and if it is straightforward enough to fix I'll just fix it. Otherwise I'll upload the example here, or add to our test, for you to inspect it and discuss desired behavior.

pcarbo · 2018-06-12T17:10:44Z

@gaow Thanks. It would be great if I could first understand this example before you attempt to fix it.

gaow · 2018-06-17T21:16:36Z

@pcarbo It has not been straightforward so cannot fix it earlier this week. I'm still working on it. I'll skip irrelevant details and discuss what I've got so far and see how to finalize it. Currently, after running separate queries I end up with this table A:

   DSC  aa.n  bb.n
0    1   1   NA
1    1   2   NA
2    1   3   NA
3    1   NA   1
4    1   NA   2
5    1   NA   3

which should further be processed to this table B:

   DSC  aa.n  bb.n
0    1   1   1
1    1   2   2
2    1   3   3

to present it more concisely. I currently struggle with a reliable implementation to go from table A to B. Previously I relied on SQL group by on none-NA columns (this case the DSC column) which would cause the issue this ticket discusses. So now I need a better way to consolidate them. Do you know a good way to do this in R?

gaow · 2018-06-17T21:37:52Z

Another question is when the two parameters are different in length (as described here), or even in nature (see below), should we be merging it at all? For example:

   DSC  aa.n  bb.p
0    1   1   NA
1    1   2   NA
2    1   3   NA
3    1   NA   0.1

does it make sense to present it as:

   DSC  aa.n  bb.p
0    1   1   0.1
1    1   2   NA
2    1   3   NA

? or it makes more sense to leave it without merging? Seems keeping as is makes sense.

I think it is a difficult question to answer without knowing the context of the DSC -- for example for me to compare two fine mapping tools I'd like to set --max-causal parameter for both of them to say 1,2,3 and have them merged to compare 1 vs 1, 2 vs 2 and 3 vs 3 -- makes sense to merge. But if I set the other one to 4,5,6 then it does not make sense to merge, eg:

   DSC  method1.max_n  method2.max_n
0    1   1   4
1    1   2   5
2    1   3   6

From the engineering point of view I wonder if it should be a post processing step of dsc-query result done at R level, considering whether or not to merge is context dependent and we cannot easily tell it by guessing from parameter names or values.

stephens999 · 2018-06-18T13:40:56Z

I think i don't understand this issue well enough. But let me say my understanding/suggestion is that the output of dsc_query should be one row per pipeline instance. Maybe discuss in person? Matthew

…

On Sun, Jun 17, 2018 at 4:37 PM gaow ***@***.***> wrote: Another question is when the two parameters are different in length (as described here) <#145 (comment)>, or even in nature (see below), should we be merging it at all? For example: DSC aa.n bb.p 0 1 1 NA 1 1 2 NA 2 1 3 NA 3 1 NA 0.1 does it make sense to present it as: DSC aa.n bb.p 0 1 1 0.1 1 1 2 NA 2 1 3 NA ? or it makes more sense to leave it without merging? Seems keeping as is makes sense. I think it is a difficult question to answer without knowing the context of the DSC -- for example for me to compare two fine mapping tools I'd like to set --max-causal parameter for both of them to say 1,2,3 and have them merged to compare 1 vs 1, 2 vs 2 and 3 vs 3 -- makes sense to merge. But if I set the other one to 4,5,6 then it does not make sense to merge, eg: DSC method1.max_n method2.max_n 0 1 1 4 1 1 2 5 2 1 3 6 From the engineering point of view I wonder if it should be a post processing step of dsc-query result done at R level, considering whether or not to merge is context dependent and we cannot easily tell it by guessing from parameter names or values. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#145 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xc9eOmzaM-nJvNRRpPS5ds87czkuks5t9swwgaJpZM4Ukzj5> .

gaow · 2018-06-18T13:59:24Z

But let me say my understanding/suggestion is that the output of dsc_query
should be one row per
pipeline instance.

Yes this is current behavior. But sometimes it is convenient to further merge the outcome for a more compact and easier to use table -- only if we know how to do this appropriately! Sure let's talk about it on Wednesday.

stephens999 · 2018-06-18T14:44:17Z

once you have one row per pipeline you can merge using the many available tools in R. No need to do that in dscquery

…

On Mon, Jun 18, 2018 at 8:59 AM gaow ***@***.***> wrote: But let me say my understanding/suggestion is that the output of dsc_query should be one row per pipeline instance. Yes this is current behavior. But sometimes it is convenient to further merge the outcome for a more compact and easier to use table -- only if we know how to do this appropriately! Sure let's talk about it on Wednesday. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#145 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xQEAPenpVfqKiJQsd4VaWfWHXKT4ks5t97I-gaJpZM4Ukzj5> .

pcarbo · 2018-06-26T19:57:03Z

@gaow Like Matthew I'm confused by your example---I don't see how dsc could automatically know when to merge, and when not to merge. It seems that this would require an understanding how the parameters are used in the code, which would require a human-level understanding of the benchmark.

jean997 · 2018-06-26T21:51:36Z

Sorry I am just looking at this discussion. @gaow, @pcarbo, @stephens999 I think the problem is actually that we don't get one row per pipeline run. That would be a fine solution.

I made an example illustrating the behavior. I am attaching the dsc file
example.dsc.txt and an R script
get_results.R.txt that shows the behavior of dscquery (file extensions changed to txt to appease git hub). The R file has the results of each command included.

Basically the problem is this. I run a pipeline simulate*(A, B)*summary where method B has three possible parameters it can run with. I run three replicates. Extracting method A only results I expect a data frame with three rows -- this is observed. Extracting method B only results I expect a data frame with 9 rows (one row per parameter value per replicate) -- this is also observed. Extracting both, the ideal solution would be to also have nine rows. This might be complicated however and it doesn't easily extend to when both methods have multiple parameters. One solution described by @gaow would be to have 12 rows with NAs for method A on nine of them and NAs for method B on three of them. However, what is observed is a data frame with three rows and only the first parameter value for method B is extracted. Another also possible solution would be to go to wide format where the results for method B give three columns each corresponding to a different parameter value.

gaow · 2018-06-26T22:25:16Z

Thank you @jean997 for clarifying it! Would you mind updating to the current master and see the new behavior? Take a step back, I realized fixing the issue you initially reported involves reverting my patch that tries automatically merging results from different runs -- which I believe was initially also your feature request so reverting it would upset you, however -- that should give the result you describe below:

One solution described by @gaow would be to have 12 rows with NAs for method A on nine of them and NAs for method B on three of them.

Would you verify if it is the case?

And ideally, yes we want to be able to reliably achieve this:

Another also possible solution would be to go to wide format where the results for method B give three columns each corresponding to a different parameter value.

But i'm having troubles with this, because it is logically difficult to properly determine and merge those NA's. Checking NA status is not enough, as shown by my examples above, and I tend to agree with @pcarbo that it requires human-level understanding.

Since for simple case such as yours, the merging would make sense, I'm now more inclined to add a step in the R utility to post process it depending on the user's need.

RE the NA.rds problem you reported at the end of your example -- that I believe should be a bit to fix at R level: @pcarbo basically we need to not let dscquery to load files when the element in the data-frame is NA ...

gaow · 2018-06-26T22:35:22Z

@jean997 I fixed the NA.rds issue and ran your example:

> library(dscrutils)
> #Try to extract both but don't end up with all parameter values for huber_mean
> dscout <- dscquery(dsc.outdir="results", 
+                    targets=c("mean.est_mean",
+                              "huber_mean.k",
+                              "huber_mean.est_mean"))
Running shell command:
dsc-query results -o /tmp/RtmpGM0tlN/file381765d2de68.csv -f --target mean.est_mean huber_mean.k huber_mean.est_mean --rds omit
INFO: Loading database ...
INFO: Running queries ...
INFO: Extraction complete!
Loading dsc-query output from CSV file.
Reading DSC outputs:
 - mean.est_mean: extracted atomic values
 - huber_mean.est_mean: extracted atomic values
> dscout
   DSC mean.est_mean huber_mean.est_mean huber_mean.k
1    1   0.008447484                  NA           NA
2    2   0.075343752                  NA           NA
3    3   0.024550986                  NA           NA
4    1            NA         0.009774250          0.5
5    2            NA         0.061133496          0.5
6    3            NA         0.025887638          0.5
7    1            NA         0.008999247          1.5
8    2            NA         0.075455984          1.5
9    3            NA         0.019175379          1.5
10   1            NA         0.008447484          5.0
11   2            NA         0.075343752          5.0
12   3            NA         0.024550986          5.0

You'll need to install the current master of dscrutils to test it out, but I'll make a new release once we resolve this issue!

@pcarbo Your previous assessment is pretty much what happens, but still to get a more concrete idea you can update DSC and run Jean's example and see for yourself. This is a pretty good example that Jean puts together. The DSC part is just dsc example.dsc.txt. You should get my results above, and I argue that we should not try automatically merging them but rather provide an R function that makes it easy to do after query when the researcher deems there is a need. For example it would be very appealing to merge results below:

> dscout
   DSC mean.est_mean huber_mean.est_mean 
1    1   0.008447484                  NA           
2    2   0.075343752                  NA         
3    3   0.024550986                  NA           
4    1            NA         0.009774250         
5    2            NA         0.061133496          
6    3            NA         0.025887638

pcarbo · 2018-06-27T03:08:55Z

@gaow I confirm that the last two queries now work after updating SoS and dsc.

@jean997 Thanks for providing this example! Very helpful to have examples for ironing out issues with dsc.

gaow · 2018-06-27T03:44:44Z

Thanks @pcarbo for confirming it! To motivate my claim that we should provide some post-processing utility, I updated my post above with an example from Jean's data that we actually may want to merge. Posting it again here:

> dscout
   DSC mean.est_mean huber_mean.est_mean 
1    1   0.008447484                  NA           
2    2   0.075343752                  NA         
3    3   0.024550986                  NA           
4    1            NA         0.009774250         
5    2            NA         0.061133496          
6    3            NA         0.025887638

is nicer to be presented (or post-processed to):

> dscout
   DSC mean.est_mean huber_mean.est_mean 
1    1   0.008447484                  0.009774250           
2    2   0.075343752                  0.061133496         
3    3   0.024550986                  0.025887638

This was initially also requested by @jean997 via slack a while ago and I responded by doing this automatic merging assuming there is always equal number of NA in each missing block -- an assumption that has led to issues reported in this ticket. Now I think we all agree not to auto-merge for reasons discussed previously. But should we provide utility functions for post processing it like above, if users deem it proper? @jean997 do you have some utility functions in R that you use for this sort of merger before I responded to your slack request?

stephens999 · 2018-06-27T09:38:16Z

i'll repeat my resistance to helping users with what is basically a programming task. Up to them how to deal with results once they are in a dataframe. However, this process can be made easier by using dsc in particular ways. here the "problem" is created by not defining a group mean_est: (mean, huber_mean) With that group suitably defined I believe you would have a columns that look somethign like: DSC, mean_est, k, est_mean 1, huber_mean, 0.5, xx 1, huber_mean, 1.5, xx 1, huber_mean, 5, xx 1, mean, NA, xx etc Put another way I agree that: ```

dscout

DSC mean.est_mean huber_mean.est_mean 1 1 0.008447484 NA 2 2 0.075343752 NA 3 3 0.024550986 NA 4 1 NA 0.009774250 5 2 NA 0.061133496 6 3 NA 0.025887638 ``` is not how we want things. We want a single column "est_mean" and another column that contains the method used (mean/huber_mean). I believe the use of a module group would have acheived this? Matthew ...

…

On Wed, Jun 27, 2018 at 4:44 AM gaow ***@***.***> wrote: Thanks @pcarbo <https://github.com/pcarbo> for confirming it! To motivate my claim that we should provide some post-processing utility, I updated my post above with an example from Jean's data that we actually may want to merge. Posting it again here: > dscout DSC mean.est_mean huber_mean.est_mean 1 1 0.008447484 NA 2 2 0.075343752 NA 3 3 0.024550986 NA 4 1 NA 0.009774250 5 2 NA 0.061133496 6 3 NA 0.025887638 is nicer to be presented (or post-processed to): > dscout DSC mean.est_mean huber_mean.est_mean 1 1 0.008447484 0.009774250 2 2 0.075343752 0.061133496 3 3 0.024550986 0.025887638 This was initially also requested by @jean997 <https://github.com/jean997> via slack a while ago and I responded by doing this automatic merging assuming there is always equal number of NA in each missing block -- an assumption that has led to issues reported in this ticket. Now I think we all agree not to auto-merge for reasons discussed previously. But should we provide utility functions for post processing it like above, if users deem it proper? @jean997 <https://github.com/jean997> do you have some utility functions in R that you use for this sort of merger before I responded to your slack request? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#145 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xYaaoE1in0POPPBsE6FN7r1s4CzUks5uAv-sgaJpZM4Ukzj5> .

gaow · 2018-06-27T12:19:28Z

Thank you @stephens999 ! The reasons I did not advertise too much about groups are #130 and #134. But I think the big picture is that those support will be provided one way or another. At this stage, groups function, though not mature, can be useful for simple cases.

@jean997 indeed for this particular example, what Matthew has proposed can be achieved already dynamically at query level, that is:

dscout <- dscquery(dsc.outdir="results", targets="mean_methods.est_mean mean_methods.k mean_methods.est_mean", groups="mean_methods: mean, huber_mean")

and you get:

> dscout
   DSC mean_methods mean_methods.est_mean mean_methods.k
1    1         mean           0.008447484             NA
2    2         mean           0.075343752             NA
3    3         mean           0.024550986             NA
4    1   huber_mean           0.009774250            0.5
5    2   huber_mean           0.061133496            0.5
6    3   huber_mean           0.025887638            0.5
7    1   huber_mean           0.008999247            1.5
8    2   huber_mean           0.075455984            1.5
9    3   huber_mean           0.019175379            1.5
10   1   huber_mean           0.008447484            5.0
11   2   huber_mean           0.075343752            5.0
12   3   huber_mean           0.024550986            5.0

Not exactly what you asked for, but is viable alternative. Of course we might need to document it better, which I just did:

  groups: Definition of module groups. For example, ‘groups =
          c("method: mean, median", "score: abs_err, sqrt_err")’ will
          dynamically create module groups ‘method’ and ‘score’ even if
          they have not previously been defined when running DSC.

along with minor format adjustment. Please update dscrutils and try it out. Is that good enough to you? Does it work on your more complicated examples?

pcarbo · 2018-06-27T15:17:52Z

@gaow @stephens999 @jean997 I agree that groups is the best way to resolve this. Here's my implementation of Matthew's suggestion:

simulate: R(dat <- rnorm(n = 1000,mean = 0,sd = 1) +
                   rbinom(n = 1000,size = 1,prob = 0.05) *
                   rexp(n = 1000,rate = 3))
  $dat: dat

unbiased_mean: R(m <- mean(x))
  x: $dat
  $est_mean: m

huber_mean: R(library(MASS); m <- huber(x,k = k)$mu)
  x: $dat
  k: 0.5, 1.5, 5
  $est_mean: m

sq_error: R(sq_error <- est^2)
  $sq_err: sq_error
  est: $est_mean

DSC:
  define: 
    mean: unbiased_mean, huber_mean
  run: simulate * mean * sq_error
  replicate: 3

For example,

dscrutils::dscquery(dsc.outdir = "example2",
  targets = c("mean.est_mean","huber_mean.k","sq_error.sq_err"))

returns the following data frame:

   DSC       mean mean.est_mean mean.k sq_error.sq_err
1    1 huber_mean      0.008447     NA      0.00007136
2    2 huber_mean      0.075344     NA      0.00567668
3    3 huber_mean      0.024551     NA      0.00060275
4    1 huber_mean      0.009774    0.5      0.00009554
5    2 huber_mean      0.061133    0.5      0.00373730
6    3 huber_mean      0.025888    0.5      0.00067017
7    1 huber_mean      0.008999    1.5      0.00008099
8    2 huber_mean      0.075456    1.5      0.00569361
9    3 huber_mean      0.019175    1.5      0.00036770
10   1 huber_mean      0.008447    5.0      0.00007136
11   2 huber_mean      0.075344    5.0      0.00567668
12   3 huber_mean      0.024551    5.0      0.00060275

jean997 · 2018-06-27T16:03:00Z

I think a group is clearly the right answer for this example. This came up for me initially in a context where groups would not have been the right choice because methods A and B produce totally different summaries.

I think @stephens999 is right that that is probably a situation where the user needs to solve their own problem but dsc also shouldn't produce non-sensical output if it is provided with a poorly thought out query. I think a lot of users will do what I did first before they figure out that they need to query the results in two steps and then merge. One option would be to just have dscquery complain if it is asked for something that is too complicated. For example, I think there is an obvious way to merge if not more than one of the methods has multiple parameter values. For situations beyond that, I think it would be fine to produce an error and suggest that user try to retrieve the results in separate chunks, since the NA filled data frame is just as cumbersome to work with.

@pcarbo In your example above, should the first three rows have unbiased_mean in the mean column?

pcarbo · 2018-06-27T16:09:16Z

dsc also shouldn't produce non-sensical output if it is provided with a poorly thought out query.

I dunno how you would solve this---if you use a programming language to develop poorly thought-out code, you will get poor results! This is a human problem, not a programming language problem. :)

In your example above, should the first three rows have unbiased_mean in the mean column?

@jean997 You are absolutely right. @gaow Any idea why the "mean" column only shows huber_mean?

jean997 · 2018-06-27T17:03:28Z

I just had a chance to re-run my example using the new version of dsc. For the most part got expected results, however, the last command gave something weird.

dscout <- dscquery(dsc.outdir="results",
+                    targets=c("mean.est_mean",
+                              "sq_error.sq_err"))
Running shell command:
dsc-query results -o /tmp/RtmphdtMHw/file3e6f2b5cd356.csv -f --target mean.est_mean sq_error.sq_err --rds omit
INFO: Loading database ...
INFO: Running queries ...
INFO: Extraction complete!
Loading dsc-query output from CSV file.
Reading DSC outputs:
 - mean.est_mean: extracted atomic values
 - sq_error.sq_err: extracted atomic values
> dscout
   DSC mean.est_mean sq_error.sq_err
1    1   0.008447484    7.135998e-05
2    2   0.075343752    5.676681e-03
3    3   0.024550986    6.027509e-04
4    1            NA    9.553596e-05
5    2            NA    3.737304e-03
6    3            NA    6.701698e-04
7    1            NA    8.098644e-05
8    2            NA    5.693605e-03
9    3            NA    3.676952e-04
10   1            NA    7.135998e-05
11   2            NA    5.676681e-03
12   3            NA    6.027509e-04

I think the last nine rows of this data frame should not appear?

pcarbo · 2018-06-27T17:18:22Z

@jean997 Yes, this seems like another bug or undesirable outcome.

stephens999 · 2018-06-30T11:21:16Z

I might be misunderstanding but this looks correct to me. For each pipeline it has extracted the variables you asked for where it can, and filled others with na. The way to exclude those rows is to filter on pipelines I think. That is run the query only on a subset of pipelines. I'm not sure what filter facilities we currently have in place though. Matthew

…

On Wed, Jun 27, 2018, 18:18 Peter Carbonetto ***@***.***> wrote: @jean997 <https://github.com/jean997> Yes, this seems like another bug or undesirable outcome. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#145 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xQUnpe314PVDGJFnoMgBAqvXcv31ks5uA75egaJpZM4Ukzj5> .

gaow · 2018-06-30T11:43:00Z

Yes exactly @stephens999 is correct. And this is when a row filtering is needed, via conditions:

dscout <- dscquery(dsc.outdir="results",
+                    targets=c("mean.est_mean",
+                              "sq_error.sq_err"), conditions="mean.est_mean not null")

(need to verify)

But I was mostly concerned that if we find this confusing in house we perhaps should do something about it. I think for this situation it might make sense to add automatically an additional column

mean
unbiased_mean
huber_mean
...

I did not find a chance to implement this during the week but will try to fixed this ticket the next couple of days.

stephens999 · 2018-06-30T11:45:31Z

Null and na are different in r... Not sure if that makes the syntax confusing. Presumably that is SQL syntax? Matthew

…

On Sat, Jun 30, 2018, 12:43 gaow ***@***.***> wrote: Yes exactly @stephens999 <https://github.com/stephens999> is correct. And this is when a row filtering is needed, via conditions: dscout <- dscquery(dsc.outdir="results", + targets=c("mean.est_mean", + "sq_error.sq_err"), conditions="mean.est_mean not null") (need to verify) But I was mostly concerned that if we find this confusing in house we perhaps should do something about it. I think for this situation it might make sense to add *automatically* an additional column mean unbiased_mean huber_mean ... I did not find a chance to implement this during the week but will try to fixed this ticket the next couple of days. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#145 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xUjn7ZcYgIQLcVTRi2mJbc1w6OyTks5uB2REgaJpZM4Ukzj5> .

stephens999 · 2018-06-30T11:50:33Z

I'm not convinced that if we find it confusing in house then adding a column automatically under certain circumstances is a good response. Such fixes by adding new rules make things harder to reason about, not easier. My suggestion would be to try to make the logic clearer... In this example that is acheicved by using groups I think. I understand that there exists another example where the groups do not fix it, but I have not seen that example so don't really understand the issue.

…

On Sat, Jun 30, 2018, 12:45 Matthew Stephens ***@***.***> wrote: Null and na are different in r... Not sure if that makes the syntax confusing. Presumably that is SQL syntax? Matthew On Sat, Jun 30, 2018, 12:43 gaow ***@***.***> wrote: > Yes exactly @stephens999 <https://github.com/stephens999> is correct. > And this is when a row filtering is needed, via conditions: > > dscout <- dscquery(dsc.outdir="results", > + targets=c("mean.est_mean", > + "sq_error.sq_err"), conditions="mean.est_mean not null") > > (need to verify) > > But I was mostly concerned that if we find this confusing in house we > perhaps should do something about it. I think for this situation it might > make sense to add *automatically* an additional column > > mean > unbiased_mean > huber_mean > ... > > I did not find a chance to implement this during the week but will try to > fixed this ticket the next couple of days. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#145 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABt4xUjn7ZcYgIQLcVTRi2mJbc1w6OyTks5uB2REgaJpZM4Ukzj5> > . >

gaow · 2018-06-30T12:11:41Z

Presumably that is SQL syntax?

Yes. I understand it is confusing. In particular R has the most diverse types of missing data. Anyways I think we should not dwell too much on that exact syntax to remove NA because post-processing in R, eg pipe to dplyr, is always more transparent to R users. I even think that conditions should not be advertised at all if most users are R based and knows how to do it in R.

My suggestion would be to try to make the logic clearer...

I think adding a column automatically whenever groups appear in the query should be made a default behavior whether there is missing data or not -- I think that's a proper fix and is generally solution, not to deal with this specific circumstance.

gaow · 2018-06-30T18:07:50Z

Okay, current behavior adjusted to:

dscrutils::dscquery(dsc.outdir = "test",  targets = c("mean.est_mean","huber_mean.k","sq_error.sq_err"))

   DSC          mean mean.est_mean mean.k sq_error.sq_err
1    1 unbiased_mean   0.008447484     NA    7.135998e-05
2    2 unbiased_mean   0.075343752     NA    5.676681e-03
3    3 unbiased_mean   0.024550986     NA    6.027509e-04
4    1    huber_mean   0.009774250    0.5    9.553596e-05
5    2    huber_mean   0.061133496    0.5    3.737304e-03
6    3    huber_mean   0.025887638    0.5    6.701698e-04
7    1    huber_mean   0.008999247    1.5    8.098644e-05
8    2    huber_mean   0.075455984    1.5    5.693605e-03
9    3    huber_mean   0.019175379    1.5    3.676952e-04
10   1    huber_mean   0.008447484    5.0    7.135998e-05
11   2    huber_mean   0.075343752    5.0    5.676681e-03
12   3    huber_mean   0.024550986    5.0    6.027509e-04

Notice here the change of huber_mean.k to mean.k. The "rule" is that as soon as a group name mean is involved, we will try to consolidate by groups for all columns involved, and add a group column (mean here) to reflect the group label.

I think making rules is not bad (in fact I personally feel that I personally make less rules than many software I use). I think what's important is consistency in behavior. In principle I would not propose, or tolerant solutions that lacks generality.

dscrutils::dscquery(dsc.outdir="test", targets=c("unbiased_mean.est_mean",  "huber_mean.est_mean", "sq_error.sq_err"))

   DSC unbiased_mean.est_mean huber_mean.est_mean sq_error.sq_err
1    1            0.008447484                  NA    7.135998e-05
2    2            0.075343752                  NA    5.676681e-03
3    3            0.024550986                  NA    6.027509e-04
4    1                     NA         0.009774250    9.553596e-05
5    2                     NA         0.061133496    3.737304e-03
6    3                     NA         0.025887638    6.701698e-04
7    1                     NA         0.008999247    8.098644e-05
8    2                     NA         0.075455984    5.693605e-03
9    3                     NA         0.019175379    3.676952e-04
10   1                     NA         0.008447484    7.135998e-05
11   2                     NA         0.075343752    5.676681e-03
12   3                     NA         0.024550986    6.027509e-04

Compare above to below: again introducing mean leads to automatic grouping for all columns involved.

dscrutils::dscquery(dsc.outdir="test", targets=c("unbiased_mean.est_mean",  "huber_mean.est_mean", "mean.k", "sq_error.sq_err"))

   DSC          mean mean.est_mean mean.k sq_error.sq_err
1    1 unbiased_mean   0.008447484     NA    7.135998e-05
2    2 unbiased_mean   0.075343752     NA    5.676681e-03
3    3 unbiased_mean   0.024550986     NA    6.027509e-04
4    1    huber_mean   0.009774250    0.5    9.553596e-05
5    2    huber_mean   0.061133496    0.5    3.737304e-03
6    3    huber_mean   0.025887638    0.5    6.701698e-04
7    1    huber_mean   0.008999247    1.5    8.098644e-05
8    2    huber_mean   0.075455984    1.5    5.693605e-03
9    3    huber_mean   0.019175379    1.5    3.676952e-04
10   1    huber_mean   0.008447484    5.0    7.135998e-05
11   2    huber_mean   0.075343752    5.0    5.676681e-03
12   3    huber_mean   0.024550986    5.0    6.027509e-04

dscrutils::dscquery(dsc.outdir="test", targets=c("huber_mean.k", "sq_error.sq_err"))

   DSC huber_mean.k sq_error.sq_err
1    1           NA    7.135998e-05
2    2           NA    5.676681e-03
3    3           NA    6.027509e-04
4    1          0.5    9.553596e-05
5    2          0.5    3.737304e-03
6    3          0.5    6.701698e-04
7    1          1.5    8.098644e-05
8    2          1.5    5.693605e-03
9    3          1.5    3.676952e-04
10   1          5.0    7.135998e-05
11   2          5.0    5.676681e-03
12   3          5.0    6.027509e-04

compare above to below: mean.k will add a mean column to the result, even if actually mean.k is unique only to huber_mean. This behavior should now be consistent with what I said above.

dscrutils::dscquery(dsc.outdir="test", targets=c("mean.k", "sq_error.sq_err"))

 - sq_error.sq_err: extracted atomic values
   DSC          mean mean.k sq_error.sq_err
1    1 unbiased_mean     NA    7.135998e-05
2    2 unbiased_mean     NA    5.676681e-03
3    3 unbiased_mean     NA    6.027509e-04
4    1    huber_mean    0.5    9.553596e-05
5    2    huber_mean    0.5    3.737304e-03
6    3    huber_mean    0.5    6.701698e-04
7    1    huber_mean    1.5    8.098644e-05
8    2    huber_mean    1.5    5.693605e-03
9    3    huber_mean    1.5    3.676952e-04
10   1    huber_mean    5.0    7.135998e-05
11   2    huber_mean    5.0    5.676681e-03
12   3    huber_mean    5.0    6.027509e-04

pcarbo · 2018-07-02T20:50:27Z

@gaow I don't have any fundamental issues with your proposal, but I do worry that it might be difficult to implement your rule in general, or explain it. Here is a simpler rule that I think is more intuitive, easier to implement, and easier to explain:

When target is x.y, where x is a module or module group, and y is an output or module parameter, return:

(1) An error when y is not defined in x, or in any modules in x, or any modules in the same group as x;

(2) NA when module x is run, or a module in group x is run, and y is not defined in the module;

(3) NA when module x is not run.

The output will not alwasy be as attractive or concise as it could be, but from the user point-of-view it is nice when all the targets you requested appear in the output table (unless it generates an error).

gaow · 2019-01-31T20:52:15Z

Wouldn't it be possible for DSC to distinguish between these two possibilities? There is a difference between "there is no output file" and "there is an output file and it contains the value NA".

In principle yes, but in practice we do not get to filter the table until we have the complete table. We will thus have to invent customized convention in the R code to distinguish these two types of NA. See here and here. But what do we use for another type of NA? Some crazy string characters?

It could be added as an argument, e.g. return_all = TRUE would return the outer join and return_all = FALSE would return only those rows with no missing data

It is a compromise but I'm always hesitated adding more options like this simply because we did not agree on the expected behavior ourselves. I am leaning towards filtering it even though it involves more work that it appears. I just want to make sure there is no bad side effects.

jdblischak · 2019-01-31T21:01:06Z

If it's unclear what the best way forward is, then maybe we should just wait until we have more user feedback. I am satisfied to use na.omit() for now.

pcarbo · 2019-01-31T21:53:04Z

@jdblischak @gaow This is one of those situations where we didn't properly anticipate all the use cases when designing the internal data structures.

stephens999 · 2019-02-01T03:42:33Z

I don't understand the example. Is there still a conditions parameter that can be used to filter out pipelines?

…

On Thu, Jan 31, 2019, 15:53 Peter Carbonetto ***@***.*** wrote: @jdblischak <https://github.com/jdblischak> @gaow <https://github.com/gaow> This is one of those situations where we didn't properly anticipate all the use cases when designing the internal data structures. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#145 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xVVsaEUIoex8OuiXElU4k-y__9IIks5vI2XBgaJpZM4Ukzj5> .

gaow · 2019-02-01T04:15:37Z

Is there still a conditions parameter that can be used to filter out
pipelines?

Yes and I think carefully crafted condition parameter can give clean results (although possibly not for this example because the "outer join" happens after condition filter). But more often than not we tend to use dscquery without condition constraints, then use R to post-process it, so that we only query it once. The condition query is perhaps only useful for very large benchmarks when it is too much computation to load every file generated for all conditions.

I don't understand the example.

The pipelines are:

  run:
    pipe_de: de * analyze * (auc, true_fdr)
    pipe_zeros: zeros * analyze * type_one_error

then the query targets are de analyze true_fdr. You see de and true_fdr are only relevant to the first pipeline, but analyze is relevant to both pipelines. As explained earlier, there is no concept of pipeline in queries. As a result, lines 36 to 45 are in fact analyze from the 2nd pipeline where de and true_fdr are missing.

stephens999 · 2019-02-01T04:31:10Z

I'm not sure I understand the query behavior. Why are there results returned for the pipe_zeros pipelines but not for the pipelines of the form de, analyze,auc ? Can you point me to the docs that describe the query behavior?

…

On Thu, Jan 31, 2019, 22:15 gaow ***@***.*** wrote: Is there still a conditions parameter that can be used to filter out pipelines? Yes and I think carefully crafted condition parameter can give clean results. But more often than not we tend to use dscquery without condition constraints, then use R to post-process it, so that we only query it once. The condition query is perhaps only useful for very large benchmarks when it is too much computation to load every file generated for all conditions. I don't understand the example. The pipelines are: run: pipe_de: de * analyze * (auc, true_fdr) pipe_zeros: zeros * analyze * type_one_error then the query targets are de analyze true_fdr. You see de and true_fdr are only relevant to the first pipeline, but analyze is relevant to both pipelines. As explained earlier, there is no concept of pipeline in queries. As a result, lines 36 to 45 are in fact analyze from the 2nd pipeline where de and true_fdr are missing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#145 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xfGP55-5kOS6Xg6AyhwYYPeQD0yBks5vI79pgaJpZM4Ukzj5> .

gaow · 2019-02-01T04:40:39Z

Sorry in the ticket I truncated the output to highlight the problem. The complete output is here:

> df_auc <- dscquery("soneson2013/",
                  targets = c("de",
                              "analyze",
                              "auc.auc"))

Reading DSC outputs:
 - auc.auc: extracted atomic values
   DSC               de analyze           auc.auc
1    1    b_1250_0_data  nbpseq 0.703066420259366
2    1    b_1250_0_data    voom 0.749941929730694
3    1    b_1250_0_data     vst 0.745499375780275
4    1    b_1250_0_data   edger 0.707332049224184
5    1    b_1250_0_data   deseq 0.627637988228999
6    1   b_625_625_data  nbpseq 0.712647685269889
7    1   b_625_625_data    voom  0.77268378180479
8    1   b_625_625_data     vst  0.77547807031507
9    1   b_625_625_data   edger 0.718635069598904
10   1   b_625_625_data   deseq 0.630987136255991
11   1    b_4000_0_data  nbpseq 0.692115897167516
12   1    b_4000_0_data    voom 0.682802801166534
13   1    b_4000_0_data     vst 0.666099988306731
14   1    b_4000_0_data   edger 0.693014518555751
15   1    b_4000_0_data   deseq 0.650152209209948
16   1 b_2000_2000_data  nbpseq 0.778166273051953
17   1 b_2000_2000_data    voom 0.772014492946875
18   1 b_2000_2000_data     vst 0.775365827640311
19   1 b_2000_2000_data   edger 0.781900276909811
20   1 b_2000_2000_data   deseq 0.744707881632538
21   1   p_625_625_data  nbpseq 0.811233462789303
22   1   p_625_625_data    voom  0.86839306341039
23   1   p_625_625_data     vst 0.870691258756994
24   1   p_625_625_data   edger 0.818146471021465
25   1   p_625_625_data   deseq 0.757796341260436
26   1   s_625_625_data  nbpseq 0.677014888307802
27   1   s_625_625_data    voom 0.749693650385287
28   1   s_625_625_data     vst 0.751365541307759
29   1   s_625_625_data   edger 0.690230491068573
30   1   s_625_625_data   deseq 0.515177624697379
31   1   r_625_625_data  nbpseq 0.651902240517095
32   1   r_625_625_data    voom 0.719384218990249
33   1   r_625_625_data     vst 0.727683658476215
34   1   r_625_625_data   edger 0.666313853728294
35   1   r_625_625_data   deseq  0.50147431424293
36   1             <NA>   edger              <NA>
37   1             <NA>   deseq              <NA>
38   1             <NA>  nbpseq              <NA>
39   1             <NA>    voom              <NA>
40   1             <NA>     vst              <NA>
41   1             <NA>   edger              <NA>
42   1             <NA>   deseq              <NA>
43   1             <NA>  nbpseq              <NA>
44   1             <NA>    voom              <NA>
45   1             <NA>     vst              <NA>
46   1             <NA>   edger              <NA>
47   1             <NA>   deseq              <NA>
48   1             <NA>  nbpseq              <NA>
49   1             <NA>    voom              <NA>
50   1             <NA>     vst              <NA>
51   1             <NA>   edger              <NA>
52   1             <NA>   deseq              <NA>
53   1             <NA>  nbpseq              <NA>
54   1             <NA>    voom              <NA>
55   1             <NA>     vst              <NA>

Here is the DSC file. Two pipelines are executed. Lines before 35 are output for the pipeline_de.

Can you point me to the docs that describe the query behavior?

The "official" approach is to use dscquery in R, see docs here. It is a wrapper of this program. However I dont think these documentation is very helpful for this specific issue.

stephens999 · 2019-02-02T23:29:12Z

thanks @gaow

you say two pipelines are executed, although of course it is many more than that...
(Here I am using the terminology that a pipeline is a sequence of modules)

In any case I have a question.

In the results I see there are 35 pipelines created by (de* analyze *auc) and 20 by
zeros * analyze * type_one_error

Where are the results for the other 35 pipelines created by
(de * analyze * true_fdr)
?

gaow · 2019-02-05T19:53:33Z

Where are the results for the other 35 pipelines created by
(de * analyze * true_fdr)

They are not available because they are not part of the query. When I query this:

library(dscrutils)
df_auc <- dscquery("soneson2013/",
                  targets = c("de",
                              "analyze",
                              "auc.auc",
                             "true_fdr.true_fdr"))

I get:

DSC	de	de.output.file	analyze	analyze.output.file	auc.auc	true_fdr.true_fdr
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_1	edger	edger/b_1250_0_data_4_edger_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_2	edger	edger/b_1250_0_data_5_edger_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_3	edger	edger/b_1250_0_data_6_edger_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_1	deseq	deseq/b_1250_0_data_4_deseq_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_2	deseq	deseq/b_1250_0_data_5_deseq_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_3	deseq	deseq/b_1250_0_data_6_deseq_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_1	nbpseq	nbpseq/b_1250_0_data_4_nbpseq_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_2	nbpseq	nbpseq/b_1250_0_data_5_nbpseq_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_3	nbpseq	nbpseq/b_1250_0_data_6_nbpseq_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_1	voom	voom/b_1250_0_data_4_voom_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_2	voom	voom/b_1250_0_data_5_voom_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_3	voom	voom/b_1250_0_data_6_voom_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_1	vst	vst/b_1250_0_data_4_vst_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_2	vst	vst/b_1250_0_data_5_vst_2	5	NA
1	b_1250_0_data	b_1250_0_data/b_1250_0_data_3	vst	vst/b_1250_0_data_6_vst_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_1	edger	edger/b_625_625_data_4_edger_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_2	edger	edger/b_625_625_data_5_edger_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_3	edger	edger/b_625_625_data_6_edger_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_1	deseq	deseq/b_625_625_data_4_deseq_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_2	deseq	deseq/b_625_625_data_5_deseq_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_3	deseq	deseq/b_625_625_data_6_deseq_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_1	nbpseq	nbpseq/b_625_625_data_4_nbpseq_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_2	nbpseq	nbpseq/b_625_625_data_5_nbpseq_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_3	nbpseq	nbpseq/b_625_625_data_6_nbpseq_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_1	voom	voom/b_625_625_data_4_voom_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_2	voom	voom/b_625_625_data_5_voom_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_3	voom	voom/b_625_625_data_6_voom_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_1	vst	vst/b_625_625_data_4_vst_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_2	vst	vst/b_625_625_data_5_vst_2	5	NA
1	b_625_625_data	b_625_625_data/b_625_625_data_3	vst	vst/b_625_625_data_6_vst_2	5	NA
⋮	⋮	⋮	⋮	⋮	⋮	⋮
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_1	edger	edger/b_4000_0_data_4_edger_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_2	edger	edger/b_4000_0_data_5_edger_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_3	edger	edger/b_4000_0_data_6_edger_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_1	deseq	deseq/b_4000_0_data_4_deseq_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_2	deseq	deseq/b_4000_0_data_5_deseq_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_3	deseq	deseq/b_4000_0_data_6_deseq_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_1	nbpseq	nbpseq/b_4000_0_data_4_nbpseq_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_2	nbpseq	nbpseq/b_4000_0_data_5_nbpseq_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_3	nbpseq	nbpseq/b_4000_0_data_6_nbpseq_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_1	voom	voom/b_4000_0_data_4_voom_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_2	voom	voom/b_4000_0_data_5_voom_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_3	voom	voom/b_4000_0_data_6_voom_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_1	vst	vst/b_4000_0_data_4_vst_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_2	vst	vst/b_4000_0_data_5_vst_2	NA	0
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_3	vst	vst/b_4000_0_data_6_vst_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_1	edger	edger/b_2000_2000_data_4_edger_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_2	edger	edger/b_2000_2000_data_5_edger_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_3	edger	edger/b_2000_2000_data_6_edger_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_1	deseq	deseq/b_2000_2000_data_4_deseq_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_2	deseq	deseq/b_2000_2000_data_5_deseq_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_3	deseq	deseq/b_2000_2000_data_6_deseq_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_1	nbpseq	nbpseq/b_2000_2000_data_4_nbpseq_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_2	nbpseq	nbpseq/b_2000_2000_data_5_nbpseq_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_3	nbpseq	nbpseq/b_2000_2000_data_6_nbpseq_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_1	voom	voom/b_2000_2000_data_4_voom_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_2	voom	voom/b_2000_2000_data_5_voom_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_3	voom	voom/b_2000_2000_data_6_voom_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_1	vst	vst/b_2000_2000_data_4_vst_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_2	vst	vst/b_2000_2000_data_5_vst_2	NA	0
1	b_2000_2000_data	b_2000_2000_data/b_2000_2000_data_3	vst	vst/b_2000_2000_data_6_vst_2	NA	0

We can discuss it when we meet but here is a minimal working example:

mwe.dsc.txt

To run this example, first update DSC:

pip install dsc -U

then run it via:

dsc mwe.dsc.txt

It should work without having to install any packages. The results can be queried using the R codes at the beginning of this post.

gaow · 2019-02-08T15:01:54Z

@stephens999 to answer again your question "Where are the results for the other 35 pipelines created by
(de * analyze * true_fdr)" let me clarify what your concern was as you explained to me in our in person discussion -- because if each row is a pipeline instance, these pipelines should also show up in the result as:

DSC	de	de.output.file	analyze	analyze.output.file	auc.auc
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_1	edger	edger/b_4000_0_data_1_edger_1	NA
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_2	edger	edger/b_4000_0_data_2_edger_1	NA
1	b_4000_0_data	b_4000_0_data/b_4000_0_data_3	edger	edger/b_4000_0_data_3_edger_1	NA

where the last column is NA because the true_fdr score pipeline does not have the information.

I agree one would expect it. In fact you will see it if I modify this line:

dsc/src/query_engine.py

Line 273 in 7d26602

    
           pipelines = filter_sublist([get_sequence(tables, pipeline) for tables, pipeline in zip(valid_tables, self.pipelines)])

by just removing the function filter_sublist. What it does is that it removes pipeline instances contained in other pipeline instances, after accounting for query targets. For example, the query:

de analyze auc.auc

will completely overlap with pipeline de * analysis * auc and result in:

1 | b_4000_0_data  | edger  | 0.8

but in pipeline de * analysis * true_fdr it will only overlap de * analysis and result in:

1 | b_4000_0_data  | edger  | NA

However, the story with overlapping zeros * analysis * type_one_error, the other pipeline in the DSC, is different. Since the only overlap with de analyze auc.auc is the analyze module it is not a complete pipeline instance. A pipeline instance should start with a module that does not have an upstream dependency. As a result, we have to complete the analyze module's upstream, which will include zeros module groups as one possibility. However since no output has been asked from zeros module group, the output will look like:

1 | NA  | edger  | NA

In sum the behavior of dsc-query is as intended at least in this example. although that filter_sublist call can be easily removed to make it truly one pipeline instance, I think it is neater to keep it as is, particularly since we are going to implement the new condition and filtering approach (discussion on slack) by that time this will not matter to the end user.

gaow · 2019-02-08T19:25:55Z

@jdblischak With this pull request you can change your query to:

library(dscrutils)
df_auc <- dscquery("soneson2013/",
                  targets = "auc.auc",
                  others = c("de", "analyze"),
                  omit.file.columns = TRUE)

and get what you actually wanted. Notice you can now use a new option omit.file.columns = TRUE to remove columns with just file names -- one would not need filenames of module instances unless they are meant to be loaded manually. I have merged this to master.

@stephens999 I used convention targets for what you suggested target_columns, and others for what you suggested others_column. Also I allowed for multiple targets columns with OR logic -- that is, a row will only be removed if all values in targets columns are missing values. This implementation is thus backwards compatible -- behavior of existing dscquery codes that other users have will not be changed. Here is the R documentation:

 targets: Query targets specified as character string separated by
          space, or a character vector, e.g., ‘targets = "simulate.n
          analyze score.error"’ and ‘targets =
          c("simulate.n","analyze","score.error")’ are equivalent.
          Using ‘paste’, eg ‘paste("simulate",c("n","p","df"),sep=".")’
          one can specify multiple properties from the same module.
          These will be the names of the columns in the returned data
          frame.

  others: Additional query items similarly specified as ‘targets’.
          Difference between ‘targets’ and `‘others’` is that the rows
          whose ‘targets’ columns containing all missing values will be
          removed, while ‘others’ columns will not have this impact.

gaow · 2019-02-10T23:54:06Z

I close this ticket now since the current behavior is what we've discussed. I'm still open to interface changes -- we can use separate tickets for it.

pcarbo · 2019-02-11T18:20:37Z

@gaow I'm finally catching up on DSC. From help(dscquery):

Difference between ‘targets’ and ‘others’ is that the rows whose ‘targets’ columns containing all missing values will be removed, while ‘others’ columns will not have this impact.

First, this description should be given in the "target" argument.

Second, this behaviour doesn't seem like quite what we discussed; from my understanding, if any of the targets have a missing value, then the row should be removed or, equivalently, the row is only kept if all targets have non-missing values.

Third, can we change the names of "targets" and "others" arguments? In particular, "others" is not a great name for an argument. What about "target.cols" and "other.cols", for example?

gaow · 2019-02-11T19:06:06Z

First, this description should be given in the "target" argument.

Sure, please feel free to make changes.

Second, this behaviour doesn't seem like quite what we discussed; from my understanding, if any of the targets have a missing value, then the row should be removed or, equivalently, the row is only kept if all targets have non-missing values.

We have not discussed what to do with multiple targets. @stephens999 initially suggested limiting it to only one column. I used OR here because it will make it compatible with existing codes. We can certainly change it to using AND logic easily, if we all agree with it (and not bothered by backwards compatibility).

Third, can we change the names of "targets" and "others" arguments?

Again, there is backwards compatibility issue, which might impact me, Jean, Kaiqian, Yuxin and possibly John -- perhaps mostly me. I'm not against changing it particularly at this point; i'm only saying that we should be careful if we want to make interface changes.

pcarbo · 2019-02-11T19:16:15Z

I used OR here because it will make it compatible with existing codes.

Sounds like we weren't precise enough in our conversation. I'm open to the "or" behaviour, but my understanding in our discussion is that we want the "and" behaviour. Let's see what @stephens999 says.

I think if we name the arguments in a clever way, we can remain backward compatible.

@stephens999 Can we reopen this so that we remember to check in with Matthew about this?

stephens999 · 2019-02-12T18:52:05Z

I don't really have a use case for multiple targets in mind. So I suggested to Only allow one to avoid having to make a decision until we have a use case... Backwards compatibility is of course nice Matthew

…

On Mon, Feb 11, 2019, 13:16 Peter Carbonetto ***@***.***> wrote: I used OR here because it will make it compatible with existing codes. Sounds like we weren't precise enough in our conversation. I'm open to the "or" behaviour, but my understanding in our discussion is that we want the "and" behaviour. Let's see what @stephens999 <https://github.com/stephens999> says. I think if we name the arguments in a clever way, we can remain backward compatible. @stephens999 <https://github.com/stephens999> Can we reopen this so that we remember to check in with Matthew about this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#145 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xQeJoCbWKi-Qeq1uQcC8B7_UrGJ_ks5vMcF_gaJpZM4Ukzj5> . On Mon, Feb 11, 2019, 13:16 Peter Carbonetto ***@***.***> wrote: I used OR here because it will make it compatible with existing codes. Sounds like we weren't precise enough in our conversation. I'm open to the "or" behaviour, but my understanding in our discussion is that we want the "and" behaviour. Let's see what @stephens999 <https://github.com/stephens999> says. I think if we name the arguments in a clever way, we can remain backward compatible. @stephens999 <https://github.com/stephens999> Can we reopen this so that we remember to check in with Matthew about this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#145 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xQeJoCbWKi-Qeq1uQcC8B7_UrGJ_ks5vMcF_gaJpZM4Ukzj5> .

pcarbo · 2019-02-12T20:44:47Z

So I suggested to only allow one to avoid having to make a decision until we have a use case...

I'm okay with that. But to ensure backward compatibility, what we would have to do is add a new argument to dscquery that is different from what we had before; if we repurpose "targets", then we will lose backward compatibility.

Here is an idea that could be more user-friendly: Many functions in R that operate on vectors or matrices often have to deal with NAs as special cases, and there is often an argument to specify behaviour for missing values; e.g., na.rm in function mean. Maybe we could have an argument na.rm that accepts the name of one (or more) names specified in targets. I think this interface would be more familiar to many R users.

gaow · 2019-02-12T22:46:13Z

Maybe we could have an argument na.rm that accepts the name of one (or more) names specified in targets

I like this proposal! Are there any obvious limitations?

stephens999 · 2019-02-12T22:59:49Z

It kind of goes back to the difference between na and not run. I thought the intended behavior was to remove pipelines that did not run auc (or whatever is in Target) rather than removing pipeline that run it but produce na?

…

On Tue, Feb 12, 2019, 16:46 gaow ***@***.***> wrote: Maybe we could have an argument na.rm that accepts the name of one (or more) names specified in targets I like this proposal! Are there any obvious limitations? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#145 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xQwp-JhEnyo9ZtDYNGoYLeX-ydD2ks5vM0Q1gaJpZM4Ukzj5> .

gaow · 2019-02-12T23:14:13Z

It kind of goes back to the difference between na and not run. ... intended behavior was to remove pipelines that did not run auc (or whatever is in Target) rather than removing pipeline that run it but produce na?

This is important to distinguish. Currently it is the intended behavior as you suggested above. It can still be the case when we implement the na.rm parameter but the interpretation of na.rm will be pipelines not run, not the conventional na.rm in R. And na.rm in our case will take column names anyways. Of course we can use other names.

So I suggested to only allow one to avoid having to make a decision until we have a use case...

@stephens999 this is currently already the case when targets has only one element. We just do not ban the case with more than one element now.

pcarbo · 2019-02-13T00:10:39Z

I like this proposal! Are there any obvious limitations?

@gaow None that I see.

I thought the intended behavior was to remove pipelines that did not run auc (or whatever is in Target) rather than removing pipeline that run it but produce na?

I thought we recognized that it is currently not possible to distinguish between the two based on how the DSC results are stored. My understanding is that sometimes DSC assigns NA to auc when tthe module generating a value for auc is not run.

stephens999 · 2019-02-13T00:28:14Z

I don't fully understand the process but I believe they are distinguishable before result extraction. They are just not distinguishable after extraction into a data frame as they are both represented as na. But during extraction it should be possible to not extract results from things that were not run. Gao can correct me if I'm wrong. Matthew

…

On Tue, Feb 12, 2019, 18:10 Peter Carbonetto ***@***.***> wrote: I like this proposal! Are there any obvious limitations? @gaow <https://github.com/gaow> None that I see. I thought the intended behavior was to remove pipelines that did not run auc (or whatever is in Target) rather than removing pipeline that run it but produce na? I thought we recognized that it is currently not possible to distinguish between the two based on how the DSC results are stored. My understanding is that sometimes DSC assigns NA to auc when tthe module generating a value for auc is not run. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#145 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABt4xUZSLRN_t5np6YU_tKAD-kbT9gk8ks5vM1f_gaJpZM4Ukzj5> .

gaow · 2019-02-13T00:59:38Z

Yes @stephens999 is correct. This is exactly what happened -- thanks to #167 all filtering are now in R and we can filter as many rounds as we want. We now filter twice: before extracting results, and after extracting values. The 'not run" filter only applies to the first round.

pcarbo · 2019-02-13T01:09:02Z

We now filter twice: before extracting results, and after extracting values.

@gaow And only the conditions are applied to filter results in the second round?

gaow · 2019-02-13T01:20:44Z

And only the conditions are applied to filter results in the second round?

Columns involved in conditions will be parsed and the program will decide which are the ones to filter in the first round and which to filter in the 2nd round. In the first round we filter by module parameters as well as for targets not run; in the 2nd round we filter by module output because at this point the output are already extracted and available.

gaow added a commit that referenced this issue Jun 26, 2018

Do not load data when the data file name is missing 'NA' #145

7184c54

gaow added the discussion label Jun 27, 2018

gaow added a commit that referenced this issue Feb 8, 2019

Implement option 'others' and automatic filtering by 'targets' #145

f264c91

gaow mentioned this issue Feb 10, 2019

Reimplement conditions parameter in dscquery with R syntax #167

Merged

gaow closed this as completed Feb 10, 2019

gaow reopened this Feb 11, 2019

gaow closed this as completed Nov 21, 2019

Failure to merge multiple pipelines automatically in query when pipeline parameters are involved #145

Failure to merge multiple pipelines automatically in query when pipeline parameters are involved #145

Comments

gaow commented Jun 12, 2018

pcarbo commented Jun 12, 2018

gaow commented Jun 12, 2018

pcarbo commented Jun 12, 2018

gaow commented Jun 17, 2018

gaow commented Jun 17, 2018

stephens999 commented Jun 18, 2018 via email

gaow commented Jun 18, 2018

stephens999 commented Jun 18, 2018 via email

pcarbo commented Jun 26, 2018

jean997 commented Jun 26, 2018

gaow commented Jun 26, 2018

gaow commented Jun 26, 2018 • edited Loading

pcarbo commented Jun 27, 2018 • edited Loading

gaow commented Jun 27, 2018

stephens999 commented Jun 27, 2018 via email

gaow commented Jun 27, 2018

pcarbo commented Jun 27, 2018 • edited Loading

jean997 commented Jun 27, 2018

pcarbo commented Jun 27, 2018

jean997 commented Jun 27, 2018

pcarbo commented Jun 27, 2018

stephens999 commented Jun 30, 2018 via email

gaow commented Jun 30, 2018

stephens999 commented Jun 30, 2018 via email

stephens999 commented Jun 30, 2018 via email

gaow commented Jun 30, 2018

gaow commented Jun 30, 2018

pcarbo commented Jul 2, 2018 • edited Loading

gaow commented Jan 31, 2019

jdblischak commented Jan 31, 2019

pcarbo commented Jan 31, 2019

stephens999 commented Feb 1, 2019 via email

gaow commented Feb 1, 2019 • edited Loading

stephens999 commented Feb 1, 2019 via email

gaow commented Feb 1, 2019

stephens999 commented Feb 2, 2019 • edited Loading

gaow commented Feb 5, 2019

gaow commented Feb 8, 2019

gaow commented Feb 8, 2019 • edited Loading

gaow commented Feb 10, 2019

pcarbo commented Feb 11, 2019

gaow commented Feb 11, 2019 • edited Loading

pcarbo commented Feb 11, 2019

stephens999 commented Feb 12, 2019 via email

pcarbo commented Feb 12, 2019

gaow commented Feb 12, 2019

stephens999 commented Feb 12, 2019 via email

gaow commented Feb 12, 2019

pcarbo commented Feb 13, 2019

stephens999 commented Feb 13, 2019 via email

gaow commented Feb 13, 2019

pcarbo commented Feb 13, 2019

gaow commented Feb 13, 2019

gaow commented Jun 26, 2018 •

edited

Loading

pcarbo commented Jun 27, 2018 •

edited

Loading

pcarbo commented Jun 27, 2018 •

edited

Loading

pcarbo commented Jul 2, 2018 •

edited

Loading

gaow commented Feb 1, 2019 •

edited

Loading

stephens999 commented Feb 2, 2019 •

edited

Loading

gaow commented Feb 8, 2019 •

edited

Loading

gaow commented Feb 11, 2019 •

edited

Loading