Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize the output format of the test functions #222

Closed
jdblischak opened this issue Mar 20, 2024 · 11 comments
Closed

Standardize the output format of the test functions #222

jdblischak opened this issue Mar 20, 2024 · 11 comments
Labels

Comments

@jdblischak
Copy link
Collaborator

jdblischak commented Mar 20, 2024

We have ambitious goals to make experiments with different test functions extremely flexible. For example:

  • Enable sim_gs_n() to apply a different test at each cut, and even potentially multiple tests per cut
  • Enable maxcombo() to accept an arbitrary number of test functions and return the best result
  • Enable multitest() to run multiple tests on a given data set and return the results (currently works by returning a list, but a data frame would be easier for downstream analysis)

Currently this flexibility isn't possible. While the test functions all accept a data frame of trial data as their first positional argument, the outputs are not uniform. This makes it impossible to combine the results into a data frame to facilitate downstream analysis. I also don't understand how the maxcombo() function can be extended to accept any number of arbitrary tests without a common test statistic to compare them (though I admit I may just not fully understand the future plans for maxcombo()).

The example below demonstrates the diversity of the outputs:

remotes::install_github("Merck/simtrial@2c015f6", upgrade = FALSE)
library("simtrial")
packageVersion("simtrial")
## [1] ‘0.3.2.13’

trial_data <- sim_pw_surv()
trial_data_cut <- cut_data_by_event(trial_data, 150)

wlr(trial_data_cut, weight = fh(rho = 0, gamma = 0))
##   rho gamma         z
## 1   0     0 -1.392586
wlr(trial_data_cut, weight = fh(rho = 0, gamma = 0.5))
##   rho gamma         z
## 1   0   0.5 -1.650691
wlr(trial_data_cut, weight = mb(delay = 3))
##           z
## 1 -1.425919
rmst(trial_data_cut, tau = 20)
##   rmst_arm1 rmst_arm0 rmst_diff         z
## 1  11.16504  10.20869 0.9563463 0.6412476
milestone(trial_data_cut, ms_time = 10)
##      method          z ms_time surv_ctrl surv_exp surv_diff std_err_ctrl std_err_exp
## 1 milestone 0.04013879      10      0.46     0.48      0.02   0.07048404  0.07065409
maxcombo(trial_data_cut, rho = c(0, 0), gamma = c(0, 0.5))
##      p_value
## 1 0.06348763

I propose that each test function returns a named vector with 3 elements:

  • description: brief description of the test performed
  • test statistic: whatever makes sense to return from the test
  • p-value: to facilitate comparing across tests with different test statistics

If this is too simplistic, we could include more information (eg the input parameters) by either:

  • Converting from a named vector to a named list. Then we could add more complex elements like parameters, which could itself be a named vector. Then we could convert these to a data frame using a list column for parameters
  • We could switch to OOP and define an S3 class for each test result. Then it could return tons of data, but then the method as.data.frame() would extract a minimal set to combine the results of many tests into a single data frame
  • We could probably pack some more information into the attributes of the named vector. But this isn't really friendly for downstream analysis. It would also likely be lost in functions like sim_gs_n() that return a data frame

xref: #215

@keaven
Copy link
Collaborator

keaven commented Mar 20, 2024

First of all: very nice!
One immediate comment: a milestone difference between .48 and .46 with the given standard errors should not have a p-value of .04 unless I misunderstand something. This is also an explanation of the detail that is different for different tests is, in my mind, essential. I would probably be fine with a description, p-value and a list (if a data.frame column can have type list???) that is dependent on the description. The list for any given description should be done consistently. It could be a mix of parameters and descriptive or other statistics. The OOP option seems potentially interesting, but I would like to see the above example done with that. You are right that we are asking for a lot of flexibility...and probably no matter what we do, some statistician(s) will want something that does not fit the model.

@jdblischak
Copy link
Collaborator Author

One immediate comment: a milestone difference between .48 and .46 with the given standard errors should not have a p-value of .04 unless I misunderstand something.

I think the .04 is the z-score, not the p-value

This is also an explanation of the detail that is different for different tests is, in my mind, essential. I would probably be fine with a description, p-value and a list (if a data.frame column can have type list???) that is dependent on the description.

I think it should be possible. I'm not a big fan of list-columns in general (since they make things more complicated), but it would be a way to store this heterogenous data.

The list for any given description should be done consistently. It could be a mix of parameters and descriptive or other statistics.

Exactly. We can use this Issue to discuss the various descriptions/parameters/statistcs/etc that we need returned from each test function and decide on a uniform output format.

@nanxstats
Copy link
Collaborator

100% agree with the OOP/S3 method comment - that would provide the interface for the returned data (implementation), as described in the principle code against interface, not implementation.

@jdblischak
Copy link
Collaborator Author

After discussion, we are planning to try returning a named vector with the following values. If this becomes too unmanageable, then we can consider trying more sophisticated options like S3 or list-columns.

  • test_type - the name of the test: wlr, rmst, maxcombo, etc
  • estimate - point estimate
  • se - standard error
  • z - z-score
  • p - p-value
  • parameters - string of parameter values, eg "rho=0;gamma=0.5"

@LittleBeannie please feel free to adjust these names as you see fit. They are just my suggestions

@LittleBeannie
Copy link
Collaborator

After discussion, we are planning to try returning a named vector with the following values. If this becomes too unmanageable, then we can consider trying more sophisticated options like S3 or list-columns.

  • test_type - the name of the test: wlr, rmst, maxcombo, etc
  • estimate - point estimate
  • se - standard error
  • z - z-score
  • p - p-value
  • parameters - string of parameter values, eg "rho=0;gamma=0.5"

@LittleBeannie please feel free to adjust these names as you see fit. They are just my suggestions

Hi John, looks great to me, thanks for the nice summary! Can I raise some minor comments to the names?

  • can we rename test_type to test? In gsDesign, we have an argument called test.type which refers to different designs (1-sided, 2-sided, etc.). Since many gsDesign users are also simtrial users, they may find it confusing that these designs have similar names but different meanings.
  • Can we rename estimate to est for consistency with Keaven's other package, nphsim,, mentioned in today's meeting? https://github.com/keaven/nphsim/blob/master/R/cox_pw.r#L25
  • Can we rename p to pval, again, for consistency with nphsim? https://github.com/keaven/nphsim/blob/master/R/combo.wlr.r#L53
  • Can we rename parameters to parameter? Everything will be singular in simtrial/gsDesign2.

@jdblischak
Copy link
Collaborator Author

Can I raise some minor comments to the names

Those names are all fine by me. Like I said, mine were only suggestions

@nanxstats
Copy link
Collaborator

Good discussions, how about this - no abbreviation at all to make them easy to search:

  • test - the name of the test: wlr, rmst, maxcombo, etc
  • point_estimate - point estimate
  • standard_error - standard error
  • z_score - Z-score
  • p_value - p-value
  • parameter - string of parameter values, e.g., "rho=0;gamma=0.5"

This also aligns with the general rule of thumb to prefer long, descriptive names suggested by R4DS.

nphsim consistency is a good topic, although I don't think it's that important because of its experimental nature... and we should take its code style with a grain of salt.

@jdblischak
Copy link
Collaborator Author

Here are the outputs of the testing functions now that #227 was merged (75d7b67). I am going to work on extending sim_gs_n() to be more flexible with wlr(), rmst(), and milestone()

remotes::install_github("Merck/simtrial@2c015f6", upgrade = FALSE)
library("simtrial")
packageVersion("simtrial")
## [1] ‘0.3.2.14’

trial_data <- sim_pw_surv()
trial_data_cut <- cut_data_by_event(trial_data, 150)

wlr(trial_data_cut, weight = fh(rho = 0, gamma = 0))
## $method
## [1] "WLR"
##
## $parameter
## [1] "FH(rho=0, gamma=0)"
##
## $estimation
## [1] -12.3417
##
## $se
## [1] 4.486425
##
## $z
## [1] -2.750898
##
wlr(trial_data_cut, weight = fh(rho = 0, gamma = 0.5))
## $method
## [1] "WLR"
##
## $parameter
## [1] "FH(rho=0, gamma=0.5)"
##
## $estimation
## [1] -10.15942
##
## $se
## [1] 2.848798
##
## $z
## [1] -3.566213
##
wlr(trial_data_cut, weight = mb(delay = 3))
## $method
## [1] "WLR"
##
## $parameter
## [1] "MB(delay = 3, max_weight = Inf)"
##
## $estimate
## [1] -14.61053
##
## $se
## [1] 5.199295
##
## $z
## [1] -2.810097
##
rmst(trial_data_cut, tau = 20)
## $method
## [1] "RMST"
##
## $parameter
## [1] 20
##
## $estimation
## [1] 1.867087
##
## $se
## [1] 1.470548
##
## $z
## [1] 1.269654
##
milestone(trial_data_cut, ms_time = 10)
## $method
## [1] "milestone"
##
## $parameter
## [1] 10
##
## $estimation
## [1] 0.1
##
## $se
## [1] 0.07019972 0.07048404
##
## $z
## [1] 0.9964078
##
maxcombo(trial_data_cut, rho = c(0, 0), gamma = c(0, 0.5))
## $method
## [1] "Maxcombo"
##
## $parameter
## [1] "FH(0, 0) + FH(0, 0.5)"
##
## $z
## [1] -2.750898 -3.566213
##
## $p_value
## [1] 0.0002723921
##

@jdblischak
Copy link
Collaborator Author

In #229, I updated all the test functions to return "estimate" instead of "estimation"

@nanxstats nanxstats added the API label Jun 5, 2024
@LittleBeannie
Copy link
Collaborator

Hi @jdblischak, is it okay if we close this issue as complete?

@jdblischak
Copy link
Collaborator Author

is it okay if we close this issue as complete?

Yes. The main remaining piece is multiple tests per cut for sim_gs_n(), so I created a new Issue for this final task (#258)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants