-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathTODO.txt
100 lines (95 loc) · 4.86 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
VarImpact TODOs
Alan priorities:
- Display true bin ranges for continuous variables in results tables.
- Unadjusted difference in means estimate (t-test) to compare to TMLE estimate
- Cell sizes in the bins/levels that it chooses (A = 0 and A = 1)
- Calculate RR treatment effects (see tmle, ltmle, Ben Arnold's package).
* OR doesn't need to be in there, byproduct of parametric regression.
- Test CV-TMLE with non-binary outcomes. E.g. transform theta?
Mark priorities:
- Support for 10-fold CV-TMLE - IN PROGRESS
Chris items:
* PRIORITY: for continuous outcomes, need to transform effect size and CIs back to the original scale (multiply by range of the original bounds.)
* PRIORITY: Track and report why variables are removed from analysis
- minCell, minYs, error during estimation, missing values, restrict_by_quantiles(), etc.
* Simulation study
* Investigate penalized histogramming of numeric variables
- Needs more tests and probably its own function.
* Give a warning if data is a matrix rather than dataframe. Causes an error in separate_factors_numerics.
* Methods
- Rename exportLatex to export_latex; support but deprecate old method name.
- Data reduction: switch from HOPACH to a more reliable clustering algorithm.
- Or upgrade HOPACH to support more than 10 (or 15?) variables.
* Arguments
- Argument for p-value cutoff for consistent results.
* Coding style
- Create more subfunctions.
* Output
- Add "Variable" column name to $results_by_fold.
- Results by fold should be divided into two tables:
- by_fold_estimates - psi estimates for each fold
- by_fold_levels - high and low levels for each fold
- Show a_L and a_H in consistent results table - critical for interpretation.
* Unfortunately this requires per-fold results for numeric variables, due to
the "directionality" definition of consistency.
- Sort consistent results by p-value then estimate size?
- Support summary()
- Return SEs in results tables.
- Fix exportTable() when vim_result is NULL due to lack of results.
* Quality
- Warn/error if there are character columns in the data, as a bonus check.
- Figure out how to reduce the number of potential failures in various steps.
- Automatically handle rows that are all NAs?
- Automatically remove any data column that is 100% equal to the outcome (so it can stay in df).
* Prediction Accuracy
- Save risk estimates for g and Q during CV-TMLE. - IN PROGRESS
- Save and report range of g to examine sparsity.
* Testing
- Need tests with single factors and numerics. Lots of lingering bugs.
- Add other tests.
* Visualization
- Visualize the treatment-specific means for each level of the treatment, with CIs
- Label the high and low level
- Layer on the raw density of the variable
- Plot of BH correction ala figure 10.6 in All of Statistics.
* Performance
- Try Rprofvis
- Investigate Jeremy's method of not wasting SL folds during CV-TMLE.
* Documentation
- Improve varImpact() roxygen description.
- Document the return object
- Vignette
* Clarify consistency definition
* Examples
* Benchmarking
- Benchmark to random forest, lasso, OLS and SL-perturbation algorithms.
* BUGS/ERRORS
- HOPACH can fail twice in a row. Need to figure out why and fix (see factor test 1).
- HOPACH (Minh example): Error in rowMeans(dmat[labels == ordlabels[j], labels == labels[medoid2]]) :
* seem to occur in hopach::mssnextlevel()
* possibly due to call to msssplitcluster()
* may need a ", drop = F" when subsetting a matrix somewhere.
* TODO: fork HOPACH and add a dimension check / debug stuff near there.
'x' must be an array of at least two dimensions
- Error in predmat[which, seq(nlami)] = preds : replacement has length zero
- Occurs when estimating TMLE on training samples.
- Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
one multinomial or binomial class has 1 or 0 observations; not allowed
-- Can happen with BreastCancer dataset, seems to depend on seed.
-- Happens in parkinson's competition dataset.
-- Seems to occur when a variable has value with a small number of obs.
Then when estimating the treatment propensity score for a given fold
there may be only a single value for all observations.
- Error in y %*% rep(1, nc) : non-conformable arguments
-- When estimating TMLE
- "Missing value(s) in medoidsdata in collap()"
* WARNINGS
- Warning in collap(data, level, d, dmat, newmed) (conducting HOPACH, at least on factors):
Not enough medoids to use newmed='medsil' in collap() -
using newmed='nn' instead
(CK: suppressing warnings currently to avoid this one.)
From existing documentation:
- Allow reporting of results that randomly do not have estimates for some of validation samples.
Wishlist
- Support clustering or variable reduction algorithms other than HOPACH.
- Support more than 2 external (CV-TMLE) folds.