TODO.txt

VarImpact TODOs

Alan priorities:
- Display true bin ranges for continuous variables in results tables.
- Unadjusted difference in means estimate (t-test) to compare to TMLE estimate
- Cell sizes in the bins/levels that it chooses (A = 0 and A = 1)
- Calculate RR treatment effects (see tmle, ltmle, Ben Arnold's package).
   * OR doesn't need to be in there, byproduct of parametric regression.
- Test CV-TMLE with non-binary outcomes. E.g. transform theta?

Mark priorities:
- Support for 10-fold CV-TMLE - IN PROGRESS

Chris items:
* PRIORITY: for continuous outcomes, need to transform effect size and CIs back to the original scale (multiply by range of the original bounds.)
* PRIORITY: Track and report why variables are removed from analysis
 - minCell, minYs, error during estimation, missing values, restrict_by_quantiles(), etc.
* Simulation study
* Investigate penalized histogramming of numeric variables
  - Needs more tests and probably its own function.
* Give a warning if data is a matrix rather than dataframe. Causes an error in separate_factors_numerics.
* Methods
   - Rename exportLatex to export_latex; support but deprecate old method name.
   - Data reduction: switch from HOPACH to a more reliable clustering algorithm.
       - Or upgrade HOPACH to support more than 10 (or 15?) variables.
* Arguments
  - Argument for p-value cutoff for consistent results.
* Coding style
  - Create more subfunctions.
* Output
  - Add "Variable" column name to $results_by_fold.
  - Results by fold should be divided into two tables:
     - by_fold_estimates - psi estimates for each fold
     - by_fold_levels - high and low levels for each fold
  - Show a_L and a_H in consistent results table - critical for interpretation.
     * Unfortunately this requires per-fold results for numeric variables, due to
       the "directionality" definition of consistency.
  - Sort consistent results by p-value then estimate size?
  - Support summary()
  - Return SEs in results tables.
  - Fix exportTable() when vim_result is NULL due to lack of results.
* Quality
  - Warn/error if there are character columns in the data, as a bonus check.
  - Figure out how to reduce the number of potential failures in various steps.
  - Automatically handle rows that are all NAs?
  - Automatically remove any data column that is 100% equal to the outcome (so it can stay in df).
* Prediction Accuracy
  - Save risk estimates for g and Q during CV-TMLE. - IN PROGRESS
  - Save and report range of g to examine sparsity.
* Testing
  - Need tests with single factors and numerics. Lots of lingering bugs.
  - Add other tests.
* Visualization
  - Visualize the treatment-specific means for each level of the treatment, with CIs
     - Label the high and low level
     - Layer on the raw density of the variable
  - Plot of BH correction ala figure 10.6 in All of Statistics.
* Performance
  - Try Rprofvis
  - Investigate Jeremy's method of not wasting SL folds during CV-TMLE.
* Documentation
  - Improve varImpact() roxygen description.
  - Document the return object
  - Vignette
     * Clarify consistency definition
* Examples
* Benchmarking
  - Benchmark to random forest, lasso, OLS and SL-perturbation algorithms.
* BUGS/ERRORS
  - HOPACH can fail twice in a row. Need to figure out why and fix (see factor test 1).
  - HOPACH (Minh example): Error in rowMeans(dmat[labels == ordlabels[j], labels == labels[medoid2]]) :
     * seem to occur in hopach::mssnextlevel()
     * possibly due to call to msssplitcluster()
     * may need a ", drop = F" when subsetting a matrix somewhere.
     * TODO: fork HOPACH and add a dimension check / debug stuff near there.
  'x' must be an array of at least two dimensions
  - Error in predmat[which, seq(nlami)] = preds : replacement has length zero
     - Occurs when estimating TMLE on training samples.
  - Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  one multinomial or binomial class has 1 or 0 observations; not allowed
     -- Can happen with BreastCancer dataset, seems to depend on seed.
     -- Happens in parkinson's competition dataset.
     -- Seems to occur when a variable has value with a small number of obs.
      Then when estimating the treatment propensity score for a given fold
      there may be only a single value for all observations.
  - Error in y %*% rep(1, nc) : non-conformable arguments
     -- When estimating TMLE
  - "Missing value(s) in medoidsdata in collap()"
* WARNINGS
  - Warning in collap(data, level, d, dmat, newmed) (conducting HOPACH, at least on factors):
  Not enough medoids to use newmed='medsil' in collap() -
 using newmed='nn' instead
   (CK: suppressing warnings currently to avoid this one.)

From existing documentation:
- Allow reporting of results that randomly do not have estimates for some of validation samples.

Wishlist
- Support clustering or variable reduction algorithms other than HOPACH.
- Support more than 2 external (CV-TMLE) folds.