Survival prediction on ECMO treatment on patients diagnosed with ARDS
Thesis project for MSc Biostatistics at University of Glasgow 2019
I will include a write up similar to the final thesis in an html RMarkdown.
ARDSdata.csv - I don't have information on who, where, how this data was optained.
The mice
package
implements a method to deal with missing data. The package creates
multiple imputations (replacement values) for multivariate missing
data. The method is based on Fully Conditional Specification, where
each incomplete variable is imputed by a separate model. The MICE
algorithm can impute mixes of continuous, binary, unordered
categorical and ordered categorical data. In addition, MICE can impute
continuous two-level data, and maintain consistency between
imputations by means of passive imputation. Many diagnostic plots are
implemented to inspect the quality of the imputations.
- Van Buuren, S. (2018). Flexible Imputation of Missing Data. Second Edition.. Chapman & Hall/CRC. Boca Raton, FL.
- Ad hoc methods and the MICE algorithm
- Convergence and pooling
- Inspecting how the observed data and missingness are related
- Passive imputation and post-processing
- Imputing multilevel data
- Sensitivity analysis with
mice
- Generate missing values with
ampute
- Imputing Missing Data with R; MICE package
- How do I perform Multiple Imputation using Predictive Mean Matching in R?
- [Regression Shrinkage and Selection via the Lasso (Tibshirani, 1996)(Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the lasso". Journal of the Royal Statistical Society. )
- Hechenbichler K. and Schliep K.P. (2004)Weighted k-Nearest-Neighbor Techniques and OrdinalClassification, Discussion Paper 399, SFB 386, Ludwig-Maximilians University Munich (http://www.stat.uni-muenchen.de/sfb386/papers/dsp/paper399.ps)
- Hechenbichler K. (2005)Ensemble-Techniken und ordinale Klassifikation, PhD-thesis
- Samworth, R.J. (2012)Optimal weighted nearest neighbour classifiers.Annals of Statistics, 40,2733-2763. (avaialble from http://www.statslab.cam.ac.uk/~rjs57/Research.html)
Ran into some rank deficiency problems when training QDA on imputed datasets
- Tune ML Algorithms in R
- Accuracy Metrics in caret
- Accuracy & Cohen's Kappa - Kappa good for imbalanced datasets
- ROC & AUC - Good for binary outcome
- RMSE & R^2 - Good for continuous outcome
- Logarithmic Loss - Good for multilevel outcome. A linear translation of the log likelihood for a binary outcome model (Bernoulli trials).
- Save trained models
- Try out caret "recipes" (http://topepo.github.io/caret/using-recipes-with-train.html)
- ROC plots
- Performance Tables
- Switch to "caret" package
- Add Parallel Processing
-
Multiple Imputation and Ensemble Learning for Classification with Incomplete Data
- Method: Build multiple datasets using Multilple Imputation. Use Ensemble method to combine classification results.
- Results: .."using the diversity of imputed datasets in an ensemble method leads to a more effective classifier."
- Notes: Flow charts showing imputation / training/ ensembling
-
Methods to Combine Multiple Imputations
- Method: Ensemble method / stacked dataset with dummy variables for imputed values
- Results: No Sources
-
Notes: Original paper on missing data - MCAR, MAR, MNAR
-
Classification Uncertainty of Multiple Imputed Data
- Method: Random Forest imputation with trees = 10
- Notes: Discusses methods for making classifications on imputed data.
- Keywords: White paper, Uncertainty measures, Discussion of Rubin's Rules
-
Handling missing values in kernel methods with application to microbiology data
- Method:
- Concatenate the multiple imputed data sets and optimize an SVM classifier in the resulting set; this not only accounts for the variability of the parameter estimates but also for the variability of the training observations in relation to the imputed values. (IMI Algorithm) In the first algorithm the training data set was imputed m times, merged into a single large data set, and then used to train a classifier, SVM in this case. Test data set was then concatenated with the stacked training data set, imputed m times, and extracted from the training samples for prediction. Each of the m now complete test data sets were used for prediction using the classifier that was trained in the previous step. Therefore, for each sample in the test data set, m predictions were produced and a majority vote was used to form the final prediction for each test sample.
- A more standard procedure, involves fitting separate SVMs to each imputed data set and get the pooled (i.e., averaged) performance of the different SVMs. In the second algorithm, training data was again imputed m times and for each of the complete data sets, a classifier was trained. Test data set was then concatenated with each of the m imputed training data sets, imputed once (i.e. m=1) and then used for prediction. Again, m predictions were produced and a majority vote was used to form the final prediction.
- The IMI algorithm was determined to work generally better.
- Notes: Two general principles should be kept in mind for performing MI at test time:
- Imputation of test data must be done in test time, that is, it is not possible to do the imputation of all data altogether (training and test).
- When imputing the missing values in test data, it is not possible to use the class (target) variable for the imputation (only the predictors can be used).
- Keywords: Kendall coefficient, Proportion of Useable Cases
- Method:
-
Multiple Imputation for Nonresponse in Surveys
- Notes: Rubin's Rules (pooling estimates)
-
Khan, S. Shehroz and Ahmad, Amir and Mihailidis, Alex (2018) Bootstrapping and Multiple Imputation Ensemble Approaches for Missing Data.
- Notes: Good literature review on Multiple Imputation and ensembling results.
- Keywords: Ensemble, Multiple Imputation, Bagging, Flow Chart, Classifier Fusion Techniques, Expectation Maximization
-
Impact of imputation of missing values on classification error for discrete data
- Notes: Comparison of imputation methods. Good write up on missing data and imputation methods.
-
Barnard, J. and Rubin, D.B. (1999). Small sample degrees of freedom with multiple imputation. Biometrika, 86, 948-955.
-
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons.
-
van Buuren S and Groothuis-Oudshoorn K (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. https://www.jstatsoft.org/v45/i03/
- Notes: MICE package