Skip to content

Classification analysis of Acute Respiratory Distress Syndrome

License

Notifications You must be signed in to change notification settings

rtedwards/ARDS-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

76 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

I. πŸ’‰ ARDS Survival Prediction

Survival prediction on ECMO treatment on patients diagnosed with ARDS

πŸ§›πŸΌβ€β™‚οΈBackground

🌱 Motivation

Thesis project for MSc Biostatistics at University of Glasgow 2019

πŸ““ Notebooks

I will include a write up similar to the final thesis in an html RMarkdown.

πŸ“ Datasets

ARDSdata.csv - I don't have information on who, where, how this data was optained.

II. πŸ“¦ Packages

Caret πŸ₯•

Books πŸ“š

Vignettes 🎻

  1. A Short Introduction to the caret Package
  2. Caret Practice

Parallel Computing πŸ’Ύ

Vignette 🎻

Imputation 🐁

The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations.

-Package 'mice'

Books πŸ“š

Vignettes 🎻

  1. Ad hoc methods and the MICE algorithm
  2. Convergence and pooling
  3. Inspecting how the observed data and missingness are related
  4. Passive imputation and post-processing
  5. Imputing multilevel data
  6. Sensitivity analysis with mice
  7. Generate missing values with ampute

More Examples

III. πŸ› οΈ Methods

Logistic Regression + LASSO Regularization

References

  • [Regression Shrinkage and Selection via the Lasso (Tibshirani, 1996)(Tibshirani, Robert (1996). "Regression Shrinkage and Selection via the lasso". Journal of the Royal Statistical Society. )

Linear Discriminant Analysis

References

Vignettes 🎻

Quadratic Discriminant Analysis

References

Vignettes 🎻

K-Nearest Neighbors

References

Random Forests

References

Support Vector Machines

References

Vignettes 🎻

IV. Misc Topics

Imbalanced Data Sets

Vignettes 🎻

Rank Deficiency

Ran into some rank deficiency problems when training QDA on imputed datasets

Tuning Parameters πŸ“»

  • Tune ML Algorithms in R
  • Accuracy Metrics in caret
    • Accuracy & Cohen's Kappa - Kappa good for imbalanced datasets
    • ROC & AUC - Good for binary outcome
    • RMSE & R^2 - Good for continuous outcome
    • Logarithmic Loss - Good for multilevel outcome. A linear translation of the log likelihood for a binary outcome model (Bernoulli trials).

V. To-Do

Bibliography

  • Multiple Imputation and Ensemble Learning for Classification with Incomplete Data

    • Method: Build multiple datasets using Multilple Imputation. Use Ensemble method to combine classification results.
    • Results: .."using the diversity of imputed datasets in an ensemble method leads to a more effective classifier."
    • Notes: Flow charts showing imputation / training/ ensembling
  • Methods to Combine Multiple Imputations

    • Method: Ensemble method / stacked dataset with dummy variables for imputed values
    • Results: No Sources
  • Inference and Missing Data

  • Notes: Original paper on missing data - MCAR, MAR, MNAR

  • Classification Uncertainty of Multiple Imputed Data

    • Method: Random Forest imputation with trees = 10
    • Notes: Discusses methods for making classifications on imputed data.
    • Keywords: White paper, Uncertainty measures, Discussion of Rubin's Rules
  • Handling missing values in kernel methods with application to microbiology data

    • Method:
      1. Concatenate the multiple imputed data sets and optimize an SVM classifier in the resulting set; this not only accounts for the variability of the parameter estimates but also for the variability of the training observations in relation to the imputed values. (IMI Algorithm) In the first algorithm the training data set was imputed m times, merged into a single large data set, and then used to train a classifier, SVM in this case. Test data set was then concatenated with the stacked training data set, imputed m times, and extracted from the training samples for prediction. Each of the m now complete test data sets were used for prediction using the classifier that was trained in the previous step. Therefore, for each sample in the test data set, m predictions were produced and a majority vote was used to form the final prediction for each test sample.
      2. A more standard procedure, involves fitting separate SVMs to each imputed data set and get the pooled (i.e., averaged) performance of the different SVMs. In the second algorithm, training data was again imputed m times and for each of the complete data sets, a classifier was trained. Test data set was then concatenated with each of the m imputed training data sets, imputed once (i.e. m=1) and then used for prediction. Again, m predictions were produced and a majority vote was used to form the final prediction.
    • The IMI algorithm was determined to work generally better.
    • Notes: Two general principles should be kept in mind for performing MI at test time:
      1. Imputation of test data must be done in test time, that is, it is not possible to do the imputation of all data altogether (training and test).
      2. When imputing the missing values in test data, it is not possible to use the class (target) variable for the imputation (only the predictors can be used).
    • Keywords: Kendall coefficient, Proportion of Useable Cases
  • Multiple Imputation for Nonresponse in Surveys

    • Notes: Rubin's Rules (pooling estimates)
  • Khan, S. Shehroz and Ahmad, Amir and Mihailidis, Alex (2018) Bootstrapping and Multiple Imputation Ensemble Approaches for Missing Data.

    • Notes: Good literature review on Multiple Imputation and ensembling results.
    • Keywords: Ensemble, Multiple Imputation, Bagging, Flow Chart, Classifier Fusion Techniques, Expectation Maximization
  • Impact of imputation of missing values on classification error for discrete data

    • Notes: Comparison of imputation methods. Good write up on missing data and imputation methods.
  • Barnard, J. and Rubin, D.B. (1999). Small sample degrees of freedom with multiple imputation. Biometrika, 86, 948-955.

  • Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons.

  • van Buuren S and Groothuis-Oudshoorn K (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. https://www.jstatsoft.org/v45/i03/

    • Notes: MICE package

About

Classification analysis of Acute Respiratory Distress Syndrome

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published