-Clustering Project: Drivers of time in hospital for diabetic patients.
- Data Science Team Members: Gabby Broussard
- I have been asked to analyze data obtained from hospital admissions for diabetic patients between 1999 and 2008 in order to discover the drivers of the length of time spent in the hospital.
Data Source: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
- Deliver a notebook presentation of models used to isolate drivers of increased length of time spent in the hospital.
- Use clustering methodologies to engineer new features and visualize factors that contribute to length of time spent in the hospital.
- All files referenced in this presentation are available in the github repository for this project: https://github.com/GabbyBarajasBroussard/diabetic-clustering-project
alpha(𝛼): 1 - confidence level (95% confidence level -> 𝛼=.05 )
- The clusters on insulin were the only clusters selected as best features by Select K Best and RFE.
- This can be evidenced by insulins being listed as top features for predicting hospital stay.
- When fitting this cluster to regression models, the best performing model is the linear regression model performed at 2.918 beating the baseline of 2.939.
- If more time was allotted, I would like to identify what demographic of patients within the insulin clusters had a increased length of stay. I would also like to compare pediatric versus adult length of stays.
- If more resources were alloted, I would like to use medication compliance and if the patient had documented diabetes education prior to admission.
Progression through the Data Science Pipeline: PLAN -> ACQUIRE -> PREPARE -> EXPLORE -> MODEL -> DELIVER
Each step in my process is recorded and staged on a Trello board at: https://trello.com/b/QJblwzYD
- Create GitHub organization and set up GitHub repo, to include readme.md and .gitignore.
- Brainstorm a list of questions and form hypotheses about how variables might impact one another.
- Read data from UCI's Machine Learning Repository database into a Pandas dataframe to be analyzed using Python.
- Created a function, acquire(df), as a reproducible component for acquiring necessary data.
- Carefully reviewed data, identifying any missing, erroneous, or invalid values.
- Explored value counts of the dataframe
- Created and called a function, prepare, as a reproducible component that cleans/prepares data for analysis by: renames columns, handling missing values, adjusts data types, handles any data integrity
- Split the data into train, validate and test sets.
- Visualized all combination of variables to explore relationships.
- Tested for independent variables that correlate with correlate with time in hospital.
- Developed hypotheses and ran statistical tests to accept or reject null hypotheses.
- Summarized takeaways and conclusions.
- Scaled data using MinMax scaler.
- Used clustering methodologies to create new features to model on.
- Ran additional statistical tests on clustered data.
-
Developed a baseline model.
-
Modeled train and validate data on OLS, Lasso Lars, and GLM Regression Models.
-
Modeled test on Linear Regression. Deliver:
-
Clearly document all code in a reproducible Jupyter notebook called Walkthrough. Instructions for Reproducing My Findings:
-
Start by cloning the github repository on your From your terminal command line, type [email protected]:GabbyBarajasBroussard/diabetic-clustering-project.git
Download the following files from https://github.com/GabbyBarajasBroussard/diabetic-clustering-project to your working directory:
acquire.py prepare.py download .csv from: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
Data obtained from: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008 Photo by Diabetesmagazijn.nl on Unsplash