This repository contains a series of data science exercises and projects aimed at exploring various domains such as clustering, time series analysis, machine learning, and more. Each project focuses on specific techniques and real-world datasets.
- Clustering Activities
- Data Wrangling Medicaid
- Simulation and Hypothesis: Memberships
- Time Series Analysis
- Wine Quality Classification
In this project, we use the k-means clustering algorithm to analyze time series data collected from wearable devices during activities like sleeping, running, and walking.
- Dataset:
data/activity.csv
- Goal: Identify the number of activities and group individuals into clusters based on their activity patterns.
- Challenge: Data inconsistency exists—one cohort sampled every second and another every 2 seconds.
- Read and inspect the dataset for missing values.
- Perform clustering to uncover hidden patterns.
- Visualize and interpret the results.
This exercise involves working with two datasets:
- IRS Statistics of Income (SOI)
- Medicaid Data per State
The goal is to create a summary table that explores medication costs per Medicaid enrollee by state.
- Answer questions such as:
- What drugs contribute most to a state's spending?
- Are there regional patterns in drug prescriptions?
Gain insights into healthcare spending and data wrangling skills crucial for real-world projects.
This project models revenue for a membership-based training website.
- Dataset:
memberships_info.csv
- Goal: Develop a generative model to predict revenue for the upcoming year.
- Gender differences in training completion rates and dropout probabilities.
- Annual growth in memberships (13% increase, SD: 1.4%).
- Historical data insights for over 90,000 enrollees in 2021.
Use the model to estimate revenue trends and understand membership dynamics.
Analyze and generate synthetic time series data based on the following components:
- Trend: Exponential growth function.
- Seasonality: Quarterly sine wave.
- Noise: Gaussian distribution.
- Compute and plot each component independently.
- Combine the components to create the final time series.
- Use
np.random.seed(42)
to ensure reproducibility.
The dataset spans 200 months, showcasing seasonal patterns and trends.
Classify wine quality (scale: 0–10) using various machine learning models.
- Dataset: Contains features like acidity, density, etc.
- Goal: Build models to classify wine quality and evaluate performance.
- K-Nearest Neighbors (KNN)
- Logistic Regression
- Random Forest
- XGBoost
- Compare models based on classification accuracy.
- Visualize confusion matrix heatmaps to analyze prediction quality.
- Clone the repository: git clone https://github.com/data-science-notebooks
Happy coding! 🚀