AC209: Data Science is a course offered at Harvard SEAS. I completed 8 coding assignments and a final group project in this course in fall 2016.
The course focuses on the analysis of messy, real life data to perform predictions using statistical and machine learning methods. Material covered integrates the five key facets of an investigation using data: (1) data collection: data wrangling, cleaning, and sampling to get a suitable data set;
(2) data management: accessing data quickly and reliably;
(3) exploratory data analysis: generating hypotheses and building intuition;
(4) prediction or statistical learning;
(5) communication ? summarizing results through visualization, stories, and interpretable summaries.
Numpy, Pandas, scipy, Scikit-learn, matplotlib, BeautifulSoup
- Linear regression
- Linear regression with regularization (Ridge and Lasso)
- Logistic regression
- Multinomial logistic regression
- LDA and QDA
- KNN
- Random forest
- Bagging and boosting
- SVM
- dimension reduction
- variable selection
- parameter tuning
- boostrapping and cross-validation
- model evaluation
- pulling data out of HTML and XML files
- imbalanced data
- missing data