Welcome to Data Science at General Assembly! This is where we will be hosting all class slides, assignments, resources, and more. Course materials for General Assembly's Data Science course in Washington, DC (1/12/16 - 3/31/16).
- Instructors: Aleks Ontman & Alex Sherman
- TA: Al Johri
Tuesday | Thursday |
---|---|
1/12: Introduction to Data Science | 1/14: Python Data Model; Data Reading and Cleaning |
1/19: Command Line and Version Control | 1/21: Exploratory Data Analysis |
1/26: Data Visualization | 1/28: Machine Learning Introduction |
2/2: K-Nearest Neighbors | 2/4: Linear Regression |
2/9: Web Scraping and Data Cleansing | 2/11: Basic Model Evaluation |
2/16: Logistic Regression | 2/18: Advanced Model Evaluation |
2/23: First Project Presentation | 2/25: Naive Bayes and Text Data |
3/1: Natural Language Processing | 3/3: Kaggle Competition |
3/8: Decision Trees | 3/10: Ensembling (Random Forest) |
3/15: Advanced scikit-learn/Clustering | 3/17: Final Project Presentation |
3/22: Final Project Presentation | 3/24: Selected Topics, Wrap-up |
Date | Assignment |
---|---|
1/21 | HW#1: Chipotle Python |
1/26 | HW#2: Command Line |
1/28 | HW#3: IMDB with Pandas |
2/4 | Project Brainstorming Deadline |
2/9 | HW#4 Yelp Votes Linear Regression |
2/11 | Project Question and Dataset Due |
2/16 | HW#5: Web Scraping - IMDB (Optional) |
2/23 | First Project Presentation |
3/8 | HW#6: Naive Bayes with Yelp Review Text (Optional) & Draft Project Paper Due |
3/15 | Peer Review Due |
3/22 | Final Project Presentations |
- Welcome from General Assembly staff
- Course overview (slides)
- Introduction to data science (slides)
- Types of data (slides)
- Data science tools (slides)
- Doing Data Science at Twitter
- 17 types of data science
- The Python Data Model
- Wrap up: Slack tour, feedback form
- Codecademy's Python course: Good beginner material, including tons of in-browser exercises.
- DataQuest: Similar interface to Codecademy, but focused on teaching Python in the context of data science.
- Google's Python Class: Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
- A Crash Course in Python for Scientists: Read through the Overview section for a quick introduction to Python.
- Python for Informatics: A very beginner-oriented book, with associated slides and videos.
- Beginner and intermediate workshop code: Useful for review and reference.
- Python Tutor: Allows you to visualize the execution of Python code.
Resources:
- For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
- For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
- Quora has a data science topic FAQ with lots of interesting Q&A.
- Keep up with local data-related events through the Data Community DC event calendar or weekly newsletter.
- Data Science vs Statistics
- 15 Books every Data Scientist Should Read
- 50+ Free Data Science Books
- FREE BOOK: Introduction to Statistical Learning
- Building Data Science Teams
- R for Everyone: Great reference for R
- Doing Data Science
- Python for Data Analysis
- Getting Started with Data Science
- Python:
- Discuss Course Project
- Wrap up: Course schedule, office hours
Homework:
-
Complete the Python homework assignment with the Chipotle data, add a commented Python script to your GitHub repo, and submit a link using the homework submission form. (Note: Pandas, which is covered in class 4, should not be used for this assignment.)
-
Review the code from the beginner and intermediate Python workshops. If you don't feel comfortable with any of the content (excluding the "requests" and "APIs" sections), you should spend some time this weekend practicing Python:
- Introduction to Python does a great job explaining Python essentials and includes tons of example code.
- If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
- If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
- If you have more time, try missions 2 and 3 from DataQuest's Learning Python course.
- If you've already mastered these topics and want more of a challenge, try solving Python Challenge number 1 (decoding a message) and send me your code in Slack.
-
To give you a framework for thinking about your project, watch What is machine learning, and how does it work? (10 minutes). (This is the IPython notebook shown in the video.) Alternatively, read A Visual Introduction to Machine Learning, which focuses on a specific machine learning model called decision trees.
-
Optional: Browse through some more example student projects, which may help to inspire your own project!
-
Install the Anaconda distribution of Python 2.7x.
- If you choose not to use Anaconda, here is a list of the Python packages you will need to install during the course.
Resources:
- Want to understand Python's comprehensions? Think in Excel or SQL may be helpful if you are still confused by list comprehensions.
- My code isn't working is a great flowchart explaining how to debug Python errors.
- PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.
- If you want to understand Python at a deeper level: Ned Batchelder's Loop Like A Native, Python Names and Values, Raymond Hettinger's Transforming Code into Beautiful, Idiomatic Python and Python Epiphanies are excellent presentations.
- Everything is an object in Python
- Nate Silver on the Art and Science of Prediction
Homework:
- Work through GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows), and then browse through this command line reference.
- Watch videos 1 through 8 (21 minutes) of Introduction to Git and GitHub.
Create a Markdown document that includes your answers to questions 1-3 below and the code you used to arrive at those answers. Add this file to a GitHub repo that you'll use for all of your coursework, and submit a link to your repo using the homework submission form:
- Complete the command line homework assignment with the Chipotle data.
- Using
chipotle.tsv
in thedata
subdirectory:- Look at the head and the tail, and think for a minute about how the data is structured. What do you think each column means? What do you think each row means? Tell me! (If you're unsure, look at more of the file contents.)
- How many orders do there appear to be?
- How many lines are in the file?
- Which burrito is more popular, steak or chicken?
- Do chicken burritos more often have black beans or pinto beans?
- Count the number of occurrences of the word 'dictionary' (regardless of case) across all files in the DAT9 repo.
- Optional: Use the the command line to discover something "interesting" about the Chipotle data. The advanced commands may be helpful to you!
Git and Markdown Resources:
- Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
- How to remove .DS_Store from GitHub
- GitHub for Beginners
- If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
- If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
- GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
- Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.
- Introducing GitHub is a nice intro to GitHub that reads quickly
- Version Control with Git
- Cracking the Code to GitHub's Growth explains why GitHub is so popular among developers.
Command Line Resources:
- The Linux command line
- If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
- If you want to do more at the command line with CSV files, try out csvkit, which can be installed via
pip
.
Copy the link from your new Forked Repo
Clone your new forked repo to your computer.
git clone [email protected]:YOUR_USERNAME/DAT-DC-11.git
cd (change directory) into the cloned repo.
git remote add upstream https://github.com/ga-students/DAT-DC-11
Repeat this step often to keep your Repo up to date with the Class Repo:
git fetch upstream
git merge upstream/master
Resources:
- Watch Syncing Your GitHub Fork to learn more about GitHub forks
- or read Simple guide to forks in GitHub and Git
- Pandas (code):
- Project question exercise
Homework:
- Read How Software in Half of NYC Cabs Generates $5.2 Million a Year in Extra Tips for an excellent example of exploratory data analysis.
- Read Anscombe's Quartet, and Why Summary Statistics Don't Tell the Whole Story for a classic example of why visualization is useful.
Resources:
- Browsing or searching the Pandas API Reference is an excellent way to locate a function even if you don't know its exact name.
- What I do when I get a new data set as told through tweets is a fun (yet enlightening) look at the process of exploratory data analysis.
- Part 2 of Exploratory Data Analysis with Pandas (code)
- Visualization with Pandas and Matplotlib (notebooks)
- Python homework with the Chipotle data ([Solution](homework solutions/03_python_homework_chipotle_explained.ipynb))
Homework:
- Complete the Pandas homework assignment with the IMDb data.
Pandas Resources:
- To learn more Pandas, read this three-part tutorial, or review these two excellent (but extremely long) notebooks on Pandas: introduction and data wrangling.
- If you want to go really deep into Pandas (and NumPy), read the book Python for Data Analysis, written by the creator of Pandas.
- This notebook demonstrates the different types of joins in Pandas, for when you need to figure out how to merge two DataFrames.
- This is a nice, short tutorial on pivot tables in Pandas.
Visualization Resources:
- Harvard's Data Science course includes an excellent lecture on Visualization Goals, Data Types, and Statistical Graphs (83 minutes), for which the slides are also available.
- Watch Look at Your Data (18 minutes) for an excellent example of why visualization is useful for understanding your data.
- For more on Pandas plotting, read this notebook or the visualization page from the official Pandas documentation.
- To learn how to customize your plots further, browse through this notebook on matplotlib or this similar notebook.
- Read Overview of Python Visualization Tools for a useful comparison of Matplotlib, Pandas, Seaborn, ggplot, Bokeh, Pygal, and Plotly.
- To explore different types of visualizations and when to use them, Choosing a Good Chart and The Graphic Continuum are nice one-page references, and the interactive R Graph Catalog has handy filtering capabilities.
- This PowerPoint presentation from Columbia's Data Mining class contains lots of good advice for properly using different types of visualizations.
- Part 2 of Visualization with Pandas and Matplotlib (notebook)
- "Human learning" exercise:
- Iris dataset hosted by the UCI Machine Learning Repository
- Iris photo
- Iris exercise Notebook
- Iris answers Notebook available after class ends
- Introduction to machine learning (slides)
Homework:
- Optional: For an introduction to linear regression, watch The Easiest Introduction to Regression Analysis (14 minutes).
- If you're not using Anaconda, install requests and Beautiful Soup 4 using
pip
. (Both of these packages are included with Anaconda.)
Machine Learning Resources:
- For a very quick summary of the key points about machine learning, watch What is machine learning, and how does it work? (10 minutes) or read the associated notebook.
- For a more in-depth introduction to machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
- The Learning Paradigms video (13 minutes) from Caltech's Learning From Data course provides a nice comparison of supervised versus unsupervised learning, as well as an introduction to "reinforcement learning".
- Real-World Active Learning is a readable and thorough introduction to "active learning", a variation of machine learning in which humans label only the most "important" observations.
- For a preview of some of the machine learning content we will cover during the course, read Sebastian Raschka's overview of the supervised learning process.
- Data Science, Machine Learning, and Statistics: What is in a Name? discusses the differences between these (and other) terms.
- The Emoji Translation Project is a really fun application of machine learning.
- Look up the characteristics of your zip code, and then read about the 67 distinct segments in detail.
IPython Notebook Resources:
- If you would like to learn the IPython Notebook, the official Notebook tutorials are useful.
- This Reddit discussion compares the relative strengths of the IPython Notebook and Spyder.
- Finish the Iris exercise Iris answers Notebook
- discuss dataframe iteration approaches iteration time test
- K-nearest neighbors and scikit-learn notebook
- Exercise with NBA player data notebook, data, data dictionary
Homework:
- Read introduction to reproducibility, read Jeff Leek's guide to creating a reproducible analysis, and watch this related Colbert Report video (8 minutes).
- Optional: If you're not using Anaconda, install Seaborn using pip. If you're using Anaconda, install Seaborn by running conda install seaborn at the command line. (Note that some students in past courses have had problems with Anaconda after installing Seaborn.)
- Work on your project!
KNN Resources:
- For a recap of the key points about KNN and scikit-learn, watch Getting started in scikit-learn with the famous iris dataset (15 minutes) and Training a machine learning model with scikit-learn (20 minutes).
- KNN supports distance metrics other than Euclidean distance, such as Mahalanobis distance, which takes the scale of the data into account.
- A Detailed Introduction to KNN is a bit dense, but provides a more thorough introduction to KNN and its applications.
- This lecture on Image Classification shows how KNN could be used for detecting similar images, and also touches on topics we will cover in future classes (hyperparameter tuning and cross-validation).
- Some applications for which KNN is well-suited are object recognition, satellite image enhancement, document categorization, and gene expression analysis.
Seaborn Resources:
- To get started with Seaborn for visualization, the official website has a series of detailed tutorials and an example gallery.
- Data visualization with Seaborn is a quick tour of some of the popular types of Seaborn plots.
- Visualizing Google Forms Data with Seaborn and How to Create NBA Shot Charts in Python are both good examples of Seaborn usage on real-world data.
- Machine learning exercise (article)
- Linear regression (notebook)
- Capital Bikeshare dataset used in a Kaggle competition
- Data dictionary
- Feature engineering example: Predicting User Engagement in Corporate Collaboration Network
- Exploring the bias-variance tradeoff notebook
Homework:
- Reading assignment on the bias-variance tradeoff
Linear Regression Resources:
- To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
- This introduction to linear regression is more detailed and mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.
- Setosa has an interactive visualization of linear regression.
- For a brief introduction to confidence intervals, hypothesis testing, p-values, and R-squared, as well as a comparison between scikit-learn code and Statsmodels code, read my DAT7 lesson on linear regression.
- Here is a useful explanation of confidence intervals from Quora.
- Hypothesis Testing: The Basics provides a nice overview of the topic, and John Rauser's talk on Statistics Without the Agonizing Pain (12 minutes) gives a great explanation of how the null hypothesis is rejected.
- Earlier this year, a major scientific journal banned the use of p-values:
- Scientific American has a nice summary of the ban.
- This response to the ban in Nature argues that "decisions that are made earlier in data analysis have a much greater impact on results".
- Andrew Gelman has a readable paper in which he argues that "it's easy to find a p < .05 comparison even if nothing is going on, if you look hard enough".
- Science Isn't Broken includes a neat tool that allows you to "p-hack" your way to "statistically significant" results.
- Accurately Measuring Model Prediction Error compares adjusted R-squared, AIC and BIC, train/test split, and cross-validation.
Other Resources:
- Section 3.3.1 of An Introduction to Statistical Learning (4 pages) has a great explanation of dummy encoding for categorical features.
- Kaggle has some nice visualizations of the bikeshare data we used today.
Aleks Ontman (Instructor)
Dr. Ontman joined Deloitte in 2012, currently a Sr. Data Scientist in Deloitte's Advanced Analytics Visualization team (VizStudio) specializing in: machine learning, design thinking for prototyping new solutions, ideation workshops, and guided interactions with Big Data. Our projects involved big data solutions for 5+ Fortune 100 companies, 10+ Fortune 500 companies, and several Federal Agencies.
Alex Sherman (Instructor)
Alex is a passionate business analytics advocate. He currently works as a Technology Consultant at Deloitte Consulting, in which he leads the design and implementation for informatics and analytics software development projects, repurposing semantic open source software to enhance data access for federal health care clients. In his free time, Alex is an avid jazz percussionist, self-proclaimed as the best drum stick spinner in the DC metro area.
- Contact Info:
- Email: [email protected]
Al Johri (TA)
Al is interested in creatively applying software engineering and data science to solve real world problems. He recently graduated with a degree in computer science from the McCormick School of Engineering at Northwestern University and currently works at the Washington Post as a Data Scientist part of the Big Data & Data Warehouse Solutions department.
As your Course Producer, it's Tim's job to make sure that you (and your instructors) have everything you need for a successful experience in DAT9. If you've got a question, and you're not sure who to ask, start with Tim!
Before GA, Tim lived and worked in China as a facilitator and program-designer for youth leadership programs at international schools all over Asia (e.g. student-council retreats, backpacking trips, etc.). After a year abroad, he was ready to move back to the good ol' USA. Tim started out at the front desk as a member of GA's Front Lines team, moved up to "Campus Commander" (yes, a real title), and then in January started as a full-time Course Producer. In addition to Data Science, Tim also produces Front- and Back-End Web Development, Data Analytics, Mobile Development, and Digital Marketing. Tim has been trying to learn Esperanto since high school.
Contact Info
- Email: [email protected]
- Phone: 202-748-3694
- and Slack too!