CORE Skills Data Science Springboard - Day 4 - Getting to Know the Tools

The aim of today's session will be to introduce methods to make sure that you're starting with quality data. As all data science methods are garbage in/garbage out you need to make sure you can explore new datasets quickly to assess whether your approach is viable. We will work towards building a basic exploratory data analysis framework with a checklist of things you should be looking out for.

You should aim to get familiar with pandas interface for manipulating (munging) tabular data, learn how to create and interpret basic summary statistics, how to identify appropriate QA/QC, and have a basic understanding of 'tidy data' and data formats.

Pre-session Reading & Resources

This week we're going to be looking at exploratory data with new datasets. You'll find that this process takes around 60-90% of any data science project so it's worth (a) getting good at it and (b) looking at ways to make this process easier. One approach is to put data in a tidy form as soon as possible.

Hadley Wickham (of R fame) wrote the original paper on this and you can read it here: https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf
There's a nice example of following this approach for a dataset in Python here: http://www.jeannicholashould.com/tidy-data-in-python.html

We're also going to be using some more advanced methods that pandas offers - you should aim to get as familiar with these as you can as it really is the swiss-army knife of data munging in Python. If you've used R before there will be a number of things that feel very familiar. There are a number of good technical resources online for getting your head around pandas which you might like to stack away for reference:

Jake van der Plas has a very good book on data science with Python which has lots of examples using pandas - it's available as a series of Jupyter notebooks on GitHub here: https://github.com/jakevdp/PythonDataScienceHandbook. You can buy a hard copy from O'Reilly Books here: http://shop.oreilly.com/product/0636920034919.do.
The pandas documentation is good if you want more info about a particular function (although less useful for how to piece everything together). It's available here: https://pandas.pydata.org
If you get stuck then someone else has probably already had your problem and got it fixed on Stack Overflow - you can see questions tagged with pandas here: https://stackoverflow.com/questions/tagged/pandas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CORE Skills Data Science Springboard - Day 4 - Getting to Know the Tools

Pre-session Reading & Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

CORE Skills Data Science Springboard - Day 4 - Getting to Know the Tools

Pre-session Reading & Resources