The aim of today's session will be to introduce methods to make sure that you're starting with quality data. As all data science methods are garbage in/garbage out you need to make sure you can explore new datasets quickly to assess whether your approach is viable. We will work towards building a basic exploratory data analysis framework with a checklist of things you should be looking out for.
You should aim to get familiar with pandas
interface for manipulating (munging) tabular data, learn how to create and interpret basic summary statistics, how to identify appropriate QA/QC, and have a basic understanding of 'tidy data' and data formats.
This week we're going to be looking at exploratory data with new datasets. You'll find that this process takes around 60-90% of any data science project so it's worth (a) getting good at it and (b) looking at ways to make this process easier. One approach is to put data in a tidy form as soon as possible.
- Hadley Wickham (of R fame) wrote the original paper on this and you can read it here: https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf
- There's a nice example of following this approach for a dataset in Python here: http://www.jeannicholashould.com/tidy-data-in-python.html
We're also going to be using some more advanced methods that pandas
offers - you should aim to get as familiar with these as you can as it really is the swiss-army knife of data munging in Python. If you've used R before there will be a number of things that feel very familiar. There are a number of good technical resources online for getting your head around pandas
which you might like to stack away for reference:
- Jake van der Plas has a very good book on data science with Python which has lots of examples using
pandas
- it's available as a series of Jupyter notebooks on GitHub here: https://github.com/jakevdp/PythonDataScienceHandbook. You can buy a hard copy from O'Reilly Books here: http://shop.oreilly.com/product/0636920034919.do. - The pandas documentation is good if you want more info about a particular function (although less useful for how to piece everything together). It's available here: https://pandas.pydata.org
- If you get stuck then someone else has probably already had your problem and got it fixed on Stack Overflow - you can see questions tagged with
pandas
here: https://stackoverflow.com/questions/tagged/pandas