Collecting, making sense of and classifying data has never been more important. Using Python, the students will learn how to:
- Collect dispersed, but publicly available, data through tools like selenium for web scraping, and application programming interfaces (APIs).
- Analyse data using dataframes and arrays, with special attention in how to optimise code using compilers like Numba for big data problems.
- Make sense of, and do quality control, on high dimensional data sets, through dimensionality reduction methods like Principal Component Analysis (PCA) and multi-dimensional scaling (MDS).
- Use the huge advances made recently, in Artificial Intelligence (AI) software like Keras, in utilizing neural networks to classify data.
By the end of the module, students will be able to:
- Refer to and adapt from a code base of taught examples of data collection, analysis, dimensionality reduction and classification.
- Build their own code base around a personally selected problem touching on the areas taught.
- Have a good understanding of the ethics and common problems encountered in data collection, processing and classification.
- Experience in Python will be a bonus. For those new to Python, I recommend them running through the Learn Python “Learn the Basics” and “Data Science Tutorials” chapters, before the module: https://www.learnpython.org/
- Optional reading for a fuller background to Data Science in Python: https://jakevdp.github.io/PythonDataScienceHandbook/
- Day 1: Python, version control and collecting publicly available data.
- Day 2: Data science in Python and Machine Learning
- Day 3: Large, multidimensional datasets
- Day 4: Classification of data using neural networks
- Anaconda3 (for python3 notebooks) (Links to an external site.) We will mainly be using python notebooks for this module, with specific modules to install for each day, which you can see in each specific 'To do before the start' page for that day.