Welcome to the course! This week we talk about -
- Basic Python
- Data Visualisation using Matplotlib
- Data Distribution
- Numpy
To make sure that you understand things well, we have given a brief description of the topics followed by links ranging from beginner's to advanced material. It is okay if you don't understand everything in the first go. The topics covered lay the foundations of Machine Learning, so take your time to understand things well. Also, it is not so important to get into everything rigorously, you can do that as per your needs later in the course, but do read the things and get an overall notion.
Before beginning with the course, you should have very little but some experience of coding in Python. This should include the different data types and data structures in python, basic syntax for different types of loops, defining functions, reading from and writing to files, etc. For basic Python Tutorials refer to the links below -
For the purpose of this course you need to watch till Tutorial #15 (for the first link) or till Tutorial #24 (for the second link), but we strongly recommend that you go through all the tutorials, as it will help you figure things out much faster later on when you try out new stuff.
Once you are clear with elementary Python, you should have a basic idea about what are Jupyter Notebooks and how to run code in them. For an introduction to Jupyter Notebooks, refer here
That's it! You are now ready to get started.
Before creating analytical models, a data scientist must develop an understanding of the properties and relationships in a dataset. There are two goals for data exploration and visualization -
- To understand the relationships between the data columns.
- To identify features that may be useful for predicting labels in machine learning projects. Additionally, redundant, collinear features can be identified.
Thus, visualization for data exploration is an essential data science skill. Here, we’ll learn to analyze data via various types of plots offered by matplotlib and seaborn library.
Don't worry if you don't understand everything now. It will become more clear once you start implementing them in the assignments.
The data that we have for our model can come from a variety of distributions. Having an understanding of the data distribution helps in making an informed decision about the model that we can use. Let us briefly talk about some data distributions -
- Bernoulli Distribution - It has only two outcomes.
- Uniform Distribution - The probability of occurrence of all outcomes is the same.
- Normal Distribution - The probability distribution is given by some expression that forms a bell - shaped curve.
Go through this article to go deeper into the various data distributions that are common in Machine Learning.
NumPy (Numerical Python) is a linear algebra library in Python. It is a very important library on which almost every data science or machine learning Python package such as SciPy (Scientific Python), Mat−plotlib (plotting library), Scikit-learn, etc depends on to a great extent.
NumPy is very useful for performing mathematical and logical operations on Arrays. It provides an abundance of useful features for operations on n-arrays and matrices in Python.
One of the main advantages of Numpy is that vectorisation using numpy arrays makes it super time efiicient. It enables parallel computation that makes it so fast and hence extremely useful.