forked from anandprabhakar0507/Python-Intro-to-Matplotlib
-
Notifications
You must be signed in to change notification settings - Fork 0
/
DV0101EN-1-1-1-Introduction-to-Matplotlib-and-Line-Plots-py-v2.0.rst
766 lines (503 loc) · 22 KB
/
DV0101EN-1-1-1-Introduction-to-Matplotlib-and-Line-Plots-py-v2.0.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
.. raw:: html
<h1 align="center">
Introduction to Matplotlib and Line Plots
.. raw:: html
</h1>
Introduction
------------
The aim of these labs is to introduce you to data visualization with
Python as concrete and as consistent as possible. Speaking of
consistency, because there is no *best* data visualization library
avaiblable for Python - up to creating these labs - we have to introduce
different libraries and show their benefits when we are discussing new
visualization concepts. Doing so, we hope to make students well-rounded
with visualization libraries and concepts so that they are able to judge
and decide on the best visualitzation technique and tool for a given
problem *and* audience.
Please make sure that you have completed the prerequisites for this
course, namely \ **Python for Data Science**\ and \ **Data Analysis
with Python**\ , which are part of this specialization.
**Note**: The majority of the plots and visualizations will be generated
using data stored in *pandas* dataframes. Therefore, in this lab, we
provide a brief crash course on *pandas*. However, if you are interested
in learning more about the *pandas* library, detailed description and
explanation of how to use it and how to clean, munge, and process data
stored in a *pandas* dataframe are provided in our course \ **Data
Analysis with Python**\ , which is also part of this specialization.
--------------
Table of Contents
-----------------
.. raw:: html
<div class="alert alert-block alert-info" style="margin-top: 20px">
1. `Exploring Datasets with *pandas* <#0>`__\ 1.1 `The Dataset:
Immigration to Canada from 1980 to 2013 <#2>`__\ 1.2 `*pandas*
Basics <#4>`__ 1.3 `*pandas* Intermediate: Indexing and
Selection <#6>`__
2. `Visualizing Data using Matplotlib <#8>`__ 2.1 `Matplotlib: Standard
Python Visualization Library <#10>`__
3. `Line Plots <#12>`__
.. raw:: html
</div>
.. raw:: html
<hr>
Exploring Datasets with *pandas*
=================================
*pandas* is an essential data analysis toolkit for Python. From their
`website <http://pandas.pydata.org/>`__: >\ *pandas* is a Python package
providing fast, flexible, and expressive data structures designed to
make working with “relational” or “labeled” data both easy and
intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python.
The course heavily relies on *pandas* for data wrangling, analysis, and
visualization. We encourage you to spend some time and familizare
yourself with the *pandas* API Reference:
http://pandas.pydata.org/pandas-docs/stable/api.html.
The Dataset: Immigration to Canada from 1980 to 2013
-----------------------------------------------------
Dataset Source: `International migration flows to and from selected
countries - The 2015
revision <http://www.un.org/en/development/desa/population/migration/data/empirical2/migrationflows.shtml>`__.
The dataset contains annual data on the flows of international
immigrants as recorded by the countries of destination. The data
presents both inflows and outflows according to the place of birth,
citizenship or place of previous / next residence both for foreigners
and nationals. The current version presents data pertaining to 45
countries.
In this lab, we will focus on the Canadian immigration data.
For sake of simplicity, Canada's immigration data has been extracted and
uploaded to one of IBM servers. You can fetch the data from
`here <https://ibm.box.com/shared/static/lw190pt9zpy5bd1ptyg2aw15awomz9pu.xlsx>`__.
--------------
*pandas* Basics
---------------
The first thing we'll do is import two key data analysis modules:
*pandas* and **Numpy**.
.. code:: ipython3
import numpy as np # useful for many scientific computing in Python
import pandas as pd # primary data structure library
Let's download and import our primary Canadian Immigration dataset using
*pandas* ``read_excel()`` method. Normally, before we can do that, we
would need to download a module which *pandas* requires to read in excel
files. This module is **xlrd**. For your convenience, we have
pre-installed this module, so you would not have to worry about that.
Otherwise, you would need to run the following line of code to install
the **xlrd** module:
::
!conda install -c anaconda xlrd --yes
Now we are ready to read in our data.
.. code:: ipython3
df_can = pd.read_excel('https://ibm.box.com/shared/static/lw190pt9zpy5bd1ptyg2aw15awomz9pu.xlsx',
sheet_name='Canada by Citizenship',
skiprows=range(20),
skipfooter=2)
print ('Data read into a pandas dataframe!')
Let's view the top 5 rows of the dataset using the ``head()`` function.
.. code:: ipython3
df_can.head()
# tip: You can specify the number of rows you'd like to see as follows: df_can.head(10)
We can also veiw the bottom 5 rows of the dataset using the ``tail()``
function.
.. code:: ipython3
df_can.tail()
When analyzing a dataset, it's always a good idea to start by getting
basic information about your dataframe. We can do this by using the
``info()`` method.
.. code:: ipython3
df_can.info()
To get the list of column headers we can call upon the dataframe's
``.columns`` parameter.
.. code:: ipython3
df_can.columns.values
Similarly, to get the list of indicies we use the ``.index`` parameter.
.. code:: ipython3
df_can.index.values
Note: The default type of index and columns is NOT list.
.. code:: ipython3
print(type(df_can.columns))
print(type(df_can.index))
To get the index and columns as lists, we can use the ``tolist()``
method.
.. code:: ipython3
df_can.columns.tolist()
df_can.index.tolist()
print (type(df_can.columns.tolist()))
print (type(df_can.index.tolist()))
To view the dimensions of the dataframe, we use the ``.shape``
parameter.
.. code:: ipython3
# size of dataframe (rows, columns)
df_can.shape
Note: The main types stored in *pandas* objects are *float*, *int*,
*bool*, *datetime64[ns]* and *datetime64[ns, tz] (in >= 0.17.0)*,
*timedelta[ns]*, *category (in >= 0.15.0)*, and *object* (string). In
addition these dtypes have item sizes, e.g. int64 and int32.
Let's clean the data set to remove a few unnecessary columns. We can use
*pandas* ``drop()`` method as follows:
.. code:: ipython3
# in pandas axis=0 represents rows (default) and axis=1 represents columns.
df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
df_can.head(2)
Let's rename the columns so that they make sense. We can use
``rename()`` method by passing in a dictionary of old and new names as
follows:
.. code:: ipython3
df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True)
df_can.columns
We will also add a 'Total' column that sums up the total immigrants by
country over the entire period 1980 - 2013, as follows:
.. code:: ipython3
df_can['Total'] = df_can.sum(axis=1)
We can check to see how many null objects we have in the dataset as
follows:
.. code:: ipython3
df_can.isnull().sum()
Finally, let's view a quick summary of each column in our dataframe
using the ``describe()`` method.
.. code:: ipython3
df_can.describe()
--------------
*pandas* Intermediate: Indexing and Selection (slicing)
-------------------------------------------------------
Select Column
~~~~~~~~~~~~~
**There are two ways to filter on a column name:**
Method 1: Quick and easy, but only works if the column name does NOT
have spaces or special characters.
.. code:: python
df.column_name
(returns series)
Method 2: More robust, and can filter on multiple columns.
.. code:: python
df['column']
(returns series)
.. code:: python
df[['column 1', 'column 2']]
(returns dataframe)
--------------
Example: Let's try filtering on the list of countries ('Country').
.. code:: ipython3
df_can.Country # returns a series
Let's try filtering on the list of countries ('OdName') and the data for
years: 1980 - 1985.
.. code:: ipython3
df_can[['Country', 1980, 1981, 1982, 1983, 1984, 1985]] # returns a dataframe
# notice that 'Country' is string, and the years are integers.
# for the sake of consistency, we will convert all column names to string later on.
Select Row
~~~~~~~~~~
There are main 3 ways to select rows:
.. code:: python
df.loc[label]
#filters by the labels of the index/column
df.iloc[index]
#filters by the positions of the index/column
Before we proceed, notice that the defaul index of the dataset is a
numeric range from 0 to 194. This makes it very difficult to do a query
by a specific country. For example to search for data on Japan, we need
to know the corressponding index value.
This can be fixed very easily by setting the 'Country' column as the
index using ``set_index()`` method.
.. code:: ipython3
df_can.set_index('Country', inplace=True)
# tip: The opposite of set is reset. So to reset the index, we can use df_can.reset_index()
.. code:: ipython3
df_can.head(3)
.. code:: ipython3
# optional: to remove the name of the index
df_can.index.name = None
Example: Let's view the number of immigrants from Japan (row 87) for the
following scenarios: 1. The full row data (all columns) 2. For year 2013
3. For years 1980 to 1985
.. code:: ipython3
# 1. the full row data (all columns)
print(df_can.loc['Japan'])
# alternate methods
print(df_can.iloc[87])
print(df_can[df_can.index == 'Japan'].T.squeeze())
.. code:: ipython3
# 2. for year 2013
print(df_can.loc['Japan', 2013])
# alternate method
print(df_can.iloc[87, 36]) # year 2013 is the last column, with a positional index of 36
.. code:: ipython3
# 3. for years 1980 to 1985
print(df_can.loc['Japan', [1980, 1981, 1982, 1983, 1984, 1984]])
print(df_can.iloc[87, [3, 4, 5, 6, 7, 8]])
Column names that are integers (such as the years) might introduce some
confusion. For example, when we are referencing the year 2013, one might
confuse that when the 2013th positional index.
To avoid this ambuigity, let's convert the column names into strings:
'1980' to '2013'.
.. code:: ipython3
df_can.columns = list(map(str, df_can.columns))
# [print (type(x)) for x in df_can.columns.values] #<-- uncomment to check type of column headers
Since we converted the years to string, let's declare a variable that
will allow us to easily call upon the full range of years:
.. code:: ipython3
# useful for plotting later on
years = list(map(str, range(1980, 2014)))
years
Filtering based on a criteria
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To filter the dataframe based on a condition, we simply pass the
condition as a boolean vector.
For example, Let's filter the dataframe to show the data on Asian
countries (AreaName = Asia).
.. code:: ipython3
# 1. create the condition boolean series
condition = df_can['Continent'] == 'Asia'
print (condition)
.. code:: ipython3
# 2. pass this condition into the dataFrame
df_can[condition]
.. code:: ipython3
# we can pass mutliple criteria in the same line.
# let's filter for AreaNAme = Asia and RegName = Southern Asia
df_can[(df_can['Continent']=='Asia') & (df_can['Region']=='Southern Asia')]
# note: When using 'and' and 'or' operators, pandas requires we use '&' and '|' instead of 'and' and 'or'
# don't forget to enclose the two conditions in parentheses
Before we proceed: let's review the changes we have made to our
dataframe.
.. code:: ipython3
print ('data dimensions:', df_can.shape)
print(df_can.columns)
df_can.head(2)
--------------
Visualizing Data using Matplotlib
=================================
Matplotlib: Standard Python Visualization Library
-------------------------------------------------
The primary plotting library we will explore in the course is
`Matplotlib <http://matplotlib.org/>`__. As mentioned on their website:
>Matplotlib is a Python 2D plotting library which produces publication
quality figures in a variety of hardcopy formats and interactive
environments across platforms. Matplotlib can be used in Python scripts,
the Python and IPython shell, the jupyter notebook, web application
servers, and four graphical user interface toolkits.
If you are aspiring to create impactful visualization with python,
Matplotlib is an essential tool to have at your disposal.
Matplotlib.Pyplot
~~~~~~~~~~~~~~~~~
One of the core aspects of Matplotlib is ``matplotlib.pyplot``. It is
Matplotlib's scripting layer which we studied in details in the videos
about Matplotlib. Recall that it is a collection of command style
functions that make Matplotlib work like MATLAB. Each ``pyplot``
function makes some change to a figure: e.g., creates a figure, creates
a plotting area in a figure, plots some lines in a plotting area,
decorates the plot with labels, etc. In this lab, we will work with the
scripting layer to learn how to generate line plots. In future labs, we
will get to work with the Artist layer as well to experiment first hand
how it differs from the scripting layer.
Let's start by importing ``Matplotlib`` and ``Matplotlib.pyplot`` as
follows:
.. code:: ipython3
# we are using the inline backend
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
\*optional: check if Matplotlib is loaded.
.. code:: ipython3
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0
\*optional: apply a style to Matplotlib.
.. code:: ipython3
print(plt.style.available)
mpl.style.use(['ggplot']) # optional: for ggplot-like style
Plotting in *pandas*
~~~~~~~~~~~~~~~~~~~~
Fortunately, pandas has a built-in implementation of Matplotlib that we
can use. Plotting in *pandas* is as simple as appending a ``.plot()``
method to a series or dataframe.
Documentation: - `Plotting with
Series <http://pandas.pydata.org/pandas-docs/stable/api.html#plotting>`__\
- `Plotting with
Dataframes <http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-plotting>`__
Line Pots (Series/Dataframe)
=============================
**What is a line plot and why use it?**
A line chart or line plot is a type of plot which displays information
as a series of data points called 'markers' connected by straight line
segments. It is a basic type of chart common in many fields. Use line
plot when you have a continuous data set. These are best suited for
trend-based visualizations of data over a period of time.
**Let's start with a case study:**
In 2010, Haiti suffered a catastrophic magnitude 7.0 earthquake. The
quake caused widespread devastation and loss of life and aout three
million people were affected by this natural disaster. As part of
Canada's humanitarian effort, the Government of Canada stepped up its
effort in accepting refugees from Haiti. We can quickly visualize this
effort using a ``Line`` plot:
**Question:** Plot a line graph of immigration from Haiti using
``df.plot()``.
First, we will extract the data series for Haiti.
.. code:: ipython3
haiti = df_can.loc['Haiti', years] # passing in years 1980 - 2013 to exclude the 'total' column
haiti.head()
Next, we will plot a line plot by appending ``.plot()`` to the ``haiti``
dataframe.
.. code:: ipython3
haiti.plot()
*pandas* automatically populated the x-axis with the index values
(years), and the y-axis with the column values (population). However,
notice how the years were not displayed because they are of type
*string*. Therefore, let's change the type of the index values to
*integer* for plotting.
Also, let's label the x and y axis using ``plt.title()``,
``plt.ylabel()``, and ``plt.xlabel()`` as follows:
.. code:: ipython3
haiti.index = haiti.index.map(int) # let's change the index values of Haiti to type integer for plotting
haiti.plot(kind='line')
plt.title('Immigration from Haiti')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')
plt.show() # need this line to show the updates made to the figure
We can clearly notice how number of immigrants from Haiti spiked up from
2010 as Canada stepped up its efforts to accept refugees from Haiti.
Let's annotate this spike in the plot by using the ``plt.text()``
method.
.. code:: ipython3
haiti.plot(kind='line')
plt.title('Immigration from Haiti')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
# annotate the 2010 Earthquake.
# syntax: plt.text(x, y, label)
plt.text(2000, 6000, '2010 Earthquake') # see note below
plt.show()
With just a few lines of code, you were able to quickly identify and
visualize the spike in immigration!
Quick note on x and y values in ``plt.text(x, y, label)``:
::
Since the x-axis (years) is type 'integer', we specified x as a year. The y axis (number of immigrants) is type 'integer', so we can just specify the value y = 6000.
.. code:: python
plt.text(2000, 6000, '2010 Earthquake') # years stored as type int
::
If the years were stored as type 'string', we would need to specify x as the index position of the year. Eg 20th index is year 2000 since it is the 20th year with a base year of 1980.
.. code:: python
plt.text(20, 6000, '2010 Earthquake') # years stored as type int
::
We will cover advanced annotation methods in later modules.
We can easily add more countries to line plot to make meaningful
comparisons immigration from different countries.
**Question:** Let's compare the number of immigrants from India and
China from 1980 to 2013.
Step 1: Get the data set for China and India, and display dataframe.
.. code:: ipython3
### type your answer here
Double-click **here** for the solution.
Step 2: Plot graph. We will explicitly specify line plot by passing in
``kind`` parameter to ``plot()``.
.. code:: ipython3
### type your answer here
Double-click **here** for the solution.
That doesn't look right...
Recall that *pandas* plots the indices on the x-axis and the columns as
individual lines on the y-axis. Since ``df_CI`` is a dataframe with the
``country`` as the index and ``years`` as the columns, we must first
transpose the dataframe using ``transpose()`` method to swap the row and
columns.
.. code:: ipython3
df_CI = df_CI.transpose()
df_CI.head()
*pandas* will auomatically graph the two countries on the same graph. Go
ahead and plot the new transposed dataframe. Make sure to add a title to
the plot and label the axes.
.. code:: ipython3
### type your answer here
Double-click **here** for the solution.
.. raw:: html
<!--
plt.title('Immigrants from China and India')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
-->
.. raw:: html
<!--
plt.show()
-->
From the above plot, we can observe that the China and India have very
similar immigration trends through the years.
*Note*: How come we didn't need to transpose Haiti's dataframe before
plotting (like we did for df\_CI)?
That's because ``haiti`` is a series as opposed to a dataframe, and has
the years as its indices as shown below.
.. code:: python
print(type(haiti))
print(haiti.head(5))
class 'pandas.core.series.Series' 1980 1666 1981 3692 1982 3498 1983
2860 1984 1418 Name: Haiti, dtype: int64
Line plot is a handy tool to display several dependent variables against
one independent variable. However, it is recommended that no more than
5-10 lines on a single graph; any more than that and it becomes
difficult to interpret.
**Question:** Compare the trend of top 5 countries that contributed the
most to immigration to Canada.
.. code:: ipython3
### type your answer here
Double-click **here** for the solution.
.. raw:: html
<!--
# get the top 5 entries
df_top5 = df_can.head(5)
-->
.. raw:: html
<!--
# transpose the dataframe
df_top5 = df_top5[years].transpose()
-->
.. raw:: html
<!--
print(df_top5)
-->
.. raw:: html
<!--
\\ # Step 2: Plot the dataframe. To make the plot more readeable, we will change the size using the `figsize` parameter.
df_top5.index = df_top5.index.map(int) # let's change the index values of df_top5 to type integer for plotting
df_top5.plot(kind='line', figsize=(14, 8)) # pass a tuple (x, y) size
-->
.. raw:: html
<!--
plt.title('Immigration Trend of Top 5 Countries')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
-->
.. raw:: html
<!--
plt.show()
-->
Other Plots
~~~~~~~~~~~
Congratulations! you have learned how to wrangle data with python and
create a line plot with Matplotlib. There are many other plotting styles
available other than the default Line plot, all of which can be accessed
by passing ``kind`` keyword to ``plot()``. The full list of available
plots are as follows:
- ``bar`` for vertical bar plots
- ``barh`` for horizontal bar plots
- ``hist`` for histogram
- ``box`` for boxplot
- ``kde`` or ``density`` for density plots
- ``area`` for area plots
- ``pie`` for pie plots
- ``scatter`` for scatter plots
- ``hexbin`` for hexbin plot
Thank you for completing this lab!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This notebook was originally created by `Jay
Rajasekharan <https://www.linkedin.com/in/jayrajasekharan>`__ with
contributions from `Ehsan M.
Kermani <https://www.linkedin.com/in/ehsanmkermani>`__, and `Slobodan
Markovic <https://www.linkedin.com/in/slobodan-markovic>`__.
This notebook was recently revised by `Alex
Aklson <https://www.linkedin.com/in/aklson/>`__. I hope you found this
lab session interesting. Feel free to contact me if you have any
questions!
This notebook is part of a course on **Coursera** called *Data
Visualization with Python*. If you accessed this notebook outside the
course, you can take this course online by clicking
`here <http://cocl.us/DV0101EN_Coursera_Week1_LAB1>`__.
.. raw:: html
<hr>
Copyright © 2018 `Cognitive
Class <https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu>`__.
This notebook and its source code are released under the terms of the
`MIT License <https://bigdatauniversity.com/mit-license/>`__.