-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathCI_Workshop_Teplitskiy_Introduction.py
256 lines (171 loc) · 5.07 KB
/
CI_Workshop_Teplitskiy_Introduction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>
# <markdowncell>
# #Statistical Learning with Python
#
# Agenda:
#
# **Intro and Plugs**
# - Sociology / computational social science
# - <a href=http://www.knowledgelab.org>www.knowledgelab.org</a>
# - <a href=http://www.dssg.io>www.dssg.io</a>
#
# **If you couldn't get your environment set up!**
# - Use: https://wakari.io/
#
# **IPython Notebook** as your IDE
# - Advantages/disadvantages
# - notebook: markdown, code, inline images
# - server
# - "--script"
# - Sharing is caring: http://nbviewer.ipython.org/
# - Keyboard shortcuts
# <markdowncell>
# <img src=http://i.minus.com/iEdBFdHPKBG8Q.gif>
# <markdowncell>
# **Pandas**
# <markdowncell>
# - Creating Series and DataFrames
# - Setting column names, index, datatypes
# - Indexing
# - By index, by label
# - Subsetting
# - Missing values
# <markdowncell>
# <img src=http://img3.wikia.nocookie.net/__cb20131231081108/degrassi/images/9/93/Panda-gif.gif>
# <markdowncell>
# **Matplotlib**
# - scatter, plot, hist
# - useful plot customazation
# - plots inside of pandas
#
# **Regression Example**
#
# **Classification Example**
#
# * 2 examples, 2 research communities
# * Understanding vs. predicting
# <markdowncell>
# <img src=http://www.totalprosports.com/wp-content/uploads/2012/11/14-nolan-ryan-high-fives-george-w-bush-gif.gif>
# <codecell>
%matplotlib inline
# <markdowncell>
# #Pandas
# <codecell>
import pandas as pd
# <markdowncell>
# Provides a crucial 2-d data structure: the ``pandas.DataFrame``
# - pandas.Series is 1-d analogue
# - Like the ``R`` data frames
#
# ``numpy`` does too, BUT ``pandas``
#
# 1. can hold *heterogenous data*; each column can have its own data type,
# 2. the axes of a DataFrame are *labeled* with column names and row indices,
#
# Perfect for data-wrangling: can take subsets, apply functions, join with other DataFrames, etc.
# <codecell>
# Load car dataset
df = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Auto.csv')
df.head() # print the first lines
# <codecell>
print 'Shape of DataFrame:', df.shape
print '\nColumns:', df.columns
print '\nIndex:', df.index[:10]
# <markdowncell>
# ###Get the ``df`` nice and cozy
# <codecell>
df.index = df.name
del df['name']
df.head()
# <markdowncell>
# ###Accessing, adding data
# You can use the dot ``.`` or bracket ``[]`` notation to access columns of the dataset. To add new columns you have to use the bracket ``[]`` notation
# <codecell>
mpg = df.mpg # get mpg column using '.'
weight = df['weight'] # get weight column using brackets
df['mpg_per_weight'] = mpg / weight # note the element-wise division
print df[['mpg', 'weight', 'mpg_per_weight']].head() # get a bunch of columns at the same time
# <markdowncell>
# ##Looking at data
# <markdowncell>
# ###Pandas indexing is really smart!
# <codecell>
# To look at all the Fords, create array of length = #rows of True and False, where True if 'ford' in string
arr_for_indexing = ['ford' in name for name in df.index]
df[arr_for_indexing].head()
# <markdowncell>
# ###But it can get confused: Indexing by "label" and by "location", ``.loc`` vs ``.iloc`` vs ``.ix``
# <markdowncell>
# ``.loc`` -- by label
#
# ``.iloc`` -- by location
#
# ``.ix`` -- by a mix
# <codecell>
df.ix[0:5, ['weight', 'mpg']] # select the first 5 rows and two columns weight and mpg
# <codecell>
# useful function!: value_counts()
df.year.value_counts()
# <markdowncell>
# ###Let's change year from "70" to "1970"
# <codecell>
df.year.apply(lambda x: '19' + str(x)) # this spits out the Series we like
df.year = df.year.apply(lambda x: '19' + str(x))
# <codecell>
#Uh oh, let's change it back!
df.year.str[-2:]
# <markdowncell>
# #Visualizing data
# <markdowncell>
# Most popular library: ``matplotlib``
#
# Others:
# - ``seaborn``
# - ``ggplot``
# - ``prettyplotlib``
# - ``bokeh``
# <markdowncell>
# ### common matplotlib plots
# - plt.hist <-- histograms
# - plt.scatter <-- scatter plot
# - plt.plot <-- most others
# <codecell>
import matplotlib.pyplot as plt
plt.hist(df.weight)
# <markdowncell>
# ### common plot features to tweak
# - plt.title('Sk00l Rox', fontsize=20)
# - plt.xlabel('')
# - plt.ylabel('')
# - plt.xlim(min, max)
# - plt.legend()
# <markdowncell>
# ###We can also used pandas' ``plot`` and other plotting function!!!
# <codecell>
df.weight.hist()
plt.title('OMG THERES A TITLE!!!11', fontsize=20)
# let's add decoration
plt.xlabel('weight')
plt.ylabel('frequency')
plt.xlim(0, df.weight.max())
plt.legend()
# <codecell>
plt.scatter(df.year.astype(int), df.weight)
# <codecell>
df.boxplot('weight')
# df.boxplot('weight', 'year')
# <codecell>
from pandas.tools.plotting import scatter_matrix
_ = scatter_matrix(df[['mpg', 'cylinders', 'displacement']], figsize=(14, 10))
# <markdowncell>
# #Regression next. But first...
# <codecell>
plt.xkcd()
# <markdowncell>
# <img src=http://replygif.net/i/209.gif>
# <codecell>
df.weight.hist()
plt.title('WOT, THERES AN XKCD STYLE???', fontsize=18)
plt.xlabel('weight')
plt.ylabel('freq.')