-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
292 lines (263 loc) · 17.5 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>
A visual Machine Learning workflow
</title>
<!-- Bootstrap -->
<link href="/MachineLearningIntro/static/app/components/bootstrap/css/bootstrap.css" rel="stylesheet">
<link href="/MachineLearningIntro/static/app/global/style.css" rel="stylesheet">
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<!-- Amplitude Analytics -->
<script type="text/javascript">
(function(e,t){var r=e.amplitude||{};var n=t.createElement("script");n.type="text/javascript";
n.async=true;n.src="https://d24n15hnbwhuhn.cloudfront.net/libs/amplitude-2.2.0-min.gz.js";
var s=t.getElementsByTagName("script")[0];s.parentNode.insertBefore(n,s);r._q=[];
function a(e){r[e]=function(){r._q.push([e].concat(Array.prototype.slice.call(arguments,0)));
}}var i=["init","logEvent","logRevenue","setUserId","setUserProperties","setOptOut","setVersionName","setDomain","setDeviceId","setGlobalUserProperties"];
for(var o=0;o<i.length;o++){a(i[o])}e.amplitude=r})(window,document);
amplitude.init("2e21c2f35eff47074772400137647a56");
</script>
<link href="/MachineLearningIntro/static/page/style.css" rel="stylesheet">
</head>
<body>
<div id="header">
<div class="container">
<div class="row">
<div class="col-xs-11 col-sm-4">
<a id="logo" class="hide-text" href="/">LOGO placeholder</a>
</div>
</div>
</div>
</div>
<div class="container" id="main">
<div class="row" id="intro">
<div class="col-xs-5">
<div id="set-up" class="tracking-section">
<h1>A visual Machine Learning workflow</h1>
<p id="translations" class="small"><a href="https://en.wikipedia.org/wiki/Decision_tree_learning">Decision tree learning</a> uses a decision tree as a predictive model which maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves)</p>
<p>In machine learning, computers apply <strong>statistical learning</strong> techniques to automatically identify patterns in data. These techniques can be used to make highly accurate predictions. </p>
<p><em>Keep scrolling.</em> Using a data set about homes, we will create a machine learning model to distinguish homes in Place A from homes in Place B.</p>
</div>
<div id="keep-scrolling">
<div id="animated-arrow">
<div class= "co_mouse_ani">
<span class="co_mouse">
<span class="co_mouse-movement"></span>
</span>
</div>
</div>
</div>
<hr class="whitespace" style="height: 30vh;" />
<div id="first-two" class="tracking-section">
<h2>First, some intuition</h2>
<p>Let’s say you had to determine whether a home is in <strong style="color:rgb(65, 153, 43);">Place A</strong> or in <strong style="color:rgb(16, 70, 131);">Place B</strong>. In machine learning terms, categorizing data points is a <strong>classification</strong> task.</p>
<p>Since Place A is relatively hilly, the elevation of a home may be a good way to distinguish the two places.</p>
<p>Based on the home-elevation data to the right, you could argue that a home above 240 ft should be <strong>classified</strong> as one in Place A.</p>
</div>
<hr class="whitespace" style="height: 40vh;" />
<div id="add-nuance" class="tracking-section">
<h2>Adding nuance</h2>
<p>Adding another <strong>dimension</strong> allows for more nuance. For example, Place B apartments can be extremely expensive per square foot.</p>
<p>So visualizing elevation <em>and</em> price per square foot in a <strong>scatterplot</strong> helps us distinguish lower-elevation homes.</p>
<p>The data suggests that, among homes at or below 240 ft, those that cost more than £1776 per square foot are in Place B.</p>
<p>Dimensions in a data set are often called <strong>features</strong>, <strong>predictors</strong>, or <strong>variables</strong>. <span class="footnote-anchor"></span></p>
</div>
<hr class="whitespace" style="height: 30vh;" />
<div id="set-boundaries" class="tracking-section">
<h2>Drawing boundaries</h2>
<p>You can visualize your elevation (>242 ft) and price per square foot (>£1776) observations as the boundaries of regions in your scatterplot. Homes plotted in the green and blue regions would be in Place A and Place B, respectively.</p>
<p>Identifying boundaries in data using math is the essence of statistical learning.</p>
<p>Of course, you’ll need additional information to distinguish homes with lower elevations <em>and</em> lower per-square-foot prices.</p>
</div>
<hr class="whitespace" style="height: 55vh;" />
<div id="more-variables">
<div id="getting-more-data" class="tracking-section">
</div>
<div id="listing-the-variables">
<!--<div id="data-table"></div>-->
<p>The dataset we are using to create the model has 7 different dimensions. Creating a model is also known as <strong>training</strong> a model.</p>
<p>On the right, we are visualizing the variables in a <strong>scatterplot matrix</strong> to show the relationships between each pair of dimensions.</p>
<p>There are clearly patterns in the data, but the boundaries for delineating them are not obvious.</p>
</div>
<div id="from-boundaries-to-pattern" class="tracking-section">
<hr class="whitespace" style="height: 30vh;" />
<h2>And now, machine learning</h2>
<p>Finding patterns in data is where machine learning comes in. Machine learning methods use statistical learning to identify boundaries.</p>
<p>One example of a machine learning method is a <strong>decision tree</strong>. Decision trees look at one variable at a time and are a reasonably accessible (though rudimentary) machine learning method. </p>
</div>
<hr class="whitespace" style="height: 30vh;" />
</div>
</div>
</div>
<div class="row" id="split">
<div class="col-xs-4 col-xs-push-8">
<div id="elevation-to-histogram" class="tracking-section">
<hr class="whitespace" style="height: 20vh;" />
<h2>Finding better boundaries</h2>
<p>Let's revisit the 240-ft elevation boundary proposed previously to see how we can improve upon our intuition.</p>
<p>Clearly, this requires a different perspective.</p>
<hr class="whitespace" style="height: 40vh;" />
<p>By transforming our visualization into a <strong>histogram</strong>, we can better see how frequently homes appear at each elevation.</p>
<p>While the highest home in Place B is ~240 ft, the majority of them seem to have far lower elevations.</p>
</div>
<div id="introduce-split" class="tracking-section">
<hr class="whitespace" style="height: 30vh;" />
<h2>Your first fork</h2>
<p>A decision tree uses if-then statements to define patterns in data.</p>
<p>For example, <strong>if</strong> a home's elevation is above some number, <strong>then</strong> the home is probably in Place A.</p>
</div>
<div id="explain-gini" class="tracking-section">
<hr class="whitespace" style="height: 30vh;" />
<p>In machine learning, these statements are called <strong>forks</strong>, and they split the data into two <strong>branches</strong> based on some value.</p>
<p>That value between the branches is called a <strong>split point</strong>. Homes to the left of that point get categorized in one way, while those to the right are categorized in another. A split point is the decision tree's version of a boundary. </p>
<hr class="whitespace" style="height: 50vh;" />
<h2>Tradeoffs</h2>
<p>Picking a split point has tradeoffs. Our initial split (~240 ft) incorrectly classifies some Place A homes as Place B ones.</p>
<p>Look at that large slice of green in the left pie chart, those are all the Place A homes that are misclassified. These are called <strong>false negatives</strong>.</p>
<hr class="whitespace" style="height: 50vh;" />
<p>However, a split point meant to capture every Place A home will include many Place B homes as well. These are called <strong>false positives</strong>.</p>
<hr class="whitespace" style="height: 50vh;" />
<h2>The best split</h2>
<p>At the <strong>best split</strong>, the results of each branch should be as homogeneous (or pure) as possible. There are several mathematical methods you can choose between to calculate the best split.<span class="footnote-anchor"></span></p>
<hr class="whitespace" style="height: 20vh;" />
<p>As we see here, even the best split on a single feature does not fully separate the Place A homes from the Place B ones.</p>
<hr class="whitespace" style="height: 10vh;" />
</div>
<div id="further-split" class="tracking-section">
<hr class="whitespace" style="height: 30vh;" />
<h2>Recursion</h2>
<p>To add another split point, the algorithm repeats the process above on the subsets of data. This repetition is called <strong>recursion</strong>, and it is a concept that appears frequently in training models.<span class="footnote-anchor"></span></p>
<p class="small">The histograms to the left show the distribution of each subset, repeated for each variable.</p>
<hr class="whitespace" style="height: 30vh;" />
<p>The best split will vary based which branch of the tree you are looking at. <span class="footnote-anchor"></span></p>
<p>For lower elevation homes, price per square foot is, at <em id="left-side-split-value">X pounds per sqft</em>, is the best variable for the next if-then statement. For higher elevation homes, it is <em id="right-side-split-attribute">price</em>, at <em id="right-side-split-value">Y pounds</em>.
<hr class="whitespace" style="height: 40vh;" />
</div>
</div>
</div>
<div class="row" id="tree">
<div class="col-xs-4">
<div class="growing-the-tree" class="tracking-section">
<hr class="whitespace" style="height: 30vh;" />
<h2>Growing a tree</h2>
<p>Additional forks will add new information that can increase a tree's <strong>prediction accuracy</strong>. </p>
<hr class="whitespace" style="height: 40vh;" />
<p>Splitting one layer deeper, the tree's accuracy improves to <strong>84%</strong>.</p>
<hr class="whitespace" style="height: 60vh;" />
<p>Adding several more layers, we get to <strong>96%</strong>.</p>
<hr class="whitespace" style="height: 60vh;" />
<p>You could even continue to add branches until the tree's predictions are <strong>100% accurate</strong>, so that at the end of every branch, the homes are purely in Place A or purely in Place B.</p>
</div>
<div class="leaf-nodes" class="tracking-section">
<hr class="whitespace" style="height: 60vh;" />
<p>These ultimate branches of the tree are called <strong>leaf nodes</strong>. Our decision tree models will classify the homes in each leaf node according to which class of homes is in the majority.</p>
<hr class="whitespace" style="height: 30vh;" />
</div>
</div>
</div>
<div class="row" id="test">
<div class="col-xs-4">
<div id="classify-training-data" class="tracking-section">
<hr class="whitespace" style="height: 35vh;" />
<h2>Making predictions</h2>
<p>The newly-trained decision tree model determines whether a home is in Place A or Place B by running each data point through the branches.</p>
<hr class="whitespace" style="height: 35vh;"/p>
<p>Here you can see the data that was used to train the tree flow through the tree.</p>
<p>This data is called <strong>training data</strong> because it was used to train the model.</p>
<hr class="whitespace" style="height: 35vh;" />
<p>Because we grew the tree until it was 100% accurate, this tree maps each training data point perfectly to which city it is in.</p>
</div>
<div id="classify-test-data" class="tracking-section">
<hr class="whitespace" style="height: 40vh;" />
<h2>Reality check</h2>
<p>Of course, what matters more is how the tree performs on previously-unseen data.</p>
<hr class="whitespace" style="height: 40vh;" />
<p>To <strong>test</strong> the tree's performance on new data, we need to apply it to data points that it has never seen before. This previously unused data is called <strong>test data</strong>.</p>
<hr class="whitespace" style="height: 40vh;" />
<p>Ideally, the tree should perform similarly on both known and unknown data.</p>
<hr class="whitespace" style="height: 25vh;" />
<p>So this one is less than ideal.<span class="footnote-anchor"></span></p>
<hr class="whitespace" style="height: 25vh;" />
</div>
<div id="misclassification" class="tracking-section">
<p>These errors are due to <strong>overfitting</strong>. Our model has learned to treat every detail in the training data as important, even details that turned out to be irrelevant.</p>
<p>Overfitting is part of a fundamental concept in machine learning.<span class="footnote-anchor"></span></p>
<hr class="whitespace" style="height: 30vh;" />
</div>
</div>
</div>
<div id="conclusion" class="tracking-section">
<div class="row">
<div class="col-xs-8 col-xs-offset-2">
<hr class="whitespace" style="height: 10vh;" />
<h2>Recap</h2>
<ol>
<li>Machine learning identifies patterns using <strong>statistical learning</strong> and computers by unearthing <strong>boundaries</strong> in data sets. You can use it to make predictions.</li>
<li>One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data</li>
<li><strong>Overfitting</strong> happens when some boundaries are based on on <em>distinctions that don't make a difference</em>. You can see if a model overfits by having test data flow through the model.</li>
</ol>
<hr class="whitespace" style="height: 25vh;" />
</div>
</div>
<div class="row">
<div class="col-xs-8 col-xs-offset-2">
<h2>Footnote</h2>
<p>Finding structures in multidimensional data sets, be they measurement data, statistics or textual documents, is difficult and time-consuming.</p>
<p>Interesting, novel relations between the data items may be hidden in the data. Decision trees are well positioned to bring high-dimensional multivariate data set structure to surface.</p>
<p>A confusion matrix has been identified as an appropriate measure for evaluating the quality of the different types of maps in representing a given data set, and for measuring the robustness of the illustration.</p>
<p>The same measures may also be used for comparing the knowledge that different maps represent.</p>
<hr class="whitespace" style="height: 5vh;" />
</div>
<div class="col-xs-5">
<p class="small">Some text goes here</p>
</div>
<div class="col-xs-5 col-xs-offset-1">
<p class="small">Some text goes here</p>
<hr class="whitespace" style="height: 25vh;" />
</div>
</div>
<div class="row" id="footnotes">
<div class="col-xs-8">
<h3>Footnotes</h3>
<ol id="footnote-list">
<li>Machine learning concepts have arisen across disciplines (computer science, statistics, engineering, psychology, etc), thus the different nomenclature.</li>
<li>To learn more about calculating the optimal split, search for 'gini index' or 'cross entropy'. </li>
<li>One reason computers are so good at applying statistical learning techniques is that they're able to do repetitive tasks, very quickly and without getting bored.</li>
<li>The algorithm described here is <em>greedy</em>, because it takes a top-down approach to splitting the data. In other words, it is looking for the variable that makes each subset the most homogeneous <em>at that moment</em>.</li>
<li>Hover over the dots to see the path it took in the tree.</li>
<li>Spoiler alert: It's the bias/variance tradeoff!</li>
</ol>
</div>
</div>
</div>
</div>
<div class="static-container" id="shadow-scatterplot"></div>
<div class="static-container" id="intro-scatterplot"></div>
<div class="static-container" id="split-quality"></div>
<div class="static-container" id="decision-tree"></div>
<div class="static-container" id="train-vs-test"></div>
<div id="footer" class="tracking-section">
</div>
<!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
<script src="/MachineLearningIntro/static/app/components/jquery/jquery.js"></script>
<!-- Include all compiled plugins (below), or include individual files as needed -->
<script src="/MachineLearningIntro/static/app/components/bootstrap/js/bootstrap.js"></script>
<script src="/MachineLearningIntro/static/app/components/underscore/underscore.js"></script>
<script src="/MachineLearningIntro/static/app/components/d3/d3.js"></script>
<script src="/MachineLearningIntro/static/app/components/riveted/riveted.js"></script>
<script src="/MachineLearningIntro/static/page/housing-data/tree-training-set-98.js"></script>
<script src="/MachineLearningIntro/static/app/components/backbone/backbone.js"></script>
<script src="/MachineLearningIntro/static/page/rAF-polyfill.js"></script>
<script src="/MachineLearningIntro/static/page/helpers.js"></script>
<script src="/MachineLearningIntro/static/page/main.js"></script>
</body>
</html>