index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>
      A visual Machine Learning workflow
    </title>

    <!-- Bootstrap -->
    <link href="/MachineLearningIntro/static/app/components/bootstrap/css/bootstrap.css" rel="stylesheet">
    <link href="/MachineLearningIntro/static/app/global/style.css" rel="stylesheet">

    <!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
    <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
    <!--[if lt IE 9]>
      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
      <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
	
	<!-- Amplitude Analytics -->  
	<script type="text/javascript">
        (function(e,t){var r=e.amplitude||{};var n=t.createElement("script");n.type="text/javascript";
        n.async=true;n.src="https://d24n15hnbwhuhn.cloudfront.net/libs/amplitude-2.2.0-min.gz.js";
        var s=t.getElementsByTagName("script")[0];s.parentNode.insertBefore(n,s);r._q=[];
        function a(e){r[e]=function(){r._q.push([e].concat(Array.prototype.slice.call(arguments,0)));
        }}var i=["init","logEvent","logRevenue","setUserId","setUserProperties","setOptOut","setVersionName","setDomain","setDeviceId","setGlobalUserProperties"];
        for(var o=0;o<i.length;o++){a(i[o])}e.amplitude=r})(window,document);
  
        amplitude.init("2e21c2f35eff47074772400137647a56");
      	</script>
	  
    <link href="/MachineLearningIntro/static/page/style.css" rel="stylesheet">
  </head>
  <body>
    <div id="header">
      <div class="container">
        <div class="row">
          <div class="col-xs-11 col-sm-4">
            <a id="logo" class="hide-text" href="/">LOGO placeholder</a>
          </div>
        </div>
      </div>
    </div>
    
<div class="container" id="main">
	<div class="row" id="intro">
		<div class="col-xs-5">
			<div id="set-up" class="tracking-section">
				<h1>A visual Machine Learning workflow</h1>
				<p id="translations" class="small"><a href="https://en.wikipedia.org/wiki/Decision_tree_learning">Decision tree learning</a> uses a decision tree as a predictive model which maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves)</p>
				<p>In machine learning, computers apply <strong>statistical learning</strong> techniques to automatically identify patterns in data. These techniques can be used to make highly accurate predictions. </p>
				<p><em>Keep scrolling.</em> Using a data set about homes, we will create a machine learning model to distinguish homes in Place A from homes in Place B.</p>
			</div>
			<div id="keep-scrolling">
				<div id="animated-arrow">
					<div class= "co_mouse_ani">
						<span class="co_mouse">
							<span class="co_mouse-movement"></span>
						</span>
					</div>
				</div>
			</div>
			<hr class="whitespace" style="height: 30vh;" />
			<div id="first-two" class="tracking-section">
				<h2>First, some intuition</h2>
				<p>Let&rsquo;s&nbsp;say you had to determine whether a home is in <strong style="color:rgb(65, 153, 43);">Place A</strong> or in <strong style="color:rgb(16, 70, 131);">Place B</strong>. In machine learning terms, categorizing data points is a <strong>classification</strong> task.</p>
				<p>Since Place A is relatively hilly, the elevation of a home may be a good way to distinguish the two places.</p> 
				<p>Based on the home-elevation data to the right, you could argue that a home above 240 ft should be <strong>classified</strong> as one in Place A.</p>
			</div>
			<hr class="whitespace" style="height: 40vh;" />
			<div id="add-nuance" class="tracking-section">
				<h2>Adding nuance</h2>
				<p>Adding another <strong>dimension</strong> allows for more nuance. For example, Place B apartments can be extremely expensive per square foot.</p>
				<p>So visualizing elevation <em>and</em> price per square foot in a <strong>scatterplot</strong> helps us distinguish lower-elevation homes.</p>
				<p>The data suggests that, among homes at or below 240 ft, those that cost more than £1776 per square foot are in Place B.</p>
				<p>Dimensions in a data set are often called <strong>features</strong>, <strong>predictors</strong>, or <strong>variables</strong>. <span class="footnote-anchor"></span></p>

			</div>
			<hr class="whitespace" style="height: 30vh;" />
			<div id="set-boundaries" class="tracking-section">
				<h2>Drawing boundaries</h2>
				<p>You can visualize your elevation (>242 ft) and price per square foot (>£1776) observations as the boundaries of regions in your scatterplot. Homes plotted in the green and blue regions would be in Place A and Place B, respectively.</p>

				<p>Identifying boundaries in data using math is the essence of statistical learning.</p>

				<p>Of course, you&rsquo;ll need additional information to distinguish homes with lower elevations <em>and</em> lower per-square-foot prices.</p>
			</div>
			<hr class="whitespace" style="height: 55vh;" />
			<div id="more-variables">
				<div id="getting-more-data" class="tracking-section">
				</div>
				<div id="listing-the-variables">
					<!--<div id="data-table"></div>-->
					<p>The dataset we are using to create the model has 7 different dimensions. Creating a model is also known as <strong>training</strong> a model.</p>
					<p>On the right, we are visualizing the variables in a <strong>scatterplot matrix</strong> to show the relationships between each pair of dimensions.</p>
					<p>There are clearly patterns in the data, but the boundaries for delineating them are not obvious.</p>
				</div>
				<div id="from-boundaries-to-pattern" class="tracking-section">
				<hr class="whitespace" style="height: 30vh;" />

				<h2>And now, machine learning</h2>
					<p>Finding patterns in data is where machine learning comes in. Machine learning methods use statistical learning to identify boundaries.</p> 
					<p>One example of a machine learning method is a <strong>decision tree</strong>. Decision trees look at one variable at a time and are a reasonably accessible (though rudimentary) machine learning method. </p>
				</div>

				<hr class="whitespace" style="height: 30vh;" />
			</div>
		</div>
	</div>
	<div class="row" id="split">
		<div class="col-xs-4 col-xs-push-8">
			<div id="elevation-to-histogram" class="tracking-section">
				<hr class="whitespace" style="height: 20vh;" />
				<h2>Finding better boundaries</h2>
				<p>Let's revisit the 240-ft elevation boundary proposed previously to see how we can improve upon our intuition.</p>
				<p>Clearly, this requires a different perspective.</p> 
				<hr class="whitespace" style="height: 40vh;" />
				<p>By transforming our visualization into a <strong>histogram</strong>, we can better see how frequently homes appear at each elevation.</p>
				<p>While the highest home in Place B is ~240 ft, the majority of them seem to have far lower elevations.</p>
			</div>
			<div id="introduce-split" class="tracking-section">
				<hr class="whitespace" style="height: 30vh;" />
				<h2>Your first fork</h2>

				<p>A decision tree uses if-then statements to define patterns in data.</p>
				<p>For example, <strong>if</strong> a home's elevation is above some number, <strong>then</strong> the home is probably in Place A.</p>
			</div>
			<div id="explain-gini" class="tracking-section">
				<hr class="whitespace" style="height: 30vh;" />
				<p>In machine learning, these statements are called <strong>forks</strong>, and they split the data into two <strong>branches</strong> based on some value.</p>
				<p>That value between the branches is called a <strong>split point</strong>. Homes to the left of that point get categorized in one way, while those to the right are categorized in another. A split point is the decision tree's version of a boundary. </p> 
				<hr class="whitespace" style="height: 50vh;" />
				<h2>Tradeoffs</h2>
				<p>Picking a split point has tradeoffs. Our initial split (~240 ft) incorrectly classifies some Place A homes as Place B ones.</p>

				<p>Look at that large slice of green in the left pie chart, those are all the Place A homes that are misclassified. These are called <strong>false negatives</strong>.</p>

				<hr class="whitespace" style="height: 50vh;" />


				<p>However, a split point meant to capture every Place A home will include many Place B homes as well. These are called <strong>false&nbsp;positives</strong>.</p>
				<hr class="whitespace" style="height: 50vh;" />

				<h2>The best split</h2>
				<p>At the <strong>best split</strong>, the results of each branch should be as homogeneous (or pure) as possible. There are several mathematical methods you can choose between to calculate the best split.<span class="footnote-anchor"></span></p>
				<hr class="whitespace" style="height: 20vh;" />
				<p>As we see here, even the best split on a single feature does not fully separate the Place A homes from the Place B ones.</p>
				<hr class="whitespace" style="height: 10vh;" />
			</div>
			<div id="further-split" class="tracking-section">
				<hr class="whitespace" style="height: 30vh;" />
				<h2>Recursion</h2>
				<p>To add another split point, the algorithm repeats the process above on the subsets of data. This repetition is called <strong>recursion</strong>, and it is a concept that appears frequently in training models.<span class="footnote-anchor"></span></p>

				<p class="small">The histograms to the left show the distribution of each subset, repeated for each variable.</p>
				<hr class="whitespace" style="height: 30vh;" />
				<p>The best split will vary based which branch of the tree you are looking at. <span class="footnote-anchor"></span></p> 
				<p>For lower elevation homes, price per square foot is, at <em id="left-side-split-value">X pounds per sqft</em>, is the best variable for the next if-then statement. For higher elevation homes, it is <em id="right-side-split-attribute">price</em>, at <em id="right-side-split-value">Y pounds</em>.
				<hr class="whitespace" style="height: 40vh;" />
			</div>

		</div>
	</div>
	<div class="row" id="tree">
		<div class="col-xs-4">
			<div class="growing-the-tree" class="tracking-section">
				<hr class="whitespace" style="height: 30vh;" />
				<h2>Growing a tree</h2>
				<p>Additional forks will add new information that can increase a tree's <strong>prediction accuracy</strong>. </p>
				<hr class="whitespace" style="height: 40vh;" />
				<p>Splitting one layer deeper, the tree's accuracy improves to <strong>84%</strong>.</p>
				<hr class="whitespace" style="height: 60vh;" />
				<p>Adding several more layers, we get to <strong>96%</strong>.</p>
				<hr class="whitespace" style="height: 60vh;" />
				<p>You could even continue to add branches until the tree's predictions are <strong>100% accurate</strong>, so that at the end of every branch, the homes are purely in Place A or purely in Place B.</p>
			</div>
			<div class="leaf-nodes" class="tracking-section">
				<hr class="whitespace" style="height: 60vh;" />
				<p>These ultimate branches of the tree are called <strong>leaf nodes</strong>. Our decision tree models will classify the homes in each leaf node according to which class of homes is in the majority.</p>
				<hr class="whitespace" style="height: 30vh;" />
			</div>

		</div>
	</div>
	<div class="row" id="test">
		<div class="col-xs-4">
			<div id="classify-training-data" class="tracking-section">
				<hr class="whitespace" style="height: 35vh;" />
				<h2>Making predictions</h2>
				<p>The newly-trained decision tree model determines whether a home is in Place A or Place B by running each data point through the branches.</p>
				<hr class="whitespace" style="height: 35vh;"/p>
				<p>Here you can see the data that was used to train the tree flow through the tree.</p>
				<p>This data is called <strong>training data</strong> because it was used to train the model.</p>
				<hr class="whitespace" style="height: 35vh;" />
				<p>Because we grew the tree until it was 100% accurate, this tree maps each training data point perfectly to which city it is in.</p>
			</div>
			<div id="classify-test-data" class="tracking-section">
				<hr class="whitespace" style="height: 40vh;" />
				<h2>Reality check</h2>

				<p>Of course, what matters more is how the tree performs on previously-unseen data.</p>
				<hr class="whitespace" style="height: 40vh;" />
				<p>To <strong>test</strong> the tree's performance on new data, we need to apply it to data points that it has never seen before. This previously unused data is called <strong>test data</strong>.</p>
				<hr class="whitespace" style="height: 40vh;" />
				<p>Ideally, the tree should perform similarly on both known and unknown data.</p>
				<hr class="whitespace" style="height: 25vh;" />
				<p>So this one is less than ideal.<span class="footnote-anchor"></span></p>
				<hr class="whitespace" style="height: 25vh;" />

			</div>
			<div id="misclassification" class="tracking-section">

				<p>These errors are due to <strong>overfitting</strong>. Our model has learned to treat every detail in the training data as important, even details that turned out to be irrelevant.</p>
				<p>Overfitting is part of a fundamental concept in machine learning.<span class="footnote-anchor"></span></p>
				<hr class="whitespace" style="height: 30vh;" />
			</div>
		</div>
	</div>
	<div id="conclusion" class="tracking-section">
		<div class="row">
			<div class="col-xs-8 col-xs-offset-2">
				<hr class="whitespace" style="height: 10vh;" />
				<h2>Recap</h2>
				<ol>
					<li>Machine learning identifies patterns using <strong>statistical learning</strong> and computers by unearthing <strong>boundaries</strong> in data sets. You can use it to make predictions.</li>
					<li>One method for making predictions is called a decision trees, which uses a series of if-then statements to identify boundaries and define patterns in the data</li>
					<li><strong>Overfitting</strong> happens when some boundaries are based on on <em>distinctions that don't make a difference</em>. You can see if a model overfits by having test data flow through the model.</li>
				</ol>
				<hr class="whitespace" style="height: 25vh;" />
			</div>
		</div>
		<div class="row">
			<div class="col-xs-8 col-xs-offset-2">
				<h2>Footnote</h2>
				<p>Finding structures in multidimensional data sets, be they measurement data, statistics or textual documents, is difficult and time-consuming.</p>
				<p>Interesting, novel relations between the data items may be hidden in the data. Decision trees are well positioned to bring high-dimensional multivariate data set structure to surface.</p>
				<p>A confusion matrix has been identified as an appropriate measure for evaluating the quality of the different types of maps in representing a given data set, and for measuring the robustness of the illustration.</p>
				<p>The same measures may also be used for comparing the knowledge that different maps represent.</p>
				<hr class="whitespace" style="height: 5vh;" />
			</div>
			<div class="col-xs-5">
				<p class="small">Some text goes here</p>
			</div>
			<div class="col-xs-5 col-xs-offset-1">
				<p class="small">Some text goes here</p>
				<hr class="whitespace" style="height: 25vh;" />
			</div>
		</div>
		<div class="row" id="footnotes">
			<div class="col-xs-8">
				<h3>Footnotes</h3>
				<ol id="footnote-list">
					<li>Machine learning concepts have arisen across disciplines (computer science, statistics, engineering, psychology, etc), thus the different nomenclature.</li>
					<li>To learn more about calculating the optimal split, search for 'gini index' or 'cross entropy'. </li>
					<li>One reason computers are so good at applying statistical learning techniques is that they're able to do repetitive tasks, very quickly and without getting bored.</li>
					<li>The algorithm described here is <em>greedy</em>, because it takes a top-down approach to splitting the data. In other words, it is looking for the variable that makes each subset the most homogeneous <em>at that moment</em>.</li>
					<li>Hover over the dots to see the path it took in the tree.</li>
					<li>Spoiler alert: It's the bias/variance tradeoff!</li>
				</ol>
			</div>
		</div>
	</div>

</div>

<div class="static-container" id="shadow-scatterplot"></div>
<div class="static-container" id="intro-scatterplot"></div>
<div class="static-container" id="split-quality"></div>
<div class="static-container" id="decision-tree"></div>
<div class="static-container" id="train-vs-test"></div>

    <div id="footer" class="tracking-section">
    </div>

    <!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
    <script src="/MachineLearningIntro/static/app/components/jquery/jquery.js"></script>
    <!-- Include all compiled plugins (below), or include individual files as needed -->
    <script src="/MachineLearningIntro/static/app/components/bootstrap/js/bootstrap.js"></script>
    <script src="/MachineLearningIntro/static/app/components/underscore/underscore.js"></script>
    <script src="/MachineLearningIntro/static/app/components/d3/d3.js"></script>
    <script src="/MachineLearningIntro/static/app/components/riveted/riveted.js"></script>
    
<script src="/MachineLearningIntro/static/page/housing-data/tree-training-set-98.js"></script>
<script src="/MachineLearningIntro/static/app/components/backbone/backbone.js"></script>
<script src="/MachineLearningIntro/static/page/rAF-polyfill.js"></script>
<script src="/MachineLearningIntro/static/page/helpers.js"></script>
<script src="/MachineLearningIntro/static/page/main.js"></script>

  </body>
</html>