PracticalDataScienceWithRErrata.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title>Practical Data Science with R Errata</title>
    <style>
      body {
        font-size: 10pt;
        margin: 1.5em;
        background-color: lightblue;
        color: darkblue;
        font-family: Verdana,sans-serif;
      }
      h1 {
        font-size: 1.2em;
        font-weight: bold;
        margin-top: 2em;
      }
      h2 {
        font-size: 1.1em;
        font-weight: bold;
      }
      fieldset {
        width: 740px;
        margin-bottom: 12px;
        border-color: #00457b;
        background-color: #cfeace;
      }

      fieldset div {
        margin-bottom: 6px;
        font-weight: normal;
      }

      legend {
        border: 2px ridge #00457b;
        font-size: 1.2em;
        font-weight: bold;
        background-color: #e36a51;
        color: white;
        padding: 8px 16px;
      }
    </style>
  </head>

  <body>

    <fieldset>
      <legend>Practical Data Science with R Errata</legend>

      <p>
      Your authors (<a href="http://ninazumel.com/" target="_blank">Nina Zumel</a> and <a href="http://johnmount.com/">John Mount</a>) 
      and <a href="http://www.manning.com" target="_blank">Manning Publications, Co.</a> have all worked very hard to make
      <a target="_blank" href="http://manning.com/PracticalDataSciencewithR"><b><i>Practical Data Science with R</i></b></a> an excellent
      and instructive book that is well worth your investment of time and money.
      However, errors do creep in and we do feel bad about that.
      To help we are maintaining a public errata list 
      <a href="http://winvector.github.io/PDSwR/PracticalDataScienceWithRErrata.html">here</a> (and <a target="_blank" href="https://github.com/WinVector/zmPDSwR/blob/master/PracticalDataScienceWithRErrata.html">source</a>).
      </p>
      <p>
        If you find any any errors in <a target="_blank" href="http://manning.com/PracticalDataSciencewithR"><b><i>Practical Data Science with R</i></b></a> 
        not listed below, or
        just find something that you think is not well explained:
        please post in the book's
        <a href="http://www.manning-sandbox.com/forum.jspa?forumID=863" target="_blank">Author Online Forum</a>
        so that they may be collected here for everyone's benefit. Thanks!
      </p>

      <h1>Errata</h1>
      <ul>
        <li>
          <h2>Page 12, Listing 1.2</h2>
          <p>
           Start the listing with the line <code>creditdata &lt;- d</code>.
          </p>
        </li>

        <li>
          <h2>Page 27 listing 2.8, page 30 listing 2.11 and any other use of H2DB</h2>
          <p>
           Starting with version 1.4 the H2DB database enforces that the
           path specified for the database files must be absolute
           (starting with "./", "/", or "~/").  The book examples were
           written using 1.3 versions of H2DB which did not enforce this.
           The fix is to edit the db definition xml files and R
           connection commands to make sure
           you have absolute paths. Use:
           <code>jdbc:h2:./H2DB;...</code>
           and not: <code>jdbc:h2:H2DB;...</code>.
          </p>
        </li>

        <li>
          <h2>Page 86, Table 5.1, second row, second column</h2>
          <p>
           The random forests cross-reference should be to section 9.1.2.
          </p>
        </li>

        <li>
          <h2>Page 126. Listing 6.11 Basic variable selection</h2>
          <p>
            Second "Run through categorical variables and pick based
            on a deviance improvement" should read "Run through numerical
            variables and pick based on a deviance improvement."
          </p>
        </li>


        <li>
          <h2>Page 126. Text and Listing 6.11 Basic variable selection</h2>
          <p>
          The "2 times" in listing 6.11 is because statistical deviance is
          traditionally written as -2*(log(p|model) - log(p|saturatedModel)).
          We are using a "delta deviance" or improvement of deviance given by
          (-2*(log(p|model) - log(p|saturatedModel))_-(-2*(log(p|nullModel) -
          log(p|saturatedModel))).  We wrote this as 2*log(p|model)-baseRateCheck)
          and it is a deviance style metric (not an AIC style metric as claimed in the text,
          which got behind by one revision).
          <p>
          </p>The
          text above listing 6.11 on page 126 claims we are going to use an AIC-like metric.  The original
          implementation did this by subtracting a variable entropy
          estimate from each liCheck in the the catVars list and subtracted 1
          from each numeric variable (a placeholder).  This turned out to not be
          useful for the data set at hand (this is the sort of thing you do want to try variations of), 
          and evidently we removed it without
          completely updating the text.
          </p>
          <p>Corrected text, listings, and results now 
          <a href="https://github.com/WinVector/zmPDSwR/blob/master/CodeExamples/c06_Memorization_methods/00081_example_6.11_of_section_6.3.1.R" target="_blank">here</a> 
          and <a href="https://github.com/WinVector/zmPDSwR/blob/master/CodeExamples/c06_Memorization_methods/00082_example_6.12_of_section_6.3.1.txt" target="_blank">here</a> .</p>
        </li>

        <li>
          <h2>Page 130. HOW DECISION TREE MODELS WORK</h2>
          <p>In the text explaining the decision tree
          "predVar126 &lt; -0.002810871" should read 
          "predVar126 &lt; 0.07366888".
          </p>
        </li>

        <li>
          <h2>Page 148, "Are there systematic errors?"</h2>
          <h2>Page 224, Footnote 5</h2>
          <p>
           Our description of "heteroscedasticity" is wrong and
           (unintentionally) conflates a number of different negative issues.
           Our intent was to expand on a simple definition such as "In
           statistics, a collection of random variables is
           heteroscedastic if there are sub-populations that have
           different variabilities from others" (taken
           from <a target="_blank"
           href="http://en.wikipedia.org/wiki/Heteroscedasticity">Wikipedia:
           Heteroscedasticity</a>).  Unfortunately we pushed this too
           far (roughly saying "it is bad if errors correlate with y's"
           when we
           meant to say "it was bad if errors correlate with unobserved
           ideal y's which don't already have the error term added in", i.e. with functions of the x's).
           </p>
           <p>The correct thing to do is: state that regression
           (and in particular the diagnostics of a regression) depends
           on a number of assumptions about the data.  Important properties
           include (paraphrased from Gelman, Carlin,
           Stern, Dunson, Vehtari, and Rubin "Bayesian
           Data Analysis" 3rd edition pp. 369-376): model
           structure, linearity of expected value of y as a function
           of the x's, bias of error terms, normality of error terms,
           independence of observations, and constancy of variance.
           For more see:
           <a target="_blank" href="http://andrewgelman.com/2013/08/04/19470/">What are the key assumptions of linear regression?</a>.
          </p>
           <p>
           Also: the frequenstist version of regression is commonly said to make no assumptions
           on the distribution of data or parameters (only on distribution of errors).
           This is not quite true as even the frequentist analysis depends on independence
           of observations (which is a fact about x's in addition to y's and errors).
           And a Bayesian treatment may need to assume some form of priors on one
           or more of these to get well behaved detailed posterior estimates.
          </p>
          <p>We will correct later editions of the book, and hope
          our error does not overly distract from the important 
          lesson that you must be aware of modeling assumptions.
          We do note with some amusement how rarely "Bayesian
          Data Analysis" 3rd edition pp. 369-376 uses the
          term "heteroscedasticity" even though these are
          the pages tagged with this term in the index (these authors
          rightly emphasizing concepts over naming).
          </p>
        </li>

        <li>
          <h2>Page 157 Section 7.2.1 "Understanding logistic regression"</h2>
          <p>Unfortunate typo: the sigmoid function is <code>s(z) = 1/(1+exp(-z))</code>
          (not <code>1/(1+exp(z))</code>).
          </p>
        </li>

        <li>
          <h2>Page 387 sqldf index entry</h2>
          <p>(Not an error) Another important cross-reference for sqldf
          is page 327, where we show an option that prevents an R-crash
          on OSX.
          </p>
        </li>

      </ul>

    </fieldset>


  </body>
</html>