moderndive
diff --git a/‎v2/ModernDive.pdf
-4.77 KB b/‎v2/ModernDive.pdf
-4.77 KB
diff --git a/‎v2/ModernDive.tex
+10-12 b/‎v2/ModernDive.tex
+10-12
diff --git a/‎v2/ModernDive_files/figure-html/action-romance-boxplot-1.png
-1.49 KB b/‎v2/ModernDive_files/figure-html/action-romance-boxplot-1.png
-1.49 KB
diff --git a/‎v2/ModernDive_files/figure-html/bootstrap-distribution-two-prop-percentile-1.png
-1.66 KB b/‎v2/ModernDive_files/figure-html/bootstrap-distribution-two-prop-percentile-1.png
-1.66 KB
diff --git a/‎v2/ModernDive_files/figure-html/ha-as-flights-boxplot-1.png
-1.53 KB b/‎v2/ModernDive_files/figure-html/ha-as-flights-boxplot-1.png
-1.53 KB
diff --git a/‎v2/ModernDive_files/figure-html/null-distribution-infer-1.png
-1.46 KB b/‎v2/ModernDive_files/figure-html/null-distribution-infer-1.png
-1.46 KB
diff --git a/‎v2/ModernDive_files/figure-html/null-distribution-movies-2-1.png
-1.51 KB b/‎v2/ModernDive_files/figure-html/null-distribution-movies-2-1.png
-1.51 KB
diff --git a/‎v2/ModernDive_files/figure-html/spotify-genre-barplot-1.png
-1.96 KB b/‎v2/ModernDive_files/figure-html/spotify-genre-barplot-1.png
-1.96 KB
diff --git a/‎v2/ModernDive_files/figure-html/spotify-genre-barplot-permuted-1.png
-1.6 KB b/‎v2/ModernDive_files/figure-html/spotify-genre-barplot-permuted-1.png
-1.6 KB
diff --git a/‎v2/ModernDive_files/figure-html/t-curve-hypo-1.png
57 Bytes b/‎v2/ModernDive_files/figure-html/t-curve-hypo-1.png
57 Bytes
diff --git a/‎v2/appendixA.html
+10-10 b/‎v2/appendixA.html
+10-10
@@ -12334,7 +12334,7 @@ \chapter{Hypothesis Testing}\label{hypothesis-testing}
 
 We have studied confidence intervals in Chapter \ref{confidence-intervals}. We now introduce hypothesis testing, another widely used method for statistical inference. A claim is made about a value or characteristic of the population and then a random sample is used to infer about the plausibility of this claim or hypothesis. For example, in Section \ref{ht-activity}, we use data collected from Spotify to investigate whether metal music is more popular than deep-house music.
 
-Many of the relevant concepts, ideas, and We have already introduced many of the necessary concepts to understand hypothesis testing in Chapters \ref{sampling} and \ref{confidence-intervals}. We can now expand further on these ideas and provide a general framework for understanding hypothesis tests. By understanding this general framework, you will be able to adapt it to many different scenarios.
+Many of the relevant concepts, ideas, and we have already introduced many of the necessary concepts to understand hypothesis testing in Chapters \ref{sampling} and \ref{confidence-intervals}. We can now expand further on these ideas and provide a general framework for understanding hypothesis tests. By understanding this general framework, you will be able to adapt it to many different scenarios.
 
 The same can be said for confidence intervals. There was one general framework that applies to confidence intervals, and the \texttt{infer} package was designed around this framework. While the specifics may change slightly for different types of confidence intervals, the general framework stays the same.
 
@@ -12855,8 +12855,8 @@ \subsection{Is metal music more popular than deep house music?}\label{is-metal-m
 \centering
 \caption{\label{tab:unnamed-chunk-453}Sample of twelve songs from the Spotify data frame.}
 \centering
-\fontsize{6}{8}\selectfont
-\begin{tabular}[t]{lllrl}
+\fontsize{8}{10}\selectfont
+\begin{tabular}[t]{ll>{\raggedright\arraybackslash}p{1.5in}rl}
 \toprule
 track\_genre & artists & track\_name & popularity & popular\_or\_not\\
 \midrule
@@ -12940,8 +12940,8 @@ \subsection{Shuffling once}\label{shuffling-once}
 \centering
 \caption{\label{tab:unnamed-chunk-459}Representative sample of metal and deep-house songs}
 \centering
-\fontsize{6}{8}\selectfont
-\begin{tabular}[t]{lllrl}
+\fontsize{8}{10}\selectfont
+\begin{tabular}[t]{ll>{\raggedright\arraybackslash}p{1.5in}rl}
 \toprule
 track\_genre & artists & track\_name & popularity & popular\_or\_not\\
 \midrule
@@ -12975,8 +12975,8 @@ \subsection{Shuffling once}\label{shuffling-once}
 \centering
 \caption{\label{tab:unnamed-chunk-461}Shuffled version of \texttt{popular\_or\_not} in representative sample of metal and deep-house songs}
 \centering
-\fontsize{6}{8}\selectfont
-\begin{tabular}[t]{lllrl}
+\fontsize{8}{10}\selectfont
+\begin{tabular}[t]{ll>{\raggedright\arraybackslash}p{1.5in}rl}
 \toprule
 track\_genre & artists & track\_name & popularity & popular\_or\_not\\
 \midrule
@@ -13000,7 +13000,7 @@ \subsection{Shuffling once}\label{shuffling-once}
 
 \begin{figure}[H]
 
-{\centering \includegraphics[width=0.7\linewidth]{images/shutterstock/shutterstock_670789453} 
+{\centering \includegraphics[width=0.6\linewidth]{images/shutterstock/shutterstock_670789453} 
 
 }
 
@@ -13302,16 +13302,14 @@ \subsubsection*{\texorpdfstring{3. \texttt{generate} replicates}{3. generate rep
 [1] 2000000
 \end{verbatim}
 
-Observe that the resulting data frame has 2,000,000 rows. This is because we performed shuffles/permutations for each of the 2000 rows 1000 times and \(2,000,000 = 1000 \cdot 2000\). If you explore the \texttt{spotify\_generate} data frame with \texttt{View()}, you will notice that the variable \texttt{replicate} indicates which resample each row belongs to. So it has the value \texttt{1} 2000 times, the value \texttt{2} 2000 times, all the way through to the value \texttt{1000} 2000 times.
+The resulting data frame has 2,000,000 rows. This is because we performed shuffles/permutations for each of the 2000 rows 1000 times and \(2,000,000 = 1000 \cdot 2000\). If you explore the \texttt{spotify\_generate} data frame with \texttt{View()}, you will notice that the variable \texttt{replicate} indicates which resample each row belongs to. So it has the value \texttt{1} 2000 times, the value \texttt{2} 2000 times, all the way through to the value \texttt{1000} 2000 times.
 
 \subsubsection*{\texorpdfstring{4. \texttt{calculate} summary statistics}{4. calculate summary statistics}}\label{calculate-summary-statistics-2}
 
 
 Now that we have generated 1000 replicates of ``shuffles'' assuming the null hypothesis is true, let's \texttt{calculate()} \index{R packages!infer!calculate()} the appropriate summary statistic for each of our 1000 shuffles. From Section \ref{understanding-ht}, point estimates related to hypothesis testing have a specific name: \emph{test statistics}. Since the unknown population parameter of interest is the difference in population proportions \(p_{m} - p_{d}\), the test statistic here is the difference in sample proportions \(\widehat{p}_{m} - \widehat{p}_{f}\).
 
-For each of our 1000 shuffles, we can calculate this test statistic by setting \texttt{stat\ =\ "diff\ in\ props"}. Furthermore, since we are interested in \(\widehat{p}_{m} - \widehat{p}_{d}\) we set \texttt{order\ =\ c("metal",\ "deep-house")}. As we stated earlier, the order of the subtraction does not matter, so long as you stay consistent throughout your analysis and tailor your interpretations accordingly.
-
-Let's save the result in a data frame called \texttt{null\_distribution}:
+For each of our 1000 shuffles, we can calculate this test statistic by setting \texttt{stat\ =\ "diff\ in\ props"}. Furthermore, since we are interested in \(\widehat{p}_{m} - \widehat{p}_{d}\) we set \texttt{order\ =\ c("metal",\ "deep-house")}. As we stated earlier, the order of the subtraction does not matter, so long as you stay consistent throughout your analysis and tailor your interpretations accordingly. Let's save the result in a data frame called \texttt{null\_distribution}:
 
 \begin{Shaded}
 \begin{Highlighting}[]
 
@@ -223,25 +223,25 @@ <h3>
 <span class="header-section-number">A.2.1</span> Additional normal calculations<a class="anchor" aria-label="anchor" href="#additional-normal-calculations"><i class="fas fa-link"></i></a>
 </h3>
 <p>For a normal density curve, the probabilities or areas for any given interval can be obtained using the R function <code><a href="https://rdrr.io/r/stats/Normal.html">pnorm()</a></code>. Think of the <code>p</code> in the name as __p__robability or __p__ercentage as this function finds the area under the curve to the left of any given value which is the probability of observing any number less than or equal to that value. It is possible to indicate the appropriate expected value and standard deviation as arguments in the function, but the default uses the standard normal values, <span class="math inline">\(\mu = 0\)</span> and <span class="math inline">\(\sigma = 1\)</span>. For example, the probability of observing a value that is less than or equal to 1 in the standard normal curve is given by:</p>
-<div class="sourceCode" id="cb571"><pre class="downlit sourceCode r">
+<div class="sourceCode" id="cb572"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/Normal.html">pnorm</a></span><span class="op">(</span><span class="fl">1</span><span class="op">)</span></span></code></pre></div>
 <pre><code>[1] 0.841</code></pre>
 <p>or 84%. This is the probability of observing a value that is less than or equal to one standard deviation above the mean.</p>
 <p>Similarly, the probability of observing a standard value between -1 and 1 is given by subtracting the area to the left of -1 from the area to the left of 1. In R, we obtain this probability as follows:</p>
-<div class="sourceCode" id="cb573"><pre class="downlit sourceCode r">
+<div class="sourceCode" id="cb574"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/Normal.html">pnorm</a></span><span class="op">(</span><span class="fl">1</span><span class="op">)</span> <span class="op">-</span> <span class="fu"><a href="https://rdrr.io/r/stats/Normal.html">pnorm</a></span><span class="op">(</span><span class="op">-</span><span class="fl">1</span><span class="op">)</span></span></code></pre></div>
 <pre><code>[1] 0.683</code></pre>
 <p>The probability of getting a standard value between -1 and 1, or equivalently, the probability of observing a value within one standard deviation from the mean is about 68%. Similarly, the probability of getting a value within 2 standard deviations from the mean is given by</p>
-<div class="sourceCode" id="cb575"><pre class="downlit sourceCode r">
+<div class="sourceCode" id="cb576"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/Normal.html">pnorm</a></span><span class="op">(</span><span class="fl">2</span><span class="op">)</span> <span class="op">-</span> <span class="fu"><a href="https://rdrr.io/r/stats/Normal.html">pnorm</a></span><span class="op">(</span><span class="op">-</span><span class="fl">2</span><span class="op">)</span></span></code></pre></div>
 <pre><code>[1] 0.954</code></pre>
 <p>or about 95%.</p>
 <p>Moreover, we do not need to restrict our study to areas within one or two standard deviations from the mean. We can find the number of standard deviations needed for any desired percentage around the mean using the R function <code><a href="https://rdrr.io/r/stats/Normal.html">qnorm()</a></code>. The <code>q</code> in the name stands for <span class="math inline">\(quantile\)</span> and this function can be thought of as the inverse or complement of <code><a href="https://rdrr.io/r/stats/Normal.html">pnorm()</a></code>. It finds the value of the random variable for a given area under the curve to the left of this value. When using the standard normal, the quantile also represents the number of standard deviations. For example, we learned that the area under the standard normal curve to the left of a standard value of 1 was approximately 84%. If instead, we want to find the standard value that corresponds to exactly an area of 84% under the curve to the left of this value, we can use the following syntax:</p>
-<div class="sourceCode" id="cb577"><pre class="downlit sourceCode r">
+<div class="sourceCode" id="cb578"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/Normal.html">qnorm</a></span><span class="op">(</span><span class="fl">0.84</span><span class="op">)</span></span></code></pre></div>
 <pre><code>[1] 0.994</code></pre>
 <p>In other words, there is exactly an 84% chance that the observed standard value is less than or equal to 0.994. Similarly, to have exactly a 95% chance of obtaining a value within <code>q</code> number of standard deviations from the mean, we need to select the appropriate value for <code><a href="https://rdrr.io/r/stats/Normal.html">qnorm()</a></code>.</p>
-<div class="sourceCode" id="cb579"><pre class="downlit sourceCode r">
+<div class="sourceCode" id="cb580"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot</a></span><span class="op">(</span><span class="cn">NULL</span>, <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes</a></span><span class="op">(</span><span class="fu"><a href="https://rdrr.io/r/base/c.html">c</a></span><span class="op">(</span><span class="op">-</span><span class="fl">4</span>,<span class="fl">4</span><span class="op">)</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
 <span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_ribbon.html">geom_area</a></span><span class="op">(</span>stat <span class="op">=</span> <span class="st">"function"</span>, fun <span class="op">=</span> <span class="va">dnorm</span>, fill <span class="op">=</span> <span class="st">"grey100"</span>, xlim <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html">c</a></span><span class="op">(</span><span class="op">-</span><span class="fl">4</span>, <span class="op">-</span><span class="fl">2</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
 <span>  <span class="fu"><a href="https://ggplot2.tidyverse.org/reference/geom_ribbon.html">geom_area</a></span><span class="op">(</span>stat <span class="op">=</span> <span class="st">"function"</span>, fun <span class="op">=</span> <span class="va">dnorm</span>, fill <span class="op">=</span> <span class="st">"grey80"</span>, xlim <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html">c</a></span><span class="op">(</span><span class="op">-</span><span class="fl">2</span>, <span class="fl">2</span><span class="op">)</span><span class="op">)</span> <span class="op">+</span></span>
@@ -253,13 +253,13 @@ <h3>
 <span>           color<span class="op">=</span><span class="st">"blue"</span><span class="op">)</span></span></code></pre></div>
 <div class="inline-figure"><img src="ModernDive_files/figure-html/normal-curve-shaded-3-1.png" width="\textwidth" style="display: block; margin: auto;"></div>
 <p>We want to find the standard value <code>q</code> such that the area in the middle is exactly 0.95 (or 95%). Before using <code><a href="https://rdrr.io/r/stats/Normal.html">qnorm()</a></code> we need to provide the total area under the curve to the left of <code>q</code>. Since the total area under the normal density curve is 1, the curve is symmetric, and the area in the middle is 0.95, the total area on the tails is 1 - 0.95 = 0.05 (or 5%), and the area on each tail is 0.05/2 = 0.025 (or 2.5%). The total area under the curve to the left of <code>q</code> will be the area in the middle and the area on the left tail or 0.95 + 0.025 = 0.975. We can now obtain the standard value <code>q</code> by using <code><a href="https://rdrr.io/r/stats/Normal.html">qnorm()</a></code>:</p>
-<div class="sourceCode" id="cb580"><pre class="downlit sourceCode r">
+<div class="sourceCode" id="cb581"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">q</span> <span class="op">&lt;-</span> <span class="fu"><a href="https://rdrr.io/r/stats/Normal.html">qnorm</a></span><span class="op">(</span><span class="fl">0.975</span><span class="op">)</span></span>
 <span><span class="va">q</span></span></code></pre></div>
 <pre><code>[1] 1.96</code></pre>
 <p>The probability of observing a value within 1.96 standard deviations from the mean is exactly 95%.</p>
 <p>We can follow this method to obtain the number of standard deviations needed for any area, or probability, around the mean. For example, if we want an area of 98% around the mean, the area on the tails is 1 - 0.98 = 0.02, or 0.02/2 = 0.01 on each tail, the area under the curve to the left of the desired <code>q</code> value would be 0.98 + 0.01 = 0.99 so</p>
-<div class="sourceCode" id="cb582"><pre class="downlit sourceCode r">
+<div class="sourceCode" id="cb583"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/Normal.html">qnorm</a></span><span class="op">(</span><span class="fl">0.99</span><span class="op">)</span></span></code></pre></div>
 <pre><code>[1] 2.33</code></pre>
 <p>The area within 2.33 standard deviations from the mean is 98%, or there is a 98% chance of choosing a value within 2.33 standard deviations from the mean. This information will be very useful to us.</p>
@@ -273,7 +273,7 @@ <h2>
 In addition, the <span class="math inline">\(t\)</span> distribution requires one additional parameter, the degrees of freedom. For the sample mean problems, the degrees of freedom needed are exactly <span class="math inline">\(n-1\)</span>, the size of the samples minus one.</p>
 <p>We construct again a 95% confidence interval for the population mean, but this time using the sample standard deviation to estimate the standard error and the <span class="math inline">\(t\)</span> distribution to determine how wide the confidence interval should be.</p>
 <p>We start by obtaining the sample statistics:</p>
-<div class="sourceCode" id="cb584"><pre class="downlit sourceCode r">
+<div class="sourceCode" id="cb585"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">almonds_sample_100</span> <span class="op">|&gt;</span> </span>
 <span><span class="fu"><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize</a></span><span class="op">(</span>mean_weight <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/mean.html">mean</a></span><span class="op">(</span><span class="va">weight</span><span class="op">)</span>,</span>
 <span>sd_weight <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/stats/sd.html">sd</a></span><span class="op">(</span><span class="va">weight</span><span class="op">)</span>,</span>
@@ -283,12 +283,12 @@ <h2>
         &lt;dbl&gt;     &lt;dbl&gt;       &lt;int&gt;
 1       3.682  0.362199         100</code></pre>
 <p>To obtain the number of standard deviations on the <span class="math inline">\(t\)</span> distribution to account for 95% of the values, we proceed as we did in the normal case: the area in the middle is 0.95, so the area on the tails is 1-0.95 = 0.05. Since the <span class="math inline">\(t\)</span> distribution is also symmetric, the area on each tail is 0.05/2 - 0.025. The number of standard deviation around the center is given by the value <span class="math inline">\(q\)</span> such as the area under the <span class="math inline">\(t\)</span> curve to the left of <span class="math inline">\(q\)</span> is exactly <span class="math inline">\(0.95 + 0.025 = 0.975\)</span>. Using R we get:</p>
-<div class="sourceCode" id="cb586"><pre class="downlit sourceCode r">
+<div class="sourceCode" id="cb587"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="fu"><a href="https://rdrr.io/r/stats/TDist.html">qt</a></span><span class="op">(</span><span class="fl">0.975</span>, df <span class="op">=</span> <span class="fl">100</span> <span class="op">-</span> <span class="fl">1</span><span class="op">)</span></span></code></pre></div>
 <pre><code>[1] 1.98</code></pre>
 <p>So, in order to account for 95% of the observations around the mean, we need to take into account all the values within 1.98 standard deviation from the mean. Compare this number with the 1.96 obtained for the standard normal; the difference is due to the fact that the <span class="math inline">\(t\)</span> curve has thicker tails than the standard normal.
 We can now construct the 95% confidence interval</p>
-<div class="sourceCode" id="cb588"><pre class="downlit sourceCode r">
+<div class="sourceCode" id="cb589"><pre class="downlit sourceCode r">
 <code class="sourceCode R"><span><span class="va">xbar</span> <span class="op">&lt;-</span> <span class="fl">3.682</span> </span>
 <span><span class="va">se_xbar</span> <span class="op">&lt;-</span> <span class="fl">0.362</span><span class="op">/</span><span class="fu"><a href="https://rdrr.io/r/base/MathFun.html">sqrt</a></span><span class="op">(</span><span class="fl">100</span><span class="op">)</span></span>
 <span><span class="va">lower_bound</span> <span class="op">&lt;-</span> <span class="va">xbar</span> <span class="op">-</span> <span class="fl">1.98</span> <span class="op">*</span>  <span class="va">se_xbar</span></span>