You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have studied confidence intervals in Chapter \ref{confidence-intervals}. We now introduce hypothesis testing, another widely used method for statistical inference. A claim is made about a value or characteristic of the population and then a random sample is used to infer about the plausibility of this claim or hypothesis. For example, in Section \ref{ht-activity}, we use data collected from Spotify to investigate whether metal music is more popular than deep-house music.
12336
12336
12337
-
Many of the relevant concepts, ideas, and We have already introduced many of the necessary concepts to understand hypothesis testing in Chapters \ref{sampling} and \ref{confidence-intervals}. We can now expand further on these ideas and provide a general framework for understanding hypothesis tests. By understanding this general framework, you will be able to adapt it to many different scenarios.
12337
+
Many of the relevant concepts, ideas, and we have already introduced many of the necessary concepts to understand hypothesis testing in Chapters \ref{sampling} and \ref{confidence-intervals}. We can now expand further on these ideas and provide a general framework for understanding hypothesis tests. By understanding this general framework, you will be able to adapt it to many different scenarios.
12338
12338
12339
12339
The same can be said for confidence intervals. There was one general framework that applies to confidence intervals, and the \texttt{infer} package was designed around this framework. While the specifics may change slightly for different types of confidence intervals, the general framework stays the same.
12340
12340
@@ -12855,8 +12855,8 @@ \subsection{Is metal music more popular than deep house music?}\label{is-metal-m
12855
12855
\centering
12856
12856
\caption{\label{tab:unnamed-chunk-453}Sample of twelve songs from the Spotify data frame.}
Observe that the resulting data frame has 2,000,000 rows. This is because we performed shuffles/permutations for each of the 2000 rows 1000 times and \(2,000,000 = 1000 \cdot 2000\). If you explore the \texttt{spotify\_generate} data frame with \texttt{View()}, you will notice that the variable \texttt{replicate} indicates which resample each row belongs to. So it has the value \texttt{1} 2000 times, the value \texttt{2} 2000 times, all the way through to the value \texttt{1000} 2000 times.
13305
+
The resulting data frame has 2,000,000 rows. This is because we performed shuffles/permutations for each of the 2000 rows 1000 times and \(2,000,000 = 1000 \cdot 2000\). If you explore the \texttt{spotify\_generate} data frame with \texttt{View()}, you will notice that the variable \texttt{replicate} indicates which resample each row belongs to. So it has the value \texttt{1} 2000 times, the value \texttt{2} 2000 times, all the way through to the value \texttt{1000} 2000 times.
Now that we have generated 1000 replicates of ``shuffles'' assuming the null hypothesis is true, let's \texttt{calculate()} \index{R packages!infer!calculate()} the appropriate summary statistic for each of our 1000 shuffles. From Section \ref{understanding-ht}, point estimates related to hypothesis testing have a specific name: \emph{test statistics}. Since the unknown population parameter of interest is the difference in population proportions \(p_{m} - p_{d}\), the test statistic here is the difference in sample proportions \(\widehat{p}_{m} - \widehat{p}_{f}\).
13311
13311
13312
-
For each of our 1000 shuffles, we can calculate this test statistic by setting \texttt{stat\ =\ "diff\ in\ props"}. Furthermore, since we are interested in \(\widehat{p}_{m} - \widehat{p}_{d}\) we set \texttt{order\ =\ c("metal",\ "deep-house")}. As we stated earlier, the order of the subtraction does not matter, so long as you stay consistent throughout your analysis and tailor your interpretations accordingly.
13313
-
13314
-
Let's save the result in a data frame called \texttt{null\_distribution}:
13312
+
For each of our 1000 shuffles, we can calculate this test statistic by setting \texttt{stat\ =\ "diff\ in\ props"}. Furthermore, since we are interested in \(\widehat{p}_{m} - \widehat{p}_{d}\) we set \texttt{order\ =\ c("metal",\ "deep-house")}. As we stated earlier, the order of the subtraction does not matter, so long as you stay consistent throughout your analysis and tailor your interpretations accordingly. Let's save the result in a data frame called \texttt{null\_distribution}:
Copy file name to clipboardexpand all lines: v2/appendixA.html
+10-10
Original file line number
Diff line number
Diff line change
@@ -223,25 +223,25 @@ <h3>
223
223
<spanclass="header-section-number">A.2.1</span> Additional normal calculations<aclass="anchor" aria-label="anchor" href="#additional-normal-calculations"><iclass="fas fa-link"></i></a>
224
224
</h3>
225
225
<p>For a normal density curve, the probabilities or areas for any given interval can be obtained using the R function <code><ahref="https://rdrr.io/r/stats/Normal.html">pnorm()</a></code>. Think of the <code>p</code> in the name as __p__robability or __p__ercentage as this function finds the area under the curve to the left of any given value which is the probability of observing any number less than or equal to that value. It is possible to indicate the appropriate expected value and standard deviation as arguments in the function, but the default uses the standard normal values, <spanclass="math inline">\(\mu = 0\)</span> and <spanclass="math inline">\(\sigma = 1\)</span>. For example, the probability of observing a value that is less than or equal to 1 in the standard normal curve is given by:</p>
<p>or 84%. This is the probability of observing a value that is less than or equal to one standard deviation above the mean.</p>
230
230
<p>Similarly, the probability of observing a standard value between -1 and 1 is given by subtracting the area to the left of -1 from the area to the left of 1. In R, we obtain this probability as follows:</p>
<p>The probability of getting a standard value between -1 and 1, or equivalently, the probability of observing a value within one standard deviation from the mean is about 68%. Similarly, the probability of getting a value within 2 standard deviations from the mean is given by</p>
<p>Moreover, we do not need to restrict our study to areas within one or two standard deviations from the mean. We can find the number of standard deviations needed for any desired percentage around the mean using the R function <code><ahref="https://rdrr.io/r/stats/Normal.html">qnorm()</a></code>. The <code>q</code> in the name stands for <spanclass="math inline">\(quantile\)</span> and this function can be thought of as the inverse or complement of <code><ahref="https://rdrr.io/r/stats/Normal.html">pnorm()</a></code>. It finds the value of the random variable for a given area under the curve to the left of this value. When using the standard normal, the quantile also represents the number of standard deviations. For example, we learned that the area under the standard normal curve to the left of a standard value of 1 was approximately 84%. If instead, we want to find the standard value that corresponds to exactly an area of 84% under the curve to the left of this value, we can use the following syntax:</p>
<p>In other words, there is exactly an 84% chance that the observed standard value is less than or equal to 0.994. Similarly, to have exactly a 95% chance of obtaining a value within <code>q</code> number of standard deviations from the mean, we need to select the appropriate value for <code><ahref="https://rdrr.io/r/stats/Normal.html">qnorm()</a></code>.</p>
<span><spanclass="fu"><ahref="https://ggplot2.tidyverse.org/reference/geom_ribbon.html">geom_area</a></span><spanclass="op">(</span>stat <spanclass="op">=</span><spanclass="st">"function"</span>, fun <spanclass="op">=</span><spanclass="va">dnorm</span>, fill <spanclass="op">=</span><spanclass="st">"grey100"</span>, xlim <spanclass="op">=</span><spanclass="fu"><ahref="https://rdrr.io/r/base/c.html">c</a></span><spanclass="op">(</span><spanclass="op">-</span><spanclass="fl">4</span>, <spanclass="op">-</span><spanclass="fl">2</span><spanclass="op">)</span><spanclass="op">)</span><spanclass="op">+</span></span>
247
247
<span><spanclass="fu"><ahref="https://ggplot2.tidyverse.org/reference/geom_ribbon.html">geom_area</a></span><spanclass="op">(</span>stat <spanclass="op">=</span><spanclass="st">"function"</span>, fun <spanclass="op">=</span><spanclass="va">dnorm</span>, fill <spanclass="op">=</span><spanclass="st">"grey80"</span>, xlim <spanclass="op">=</span><spanclass="fu"><ahref="https://rdrr.io/r/base/c.html">c</a></span><spanclass="op">(</span><spanclass="op">-</span><spanclass="fl">2</span>, <spanclass="fl">2</span><spanclass="op">)</span><spanclass="op">)</span><spanclass="op">+</span></span>
<p>We want to find the standard value <code>q</code> such that the area in the middle is exactly 0.95 (or 95%). Before using <code><ahref="https://rdrr.io/r/stats/Normal.html">qnorm()</a></code> we need to provide the total area under the curve to the left of <code>q</code>. Since the total area under the normal density curve is 1, the curve is symmetric, and the area in the middle is 0.95, the total area on the tails is 1 - 0.95 = 0.05 (or 5%), and the area on each tail is 0.05/2 = 0.025 (or 2.5%). The total area under the curve to the left of <code>q</code> will be the area in the middle and the area on the left tail or 0.95 + 0.025 = 0.975. We can now obtain the standard value <code>q</code> by using <code><ahref="https://rdrr.io/r/stats/Normal.html">qnorm()</a></code>:</p>
<p>The probability of observing a value within 1.96 standard deviations from the mean is exactly 95%.</p>
261
261
<p>We can follow this method to obtain the number of standard deviations needed for any area, or probability, around the mean. For example, if we want an area of 98% around the mean, the area on the tails is 1 - 0.98 = 0.02, or 0.02/2 = 0.01 on each tail, the area under the curve to the left of the desired <code>q</code> value would be 0.98 + 0.01 = 0.99 so</p>
<p>The area within 2.33 standard deviations from the mean is 98%, or there is a 98% chance of choosing a value within 2.33 standard deviations from the mean. This information will be very useful to us.</p>
@@ -273,7 +273,7 @@ <h2>
273
273
In addition, the <spanclass="math inline">\(t\)</span> distribution requires one additional parameter, the degrees of freedom. For the sample mean problems, the degrees of freedom needed are exactly <spanclass="math inline">\(n-1\)</span>, the size of the samples minus one.</p>
274
274
<p>We construct again a 95% confidence interval for the population mean, but this time using the sample standard deviation to estimate the standard error and the <spanclass="math inline">\(t\)</span> distribution to determine how wide the confidence interval should be.</p>
275
275
<p>We start by obtaining the sample statistics:</p>
<p>To obtain the number of standard deviations on the <spanclass="math inline">\(t\)</span> distribution to account for 95% of the values, we proceed as we did in the normal case: the area in the middle is 0.95, so the area on the tails is 1-0.95 = 0.05. Since the <spanclass="math inline">\(t\)</span> distribution is also symmetric, the area on each tail is 0.05/2 - 0.025. The number of standard deviation around the center is given by the value <spanclass="math inline">\(q\)</span> such as the area under the <spanclass="math inline">\(t\)</span> curve to the left of <spanclass="math inline">\(q\)</span> is exactly <spanclass="math inline">\(0.95 + 0.025 = 0.975\)</span>. Using R we get:</p>
<p>So, in order to account for 95% of the observations around the mean, we need to take into account all the values within 1.98 standard deviation from the mean. Compare this number with the 1.96 obtained for the standard normal; the difference is due to the fact that the <spanclass="math inline">\(t\)</span> curve has thicker tails than the standard normal.
290
290
We can now construct the 95% confidence interval</p>
0 commit comments