Skip to content

Commit 0532cc8

Browse files
committed
Ex 05 simplifying by dropping CI calculation
1 parent eeeb143 commit 0532cc8

File tree

1 file changed

+1
-52
lines changed

1 file changed

+1
-52
lines changed

Exercise_05_JAGS.Rmd

+1-52
Original file line numberDiff line numberDiff line change
@@ -365,59 +365,8 @@ summary(SD)
365365
* Describe and explain the parameter covariances that you observe in the pairs plot and parameter correlation matrix.
366366
* Compare the summary statistics for the Bayesian regression model to those from the classical regression: summary(lm( y ~ x1 )). This should include a comparison of the means and uncertainties of **all 3 model parameters**
367367

368-
# Regression Credible Intervals
369368

370-
When fitting the mean of a distribution, the density plot of that distribution, and its moments/quantiles, provides all the information we need to evaluate the performance of the model. By contrast, when fitting a process model that has covariates, whether it be a simple regression or a complex nonlinear model, we are also often interested in the error estimate on the overall model. This error estimate comes in two forms, the credible interval and the prediction interval. The credible interval, which is the Bayesian analog to the frequentist confidence interval, provides an uncertainty estimate based on the uncertainty in the model parameters. We have already seen Bayesian credible intervals frequently as the quantiles of the posterior distribution of individual parameters. We can extend this to an estimate of model uncertainty by looking at the uncertainty in the distribution of the model itself by calculating a credible interval around the regression line. Numerically we do this by evaluating the model over a sequence of covariate values for every pair of parameter values in the MCMC sequence. Lets begin by looking at a subset of the MCMC
371-
372-
```{r}
373-
xpred <- 0:20
374-
plot(x1,y)
375-
for(i in 1:10){
376-
lines(xpred, out[i,"b[1]"] + out[i,"b[2]"]*xpred)
377-
}
378-
```
379-
380-
381-
You can see within the loop that we're plotting our process model – b1 + b2*x – for pairs of regression parameters from the MCMC which created a distribution of models. The reason that we use pairs of values from the posterior (i.e. rows in bgibbs) rather than simulating values from the posterior of each parameter independently is to account for the covariance structure of the parameters. As we saw in the pairs plot above, the covariance of the slope and intercept is considerable, and thus independent sapling of their marginal posterior distributions would lead to credible intervals that are substantially too large.
382-
This distribution of models is by itself not easy to interpret, especially if we added lines for EVERY pair of points in our posterior, so instead we'll calculate the quantiles of the posterior distribution of all of these lines. To do this we'll need to first make the predictions for all of these lines and then look at the posterior distribution of predicted y values given our sequence of x values.
383-
Before we dive into that, lets also consider the other quantity we're interested in calculating, the predictive interval, since it is easiest to calculate both at the same time. While the credible interval was solely concerned about the uncertainty in the model parameters, the predictive interval is also concerned about the residual error between the model and the data. When we make a prediction with a model and want to evaluate how well the model matches data we obviously need to consider our data model. How we do this numerically is to generate pseudodata from our model, conditioned on the current value of not only the regression parameters but also the variance parameters, and then look at the distribution of predictions.
384-
In R we'll begin the calculation of our credible and predictive intervals by setting up data structures to store all the calculations
385-
386-
```{r}
387-
## credible and prediction intervals
388-
nsamp <- 5000
389-
samp <- sample.int(nrow(out),nsamp)
390-
xpred <- 0:20 ## sequence of x values we're going to
391-
npred <- length(xpred) ## make predictions for
392-
ypred <- matrix(0.0,nrow=nsamp,ncol=npred) ## storage for predictive interval
393-
ycred <- matrix(0.0,nrow=nsamp,ncol=npred) ## storage for credible interval
394-
```
395-
396-
Next we'll set up a loop where we'll calculate the expected value of y at each x for each pair of regression parameters and then add additional random error from the data model. When looping through the posterior MCMC we'll obviously want to account for any burn-in period and thinning
397-
398-
```{r}
399-
for(g in seq_len(nsamp)){
400-
theta = out[samp[g],]
401-
ycred[g,] <- theta["b[1]"] + theta["b[2]"]*xpred
402-
ypred[g,] <- rnorm(npred,ycred[g,],1/sqrt(theta["S"]))
403-
}
404-
```
405-
406-
Once we have the full matrix of predicted values we'll calculate the quantiles by column (ie for each x value) and then plot them vs. the data. By selecting the 2.5% and 97.5% quantiles we are generating a 95% interval, because 95% of the calculated values fall in the middle and 2.5% fall in each of the upper and lower tails. We could construct alternative interval estimates by just calculating different quantiles, for example the 5% and 95% quantiles would provide a 90% interval estimate.
407-
408-
```{r}
409-
ci <- apply(ycred,2,quantile,c(0.025,0.5,0.975)) ## credible interval and median
410-
pi <- apply(ypred,2,quantile,c(0.025,0.975)) ## prediction interval
411-
412-
plot(x1,y,cex=0.5,xlim=c(0,20),ylim=c(0,50))
413-
lines(xpred,ci[1,],col=3,lty=2) ## lower CI
414-
lines(xpred,ci[2,],col=3,lwd=3) ## median
415-
lines(xpred,ci[3,],col=3,lty=2) ## upper CI
416-
lines(xpred,pi[1,],col=4,lty=2) ## lower PI
417-
lines(xpred,pi[2,],col=4,lty=2) ## upper PI
418-
```
419-
420-
# Extra Credit: Multiple Regression
369+
# Multiple Regression
421370

422371
Using the dataset "data/Ex05.Part2.RData", extend your univariate regression to a multiple regression model with two covariates (x1, x2) and an interaction term (x1*x2). In doing so, not only do you need to update your process model, but you also need to make sure to update the dimension of your prior and initial conditions on $b$ from 2 to 4.
423372

0 commit comments

Comments
 (0)