Due Sunday by 11:59 pm.
-
(5 points) Generate a data set with
$p = 20$ features,$n = 1000$ observations, and an associated quantitative response vector generated according to the model$$Y = X\beta + \epsilon$$ where
$\beta$ has some elements that are exactly equal to zero. -
(10 points) Split your data set into a training set containing 100 observations and a test set containing 900 observations.
-
(10 points) Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size. For which model size does the training set MSE take on its minimum value?
-
(5 points) Plot the test set MSE associated with the best model of each size.
-
(5 points) For which model size does the test set MSE take on its minimum value? Comment on your results.
If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you generate the data previously until you create a data generating process in which the test set MSE is minimized for an intermediate model size.
-
(10 points) How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient sizes.
-
(10 points) Create a plot displaying
$$\sqrt{\sum_{j=1}^p (\beta_j - \hat{\beta}_j^r)^2}$$ for a range of values of
$r$ , where$\hat{\beta}_j^r$ is the $j$th coefficient estimate for the best model containing$r$ coefficients. Comment on what you observe. How does this compare to the test MSE plot?
The General Social Survey is a biannual survey of the American public.^[Conducted by NORC at the University of Chicago.]
In this problem set, you are going to predict individual feelings towards egalitarianism. Specifically, egalit_scale
is an additive index constructed from a series of questions designed to measure how egalitarian individuals are -- that is, the extent to which they think economic opportunities should be distributed more equally in society. The feature ranges from 1 (low egalitarianism) to 35 (high egalitarianism).
gss_*.csv
contain a selection of variables from the 2012 GSS. Documentation for the other features (if not clearly coded) can be viewed here. Some data pre-processing has been done in advance for you to ease your model fitting:
- Missing values have been imputed
- Nominal variables with more than two classes have been converted to dummy variables
- Remaining categorical variables have been converted to integer values
Your task is to construct a series of statistical/machine learning models to accurately predict an individual's egalitarianism using model selection and regularization methods. Use all the available predictors for each model unless otherwise specified.
-
(10 points) Fit a least squares linear model on the training set, and report the test MSE.
-
(10 points) Fit a ridge regression model on the training set, with
$\lambda$ chosen by 10-fold cross-validation. Report the test MSE. -
(10 points) Fit a lasso regression on the training set, with
$\lambda$ chosen by 10-fold cross-validation. Report the test MSE, along with the number of non-zero coefficient estimates. -
(10 points) Fit an elastic net regression model on the training set, with
$\alpha$ and$\lambda$ chosen by 10-fold cross-validation. That is, estimate models with$\alpha = 0, 0.1, 0.2, \ldots, 1$ using the same values for$\lambda$ across each model. Select the combination of$\alpha$ and$\lambda$ with the lowest cross-validation MSE. For that combination, report the test MSE along with the number of non-zero coefficient estimates. -
(5 points) Comment on the results obtained. How accurately can we predict an individual's egalitarianism? Is there much difference among the test errors resulting from these approaches?