05-Week2_home.Rmd

# Week 2 - Home

```{r include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```
This exercise is based on:

Kestilä, Elina (2006). Is There Demand for Radical Right
Populism in the Finnish Electorate? *Scandinavian Political Studies 29*(3),169-191 
 
You have read and answered questions about the article in the reading questions. In this 
exercise, as well as in the second class practical, we will analyze these data ourselves. 

The data for this practical stem from the first round of the European Social Survey (ESS). 
This is a repeated cross-sectional survey across 32 European countries. The first round 
was held in 2002, and since then, subsequent rounds of data-collection are held bi-
anually. More info, as well as access to all data -> www.europeansocialsurvey.org. 

The raw, first round data can also be found [here](https://github.com/cjvanlissa/TCSM_student). The file is called ESSround1-
a.sav. This file contains data for all respondents, but only those variables are included 
that you will need in this exercise. 

### Question 1.a

Download the file, and import it in R. Inspect the file (no. of cases and no. of variables) to see if the file opened well. 

```{r, message=FALSE, warning=FALSE, eval=FALSE}
library(foreign)
data <- read.spss("ESSround1-a.sav", to.data.frame = TRUE)
```
```{r, echo=FALSE, warning=FALSE}
library(foreign)
data <- read.spss("TCSM_student/ESSround1-a.sav", to.data.frame = TRUE)
```

<details>
  <summary><b>For a description of all variables in the dataset, click here!</b></summary>
```{r, results = "asis", echo = FALSE}
desc <- c("Title of dataset",
"ESS round",
"Edition",
"Production date",
"Country",
"Respondent's identification number",
"Trust in the legal system",
"Trust in the police",
"Trust in the United Nations",
"Trust in the European Parliament",
"Trust in country's parliament",
"State of health services in country nowadays",
"State of education in country nowadays",
"How satisfied with present state of economy in country",
"How satisfied with the national government",
"How satisfied with the way democracy works in country",
"Politicians interested in votes rather than peoples opinions",
"Politicians in general care what people like respondent think",
"Trust in politicians",
"Allow many/few immigrants of same race/ethnic group as majority",
"Allow many/few immigrants of different race/ethnic group from majority",
"Allow many/few immigrants from richer countries in Europe",
"Allow many/few immigrants from poorer countries in Europe",
"Allow many/few immigrants from richer countries outside Europe",
"Allow many/few immigrants from poorer countries outside Europe",
"Qualification for immigration: christian background",
"Qualification for immigration: be white",
"Average wages/salaries generally brought down by immigrants",
"Immigrants harm economic prospects of the poor more than the rich",
"Immigrants take jobs away in country or create new jobs",
"Taxes and services: immigrants take out more than they put in or less",
"Immigration bad or good for country's economy",
"Country's cultural life undermined or enriched by immigrants",
"Immigrants make country worse or better place to live",
"Immigrants make country's crime problems worse or better",
"Richer countries should be responsible for accepting people from poorer countries",
"Better for a country if almost everyone share customs and traditions",
"Better for a country if a variety of different religions",
"Country has more than its fair share of people applying refugee status",
"People applying refugee status allowed to work while cases considered",
"Government should be generous judging applications for refugee status",
"Most refugee applicants not in real fear of persecution own countries",
"Financial support to refugee applicants while cases considered",
"Granted refugees should be entitled to bring close family members",
"Gender",
"Year of birth",
"Highest level of education",
"Years of full-time education completed",
"How interested in politics",
"Placement on left right scale")

library(kableExtra)

kable(data.frame(Variable = names(data), Description = desc)) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

```
`r if(knitr::is_html_output()){"\\details"}`

### Question 1.b

The ESS-file contains much more information than we need to re-analyze the paper by 
Kestilä. We need to reduce the number of cases to make sure our results pertain to the relevant target population.  

Kestilä only uses data from ten countries: `c("Austria", "Belgium", "Denmark", "Finland", "France", "Germany", "Italy", "Netherlands", "Norway", "Sweden")`.

As explained in the tutorial chapters at the beginning of this GitBook, it is possible to select rows from the data. In this case, we will select only rows from these countries by means of boolean indexing, using the `%in%` function to check if the value of `$cntry` is in the list.

*Hint: Use* `[]; %in%`

<details>
  <summary>Click for explanation</summary>

Use the following code:
  
```{r, echo = TRUE}
df <- data[data$cntry %in% c("Austria", "Belgium", "Denmark", "Finland", "France", "Germany", "Italy", "Netherlands", "Norway", "Sweden"), ]
```

`r if(knitr::is_html_output()){"\\details"}`


### Question 1.c

Inspect the data file again to see whether step 1b went ok.

### Question 1.d 

Before we can start the analyses, we first need to screen the 
data. What are the things we need to watch for? (think about your earlier 
statistics-courses)?

<details>
  <summary>Click for explanation</summary>
  This question is open to interpretation. One thing you might notice is that all variables the authors used are currently coded as factor variables (e.g., "Factor w/ 11 levels"):

```{r, echo = TRUE}
library(tidySEM)
descriptives(df)
levels(df$trstlgl)
```

In keeping with conventions, we could treat ordinal Likert scales with >5 levels as continuous. We can either re-code the data, or prevent `read.spss()` from coding these variables as factors when it reads the data. Here is code for both approaches.

#### Re-coding factors to numeric

We can convert a factor to numeric using the function `as.numeric()`. However, in this case, we have `r 45-7` variables to convert - that's a lot of code.

Thankfully, there is a function to apply this transformation to each column of the data: `lapply()`, short for list apply. This function takes each list element (column of the `data.frame`), and applies the function `as.numeric()` to it:

```{r, echo = TRUE}
df[7:44] <- lapply(df[7:44], as.numeric)
```

#### Reading data without coding factors

An alternative solution is to prevent the function `read.spss()` from using value labels to code variables as factors when the data are loaded. However, we're not going to use this right now, so the information below is merely illustrative:

```{r, eval = FALSE, echo = TRUE}
# The option use.value.labels = FALSE stops the function from coding factors:
data <- read.spss("ESSround1-a.sav", to.data.frame = TRUE, use.value.labels = FALSE)
# Then, re-select the subset of data. The countries are now also unlabeled, so
# we select them by number:
df <- data[data$cntry %in% c(21,18,17,15,9,8,6,5,2,1), ]
```

`r if(knitr::is_html_output()){"\\details"}`

### Question 1.e

Aside from screening variables by looking at summary statistics, we can also plot their distributions. You already know how to do this for one single variable.
Yet it is not very difficult to plot all variables in a data.frame.
To do this, we need to turn the data from "wide format" (one column per variable)
into "long format" (one column with the variable names, one column with the values).
The package `tidyr` has a convenient function to reshape data to long format:
`pivot_longer()`.

First, make a long data.frame containing variables 7:44.

```{r, eval = FALSE, echo = TRUE}
install.packages("tidyr")
library(tidyr)
df_plot <- df[7:44]
df_plot <- pivot_longer(df_plot, names(df_plot))
```
```{r, echo = FALSE}
library(tidyr)
df_plot <- df[7:44]
df_plot <- pivot_longer(df_plot, names(df_plot))
```

Next, we can plot the data - using `geom_histogram()`, `geom_density()` or `geom_boxplot()` as before. A new additional function allows us to make separate plots for each variable: `+ facet_wrap(~name, scales = "free_x")`.

<details>
  <summary>Click for explanation</summary>

```{r, echo = TRUE}
library(ggplot2)
ggplot(df_plot, aes(x = value)) + geom_histogram() + facet_wrap(~name, scales = "free_x")
```

Notice the fact that you can see that the scales are actually categorical (because of the gaps between bars), and that most variables look relatively normally distributed despite being categorical. It's probably fine to treat them as continuous.
`r if(knitr::is_html_output()){"\\details"}`

### Question 1.f

Check the scale descriptives table again. Are there any incorrectly coded missing value labels, or other inexplicable values?

### Question 1.g

The first step in re-analyzing data is replicating the results from the paper by 
Kestilä. Run a Principal Component Analysis using `psych::principal()`, and 
choose the exact same specification as Kestilä concerning estimation method, 
rotation etc. Do two analyses: one for trust in politics, and one for attitudes towards immigration. Remember that you can view the help file for `psych::principal()` by running
`?psych::principal`.

*Hint: Use* `psych::principal()`

<details>
  <summary>Click for explanation</summary>

#### Trust in politics

Kestilä extracted three components, with VARIMAX rotation. When we print the results, we can hide all factor loadings smaller than the smallest one in their table, to make it easier to read:
  
```{r, echo = TRUE}
library(psych)
pca_trust <- principal(df[, 7:19], nfactors = 3, rotate = "varimax")
print(pca_trust, cut = .3, digits =3)
```

For attitude towards immigration, Kestilä extracted five components, with VARIMAX rotation:
  
```{r, echo = TRUE}
library(psych)
pca_att <- principal(df[, 20:44], nfactors = 5, rotate = "varimax")
print(pca_att, cut = .3)
```
`r if(knitr::is_html_output()){"\\details"}`

### Question 1.h

<!-- Video here about extracting factor scores-->
Extract the PCA factor scores from the results objects, and add them to the data.frame. Give the PCA scores informative names, based on your interpretation of the factor loadings, so that you understand what they summarize. 

*Hint: Use* `$; colnames(); cbind()`

<details>
  <summary>Click for explanation</summary>
Extracting factor scores

The factor scores are inside of the results objects. Use the `$` operator to access them:

```{r, echo = TRUE}
head(pca_att$scores)
```

We're going to give these factor scores some informative names, and add them to our data.frame. **You should give them different, informative names based on the meaning of the factors!**

```{r, echo = TRUE}
# Print names
colnames(pca_att$scores)
# Change names
colnames(pca_att$scores) <- c("Att1", "Att2", "Att3", "Att4", "Att5")
colnames(pca_trust$scores) <- c("Trust1", "Trust2", "Trust3")
# Add columns
df <- cbind(df, pca_trust$scores, pca_att$scores)
```

`r if(knitr::is_html_output()){"\\details"}`


### Question 1.i

Are you able to replicate her results? 

<details>
  <summary>Click for explanation</summary>
  No, probably not. The team of teachers was unable to reproduce this analysis, even though it should be pretty easy to do so...

`r if(knitr::is_html_output()){"\\details"}`


### Question 1.j

Save your syntax and bring your data and syntax to the practical on Thursday.