06-Week2_class.Rmd

# Week 2 - Class

```{r include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE, echo = TRUE)
```
In the unlikely event that you were able to replicate the results of Kestilä in the take-home exercise, rerun your analysis script. If you did not manage to replicate the results, load the dataset `ESSround1-b.csv` into RStudio. 

### Loading the data

If you don't know how to load the data, click below. 

<details>
  <summary>Click for explanation</summary>
``` {r, eval = FALSE, warning = FALSE}
df <- read.csv("ESSround1-b.csv", stringsAsFactors = FALSE)
```
``` {r, echo = FALSE, message = FALSE, warning = FALSE}
df <- read.csv("TCSM_student/ESSround1-b.csv", stringsAsFactors = FALSE)
```
`r if(knitr::is_html_output()){"\\details"}`


### Question 1

Kestilä states that running a Principal Components Analysis is a good way to test whether the survey questions in the ESS measure attitudes towards immigration and trust in politics.
Based on your reading of Preacher and MacCallum (2003), do you agree with this position?

### Question 2

If you would have to choose a method for constructing the 'trust in politics' and 'attitude towards immigration' scales based on the theory and background information in the Kestilä article, what type of factor analysis would you choose? What key factors influence your decision?

<details>
  <summary>Click for information</summary>
  
Key factors include:

* Theory-driven or exploratory?
* Estimation method
* Rotation method
* Method to establish how many factors are needed
`r if(knitr::is_html_output()){"\\details"}`

### Question 3

Run two factor analyses, one for each PCA of the original article. Inspect the number of factors necessary, evaluate the rotation method, and if necessary, run the factor analysis again with adapted settings (rotation method and/or different number of factors). How many factors are there?

*Hint: Examing the help file for* `?psych::fa`

<details>
  <summary>Click for explanation</summary>

To perform an exploratory factor analysis, you can use the function `fa` of the package `psych`. You will have to specify the data, and the variables that you want to include in the factor analyses. Furthermore, you will have to specify the number of factors that you want to extract, the rotation method and the estimation method.

```{r}
library(psych)
library(GPArotation)

efa_trust <- fa(df[, 7:19], nfactors = 3, rotate = "promax", fm = "ml", scores = "Bartlett")
```

In order to determine the number of factors to extract, you might want to look at the eigenvalues of the factors, which can be accessed by using the following code:

```{r}
round(efa_trust$values, digits = 3)
```

Another common approach is to make a scree plot. All that is required for this, is to pass the eigenvalues to the `qplot()` function and add a geom_path:

```{r}
library(ggplot2)
qplot(y = efa_trust$values) + geom_path()
```

You can do the same for the attitude variables:

```{r, warning = FALSE}
efa_att <- fa(df[, 20:44], nfactors = 5, rotate = "promax", fm = "ml", scores = "Bartlett")
round(efa_att$values, digits = 3)
qplot(y = efa_att$values) + geom_path()
```

As recommended in the lecture, a better and more quantified way to establish the number of factors is "parallel analysis" (Horn, 1965). Here's how to do it in R:

```{r}
fa.parallel(df[, 7:19])
```

Note that, in this case, parallel analysis recommends as many as 6 factors. However, in order to be able to compare our results with those from Kestilä, we might consider aligning ourselves with the literature and choosing the same number of factors for our analyses as components in the article.

`r if(knitr::is_html_output()){"\\details"}`


### Question 4

Apart from the number of factors, you also want to look at the factor loadings. They can be found in the "pattern matrix". The higher the factor loadings are, the more indicative an item is for the latent factor. If you find some items to have only very low loadings (indicating that the items do not provide much information about the factor), you may choose not to include them in your analysis. This means you have to rerun the analysis under question 3. 

<details>
  <summary>Click for explanation</summary>
  
  You can find the factor loadings by means of the 'print'-function used in the take-home exercise, or you can search for the variable 'loadings', which is inside the results object, to end up with just the information you are searching for.  
  
  
```{r}
efa_trust$loadings
efa_att$loadings
```

The matrix of loadings indicates how strongly each factor (columns) is associated with the items (rows).
Below the matrix of loadings, we see a second matrix, which indicates (amongst other things) the
*proportion var*: How much variance in the items is explained by each of the factors.
Each subsequent factor explains slightly less variance than the ones before it (this is a property of exploratory factor analysis).
The *cumulative var* indicates how much variance the factors explain, in total. If you estimated as many factors as items, then the last value for *cumulative var* would be `1.00` (100%).

The factor loading matrix is slightly hard to read, due to the jumble of factor loadings. To create more clarity, it is convenient to suppress the factor loadings that are lower than .30.

```{r}
print(efa_trust$loadings, cut = .30, digits = 2)
```

Furthermore, if you want to perform a factor analysis without, say, `stfedu`, while you want all other variables included in your factor analysis, you can simply leave the column number of `stfedu`, which is 13, out of the command:

```{r, eval = FALSE}
efa_trust_without_stfedu <- fa(df[, c(7:12, 14:19)], nfactors = 3, rotate = "promax", fm = "ml")
```
`r if(knitr::is_html_output()){"\\details"}`


### Question 5

Give the factor scores an appropriate name. You can do this by inspecting the items that load on one factor. What do these items have in common substantively? The goal of a factor analysis usually is to create interpretable factors. If you have trouble interpreting the factors, you can choose to tweak the analysis by changing the options, or including/excluding more items.

Furthermore, after you named the factor scores accordingly, extract them from the results object and add them to the data.frame.

*Hint: If you do not know how to do this, have a look at question 1.h from the take-home exercise of week 2.*

**Please note that the `colnames` will be specified from left to right, and not, for example, from ML1 to ML5.**

```{r, echo = FALSE}
# Change names
colnames(efa_trust$scores) <- c("trustpolEFA", "satcntryEFA", "trustinstEFA")
colnames(efa_att$scores) <- c("effectimmiEFA", "allowimmiEFA", "allowrefEFA", "ethnicEFA", "immieurEFA")
# Add columns
df_with_scores <- cbind(df, efa_trust$scores, efa_att$scores)
```

### Question 6

The next step is to assert whether the items that together form one factor, also form a reliable scale. Run separate reliability analyses by means of the function `alpha` for the items that together form one factor, and evaluate Cronbach's alpha to see whether the scales are internally consistent. The "Reliability if an item is dropped (alpha.drop)" information may be handy to inspect what would happen if you would delete one item; you can find it inside the reliability analysis object. If Cronbach's alpha is not ok, deselect one survey item and run the analyses under question 4 and question 5 again.

*Hint: Cronbach's alpha > .7 are deemed to be ok, > .8 is good.* 

If Cronbach's alpha is not ok, deselect one survey item and run the analyses under question 4 and question 5 again.

<details>
  <summary>Click for explanation</summary>

If you want to assess the reliability of the variables `pltcare`, `pltinvt`, `trstprl`, `trstplt`, and `trstep` you can run a reliability analysis as follows. 

*Hint: name the new objects substantively, instead of numbering them.*

```{r}
library(psych)
reli_1 <- psych::alpha(df_with_scores[, c(7, 8, 9, 12, 13)])
reli_1
```

Sometimes, the table "Reliability if an item is dropped" will indicate that Cronbach's alpha increases when you drop a variable out of the analysis. Note, however, that doing so is an exploratory approach to analysis, and it may make your work incomparable to other publications using the same scale.

`r if(knitr::is_html_output()){"\\details"}`

### Question 7

Now you can analyze the differences between the factor scores for the PCA analysis (take-home exercise 2) and the EFA by plotting them in a series of scatterplots (bivariate) using `ggplot2`. The PCA factor scores are already stored in the dataset `ESSround1-b.csv` in columns 45 through 52 (`df[, 45:52]`). 

Plot the scores for one of your new factors on the x axis, and try to match it with a corresponding PCA component in the dataset. Note that if you went with a different number of factors than Kestilä's components (5 for 'trust  in politics' and 3 for 'attitudes towards immigration'), you may find little correlation between any factors.

<details>
  <summary>Click for explanation</summary>

*Make sure to adjust the arguments `x = EFA_trust1` and `y = PCA_Trust1` to align with _your own_ EFA factor column name and a PCA component column name _from the data_ respectively*


```{r, eval = FALSE}
library(ggplot2)
ggplot(data = df_with_scores) +
  geom_point(mapping = aes(x = EFA_Trust1, y = PCA_Trust1))
```

```{r, echo = FALSE, warning = FALSE, message = FALSE}
library(ggplot2)
ggplot(data = df_with_scores) +
  geom_point(mapping = aes(x = trustinstEFA, y = trustinst)) +
  labs(x = "EFA_Trust1",
       y = "PCA_Trust1")
```

`r if(knitr::is_html_output()){"\\details"}`

### Question 8

Examine the correlations between the PCA and EFA scales for the 'trust in politics' scores, and for the 'immigration' scores. What is your conclusion: is there a difference between them?

<details>
  <summary>Click for explanation</summary>
  
*Hint: Name the new objects substantively once again.*

Example for trust in politics:

```{r, eval = FALSE}
library(psych)
corr.test(df_with_scores$EFA_Trust1, df_with_scores$PCA_Trust1, use = "complete.obs")
```
```{r, echo = FALSE, warning = FALSE}
library(psych)
corr.test(df_with_scores$trustinstEFA, df_with_scores$trustinst, use = "complete.obs") 
```

`r if(knitr::is_html_output()){"\\details"}`


### Question 9

Kestila uses the PCA factor scores to evaluate country level differences in 1. Attitudes towards immigration and 2. Political trust. Repeat her analyses using the factor scores you saved in step 5. Think about the statistical test you would like to use. Do you draw similar or different conclusions?

<details>
  <summary>Click for explanation</summary>

We can use an ANOVA to test whether the countries differ in the amount of political trust the participants have.
```{r}
fit_ANOVA <- aov(trustinstEFA ~ cntry, data = df_with_scores)
summary(fit_ANOVA)
```

In which `trustinstEFA` is the dependent variable, `cntry` is the grouping variable, and `df` is the name of the dataset; `summary` will provide the results. However, it will only indicate whether the variable `cntry` is significant or not, so you will be unable to tell which countries differ in terms of their political trust. To tell which countries differ, you can do a pairwise comparison test, with a Bonferroni adjustment for multiple testing.


```{r}
pair_comparison <- pairwise.t.test(df_with_scores$trustinstEFA, df_with_scores$cntry, p.adjust.method = "bonf", na.rm = TRUE)
round(pair_comparison$p.value, digits = 3)
```

`r if(knitr::is_html_output()){"\\details"}`

### Question 10

The second goal of Kestilä is to show how socio-demographic characteristics affect attitudes towards immigrants and trust in politics in Finland. Select only the Finnish cases using the variable `cntry`. 

<details>
  <summary>Click for explanation</summary>
  
To select the Finnish cases only, select rows for which `$cntry` is equal to (`==`) Finland:
```{r}
df_finland <- df_with_scores[df_with_scores$cntry == "Finland", ]
```

`r if(knitr::is_html_output()){"\\details"}`

After selecting the Finnish cases, we have to prepare our variables for analysis.
Thanks to the factor analyses above, our dependent variables are taken care of.
However, we should examine the independent variables to make sure there are no surprises,
and do any recoding necessary before we can run multiple regression. The independent variables are
`c("gndr", "yrbrn", "eduyrs", "polintr", "lrscale")`.

<details>
  <summary>Click for explanation</summary>
```{r}
library(tidySEM)
descriptives(df_finland[, c("gndr", "yrbrn", "eduyrs", "polintr", "lrscale")])
```

  
Note that we still have to do some recoding. For example, the data contains participants' year of birth instead of their age. We have to recode this variable. The data were collected in 2002, so we can simply subtract the year of birth of every participant from the year 2002.


```{r}
df_finland$age <- (2002 - df_finland$yrbrn)
```

Furthermore, the variables `polintr` and `lrscale` are currently coded as character variables. If you analyze them like that, R will make dummies for all distinct values on these variables. Let's recode those, too! The main challenge is to get the levels of these variables in the correct order, and convert them to numbers. We demonstrate two ways to do this, one way for each of the two variables.

For political interest, we convert the character variable to an ordered categorical variable, and we specify the correct order of labels. Then, we convert it to a numeric variable.

```{r}
table(df_finland$polintr)
df_finland$polintr <- ordered(df_finland$polintr, levels = c("Not at all interested", "Hardly interested", "Quite interested", "Very interested"))
table(df_finland$polintr)
df_finland$polintr <- as.numeric(df_finland$polintr)
```

For political orientation, only the two extreme values of the scale are labeled as text. Thus, we can replace these labels with numbers, and then convert the entire variable to numeric.

```{r}
table(df_finland$lrscale)
df_finland$lrscale[df_finland$lrscale == "Left"] <- 0
df_finland$lrscale[df_finland$lrscale == "Right"] <- 10
df_finland$lrscale <- as.numeric(df_finland$lrscale)
```

Furthermore, Kestilä recoded `polintr` in such a way that there are only two categories. The lowest two and highest two categories were combined. Here is how to do this in R, so you can replicate their analysis:

```{r}
df_finland$polintr_dummy <- ifelse(df_finland$polintr <= 2,
                                   "Not or hardly interested",
                                   "Quite or very interested")
```


Next, run a number of multiple linear regression analyses with the (sub-)scales of attitudes towards immigrants and political trust as subsequent dependent variables, and the same predictors as Kestilä. Inspect your output.

Compare your results with the results from Kestilä. How do your results differ or agree with the results by Kestilä?

```{r}
fit_trust1 <- lm(trustinstEFA ~ gndr + age + eduyrs + polintr_dummy + lrscale, data = df_finland, na.action = na.omit)
summary(fit_trust1)
```
`r if(knitr::is_html_output()){"\\details"}`


### Question 11

Save your syntax and your data, you will need it next week.