sentiment.Rmd

---
title: "Sentiment Hitchhikers Guide trilogy"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

> “The ships hung in the sky in much the same way that bricks don't.”   - Hitchhiker's Guide to the Galaxy

These are the beautiful words of 'Hitchhiker's Guide to the Galaxy', book one in the trilogy
of five. Douglas Adams wrote in way that I really really like. Many many of the 
sentences make me laugh, it is a beautiful high paced advanture. Just check these quotes:

> “For a moment, nothing happened. Then, after a second or so, nothing continued to happen.” 

a lovely anticlimax.

![My hardcopy version of the books](data/images/IMG_20180608_183720120.jpg)

There many things we can do with the books. But in this case I will look at sentiment
and common words. 

## Sentiment

```{r}
source("R/01_load_data.R")
library(tidytext)
unnestedHHGTTG<- 
    HHGTTG %>% 
    group_by(book, chapter) %>% 
    unnest_tokens(output = "word",input = content, token = "words") %>% 
    ungroup()

# use afinn and numbers
# NRC surprise, fear, joy, sadness and count
joy <- get_sentiments("nrc") %>% 
    filter(sentiment == "joy")
surprise <- get_sentiments("nrc") %>% 
    filter(sentiment == "surprise")
fear <- get_sentiments("nrc") %>% 
    filter(sentiment == "fear")
sadness <- get_sentiments("nrc") %>% 
    filter(sentiment == "sadness")
```

Using the NRC sentiment corpus we can look at words that signal joy, surprise, fear or sadness. 


```{r}
unnestedHHGTTG %>% 
    inner_join(joy) %>%
    group_by(book) %>% 
    count(word, sort = TRUE) %>% 
    top_n(15, wt = n) %>% 
    ungroup() %>% 
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n, fill = book))+
    geom_col(show.legend = FALSE)+
    facet_wrap(~book,scales = "free")+
    coord_flip()+
    labs(
        title = "Top 25 most frequent joy-words",
        subtitle = "",
        x = "", y = ""
    )

unnestedHHGTTG %>% 
    inner_join(surprise) %>%
    group_by(book) %>% 
    count(word, sort = TRUE) %>% 
    top_n(15, wt = n) %>% 
    ungroup() %>% 
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n, fill = book))+
    geom_col(show.legend = FALSE)+
    facet_wrap(~book,scales = "free")+
    coord_flip()+
    labs(
        title = "Top 25 most frequent surprise-words",
        subtitle = "",
        x = "", y = ""
    )

unnestedHHGTTG %>% 
    inner_join(fear) %>%
    group_by(book) %>% 
    count(word, sort = TRUE) %>% 
    top_n(15, wt = n) %>% 
    ungroup() %>% 
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n, fill = book))+
    geom_col(show.legend = FALSE)+
    facet_wrap(~book,scales = "free")+
    coord_flip()+
    labs(
        title = "Top 25 most frequent fear-words",
        subtitle = "",
        x = "", y = ""
    )

unnestedHHGTTG %>% 
    inner_join(sadness) %>%
    group_by(book) %>% 
    count(word, sort = TRUE) %>% 
    top_n(15, wt = n) %>% 
    ungroup() %>% 
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(word, n, fill = book))+
    geom_col(show.legend = FALSE)+
    facet_wrap(~book,scales = "free")+
    coord_flip()+
    labs(
        title = "Top 25 most frequent sad-words",
        subtitle = "",
        x = "", y = ""
    )
```

But we can also look at the sentiment per chapter, the bing corpus has words
divided in positive and negative words. We can count the number of negative and positive
words and check the result.

```{r}
sentiment_chapter <- 
    unnestedHHGTTG %>% 
    inner_join(get_sentiments("bing")) %>%
    count(book, chapter, sentiment) %>%
    spread(sentiment, n, fill = 0) %>%
    mutate(
        sentiment = positive - negative,
        N_sent_words = positive+negative,
        relative_sentiment = sentiment / N_sent_words
        )

sentiment_chapter %>% 
    ggplot(aes(chapter, sentiment, fill = book)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~book, ncol = 2, scales = "free_x")+
    labs(
        title = "Sentiment per chapter",
        subtitle = "Total sentiment per chapter"
    )

sentiment_chapter %>% 
    ggplot(aes(chapter, relative_sentiment, fill = book)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~book, ncol = 2, scales = "free_x")+
    labs(
        title = "Sentiment per chapter",
        subtitle = "relative sentiment per chapter"
    )
```


We can also look at the sentiment per word. 

```{r}
bing_word_counts <- unnestedHHGTTG  %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
bing_word_counts %>% head(8)
```


And then 


```{r}
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
```


### using tf-idf

Find the more typical words per book. 

```{r}
book_words <- unnestedHHGTTG %>%
  count(book, word, sort = TRUE) %>%
  bind_tf_idf(word, book, n) %>% 
    arrange(-tf_idf)

total_words <- book_words %>% 
  group_by(book) %>% 
  summarize(total = sum(n))
book_words <- left_join(book_words, total_words)


ggplot(book_words, aes(n/total, fill = book)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.0009) +
  facet_wrap(~book, ncol = 2, scales = "free_y")
book_words
```

```{r}
book_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(book) %>% 
  top_n(15,wt = tf_idf) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, ncol = 2, scales = "free") +
  coord_flip()
```


.