diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..5b6a065 --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +.Rproj.user +.Rhistory +.RData +.Ruserdata diff --git a/ch-02.Rmd b/ch-02.Rmd new file mode 100644 index 0000000..df420a5 --- /dev/null +++ b/ch-02.Rmd @@ -0,0 +1,364 @@ +--- +title: "Solutions to Chapter 2 Exercises" +author: "Howard Baek" +date: "Last compiled on `r format(Sys.time(), '%B %d, %Y')`" +output: html_document +editor_options: + chunk_output_type: console +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +```{r, include=FALSE} +library(tidyverse) +``` + + +### 2.2.1 Exercises + +
+ +**1. List five functions that you could use to get more information about the `mpg` dataset.** + +- `help(mpg)`: Documentation of dataset +- `dim(mpg)`: Dimensions of dataset +- `summary(mpg)`: Summary measures of dataset +- `str(mpg)`: Display of the internal structure of dataset +- `glimpse(mpg)`: `dplyr` version of `str(mpg)` + +
+ +**2. How can you find out what other datasets are included with ggplot2?** + +- `data(package = "ggplot2")` loads the available data sets in ggplot2. Alternatively,if you have internet access, go to https://ggplot2.tidyverse.org/reference/index.html#section-data + +**3. Apart from the US, most countries use fuel consumption (fuel consumed over fixed distance) rather than fuel economy (distance travelled with fixed amount of fuel). How could you convert cty and hwy into the European standard of l/100km?** + +- According to [asknumbers](https://www.asknumbers.com/mpg-to-L100km.aspx), you divide 235.214583 by the mpg values in `cty` and `hwy` to convert them into the European standard of l/100km. + +- Function to convert into European standard (Rademaker, 2016): + +```{r} +mpgTol100km <- function(milespergallon){ + + GalloLiter <- 3.785411784 + MileKilometer <- 1.609344 + + l100km <- (100*GalloLiter)/(milespergallon*MileKilometer) + l100km + +} +``` + + +**4. Which manufacturer has the most models in this dataset? Which model has the most variations? Does your answer change if you remove the redundant specification of drive train (e.g. "pathfinder 4wd", "a4 quattro") from the model name?** + +```{r} +# Count manufacturers and sort +mpg %>% count(manufacturer, sort = TRUE) +``` + +- `dodge` has the most models in this dataset. + + +```{r} +unique(mpg$model) +``` + +- (Rademaker, 2016) The `a4` and `camry` both have a second model (the `a4 quattro` and the `camry solar`) + + +```{r} +# Remove redundant information (Rademaker, 2016) +str_trim(str_replace_all(unique(mpg$model), c("quattro" = "", "4wd" = "", + "2wd" = "", "awd" = ""))) +``` + + + + +### 2.3.1 Exercises + +**1. How would you describe the relationship between `cty` and `hwy`? Do you have any concerns about drawing conclusions from that plot?** + +```{r} +mpg %>% + ggplot(aes(cty, hwy)) + + geom_point() +``` + +- The plot shows a strongly linear relationship, which tells me that `cty` and `hwy` are highly correlated variables. The only concern I have is that the points seem to be overlapping. +- There is not much insight to be gained except that cars which are fuel efficient on a highway are also fuel efficient in cities. This relationship is probably a function of speed (Rademaker, 2016) + + + +**2. What does `ggplot(mpg, aes(model, manufacturer)) + geom_point()` show? Is it useful? How could you modify the data to make it more informative?** + +```{r} +ggplot(mpg, aes(model, manufacturer)) + + geom_point() +``` + +- The plot shows the manufacturer of each model. Its not very readable since there are too many models and this clutters up the x-axis with too many ticks! I would just plot 20 or so models so that the graph is more readable. See below: + +```{r} +mpg %>% + head(25) %>% + ggplot(aes(model, manufacturer)) + + geom_point() +``` + +- A possible alternative would be to look total number of observations for each manufacturer-model combination using geom_bar(). (Rademaker, 2016) + +```{r} +df <- mpg %>% + transmute("man_mod" = paste(manufacturer, model, sep = " ")) + + +ggplot(df, aes(man_mod)) + + geom_bar() + + coord_flip() +``` + + + + +**3. Describe the data, aesthetic mappings and layers used for each of the following plots. You'll need to guess a little because you haven't seen all the datasets and functions yet, but use your common sense! See if you can predict what the plot will look like before running the code.** + +1. `ggplot(mpg, aes(cty, hwy)) + geom_point()` + +- *Data*: `mpg` + +- *Aesthetic*: highway miles per gallon is mapped to y position and city miles per gallon is mapped to x position. + +- *Layer*: points + +2. `ggplot(diamonds, aes(carat, price)) + geom_point()` + +- *Data*: `diamonds` + +- *Aesthetic*: price in US dollars is mapped to y position, weight of the diamond is mapped to x position. + +- *Layer*: points + +3. `ggplot(economics, aes(date, unemploy)) + geom_line()` + +- *Data*: `economics` + +- *Aesthetic*: median duration of unemployment, in weeks, is mapped to y position and month of data collection is mapped to x position. + +- *Layer*: line + +(Rademaker, 2016) Alternatively, you can always access plot info using summary() as in e.g. +```{r} +# summary() +summary(ggplot(economics, aes(date, unemploy)) + geom_line()) +``` + + + +### 2.4.1 Exercises + +**1. Experiment with the colour, shape and size aesthetics. What happens when you map them to continuous values? What about categorical values? What happens when you use more than one aesthetic in a plot?** + +```{r} +# Map color to continuous value +mpg %>% + ggplot(aes(cty, hwy, color = displ)) + + geom_point() + +# Map color to categorical value +mpg %>% + ggplot(aes(cty, hwy, color = trans)) + + geom_point() + +# Use more than one aesthetic in a plot +mpg %>% + ggplot(aes(cty, hwy, color = trans, size = trans)) + + geom_point() +``` + +**2. What happens if you map a continuous variable to shape? Why? What happens if you map trans to shape? Why?** + +```{r, eval = FALSE} +mpg %>% + ggplot(aes(cty, hwy, shape = displ)) + + geom_point() +``` + +- I can not map a continuous variable to shape and I get an error message: `Error: A continuous variable can not be mapped to shape` + +```{r} +mpg %>% + ggplot(aes(cty, hwy, shape = trans)) + + geom_point() +``` + +- I get an warning message that tells me the shape palette can only deal with 6 discrete values. + + +**3. How is drive train related to fuel economy? How is drive train related to engine size and class?** + +```{r} +mpg %>% + ggplot(aes(drv, cty)) + + geom_col() + +mpg %>% + ggplot(aes(drv, hwy)) + + geom_col() +``` + +- Front-wheel drive has the best fuel economy, then 4wd, then rear wheel drive. + +```{r} +mpg %>% + ggplot(aes(drv, displ, fill = class)) + + geom_col(position = "dodge") +``` + +- 4wd has biggest engine size, then front-wheel, then rear wheel drive. Out of all 4wd, suvs have biggest engine size. Out of all front-wheel drive, midsize has biggest engine size. Out of all rear wheel drive, 2 seater has biggest engine size. + + +### 2.5.1 Exercises + +**1. What happens if you try to facet by a continuous variable like hwy? What about cyl? What's the key difference?** + +```{r} +mpg %>% + ggplot(aes(drv, displ, fill = class)) + + geom_col(position = "dodge") + + facet_wrap(~hwy) + +mpg %>% + ggplot(aes(drv, displ, fill = class)) + + geom_col(position = "dodge") + + facet_wrap(~cyl) +``` + +- The key difference is `hwy` is a continuous variable that has 27 unique values, so you get 27 different subsets. However, `cly` is a categorical variable and has 4 unique values, so `cyl` only has 4 different subsets. It is less cluttered when you try to facet. +- (Rademaker, 2016) Facetting by a continous variable works but becomes hard to read and interpret when the variable that we facet by has to many levels. + +**2. Use faceting to explore the 3-way relationship between fuel economy, engine size, and number of cylinders. How does faceting by number of cylinders change your assessement of the relationship between engine size and fuel economy?** + +```{r} +mpg %>% + ggplot(aes(displ, cty)) + + geom_point() + + +mpg %>% + ggplot(aes(displ, cty)) + + geom_point() + + facet_wrap(~cyl) +``` + +- When I initially plot engine size and fuel economy, I see an overall decreasing linear relationship. Upon faceting, I see that the decreasing relationship is mostly seen in the 4 cylinder subset. In the other cylinder subsets, we see a flat relationship - as engine displacement increases, fuel economy remains constant. + +**3. Read the documentation for `facet_wrap()`. What arguments can you use to control how many rows and columns appear in the output?** + +- I can use the arguments `nrow, ncol` to control how many rows and columns appear in the output. + +**4. What does the `scales` argument to `facet_wrap()` do? When might you use it?** + +- It allows users to decide whether scales should be fixed. I would use it whenever different subsets of the data are on vastly different scales. +- (Rademaker, 2016) If we want to compare across facets, `scales = "fixed"` is more appropriate. If our focus is on individual patterns within each facet, setting `scales = "free"` might be more approriate. + + +### 2.6.6 Exercises + +**1. What's the problem with the plot created by `ggplot(mpg, aes(cty, hwy)) + geom_point()`? Which of the geoms described above is most effective at remedying the problem?** + +```{r} +ggplot(mpg, aes(cty, hwy)) + + geom_point() +``` + +- The problem is overplotting. + +1. Use `geom_jitter` to add random noise to the data and avoid overplotting. + +```{r} +ggplot(mpg, aes(cty, hwy)) + + geom_jitter() +``` + +2. (Rademaker, 2016) Set opacity with `alpha` + +```{r} +ggplot(mpg, aes(cty, hwy)) + + geom_point(alpha = 0.3) +``` + + +**2. One challenge with `ggplot(mpg, aes(class, hwy)) + geom_boxplot()` is that the ordering of `class` is alphabetical, which is not terribly useful. How could you change the factor levels to be more informative?** + + +```{r} +mpg %>% + mutate(class = factor(class), + class = fct_reorder(class, hwy)) %>% + ggplot(aes(class, hwy)) + + geom_boxplot() +``` + +- We could convert `class` to a factor and reorder it by `hwy` + +**3. Explore the distribution of the carat variable in the `diamonds` dataset. What binwidth reveals the most interesting patterns?** + +```{r} +diamonds %>% + ggplot(aes(carat)) + + geom_histogram(binwidth = 0.3) +``` + +- This is a subjective answer, but binwidth of 0.2 or 0.3 reveals that the distribution of `carat` is heavily skewed to the right. This means that most diamonds carats are between 0 and 1. + +**4. Explore the distribution of the price variable in the `diamonds` data. How does the distribution vary by cut?** + +```{r} +diamonds %>% + ggplot(aes(price)) + + geom_histogram() + +diamonds %>% + mutate(cut = fct_reorder(cut, price)) %>% + ggplot(aes(cut, price)) + + geom_boxplot() + +ggplot(diamonds, aes(x = price, y =..density.., color = cut)) + + geom_freqpoly(binwidth = 200) +``` + +- (Rademaker, 2016) Fair quality diamonds are more expensive then others. Possible reason is they are bigger. + +**5. You now know (at least) three ways to compare the distributions of subgroups: `geom_violin()`, `geom_freqpoly()` and the colour aesthetic, or `geom_histogram()` and faceting. What are the strengths and weaknesses of each approach? What other approaches could you try?** + +- According to the book, `geom_violin()` shows a compact representation of the "density" of the distribution, highlighting the areas where more points are found. Its weakness is that violin plos rely on the calculation of a density estimate, which is hard to interpret. + +- According to the book, `geom_freqploy()` bins the data, then counts the number of observations in each bin using lines. One possible weakness is that you have to select the width of the bins yourself by experimentation. + +- According to the book, `geom_histogram()` and faceting makes it easier to see the distribution of each group, but makes comparisons between groups a little harder. + +**6. Read the documentation for `geom_bar()`. What does the `weight` aesthetic do?** + +```{r, eval = FALSE} +?geom_bar() +``` + +- The `weight` aesthetic converts the number of cases to a weight and makes the height of the bar proportional to the sum of the weights. See below: + +```{r} +g <- ggplot(mpg, aes(class)) + +# Number of cars in each class: +g + geom_bar() +# Total engine displacement of each class +g + geom_bar(aes(weight = displ)) +``` + +**7. Using the techniques already discussed in this chapter, come up with three ways to visualize a 2d categorical distribution. Try them out by visualising the distribution of `model` and `manufacturer`, `trans` and `class`, and `cyl` and `trans`.** + +- NA diff --git a/ch-02.html b/ch-02.html new file mode 100644 index 0000000..b0d704e --- /dev/null +++ b/ch-02.html @@ -0,0 +1,571 @@ + + + + + + + + + + + + + + +Solutions to Chapter 2 Exercises + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

2.2.1 Exercises

+


+

1. List five functions that you could use to get more information about the mpg dataset.

+
    +
  • help(mpg): Documentation of dataset
  • +
  • dim(mpg): Dimensions of dataset
  • +
  • summary(mpg): Summary measures of dataset
  • +
  • str(mpg): Display of the internal structure of dataset
  • +
  • glimpse(mpg): dplyr version of str(mpg)
  • +
+


+

2. How can you find out what other datasets are included with ggplot2?

+ +

3. Apart from the US, most countries use fuel consumption (fuel consumed over fixed distance) rather than fuel economy (distance travelled with fixed amount of fuel). How could you convert cty and hwy into the European standard of l/100km?

+
    +
  • According to asknumbers, you divide 235.214583 by the mpg values in cty and hwy to convert them into the European standard of l/100km.

  • +
  • Function to convert into European standard (Rademaker, 2016):

  • +
+
mpgTol100km <- function(milespergallon){
+  
+  GalloLiter <- 3.785411784
+  MileKilometer <- 1.609344 
+  
+  l100km <- (100*GalloLiter)/(milespergallon*MileKilometer)
+  l100km
+  
+}
+

4. Which manufacturer has the most models in this dataset? Which model has the most variations? Does your answer change if you remove the redundant specification of drive train (e.g. “pathfinder 4wd”, “a4 quattro”) from the model name?

+
# Count manufacturers and sort
+mpg %>% count(manufacturer, sort = TRUE)
+
## # A tibble: 15 × 2
+##    manufacturer     n
+##    <chr>        <int>
+##  1 dodge           37
+##  2 toyota          34
+##  3 volkswagen      27
+##  4 ford            25
+##  5 chevrolet       19
+##  6 audi            18
+##  7 hyundai         14
+##  8 subaru          14
+##  9 nissan          13
+## 10 honda            9
+## 11 jeep             8
+## 12 pontiac          5
+## 13 land rover       4
+## 14 mercury          4
+## 15 lincoln          3
+
    +
  • dodge has the most models in this dataset.
  • +
+
unique(mpg$model)
+
##  [1] "a4"                     "a4 quattro"             "a6 quattro"            
+##  [4] "c1500 suburban 2wd"     "corvette"               "k1500 tahoe 4wd"       
+##  [7] "malibu"                 "caravan 2wd"            "dakota pickup 4wd"     
+## [10] "durango 4wd"            "ram 1500 pickup 4wd"    "expedition 2wd"        
+## [13] "explorer 4wd"           "f150 pickup 4wd"        "mustang"               
+## [16] "civic"                  "sonata"                 "tiburon"               
+## [19] "grand cherokee 4wd"     "range rover"            "navigator 2wd"         
+## [22] "mountaineer 4wd"        "altima"                 "maxima"                
+## [25] "pathfinder 4wd"         "grand prix"             "forester awd"          
+## [28] "impreza awd"            "4runner 4wd"            "camry"                 
+## [31] "camry solara"           "corolla"                "land cruiser wagon 4wd"
+## [34] "toyota tacoma 4wd"      "gti"                    "jetta"                 
+## [37] "new beetle"             "passat"
+
    +
  • (Rademaker, 2016) The a4 and camry both have a second model (the a4 quattro and the camry solar)
  • +
+
# Remove redundant information (Rademaker, 2016)
+str_trim(str_replace_all(unique(mpg$model), c("quattro" = "", "4wd" = "", 
+                                     "2wd" = "", "awd" = "")))
+
##  [1] "a4"                 "a4"                 "a6"                
+##  [4] "c1500 suburban"     "corvette"           "k1500 tahoe"       
+##  [7] "malibu"             "caravan"            "dakota pickup"     
+## [10] "durango"            "ram 1500 pickup"    "expedition"        
+## [13] "explorer"           "f150 pickup"        "mustang"           
+## [16] "civic"              "sonata"             "tiburon"           
+## [19] "grand cherokee"     "range rover"        "navigator"         
+## [22] "mountaineer"        "altima"             "maxima"            
+## [25] "pathfinder"         "grand prix"         "forester"          
+## [28] "impreza"            "4runner"            "camry"             
+## [31] "camry solara"       "corolla"            "land cruiser wagon"
+## [34] "toyota tacoma"      "gti"                "jetta"             
+## [37] "new beetle"         "passat"
+
+
+

2.3.1 Exercises

+

1. How would you describe the relationship between cty and hwy? Do you have any concerns about drawing conclusions from that plot?

+
mpg %>% 
+  ggplot(aes(cty, hwy)) +
+  geom_point()
+

+
    +
  • The plot shows a strongly linear relationship, which tells me that cty and hwy are highly correlated variables. The only concern I have is that the points seem to be overlapping.
  • +
  • There is not much insight to be gained except that cars which are fuel efficient on a highway are also fuel efficient in cities. This relationship is probably a function of speed (Rademaker, 2016)
  • +
+

2. What does ggplot(mpg, aes(model, manufacturer)) + geom_point() show? Is it useful? How could you modify the data to make it more informative?

+
ggplot(mpg, aes(model, manufacturer)) +
+  geom_point()
+

+
    +
  • The plot shows the manufacturer of each model. Its not very readable since there are too many models and this clutters up the x-axis with too many ticks! I would just plot 20 or so models so that the graph is more readable. See below:
  • +
+
mpg %>% 
+  head(25) %>% 
+  ggplot(aes(model, manufacturer)) +
+  geom_point()
+

+
    +
  • A possible alternative would be to look total number of observations for each manufacturer-model combination using geom_bar(). (Rademaker, 2016)
  • +
+
df <- mpg %>% 
+  transmute("man_mod" = paste(manufacturer, model, sep = " "))
+
+
+ggplot(df, aes(man_mod)) +
+  geom_bar() + 
+  coord_flip()
+

+

3. Describe the data, aesthetic mappings and layers used for each of the following plots. You’ll need to guess a little because you haven’t seen all the datasets and functions yet, but use your common sense! See if you can predict what the plot will look like before running the code.

+
    +
  1. ggplot(mpg, aes(cty, hwy)) + geom_point()
  2. +
+
    +
  • Data: mpg

  • +
  • Aesthetic: highway miles per gallon is mapped to y position and city miles per gallon is mapped to x position.

  • +
  • Layer: points

  • +
+
    +
  1. ggplot(diamonds, aes(carat, price)) + geom_point()
  2. +
+
    +
  • Data: diamonds

  • +
  • Aesthetic: price in US dollars is mapped to y position, weight of the diamond is mapped to x position.

  • +
  • Layer: points

  • +
+
    +
  1. ggplot(economics, aes(date, unemploy)) + geom_line()
  2. +
+
    +
  • Data: economics

  • +
  • Aesthetic: median duration of unemployment, in weeks, is mapped to y position and month of data collection is mapped to x position.

  • +
  • Layer: line

  • +
+

(Rademaker, 2016) Alternatively, you can always access plot info using summary() as in e.g.

+
# summary(<plot>)
+summary(ggplot(economics, aes(date, unemploy)) + geom_line())
+
## data: date, pce, pop, psavert, uempmed, unemploy [574x6]
+## mapping:  x = ~date, y = ~unemploy
+## faceting: <ggproto object: Class FacetNull, Facet, gg>
+##     compute_layout: function
+##     draw_back: function
+##     draw_front: function
+##     draw_labels: function
+##     draw_panels: function
+##     finish_data: function
+##     init_scales: function
+##     map_data: function
+##     params: list
+##     setup_data: function
+##     setup_params: function
+##     shrink: TRUE
+##     train_scales: function
+##     vars: function
+##     super:  <ggproto object: Class FacetNull, Facet, gg>
+## -----------------------------------
+## geom_line: na.rm = FALSE, orientation = NA
+## stat_identity: na.rm = FALSE
+## position_identity
+
+
+

2.4.1 Exercises

+

1. Experiment with the colour, shape and size aesthetics. What happens when you map them to continuous values? What about categorical values? What happens when you use more than one aesthetic in a plot?

+
# Map color to continuous value
+mpg %>% 
+  ggplot(aes(cty, hwy, color = displ)) +
+  geom_point()
+

+
# Map color to categorical value
+mpg %>% 
+  ggplot(aes(cty, hwy, color = trans)) +
+  geom_point()
+

+
# Use more than one aesthetic in a plot
+mpg %>% 
+  ggplot(aes(cty, hwy, color = trans, size = trans)) +
+  geom_point()
+
## Warning: Using size for a discrete variable is not advised.
+

+

2. What happens if you map a continuous variable to shape? Why? What happens if you map trans to shape? Why?

+
mpg %>% 
+  ggplot(aes(cty, hwy, shape = displ)) +
+  geom_point()
+
    +
  • I can not map a continuous variable to shape and I get an error message: Error: A continuous variable can not be mapped to shape
  • +
+
mpg %>% 
+  ggplot(aes(cty, hwy, shape = trans)) +
+  geom_point()
+
## Warning: The shape palette can deal with a maximum of 6 discrete values because
+## more than 6 becomes difficult to discriminate; you have 10. Consider
+## specifying shapes manually if you must have them.
+
## Warning: Removed 96 rows containing missing values (geom_point).
+

+
    +
  • I get an warning message that tells me the shape palette can only deal with 6 discrete values.
  • +
+

3. How is drive train related to fuel economy? How is drive train related to engine size and class?

+
mpg %>% 
+  ggplot(aes(drv, cty)) +
+  geom_col()
+

+
mpg %>% 
+  ggplot(aes(drv, hwy)) +
+  geom_col()
+

+
    +
  • Front-wheel drive has the best fuel economy, then 4wd, then rear wheel drive.
  • +
+
mpg %>% 
+  ggplot(aes(drv, displ, fill = class)) +
+  geom_col(position = "dodge")
+

+
    +
  • 4wd has biggest engine size, then front-wheel, then rear wheel drive. Out of all 4wd, suvs have biggest engine size. Out of all front-wheel drive, midsize has biggest engine size. Out of all rear wheel drive, 2 seater has biggest engine size.
  • +
+
+
+

2.5.1 Exercises

+

1. What happens if you try to facet by a continuous variable like hwy? What about cyl? What’s the key difference?

+
mpg %>% 
+  ggplot(aes(drv, displ, fill = class)) +
+  geom_col(position = "dodge") +
+  facet_wrap(~hwy)
+

+
mpg %>% 
+  ggplot(aes(drv, displ, fill = class)) +
+  geom_col(position = "dodge") +
+  facet_wrap(~cyl)
+

+
    +
  • The key difference is hwy is a continuous variable that has 27 unique values, so you get 27 different subsets. However, cly is a categorical variable and has 4 unique values, so cyl only has 4 different subsets. It is less cluttered when you try to facet.
  • +
  • (Rademaker, 2016) Facetting by a continous variable works but becomes hard to read and interpret when the variable that we facet by has to many levels.
  • +
+

2. Use faceting to explore the 3-way relationship between fuel economy, engine size, and number of cylinders. How does faceting by number of cylinders change your assessement of the relationship between engine size and fuel economy?

+
mpg %>% 
+  ggplot(aes(displ, cty)) +
+  geom_point()
+

+
mpg %>% 
+  ggplot(aes(displ, cty)) +
+  geom_point() +
+  facet_wrap(~cyl)
+

+
    +
  • When I initially plot engine size and fuel economy, I see an overall decreasing linear relationship. Upon faceting, I see that the decreasing relationship is mostly seen in the 4 cylinder subset. In the other cylinder subsets, we see a flat relationship - as engine displacement increases, fuel economy remains constant.
  • +
+

3. Read the documentation for facet_wrap(). What arguments can you use to control how many rows and columns appear in the output?

+
    +
  • I can use the arguments nrow, ncol to control how many rows and columns appear in the output.
  • +
+

4. What does the scales argument to facet_wrap() do? When might you use it?

+
    +
  • It allows users to decide whether scales should be fixed. I would use it whenever different subsets of the data are on vastly different scales.
  • +
  • (Rademaker, 2016) If we want to compare across facets, scales = "fixed" is more appropriate. If our focus is on individual patterns within each facet, setting scales = "free" might be more approriate.
  • +
+
+
+

2.6.6 Exercises

+

1. What’s the problem with the plot created by ggplot(mpg, aes(cty, hwy)) + geom_point()? Which of the geoms described above is most effective at remedying the problem?

+
ggplot(mpg, aes(cty, hwy)) + 
+  geom_point()
+

+
    +
  • The problem is overplotting.
  • +
+
    +
  1. Use geom_jitter to add random noise to the data and avoid overplotting.
  2. +
+
ggplot(mpg, aes(cty, hwy)) + 
+  geom_jitter()
+

+
    +
  1. (Rademaker, 2016) Set opacity with alpha
  2. +
+
ggplot(mpg, aes(cty, hwy)) +
+  geom_point(alpha = 0.3)
+

+

2. One challenge with ggplot(mpg, aes(class, hwy)) + geom_boxplot() is that the ordering of class is alphabetical, which is not terribly useful. How could you change the factor levels to be more informative?

+
mpg %>% 
+  mutate(class = factor(class),
+         class = fct_reorder(class, hwy)) %>% 
+  ggplot(aes(class, hwy)) +
+  geom_boxplot()
+

+
    +
  • We could convert class to a factor and reorder it by hwy
  • +
+

3. Explore the distribution of the carat variable in the diamonds dataset. What binwidth reveals the most interesting patterns?

+
diamonds %>% 
+  ggplot(aes(carat)) +
+  geom_histogram(binwidth = 0.3)
+

+
    +
  • This is a subjective answer, but binwidth of 0.2 or 0.3 reveals that the distribution of carat is heavily skewed to the right. This means that most diamonds carats are between 0 and 1.
  • +
+

4. Explore the distribution of the price variable in the diamonds data. How does the distribution vary by cut?

+
diamonds %>% 
+  ggplot(aes(price)) +
+  geom_histogram()
+
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+

+
diamonds %>% 
+  mutate(cut = fct_reorder(cut, price)) %>% 
+  ggplot(aes(cut, price)) +
+  geom_boxplot()
+

+
ggplot(diamonds, aes(x = price, y =..density.., color = cut)) +
+  geom_freqpoly(binwidth = 200)
+

+
    +
  • (Rademaker, 2016) Fair quality diamonds are more expensive then others. Possible reason is they are bigger.
  • +
+

5. You now know (at least) three ways to compare the distributions of subgroups: geom_violin(), geom_freqpoly() and the colour aesthetic, or geom_histogram() and faceting. What are the strengths and weaknesses of each approach? What other approaches could you try?

+
    +
  • According to the book, geom_violin() shows a compact representation of the “density” of the distribution, highlighting the areas where more points are found. Its weakness is that violin plos rely on the calculation of a density estimate, which is hard to interpret.

  • +
  • According to the book, geom_freqploy() bins the data, then counts the number of observations in each bin using lines. One possible weakness is that you have to select the width of the bins yourself by experimentation.

  • +
  • According to the book, geom_histogram() and faceting makes it easier to see the distribution of each group, but makes comparisons between groups a little harder.

  • +
+

6. Read the documentation for geom_bar(). What does the weight aesthetic do?

+
?geom_bar()
+
    +
  • The weight aesthetic converts the number of cases to a weight and makes the height of the bar proportional to the sum of the weights. See below:
  • +
+
g <- ggplot(mpg, aes(class))
+
+# Number of cars in each class:
+g + geom_bar()
+

+
# Total engine displacement of each class
+g + geom_bar(aes(weight = displ))
+

+

7. Using the techniques already discussed in this chapter, come up with three ways to visualize a 2d categorical distribution. Try them out by visualising the distribution of model and manufacturer, trans and class, and cyl and trans.

+
    +
  • NA
  • +
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/ch-03.Rmd b/ch-03.Rmd new file mode 100644 index 0000000..fda1ba9 --- /dev/null +++ b/ch-03.Rmd @@ -0,0 +1,87 @@ +--- +title: "Solutions to Chapter 3 Exercises" +author: "Howard Baek" +date: "Last compiled on `r format(Sys.time(), '%B %d, %Y')`" +output: html_document +editor_options: + chunk_output_type: console +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + + +```{r, include=FALSE} +library(tidyverse) +``` + + +### 3.1.1 Exercises + + +1. What geoms would you use to draw each of the following named plots? + +- Scatterplot: `geom_point()` +- Line chart: `geom_line()` +- Histogram: `geom_histogram()` +- Bar chart: `geom_bar()` +- Pie chart: ggplot2 does not have a geom to draw pie charts. One workaround, according to the [R Graph Gallery](https://www.r-graph-gallery.com/piechart-ggplot2.html) is to build a stacked bar chart with one bar only using the `geom_bar()` function and then make it circular with `coord_polar()` + + +2. What's the difference between `geom_path()` and `geom_polygon()`? + +- `geom_polygon` draws the same graph (lines) as `geom_path`, but it fills these lines with color. See below: + +```{r, include = FALSE} +df <- data.frame( + x = c(3, 1, 5), + y = c(2, 4, 6), + label = c("a","b","c") +) +p <- ggplot(df, aes(x, y, label = label)) + + labs(x = NULL, y = NULL) + # Hide axis label + theme(plot.title = element_text(size = 12)) # Shrink plot title +``` + + +```{r, echo = FALSE} +p + + geom_path() + + ggtitle("geom_path()") +``` + +```{r, echo = FALSE} +p + + geom_polygon() + + ggtitle("geom_polygon()") +``` + + +3. What's the difference between `geom_path()` and `geom_line()` + +`geom_line()` connects points from left to right, whereas `geom_path()` connects points in the order they appear in the data. See below: + +```{r, echo = FALSE} +p + + geom_line() + + ggtitle("geom_line()") +``` + +```{r, echo = FALSE} +p + + geom_path() + + ggtitle("geom_path()") +``` + + +4. What low-level geoms are used to draw `geom_smooth()`? What about `geom_boxplot()` and `geom_violin()`? + +(kangnade) + +- `geom_point()`, `geom_path()`, and `geom_area()` are used to draw `geom_smooth()`. +- `geom_rect()`, `geom_line()`, `geom_point()` are used for `geom_boxplot()`. +- `geom_area()` and `geom_path()` are used for `geom_violin()` + + + diff --git a/ch-03.html b/ch-03.html new file mode 100644 index 0000000..09e50ff --- /dev/null +++ b/ch-03.html @@ -0,0 +1,257 @@ + + + + + + + + + + + + + + +Solutions to Chapter 3 Exercises + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

3.1.1 Exercises

+
    +
  1. What geoms would you use to draw each of the following named plots?
  2. +
+
    +
  • Scatterplot: geom_point()
  • +
  • Line chart: geom_line()
  • +
  • Histogram: geom_histogram()
  • +
  • Bar chart: geom_bar()
  • +
  • Pie chart: ggplot2 does not have a geom to draw pie charts. One workaround, according to the R Graph Gallery is to build a stacked bar chart with one bar only using the geom_bar() function and then make it circular with coord_polar()
  • +
+
    +
  1. What’s the difference between geom_path() and geom_polygon()?
  2. +
+
    +
  • geom_polygon draws the same graph (lines) as geom_path, but it fills these lines with color. See below:
  • +
+

+

+
    +
  1. What’s the difference between geom_path() and geom_line()
  2. +
+

geom_line() connects points from left to right, whereas geom_path() connects points in the order they appear in the data. See below:

+

+

+
    +
  1. What low-level geoms are used to draw geom_smooth()? What about geom_boxplot() and geom_violin()?
  2. +
+

(kangnade)

+
    +
  • geom_point(), geom_path(), and geom_area() are used to draw geom_smooth().
  • +
  • geom_rect(), geom_line(), geom_point() are used for geom_boxplot().
  • +
  • geom_area() and geom_path() are used for geom_violin()
  • +
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/ch-04.Rmd b/ch-04.Rmd new file mode 100644 index 0000000..0cdc71e --- /dev/null +++ b/ch-04.Rmd @@ -0,0 +1,111 @@ +--- +title: "Solutions to Chapter 4 Exercises" +author: "Howard Baek" +date: "Last compiled on `r format(Sys.time(), '%B %d, %Y')`" +output: html_document +editor_options: + chunk_output_type: console +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + + +```{r, include=FALSE} +library(tidyverse) +``` + + +### 4.5 Exercises + +1. Draw a boxplot of `hwy` for each value of `cyl`, without turning `cyl` into a factor. What extra aesthetic do you need to set? + +```{r} +mpg %>% + ggplot(aes(cyl, hwy, group = cyl)) + + geom_boxplot() +``` + +- Since the variable `cyl` is an integer, you need to set `group = cyl`. According to the [ggplot2 docs](https://ggplot2.tidyverse.org/reference/aes_group_order.html), "when no discrete variable is used in the plot, you will need to explicitly define the grouping structure by mapping group to a variable that has a different value for each group." + + +2. Modify the following plot so that you get one boxplot per integer value of `displ`. + +```{r} +ggplot(mpg, aes(displ, cty)) + + geom_boxplot() +``` + +- As discussed in the previous question, you need to set `group = displ` because `displ` is not a discrete variaable. + +```{r} +ggplot(mpg, aes(displ, cty, group = displ)) + + geom_boxplot() +``` + + +3. When illustrating the difference between mapping continuous and discrete colours to a line, the discrete example needed `aes(group = 1)`. Why? What happens if that is omitted? What's the difference between `aes(group = 1)` and `aes(group = 2)`? Why? + + +- Let's examine the example in the book: + +```{r} +df <- data.frame(x = 1:3, y = 1:3, colour = c(1,3,5)) + +ggplot(df, aes(x, y, colour = factor(colour))) + + geom_line(aes(group = 2), size = 2) + + geom_point(size = 5) + +ggplot(df, aes(x, y, colour = colour)) + + geom_line(aes(group = 1), size = 2) + + geom_point(size = 5) +``` + +- When omitted, we don't get a line that connects all these points. In fact, we get a message saying "geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?" This happens because we included the colour aesthetic and made each color include only one observation. In order to tell ggplot that all these points are in the same group, we need to include `aes(group = 1)`. It doesn't matter what group is equal to. As long as we include all the points in the same group, we should be able to connect the points with a line. + + + +4. How many bars are in each of the following plots? + +```{r} +ggplot(mpg, aes(drv)) + + geom_bar() +# There are 3 bars in this plot. + + +ggplot(mpg, aes(drv, fill = hwy, group = hwy)) + + geom_bar() +# In this plot, the “shaded bars” for each drv have been constructed by stacking many distinct bars on top of each other, each filled with a different shade based on the value of hwy. + + +mpg2 <- mpg %>% + arrange(hwy) %>% + mutate(id = seq_along(hwy)) + +ggplot(mpg2, aes(drv, fill = hwy, group = id)) + + geom_bar() +# In this plot, the “shaded bars” for each drv have been constructed by stacking many distinct bars on top of each other, each filled with a different shade based on the value of hwy. +``` + + + +5. Install the babynames package. It contains data about the popularity of babynames in the US. Run the following code and fix the resulting graph. Why does this graph make me unhappy? + +```{r} +# install.packages("babynames") +library(babynames) +hadley <- dplyr::filter(babynames, name == "Hadley") +ggplot(hadley, aes(year, n)) + + geom_line() +``` + +- To fix this, you need to differentiate the sex- Male and Female. + +```{r} +ggplot(hadley, aes(year, n, group = sex, color = sex)) + + geom_line() +``` + +- The reason this graph makes Hadley unhappy is there are alot more female babies named Hadley than male babies! + diff --git a/ch-04.html b/ch-04.html new file mode 100644 index 0000000..7f21eba --- /dev/null +++ b/ch-04.html @@ -0,0 +1,307 @@ + + + + + + + + + + + + + + +Solutions to Chapter 4 Exercises + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

4.5 Exercises

+
    +
  1. Draw a boxplot of hwy for each value of cyl, without turning cyl into a factor. What extra aesthetic do you need to set?
  2. +
+
mpg %>% 
+  ggplot(aes(cyl, hwy, group = cyl)) +
+  geom_boxplot()
+

+
    +
  • Since the variable cyl is an integer, you need to set group = cyl. According to the ggplot2 docs, “when no discrete variable is used in the plot, you will need to explicitly define the grouping structure by mapping group to a variable that has a different value for each group.”
  • +
+
    +
  1. Modify the following plot so that you get one boxplot per integer value of displ.
  2. +
+
ggplot(mpg, aes(displ, cty)) + 
+  geom_boxplot()
+
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
+

+
    +
  • As discussed in the previous question, you need to set group = displ because displ is not a discrete variaable.
  • +
+
ggplot(mpg, aes(displ, cty, group = displ)) + 
+  geom_boxplot()
+

+
    +
  1. When illustrating the difference between mapping continuous and discrete colours to a line, the discrete example needed aes(group = 1). Why? What happens if that is omitted? What’s the difference between aes(group = 1) and aes(group = 2)? Why?
  2. +
+
    +
  • Let’s examine the example in the book:
  • +
+
df <- data.frame(x = 1:3, y = 1:3, colour = c(1,3,5))
+
+ggplot(df, aes(x, y, colour = factor(colour))) + 
+  geom_line(aes(group = 2), size = 2) +
+  geom_point(size = 5)
+

+
ggplot(df, aes(x, y, colour = colour)) + 
+  geom_line(aes(group = 1), size = 2) +
+  geom_point(size = 5)
+

+
    +
  • When omitted, we don’t get a line that connects all these points. In fact, we get a message saying “geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?” This happens because we included the colour aesthetic and made each color include only one observation. In order to tell ggplot that all these points are in the same group, we need to include aes(group = 1). It doesn’t matter what group is equal to. As long as we include all the points in the same group, we should be able to connect the points with a line.
  • +
+
    +
  1. How many bars are in each of the following plots?
  2. +
+
ggplot(mpg, aes(drv)) + 
+      geom_bar()
+

+
# There are 3 bars in this plot.
+
+
+ggplot(mpg, aes(drv, fill = hwy, group = hwy)) + 
+  geom_bar()
+

+
# In this plot, the “shaded bars” for each drv have been constructed by stacking many distinct bars on top of each other, each filled with a different shade based on the value of hwy.
+
+
+mpg2 <- mpg %>% 
+  arrange(hwy) %>%
+  mutate(id = seq_along(hwy)) 
+
+ggplot(mpg2, aes(drv, fill = hwy, group = id)) + 
+  geom_bar()
+

+
# In this plot, the “shaded bars” for each drv have been constructed by stacking many distinct bars on top of each other, each filled with a different shade based on the value of hwy.
+
    +
  1. Install the babynames package. It contains data about the popularity of babynames in the US. Run the following code and fix the resulting graph. Why does this graph make me unhappy?
  2. +
+
# install.packages("babynames")
+library(babynames)
+hadley <- dplyr::filter(babynames, name == "Hadley")
+ggplot(hadley, aes(year, n)) + 
+  geom_line()
+

+
    +
  • To fix this, you need to differentiate the sex- Male and Female.
  • +
+
ggplot(hadley, aes(year, n, group = sex, color = sex)) + 
+  geom_line()
+

+
    +
  • The reason this graph makes Hadley unhappy is there are alot more female babies named Hadley than male babies!!!
  • +
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/ch-05.Rmd b/ch-05.Rmd new file mode 100644 index 0000000..f83c01c --- /dev/null +++ b/ch-05.Rmd @@ -0,0 +1,68 @@ +--- +title: "Solutions to Chapter 5 Exercises" +author: "Howard Baek" +date: "Last compiled on `r format(Sys.time(), '%B %d, %Y')`" +output: html_document +editor_options: + chunk_output_type: console +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +```{r, include=FALSE} +library(tidyverse) +``` + + +### 5.4.1 Exercises + + +1. What binwidth tells you the most interesting story about the distribution of `carat`? + +```{r} +diamonds %>% + ggplot(aes(carat)) + + geom_histogram(binwidth = 0.2) +``` + + +- Highly subjective answer, but I would go with 0.2 since it gives you the right amount of information about the distribution of `carat`: right-skewed. + + +2. Draw a histogram of `price`. What interesting patterns do you see? + +```{r} +diamonds %>% + ggplot(aes(price)) + + geom_histogram(binwidth = 500) +``` + +- It's skewed to the right and has a long tail. Also, there is a small peak around 5000 and a huge peak around 0. + + +3. How does the distribution of `price` vary with `clarity`? + +```{r} +diamonds %>% + ggplot(aes(clarity, price)) + + geom_boxplot() +``` + +- The range of prices is similar across clarity and the median and IQR vary greatly with clarity. + +4. Overlay a frequency polygon and density plot of `depth`. What computed variable do you need to map to `y` to make the two plots comparable? (You can either modify `geom_freqpoly()` or `geom_density()`.) + +```{r} +diamonds %>% + count(depth) %>% + mutate(sum = sum(n), + density = n / sum) %>% + ggplot(aes(depth, density)) + + geom_line() +``` + +- Say you start off with the count of values in `depth` and you plot `geom_freqpoly()`. Then, you would want to divide each count by the total number of points to get density. This would get you the y variable needed for `geom_density()` + + diff --git a/ch-05.html b/ch-05.html new file mode 100644 index 0000000..e206e86 --- /dev/null +++ b/ch-05.html @@ -0,0 +1,267 @@ + + + + + + + + + + + + + + +Solutions to Chapter 5 Exercises + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

5.4.1 Exercises

+
    +
  1. What binwidth tells you the most interesting story about the distribution of carat?
  2. +
+
diamonds %>% 
+  ggplot(aes(carat)) +
+  geom_histogram(binwidth = 0.2)
+

+
    +
  • I would go with 0.2 since it gives you the right amount of information about the distribution of carat. It tells you it is right skewed.
  • +
+
    +
  1. Draw a histogram of price. What interesting patterns do you see?
  2. +
+
diamonds %>% 
+  ggplot(aes(price)) +
+  geom_histogram(binwidth = 500)
+

+
    +
  • It’s skewed to the right with a long tail.
  • +
+
    +
  1. How does the distribution of price vary with clarity?
  2. +
+
diamonds %>% 
+  ggplot(aes(clarity, price)) +
+  geom_boxplot()
+

+
    +
  • Even though the range of prices is similar across clarity, the median price and IQR of diamonds varies greatly with clarity.
  • +
+
    +
  1. Overlay a frequency polygon and density plot of depth. What computed variable do you need to map to y to make the two plots comparable? (You can either modify geom_freqpoly() or geom_density().)
  2. +
+
diamonds %>% 
+  count(depth) %>% 
+  mutate(sum = sum(n),
+         density = n / sum) %>% 
+  ggplot(aes(depth, density)) +
+  geom_line()
+

+
    +
  • Say you start off with the count of values in depth and you plot geom_freqpoly(). Then, you would want to divide each count by the total number of points to get density. This would get you the y variable needed for geom_density()
  • +
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/ch-10.Rmd b/ch-10.Rmd new file mode 100644 index 0000000..a1673ae --- /dev/null +++ b/ch-10.Rmd @@ -0,0 +1,104 @@ +--- +title: "Solutions to Chapter 10 Exercises" +author: "Howard Baek" +date: "Last compiled on `r format(Sys.time(), '%B %d, %Y')`" +output: html_document +editor_options: + chunk_output_type: console +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +```{r, include=FALSE} +library(tidyverse) +``` + + +1. The following code creates two plots of the mpg dataset. Modify the code so that the legend and axes match, without using faceting! + +```{r} +fwd <- subset(mpg, drv == "f") +rwd <- subset(mpg, drv == "r") + +ggplot(fwd, aes(displ, hwy, colour = class)) + + geom_point() + + scale_color_discrete(limits = c("compact", "midsize", "minivan", "subcompact", + "2seater", "suv")) + + coord_cartesian(xlim = c(1, 8), + ylim = c(15, 50)) + +ggplot(rwd, aes(displ, hwy, colour = class)) + + geom_point() + + scale_color_discrete(limits = c("compact", "midsize", "minivan", "subcompact", + "2seater", "suv")) + + coord_cartesian(xlim = c(1, 8), + ylim = c(15, 50)) +``` + +- We can make the legend and axes match by manually setting the limits of color. + + + +2. What happens if you add two `xlim()` calls to the same plot? Why? + +```{r} +ggplot(fwd, aes(displ, hwy, colour = class)) + + geom_point() + + xlim(2, 5) + + xlim(3, 4) +``` + +- You get a very informative message: `Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.` I'm guessing the reason is ggplot evaluates the code top-down. Since the second `xlim` call happens after the first `xlim`, it replaces the first `xlim`. + + +3. What does `scale_x_continuous(limits = c(NA, NA))` do? + +```{r} +ggplot(fwd, aes(displ, hwy, colour = class)) + + geom_point() + + scale_x_continuous(limits = c(NA, NA)) +``` + +- It doesn't change the limits because, according to the help page, "use NA to refer to the existing minimum or maximum". This means that we are setting the limits to be from the existing minimum to the existing maximum. + + + +4. What does `expand_limits()` do and how does it work? Read the source code. + +```{r, eval = FALSE} +?expand_limits +``` + +- According to the help page, "it ensures limits include a single value, for all panels or all plots". It is a wrapper around `geom_blank()` and makes it easy to add such values. + + + + + + +### 10.1.8 Exercises + +1. Recreate the following graphic: + +```{r} +ggplot(mpg, aes(displ, hwy)) + + geom_point(size = 3) + + scale_x_continuous("Displacement", labels = scales::unit_format(suffix = "L")) + + scale_y_continuous(quote(paste("Highway ", (frac(miles, gallon))))) +``` + + +2. List the three different types of object you can supply to the `breaks` argument. How do `breaks` and `labels` differ? + +According to the help page, you can supply NULL, transformation object (from `waiver()`), a numeric vector, and a function that takes limits as input and returns breaks as output. +- The `labels` argument takes in a character vector instead of a numeric vector, which `breaks` accepts. Also, it accepts a function that takes breaks as input (instead of limits) and returns labels as output (instead of breaks). + + +3. What label function allows you to create mathematical expressions? What label function converts 1 to 1st, 2 to 2nd, and so on? + +- `label_math()` allows you to create mathematical expressions. +- `label_ordinal()` allows you to label ordinal numbers (1 to 1st, 2 to 2nd, 3 to 3rd and so on) + + diff --git a/ch-10.html b/ch-10.html new file mode 100644 index 0000000..b4fe869 --- /dev/null +++ b/ch-10.html @@ -0,0 +1,297 @@ + + + + + + + + + + + + + + +Solutions to Chapter 10 Exercises + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
    +
  1. The following code creates two plots of the mpg dataset. Modify the code so that the legend and axes match, without using faceting!
  2. +
+
fwd <- subset(mpg, drv == "f")
+rwd <- subset(mpg, drv == "r")
+
+ggplot(fwd, aes(displ, hwy, colour = class)) +
+  geom_point() +
+  scale_color_discrete(limits = c("compact", "midsize", "minivan", "subcompact",
+                                  "2seater", "suv")) +
+  coord_cartesian(xlim = c(1, 8),
+                  ylim = c(15, 50))
+

+
ggplot(rwd, aes(displ, hwy, colour = class)) + 
+  geom_point() +
+  scale_color_discrete(limits = c("compact", "midsize", "minivan", "subcompact",
+                                  "2seater", "suv")) +
+  coord_cartesian(xlim = c(1, 8),
+                  ylim = c(15, 50))
+

+
    +
  • We can make the legend and axes match by manually setting the limits of color.
  • +
+
    +
  1. What happens if you add two xlim() calls to the same plot? Why?
  2. +
+
ggplot(fwd, aes(displ, hwy, colour = class)) +
+  geom_point() +
+  xlim(2, 5) +
+  xlim(3, 4)
+
## Scale for 'x' is already present. Adding another scale for 'x', which will
+## replace the existing scale.
+
## Warning: Removed 75 rows containing missing values (geom_point).
+

+
    +
  • You get a very informative message: Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale. I’m guessing the reason is ggplot evaluates the code top-down. Since the second xlim call happens after the first xlim, it replaces the first xlim.
  • +
+
    +
  1. What does scale_x_continuous(limits = c(NA, NA)) do?
  2. +
+
ggplot(fwd, aes(displ, hwy, colour = class)) +
+  geom_point() +
+  scale_x_continuous(limits = c(NA, NA))
+

+
    +
  • It doesn’t change the limits because, according to the help page, “use NA to refer to the existing minimum or maximum”. We are setting the limits to be from the existing minimum to the existing maximum.
  • +
+
    +
  1. What does expand_limits() do and how does it work? Read the source code.
  2. +
+
?expand_limits
+
    +
  • According to the help page, “it ensures limits include a single value, for all panels or all plots”. It is a wrapper around geom_blank() and makes it easy to add such values.
  • +
+
+

10.1.8 Exercises

+
    +
  1. Recreate the following graphic:
  2. +
+
ggplot(mpg, aes(displ, hwy)) + 
+  geom_point(size = 3) +  
+  scale_x_continuous("Displacement", labels = scales::unit_format(suffix = "L")) + 
+  scale_y_continuous(quote(paste("Highway ", (frac(miles, gallon))))) 
+

+
    +
  1. List the three different types of object you can supply to the breaks argument. How do breaks and labels differ?
  2. +
+

According to the help page, you can supply NULL, transformation object (from waiver()), a numeric vector, and a function that takes limits as input and returns breaks as output. - The labels argument takes in a character vector instead of a numeric vector, which breaks accepts. Also, it accepts a function that takes breaks as input (instead of limits) and returns labels as output (instead of breaks).

+
    +
  1. What label function allows you to create mathematical expressions? What label function converts 1 to 1st, 2 to 2nd, and so on?
  2. +
+
    +
  • label_math() allows you to create mathematical expressions.
  • +
  • label_ordinal() allows you to label ordinal numbers (1 to 1st, 2 to 2nd, 3 to 3rd and so on)
  • +
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/ch-11.Rmd b/ch-11.Rmd new file mode 100644 index 0000000..98d1a22 --- /dev/null +++ b/ch-11.Rmd @@ -0,0 +1,67 @@ +--- +title: "Solutions to Chapter 11 Exercises" +author: "Howard Baek" +date: "Last compiled on `r format(Sys.time(), '%B %d, %Y')`" +output: html_document +editor_options: + chunk_output_type: console +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +```{r, include=FALSE} +library(tidyverse) +``` + +### 11.3.4 Exercises + + +```{r} +drv_labels <- c("4" = "4wd", "f" = "fwd", "r" = "rwd") + +ggplot(mpg, aes(displ, hwy)) + + geom_point(aes(colour = drv)) + + scale_colour_discrete(labels = drv_labels) +``` + + +- We store the labels inside `drv_labels` and use it in `scale_colour_discrete()` + + +### 11.7.5 Exercises + +1. How do you make legends appear to the left of the plot? +` +- `theme(legend.position = "left")` make legends appear to the left of the plot. +- Other options: `theme(legend.position = "right")`, `theme(legend.position = "bottom")`, and `theme(legend.position = "none")` + + +2. What's gone wrong with this plot? How could you fix it? + +- There are two separate legends for the same variable (`drv`). We need to combine these two legends into one. To do this, both `color` and `shape` need to be given shape specifications. + +```{r} +ggplot(mpg, aes(displ, hwy)) + + geom_point(aes(colour = drv, shape = drv)) + + scale_colour_discrete("Drive train", + breaks = c("4", "f", "r"), + labels = c("4-wheel", "front", "rear")) + + scale_shape_discrete("Drive train", + breaks = c("4", "f", "r"), + labels = c("4-wheel", "front", "rear")) +``` + + +3. + +```{r} +ggplot(mpg, aes(displ, hwy, colour = class)) + + geom_point(show.legend = FALSE) + + geom_smooth(method = "lm", se = FALSE) + + theme(legend.position = "bottom") + + guides(colour = guide_legend(nrow = 1)) +``` + +> Note: The answers to these "recreate the code for this plot" questions are provided in the source code of the book. diff --git a/ch-11.html b/ch-11.html new file mode 100644 index 0000000..b4b8ce3 --- /dev/null +++ b/ch-11.html @@ -0,0 +1,271 @@ + + + + + + + + + + + + + + +Solutions to Chapter 11 Exercises + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

11.3.4 Exercises

+
drv_labels <- c("4" = "4wd", "f" = "fwd", "r" = "rwd")
+
+ggplot(mpg, aes(displ, hwy)) + 
+  geom_point(aes(colour = drv)) +
+  scale_colour_discrete(labels = drv_labels)
+

+
    +
  • We store the labels inside drv_labels and use it in scale_colour_discrete()
  • +
+
+
+

11.7.5 Exercises

+
    +
  1. How do you make legends appear to the left of the plot? `
  2. +
+
    +
  • theme(legend.position = "left") make legends appear to the left of the plot.
  • +
  • Other options: theme(legend.position = "right"), theme(legend.position = "bottom"), and theme(legend.position = "none")
  • +
+
    +
  1. What’s gone wrong with this plot? How could you fix it?
  2. +
+
    +
  • There are two separate legends for the same variable (drv). We need to combine these two legends into one. To do this, both color and shape need to be given shape specifications.
  • +
+
ggplot(mpg, aes(displ, hwy)) + 
+  geom_point(aes(colour = drv, shape = drv)) + 
+  scale_colour_discrete("Drive train",
+                        breaks = c("4", "f", "r"),
+                        labels = c("4-wheel", "front", "rear")) +
+  scale_shape_discrete("Drive train",
+                        breaks = c("4", "f", "r"),
+                        labels = c("4-wheel", "front", "rear"))
+

+
    +
  1. +
+
ggplot(mpg, aes(displ, hwy, colour = class)) + 
+      geom_point(show.legend = FALSE) + 
+      geom_smooth(method = "lm", se = FALSE) + 
+      theme(legend.position = "bottom") + 
+      guides(colour = guide_legend(nrow = 1))
+
## `geom_smooth()` using formula 'y ~ x'
+

+
+

Note: The answers to these “recreate the code for this plot” questions are provided in the source code of the book.

+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/ch-14.Rmd b/ch-14.Rmd new file mode 100644 index 0000000..f974a9c --- /dev/null +++ b/ch-14.Rmd @@ -0,0 +1,175 @@ +--- +title: "Solutions to Chapter 14 Exercises" +author: "Howard Baek" +date: "Last compiled on `r format(Sys.time(), '%B %d, %Y')`" +output: html_document +editor_options: + chunk_output_type: console +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + + +```{r, include=FALSE} +library(tidyverse) +``` + + +### 14.3.1 Exercises + +1. The first two arguments to ggplot are `data` and `mapping`. The first +two arguments to all layer functions are `mapping` and `data`. Why does the +order of the arguments differ? (Hint: think about what you set most commonly.) + +- Commonly, you first set the data in `ggplot()` and then set aesthetics inside your layer functions, like `geom_point()`, `geom_boxplot()`, or `geom_histogram()`. + + +2. + +```{r} +library(dplyr) +class <- mpg %>% + group_by(class) %>% + summarise(n = n(), hwy = mean(hwy)) +``` + +```{r} +mpg %>% + ggplot(aes(class, hwy)) + + geom_jitter(width = 0.15, height = 0.35) + + geom_point(data = class, aes(class, hwy), + color = "red", + size = 6) + + geom_text(data = class, aes(y = 10, x = class, label = paste0("n = ", n))) +``` + +- I plotted 3 different layers: jittered points, red point for the summary measure, mean, and text for the sample size (n). + + + +### 14.4.3 Exercises + +1. Simplify the following plot specifications: + +```{r} +#################################### +#################################### +# ggplot(mpg) + +# geom_point(aes(mpg$displ, mpg$hwy)) + +# The above can be simplified: +# ggplot(mpg) + +# geom_point(aes(displ, hwy)) +#################################### +#################################### + + +#################################### +#################################### +# ggplot() + +# geom_point(mapping = aes(y = hwy, x = cty), +# data = mpg) + +# geom_smooth(data = mpg, +# mapping = aes(cty, hwy)) + +# The above can be simplified: +# ggplot(mpg, aes(cty, hwy)) + +# geom_point() + +# geom_smooth() +#################################### +#################################### + + +#################################### +#################################### +# ggplot(diamonds, aes(carat, price)) + +# geom_point(aes(log(brainwt), log(bodywt)), +# data = msleep) + +# The above can be simplified: +# msleep_processed <- msleep %>% +# mutate(brainwt_log = log(brainwt), +# bodywt_log = log(bodywt)) + +# ggplot(diamonds, aes(carat, price)) + +# geom_point(aes(brainwt_log, bodywt_log), +# data = msleep_processed) +#################################### +#################################### +``` + + +2. What does the following code do? Does it work? Does it make sense? Why/why not? + +```{r} +ggplot(mpg) + + geom_point(aes(class, cty)) + + geom_boxplot(aes(trans, hwy)) +``` + +- It plots points of `class` vs `cty` and then a boxplot of `trans` vs `hwy`. It doesn't make sense to plot layers with different `x` and `y` variables. + + +3. What happens if you try to use a continuous variable on the x axis in one layer, and a categorical variable in another layer? What happens if you do it in the opposite order? + +- Not sure + + +### 14.5.1 Exercises + +1,2,3 omitted. + +4. Starting from top left, clockwise direction: + +- `geom_violin()`, `geom_point()`, `geom_point()`, `geom_path()`, `geom_area()`, `geom_hex()`. + + + + +### 14.6.2 Exercises + +1. +```{r} +mod <- loess(hwy ~ displ, data = mpg) +smoothed <- data.frame(displ = seq(1.6, 7, length = 50)) +pred <- predict(mod, newdata = smoothed, se = TRUE) +smoothed$hwy <- pred$fit +smoothed$hwy_lwr <- pred$fit - 1.96 * pred$se.fit +smoothed$hwy_upr <- pred$fit + 1.96 * pred$se.fit + +smoothed %>% + ggplot(aes(displ, hwy)) + + geom_line(color = "dodgerblue1") + + geom_ribbon(aes(ymin = hwy_lwr, + ymax = hwy_upr), + alpha = 0.4) +``` + + +2. From left to right, + +`stat_ecdf()`, `stat_qq()`, `stat_function()` + + +3. +```{r} +mpg %>% + ggplot(aes(drv, trans)) + + geom_count(aes(size = after_stat(prop), group = 1)) +``` + + + + +### 14.7.1 Exercises + +1. According to the help page, `position_nudge()` is generally useful for adjusting the position of items on discrete scales by a small amount. Nudging is built in to geom_text() because it's so useful for moving labels a small distance from what they're labelling. + +2. Not sure + +3. `geom_jitter()` adds a small amount of random variation to the location of each point. It is useful for looking at all the overplotted points. On the other hand, `geom_count()` counts the number of overlapping observations at each location. It is useful for understanding the number of points in a location. + +4. Stacked area plot seems useful when you want to portray an area whereas a line plot seems useful when you just need a line. + diff --git a/ch-15.Rmd b/ch-15.Rmd new file mode 100644 index 0000000..6657edc --- /dev/null +++ b/ch-15.Rmd @@ -0,0 +1,88 @@ +--- +title: "Solutions to Chapter 15 Exercises" +author: "Howard Baek" +date: "Last compiled on `r format(Sys.time(), '%B %d, %Y')`" +output: html_document +editor_options: + chunk_output_type: console +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + + +```{r, include=FALSE} +library(tidyverse) +``` + + +### 15.1.1 Exercises + +```{r} +#################################### +#################################### +# ggplot(mpg, aes(displ)) + +# scale_y_continuous("Highway mpg") + +# scale_x_continuous() + +# geom_point(aes(y = hwy)) + +# The above can be modified to: +# mpg %>% +# ggplot(aes(displ, hwy)) + +# geom_point() + +# labs(y = "Highway mpg") +#################################### +#################################### + + + + +#################################### +#################################### +# ggplot(mpg, aes(y = displ, x = class)) + +# scale_y_continuous("Displacement (l)") + +# scale_x_discrete("Car type") + +# scale_x_discrete("Type of car") + +# scale_colour_discrete() + +# geom_point(aes(colour = drv)) + +# scale_colour_discrete("Drive\ntrain") + +# The above can be modified to +# mpg %>% +# ggplot(aes(class, displ)) + +# geom_point(aes(color = drv)) + +# labs(x = "Type of car", +# y = "Displacement (l)", +# color = "Drive\ntrain") +#################################### +#################################### +``` + + + + +2. What happens if you pair a discrete variable with a continuous scale? What happens if you pair a continuous variable with a discrete scale? + +```{r} +mpg %>% + ggplot(aes(class, displ)) + + geom_point(aes(color = drv)) + + scale_y_discrete() + + labs(x = "Type of car", + y = "Displacement (l)", + color = "Drive\ntrain") +``` + +- When you pair a discrete variable with a continuous scale, you don't see a plot and get this error message: _Discrete value supplied to continuous scale_ + +- When you pair a continuous variable with a discrete scale, as seen above, you get a different looking plot that doesn't contain the proper axis ticks or grid lines. + + +### 15.7 Exercises + +According to the help pages, + +- `name`: specifies the labels for **axis** and the title for **legends**. +- `breaks`: specifies the ticks and grid lines for **axis** and the key for **legends**. +- `labels`: specifies the tick label for **axis** and key label for **legends**. diff --git a/ch-15.html b/ch-15.html new file mode 100644 index 0000000..409b3fa --- /dev/null +++ b/ch-15.html @@ -0,0 +1,284 @@ + + + + + + + + + + + + + + +Solutions to Chapter 15 Exercises + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

15.1.1 Exercises

+
####################################
+####################################
+# ggplot(mpg, aes(displ)) + 
+#   scale_y_continuous("Highway mpg") + 
+#   scale_x_continuous() +
+#   geom_point(aes(y = hwy))
+
+# The above can be modified to:
+# mpg %>% 
+#   ggplot(aes(displ, hwy)) +
+#   geom_point() +
+#   labs(y = "Highway mpg")
+####################################
+####################################
+
+
+
+
+####################################
+####################################
+# ggplot(mpg, aes(y = displ, x = class)) + 
+#   scale_y_continuous("Displacement (l)") + 
+#   scale_x_discrete("Car type") +
+#   scale_x_discrete("Type of car") + 
+#   scale_colour_discrete() + 
+#   geom_point(aes(colour = drv)) + 
+#   scale_colour_discrete("Drive\ntrain")
+
+# The above can be modified to
+# mpg %>% 
+#   ggplot(aes(class, displ)) +
+#   geom_point(aes(color = drv)) +
+#   labs(x = "Type of car",
+#        y = "Displacement (l)",
+#        color = "Drive\ntrain")
+####################################
+####################################
+
    +
  1. What happens if you pair a discrete variable with a continuous scale? What happens if you pair a continuous variable with a discrete scale?
  2. +
+
mpg %>%
+  ggplot(aes(class, displ)) +
+  geom_point(aes(color = drv)) +
+  scale_y_discrete() +
+  labs(x = "Type of car",
+       y = "Displacement (l)",
+       color = "Drive\ntrain")
+

+
    +
  • When you pair a discrete variable with a continuous scale, you don’t see a plot and get this error message: Discrete value supplied to continuous scale

  • +
  • When you pair a continuous variable with a discrete scale, as seen above, you get a different looking plot that doesn’t contain the proper axis ticks or grid lines.

  • +
+
+
+

15.7 Exercises

+
    +
  • name: specifies the labels for axis and the title for legends.
  • +
  • breaks: specifies the ticks and grid lines for axis and the key for legends.
  • +
  • labels: specifies the tick label for axis and key label for legends.
  • +
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/ch-17.Rmd b/ch-17.Rmd new file mode 100644 index 0000000..a0db00e --- /dev/null +++ b/ch-17.Rmd @@ -0,0 +1,74 @@ +--- +title: "Solutions to Chapter 17 Exercises" +author: "Howard Baek" +date: "Last compiled on `r format(Sys.time(), '%B %d, %Y')`" +output: html_document +editor_options: + chunk_output_type: console +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + + +```{r, include=FALSE} +library(tidyverse) +``` + + +### 17.7 Exercises + +1. +```{r} +# faceting by cut and grouping by carat. +diamonds %>% + ggplot(aes(price)) + + geom_histogram(aes(color = carat)) + + facet_wrap(~cut, scales = "free_y") + +# faceting by carat and grouping by cut. +diamonds %>% + ggplot(aes(price)) + + geom_histogram(aes(color = cut)) + + facet_wrap(~carat, scales = "free_y") +``` + +- It makes more sense to facet by cut because its a discrete variable. Faceting by carat, a continuous variable, makes too many facets and renders the plot unreadable! + + +2. +```{r} +diamonds %>% + ggplot(aes(carat, price)) + + geom_point(aes(color = color)) +``` + + +```{r} +diamonds %>% + ggplot(aes(carat, price)) + + geom_point(aes(color = color)) + + facet_wrap(~color) +``` + +- I think its better to use grouping to compare the different colors. The panels all have the same shape, so it's hard to compare the groups across facets. If I use faceting, I'd add that the plot is facetted by diamond colour, from D (best) to J (worst). + + +3. I think `facet_wrap()` is more useful than `facet_grid()` because the former function is useful if you have a single variable with many levels and want to arrange the plots in a more space efficient manner. In data analysis, its extremely common to have a single variable with many levels that the analyst wants to arrange the for easy comparison. Although `facet_grid()` works on single variables, `facet_wrap()` involves less typing when you have a single variable. + + +4. +```{r} +mpg2 <- subset(mpg, cyl != 5 & drv %in% c("4", "f") & class != "2seater") + +mpg2 %>% + ggplot(aes(displ, hwy)) + + geom_point() + + geom_smooth(data = mpg2 %>% select(-class), + se = FALSE, + method = "loess") + + + facet_wrap(~class) +``` + diff --git a/ch-19.Rmd b/ch-19.Rmd new file mode 100644 index 0000000..321244e --- /dev/null +++ b/ch-19.Rmd @@ -0,0 +1,87 @@ +--- +title: "Solutions to Chapter 19 Exercises" +author: "Howard Baek" +date: "Last compiled on `r format(Sys.time(), '%B %d, %Y')`" +output: html_document +editor_options: + chunk_output_type: console +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + + +```{r, include=FALSE} +library(tidyverse) +``` + + +### 19.2.1 Exercises + +1. +```{r} +pink_hist <- geom_histogram( + color = "pink", + bins = 100 +) +``` + +2. +```{r} +fill_blues <- scale_fill_distiller( + palette = "Blues" +) +``` + + +3. +```{r, eval=FALSE} +?theme_gray() +``` + +- Its arguments include `base_size`, `base_family`, `base_line_size`, and `base_rect_size` +- According to the help file, `theme_gray()` is the signature ggplot2 theme with a grey background and white gridlines and is designed to put the data forward yet make comparisons easy. + + + +4. +```{r} +scale_colour_wesanderson <- function(palette = "BottleRocket1", ...) { + scale_color_manual(values = wesanderson::wes_palette(palette), ...) +} + +# Working example +ggplot(mtcars, aes(wt, disp, color = factor(gear))) + + geom_point() + + scale_colour_wesanderson() +``` + + + + +### 19.3.4 Exercises + +1. +```{r} +remove_labels <- theme(legend.position = "none", + axis.title.x = element_blank(), + axis.title.y = element_blank()) + +# Working Example +ggplot(mtcars, aes(wt, disp, color = factor(gear))) + + geom_point() + + remove_labels +``` + + +2. Not sure + + +### 19.4.3. Exercises + +These questions are way above my head! + +### 19.5.1 Exercises + +These questions are way above my head! \ No newline at end of file diff --git a/data-visualization-2.1.pdf b/data-visualization-2.1.pdf new file mode 100644 index 0000000..2193b8b Binary files /dev/null and b/data-visualization-2.1.pdf differ diff --git a/ggplot2-book-exercises-solutions.Rproj b/ggplot2-book-exercises-solutions.Rproj new file mode 100644 index 0000000..8e3c2eb --- /dev/null +++ b/ggplot2-book-exercises-solutions.Rproj @@ -0,0 +1,13 @@ +Version: 1.0 + +RestoreWorkspace: Default +SaveWorkspace: Default +AlwaysSaveHistory: Default + +EnableCodeIndexing: Yes +UseSpacesForTab: Yes +NumSpacesForTab: 2 +Encoding: UTF-8 + +RnwWeave: Sweave +LaTeX: pdfLaTeX