web-scraping.qmd

---
title: "Lab 5"
author: "Eva, Chloe, and Kai"
format: 
  html: 
    self-contained: true
    code-tools: true
    code-fold: true
editor: visual
execute: 
  echo: true
  include: true
  message: false
  warning: false
embed-resources: true
theme: cerulean
---

> **Goal:** Scrape information from <https://www.cheese.com> to obtain a dataset of characteristics about different cheeses, and gain deeper insight into your coding process. 🪤

```{r}
#| label: libraries
library(rvest)
library(tidyverse)
library(dplyr)
library(stringr)
```

**Part 1:** Locate and examine the `robots.txt` file for this website. Summarize what you learn from it.

The robots.txt file is located at https://www.cheese.com/robots.txt. This gives us the link to the site map which is https://www.cheese.com/sitemap.xml and that the user agent is a "\*" which means anyone can webscrape the site. The site map includes the url to all the different types of cheese in alphabetical order.

**Part 2:** Learn about the `html_attr()` function from `rvest`. Describe how this function works with a small example.

The html_attr() function gets a single attribute from the webpage. In the example, the html_elements("a") is pulling everything that has an anchor, or a hyperlink. Then the html_attr("class") is identifying which class each one is.

```{r}
#| message: false
#| label: function-use-example

url <- "https://www.cheese.com/alphabetical"
webpage <- read_html(url)
webpage |>
  html_elements("a") |>
  html_attr("class")

```

**Part 3:** (Do this alongside Part 4 below.) I used [ChatGPT](https://chat.openai.com/chat) to start the process of scraping cheese information with the following prompt:

> Write R code using the rvest package that allows me to scrape cheese information from cheese.com.

Fully document your process of checking this code. Record any observations you make about where ChatGPT is useful / not useful.

-   ChatGPT is useful in getting the names of the cheese for the first page, however it is not accurately getting the url's for the cheeses. Some of the urls it is webscraping twice which is offsetting the cheese name from the correct url.

```{r}
#| eval: false
#| label: small-example-of-getting-cheese-info

# Define the URL
url <- "https://www.cheese.com/alphabetical" #This is the url for the alphabetical list of all the cheeses on the site

# Read the HTML content from the webpage
webpage <- read_html(url) #This is useful to read in the url

# Extract the cheese names and URLs
cheese_data <- webpage %>%
  html_nodes(".cheese-item h3") %>% # We added h3 to fix the duplicates problem
  html_nodes("a") %>%
  html_attr("href") %>%
  paste0("https://cheese.com", .) # This is making the urls for the cheese

cheese_names <- webpage %>%
  html_nodes(".cheese-item h3") %>%
  html_text() # this is making the names for the cheese

# Create a data frame to store the results
cheese_df <- data.frame(Name = cheese_names,
                        URL = cheese_data,
                        stringsAsFactors = FALSE) # This is creating a dataframe with the cheese names and the urls

# Print the data frame
print(cheese_df) #This is printing the cheese_df, however the print() function does not need to be used
```

**Part 4:** Obtain the following information for **all** cheeses in the database:

-   Cheese name
-   URL for the cheese's webpage (e.g., <https://www.cheese.com/gouda/>)
-   Whether or not the cheese has a picture (e.g., [gouda](https://www.cheese.com/gouda/) has a picture, but [bianco](https://www.cheese.com/bianco/) does not)

To be kind to the website owners, please add a 1 second pause between page queries. (Note that you can view 100 cheeses at a time.)

```{r}
#| label: function-to-extract-cheese-name-url-image

# source for stringr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/strings.pdf

url <- "https://www.cheese.com/alphabetical?per_page=100"

get_cheese <- function(url) {
  # Read the HTML content from the webpage
  webpage <- read_html(url)
  
  # Extract the cheese URLs
  cheese_data <- webpage %>%
    html_nodes(".cheese-item h3") %>% 
    html_nodes("a") %>%
    html_attr("href") %>%
    paste0("https://cheese.com", .) 
  
  # Extract the cheese names
  cheese_names <- webpage %>%
    html_nodes(".cheese-item h3") %>%
    html_text()
  
  # Extract the image links
  cheese_img <- webpage |>
    html_nodes(".cheese-item img") |>
    html_attr("src") |>
    str_replace("^/static/.*", "No image") |>
    str_replace("^/media/.*", "Yes")

  
  # Create a data frame to store the results
  cheese_df <- data.frame(Name = cheese_names,
                          URL = cheese_data,
                          Img = cheese_img,
                          stringsAsFactors = FALSE)

  # Print the data frame
  cheese_df
}


```

```{r}
#| label: fetching-all-cheese-data-from-website

#source for for loop: https://stackoverflow.com/questions/27153263/adding-elements-to-a-list-in-for-loop-in-r

# Create an empty list
all_cheeses <- c()

# Iterate across each page on the website
for(i in 1:20){
  # Construct URL for each page
  url <- paste0("https://www.cheese.com/alphabetical?per_page=100&page=", i)
  # Pause execution for one second
  Sys.sleep(1)
  # Get all URLs, cheese names, and images for each type of cheese
  cheese <- get_cheese(url) 
  # Combine the data from all of the pages
  all_cheeses <- bind_rows(all_cheeses, cheese)
}

```

**Part 5:** When you go to a particular cheese's page (like [gouda](https://www.cheese.com/gouda/)), you'll see more detailed information about the cheese. For [**just 10**]{.underline} of the cheeses in the database, obtain the following detailed information:

-   milk information
-   country of origin
-   family
-   type
-   flavour

(Just 10 to avoid overtaxing the website! Continue adding a 1 second pause between page queries.)

### Fetching Cheeses with URL

```{r}
#| label: fetching-cheeses-with-url

#source for str_extract: https://stackoverflow.com/questions/57438472/using-str-extract-in-r-to-extract-a-number-before-a-substring-with-regex
#source for sumarise_all: https://stackoverflow.com/questions/64062261/collapsing-strings-with-summarise-all
#source for sub: https://stackoverflow.com/questions/25307899/r-remove-anything-after-comma-from-column

get_cheese_info_url <- function(cheese_url){
  #Reading in the URL
  webpage <- read_html(cheese_url)
  
  #Fetch the cheese name
  cheese_type <- webpage |>
    html_node("h1") |>
    html_text() |>
    str_trim()
  
  #Fetch all of the cheese info
  cheese_info <- webpage |>
    html_nodes("li p") |>
    html_text() |>
    unique()
 
  #Extract all the cheese info and remove headers
  made_from <- str_extract(cheese_info, "(?<=Made from ).*")
  country <- str_extract(cheese_info,"(?<=Country of origin: ).*" )
  family <- str_extract(cheese_info,"(?<=Family: ).*" )
  type <- str_extract(cheese_info, "(?<=Type: ).*")
  flavour <- str_extract(cheese_info, "(?<=Flavour: ).*")
  
  #Convert to a dataframe
  cheese_info_df <- 
    data.frame(
      cheese_type = cheese_type,
      made_from = made_from,
      Country = country,
      Family = family,
      Type = type,
      Flavour = flavour
      )  |>
    #Remove all of the NAs
    summarise_all(~ toString(na.omit(.))) |> 
    #Mutate to have one cheese name
    mutate(
      cheese_type = sub(",.*", "", cheese_type)
    )
  
}


```

```{r}
#| label: map-function-for-cheese-info

#Select 10 cheese
cheeses_top10 <- all_cheeses |>
  slice_head(n = 10)


#Incorporating the 1 second pause
cheese_info <- map(cheeses_top10$URL, slowly(get_cheese_info_url, 
                                             rate = rate_delay(pause = 1))
                   ) 

#Iterate through the function for each of the 10 cheeses
cheese_info <- map(cheeses_top10$URL,
                    slowly(get_cheese_info_url,
                           rate = rate_delay(pause = 2))) |>

  bind_rows()

cheese_info
```

**Part 6:** Evaluate the code that you wrote in terms of the [core principle of good function writing](function-strategies.qmd). To what extent does your implementation follow these principles? What are you learning about what is easy / challenging for you about approaching complex tasks requiring functions and loops?

In this lab, we have created functions that demonstrate reusability and 
efficiency. For example, the get_cheese function takes in the argument "url",
defined above the function. The function is used to read the url and use the 
CSS selectors to paste together the URLs for the individual cheeses, the
names of the cheese, and whether the cheese has an image or not. This function 
was then used to create the data frame all_cheeses. We then 
define a function get_cheese_info_url which takes in a single argument,
"cheese_url". This argument is used to read in the URL and obtain the cheese
name and information from the CSS selectors. These functions are then called
to create the final dataframe cheese_info, demonstrating their reproducibility,
reusability, and efficiency.