generated from Cal-Poly-Advanced-R/lab-5-web-scraping
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathweb-scraping.qmd
261 lines (197 loc) · 8.84 KB
/
web-scraping.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
---
title: "Lab 5"
author: "Eva, Chloe, and Kai"
format:
html:
self-contained: true
code-tools: true
code-fold: true
editor: visual
execute:
echo: true
include: true
message: false
warning: false
embed-resources: true
theme: cerulean
---
> **Goal:** Scrape information from <https://www.cheese.com> to obtain a dataset of characteristics about different cheeses, and gain deeper insight into your coding process. 🪤
```{r}
#| label: libraries
library(rvest)
library(tidyverse)
library(dplyr)
library(stringr)
```
**Part 1:** Locate and examine the `robots.txt` file for this website. Summarize what you learn from it.
The robots.txt file is located at https://www.cheese.com/robots.txt. This gives us the link to the site map which is https://www.cheese.com/sitemap.xml and that the user agent is a "\*" which means anyone can webscrape the site. The site map includes the url to all the different types of cheese in alphabetical order.
**Part 2:** Learn about the `html_attr()` function from `rvest`. Describe how this function works with a small example.
The html_attr() function gets a single attribute from the webpage. In the example, the html_elements("a") is pulling everything that has an anchor, or a hyperlink. Then the html_attr("class") is identifying which class each one is.
```{r}
#| message: false
#| label: function-use-example
url <- "https://www.cheese.com/alphabetical"
webpage <- read_html(url)
webpage |>
html_elements("a") |>
html_attr("class")
```
**Part 3:** (Do this alongside Part 4 below.) I used [ChatGPT](https://chat.openai.com/chat) to start the process of scraping cheese information with the following prompt:
> Write R code using the rvest package that allows me to scrape cheese information from cheese.com.
Fully document your process of checking this code. Record any observations you make about where ChatGPT is useful / not useful.
- ChatGPT is useful in getting the names of the cheese for the first page, however it is not accurately getting the url's for the cheeses. Some of the urls it is webscraping twice which is offsetting the cheese name from the correct url.
```{r}
#| eval: false
#| label: small-example-of-getting-cheese-info
# Define the URL
url <- "https://www.cheese.com/alphabetical" #This is the url for the alphabetical list of all the cheeses on the site
# Read the HTML content from the webpage
webpage <- read_html(url) #This is useful to read in the url
# Extract the cheese names and URLs
cheese_data <- webpage %>%
html_nodes(".cheese-item h3") %>% # We added h3 to fix the duplicates problem
html_nodes("a") %>%
html_attr("href") %>%
paste0("https://cheese.com", .) # This is making the urls for the cheese
cheese_names <- webpage %>%
html_nodes(".cheese-item h3") %>%
html_text() # this is making the names for the cheese
# Create a data frame to store the results
cheese_df <- data.frame(Name = cheese_names,
URL = cheese_data,
stringsAsFactors = FALSE) # This is creating a dataframe with the cheese names and the urls
# Print the data frame
print(cheese_df) #This is printing the cheese_df, however the print() function does not need to be used
```
**Part 4:** Obtain the following information for **all** cheeses in the database:
- Cheese name
- URL for the cheese's webpage (e.g., <https://www.cheese.com/gouda/>)
- Whether or not the cheese has a picture (e.g., [gouda](https://www.cheese.com/gouda/) has a picture, but [bianco](https://www.cheese.com/bianco/) does not)
To be kind to the website owners, please add a 1 second pause between page queries. (Note that you can view 100 cheeses at a time.)
```{r}
#| label: function-to-extract-cheese-name-url-image
# source for stringr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/strings.pdf
url <- "https://www.cheese.com/alphabetical?per_page=100"
get_cheese <- function(url) {
# Read the HTML content from the webpage
webpage <- read_html(url)
# Extract the cheese URLs
cheese_data <- webpage %>%
html_nodes(".cheese-item h3") %>%
html_nodes("a") %>%
html_attr("href") %>%
paste0("https://cheese.com", .)
# Extract the cheese names
cheese_names <- webpage %>%
html_nodes(".cheese-item h3") %>%
html_text()
# Extract the image links
cheese_img <- webpage |>
html_nodes(".cheese-item img") |>
html_attr("src") |>
str_replace("^/static/.*", "No image") |>
str_replace("^/media/.*", "Yes")
# Create a data frame to store the results
cheese_df <- data.frame(Name = cheese_names,
URL = cheese_data,
Img = cheese_img,
stringsAsFactors = FALSE)
# Print the data frame
cheese_df
}
```
```{r}
#| label: fetching-all-cheese-data-from-website
#source for for loop: https://stackoverflow.com/questions/27153263/adding-elements-to-a-list-in-for-loop-in-r
# Create an empty list
all_cheeses <- c()
# Iterate across each page on the website
for(i in 1:20){
# Construct URL for each page
url <- paste0("https://www.cheese.com/alphabetical?per_page=100&page=", i)
# Pause execution for one second
Sys.sleep(1)
# Get all URLs, cheese names, and images for each type of cheese
cheese <- get_cheese(url)
# Combine the data from all of the pages
all_cheeses <- bind_rows(all_cheeses, cheese)
}
```
**Part 5:** When you go to a particular cheese's page (like [gouda](https://www.cheese.com/gouda/)), you'll see more detailed information about the cheese. For [**just 10**]{.underline} of the cheeses in the database, obtain the following detailed information:
- milk information
- country of origin
- family
- type
- flavour
(Just 10 to avoid overtaxing the website! Continue adding a 1 second pause between page queries.)
### Fetching Cheeses with URL
```{r}
#| label: fetching-cheeses-with-url
#source for str_extract: https://stackoverflow.com/questions/57438472/using-str-extract-in-r-to-extract-a-number-before-a-substring-with-regex
#source for sumarise_all: https://stackoverflow.com/questions/64062261/collapsing-strings-with-summarise-all
#source for sub: https://stackoverflow.com/questions/25307899/r-remove-anything-after-comma-from-column
get_cheese_info_url <- function(cheese_url){
#Reading in the URL
webpage <- read_html(cheese_url)
#Fetch the cheese name
cheese_type <- webpage |>
html_node("h1") |>
html_text() |>
str_trim()
#Fetch all of the cheese info
cheese_info <- webpage |>
html_nodes("li p") |>
html_text() |>
unique()
#Extract all the cheese info and remove headers
made_from <- str_extract(cheese_info, "(?<=Made from ).*")
country <- str_extract(cheese_info,"(?<=Country of origin: ).*" )
family <- str_extract(cheese_info,"(?<=Family: ).*" )
type <- str_extract(cheese_info, "(?<=Type: ).*")
flavour <- str_extract(cheese_info, "(?<=Flavour: ).*")
#Convert to a dataframe
cheese_info_df <-
data.frame(
cheese_type = cheese_type,
made_from = made_from,
Country = country,
Family = family,
Type = type,
Flavour = flavour
) |>
#Remove all of the NAs
summarise_all(~ toString(na.omit(.))) |>
#Mutate to have one cheese name
mutate(
cheese_type = sub(",.*", "", cheese_type)
)
}
```
```{r}
#| label: map-function-for-cheese-info
#Select 10 cheese
cheeses_top10 <- all_cheeses |>
slice_head(n = 10)
#Incorporating the 1 second pause
cheese_info <- map(cheeses_top10$URL, slowly(get_cheese_info_url,
rate = rate_delay(pause = 1))
)
#Iterate through the function for each of the 10 cheeses
cheese_info <- map(cheeses_top10$URL,
slowly(get_cheese_info_url,
rate = rate_delay(pause = 2))) |>
bind_rows()
cheese_info
```
**Part 6:** Evaluate the code that you wrote in terms of the [core principle of good function writing](function-strategies.qmd). To what extent does your implementation follow these principles? What are you learning about what is easy / challenging for you about approaching complex tasks requiring functions and loops?
In this lab, we have created functions that demonstrate reusability and
efficiency. For example, the get_cheese function takes in the argument "url",
defined above the function. The function is used to read the url and use the
CSS selectors to paste together the URLs for the individual cheeses, the
names of the cheese, and whether the cheese has an image or not. This function
was then used to create the data frame all_cheeses. We then
define a function get_cheese_info_url which takes in a single argument,
"cheese_url". This argument is used to read in the URL and obtain the cheese
name and information from the CSS selectors. These functions are then called
to create the final dataframe cheese_info, demonstrating their reproducibility,
reusability, and efficiency.