-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbeeradvocate-scrape.Rmd
134 lines (95 loc) · 6.6 KB
/
beeradvocate-scrape.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
title: "Web Scraping Craft Beer Ratings From Beer Advocate with import.io & R"
author: "Jasmine Dumas"
date: 'Created: December 14, 2015; Updated March 4, 2016'
output: html_document
---
# Introduction on Import.io and Beer Advocate
Getting information on craft breweries can be a difficult process with data dispersed over multiple websites or in unuseable formats, necessary for analysis. [Import.io](https://import.io/) _instantly_ turns webpages into data ready for analysis with minimal or no set-up involved. The previous [tutorial](http://trendct.org/2016/01/08/how-to-scrape-airbnb-data-without-programming-using-import-io/) highlights the process in detail. This method works great for extracting meta data on craft breweries, and overall beer ratings from the popular site Beer Advocate which is a online community which supports beer education, events and a forum to rate beers.
# Getting the websites which contain out data of interest
1. Go to [BeerAdvocate.com](http://www.beeradvocate.com/) and navigate to the search for a place (Connecticut) [state](http://www.beeradvocate.com/place/directory/9/US/CT/), then onto the [breweries link](http://www.beeradvocate.com/place/list/?c_id=US&s_id=CT&brewery=Y).
Screenshot 1
[https://www.dropbox.com/s/j4djidmmszewc5m/BA-1.jpeg?dl=0]("https://www.dropbox.com/s/j4djidmmszewc5m/BA-1.jpeg?dl=0")
Screenshot 2
[https://www.dropbox.com/s/jdteuq03cb1stqv/BA-2.jpeg?dl=0]("https://www.dropbox.com/s/jdteuq03cb1stqv/BA-2.jpeg?dl=0")
Screenshot 3
[https://www.dropbox.com/s/y5r1q8e5dq67hdy/BA-3.jpeg?dl=0]("https://www.dropbox.com/s/y5r1q8e5dq67hdy/BA-3.jpeg?dl=0")
2. We need to determine the URL structure due to the pagination on Beer advocate, but luckily enough this is fairly simple to do by clicking on each of the results links (ie. 1-20, 21-40, 41-60). Here are the necessary links:
[http://www.beeradvocate.com/place/list/?start=0&c_id=US&s_id=CT&brewery=Y&sort=name](),
[http://www.beeradvocate.com/place/list/?start=20&c_id=US&s_id=CT&brewery=Y&sort=name](),
[http://www.beeradvocate.com/place/list/?start=40&c_id=US&s_id=CT&brewery=Y&sort=name]().
Screenshot 4
[https://www.dropbox.com/s/yus6li8186qlfg7/BA-4.jpeg?dl=0]("https://www.dropbox.com/s/yus6li8186qlfg7/BA-4.jpeg?dl=0")
Screenshot 5
[https://www.dropbox.com/s/vgqahcnceoopmnk/BA-5.jpeg?dl=0]("https://www.dropbox.com/s/vgqahcnceoopmnk/BA-5.jpeg?dl=0")
3. After initially setting up an account on import.io which can be done through linking your github account, you can navigate to your [my data](https://import.io/data/mine/) page and input the previous links into the bulk extractor located in the "How would you like to use this API?" dropdown for your _Magic API_ and press the button to run the queries.
Screenshot 6
[https://www.dropbox.com/s/cessjc3yhzhf0ov/BA-6.jpeg?dl=0]("https://www.dropbox.com/s/cessjc3yhzhf0ov/BA-6.jpeg?dl=0")
4. The new output page will be a tabular view of all of the extracted link data ready for export in multiple formats such as Spreadsheet (for CSV), HTML, and JSON. We will download the Spreadsheet format for this tutorial.
Screenshot 7
[https://www.dropbox.com/s/cwi2o8acq0fnr85/BA-7.jpeg?dl=0]("https://www.dropbox.com/s/cwi2o8acq0fnr85/BA-7.jpeg?dl=0")
# Data pre-processing
5. Read the data into R:
```{r, message=FALSE}
beer_advocate <- read.csv("beeradvocate.com API 14th Dec 10-37.csv")
beer_head <- head(beer_advocate)
```
6. So we have some links and columns filled with more than one piece of information but that easy to fix by removing duplicate columns, creating coherant column headers, specifying and unifying missing data, extract the necessary information from other columns.
```{r}
# remove duplicate columns with unimportant info
beer_advocate$bottomdark_links._text <- NULL
beer_advocate$bottomdark_links._source <- NULL
beer_advocate$pageUrl <- NULL
beer_advocate$image <- NULL
beer_advocate$bottomdark_content_numbers._source <- NULL
beer_advocate$bottomlight_number <- NULL
beer_advocate$link._source <- NULL
# change the column names
colnames(beer_advocate) <- c("address", "brewery_website", "beer_advocate_website", "number_of_reviews","beer_avg", "number_of_beers", "place_type", "brewery_rating", "brewery_name", "phone_number")
# get our missing values as NA
beer_advocate[beer_advocate == "-"] = NA
# clean up some of the columns and remove extra data
beer_advocate$address <- as.character(beer_advocate$address) # change into character data type
beer_advocate$address <- gsub(" official website", "", beer_advocate$address)
# extract the phone number from the address column
# http://stackoverflow.com/questions/21007607/extract-phone-number-regex
library(stringr)
beer_advocate$phone_number <- str_extract_all(beer_advocate$address, "\\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}")
beer_advocate$phone_number[beer_advocate$phone_number == "character(0)"] = NA
# remove the phone number tag from the address field for geocoding
beer_advocate$address <- gsub(" phone: \\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}", "", beer_advocate$address)
# change the factor for average beer rating into a numeric
# http://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information
beer_advocate$beer_avg <- as.numeric(levels(beer_advocate$beer_avg))[beer_advocate$beer_avg]
```
# Geolocation of addresses for plotting on a map
7. Grab some latitude and longitudes for those craft brewery addresses!
```{r, message=FALSE}
library(htmlwidgets)
library(leaflet)
library(ggmap)
# This function geocodes a location (find latitude and longitude) using the Google Maps API
geo <- geocode(location = beer_advocate$address, output="latlon", source="google")
# add those coordinates to our dataset
beer_advocate$lon <- geo$lon
beer_advocate$lat <- geo$lat
```
# Data Vizualization using Leaflet
8. Plot the corrdinates and the craft brewery onto a javascript based map, leaflet
```{r}
# create the pop-up info
beer_advocate$pop_up_content <- paste(sep = "<br/>",
"<b><a href='", beer_advocate$beer_advocate_website, "'>", beer_advocate$brewery_name, "</a></b>", beer_advocate$address,
beer_advocate$phone_number, paste("Average beer rating: ", beer_advocate$beer_avg,"/5.0"))
# add colors corresponding to the average beer ratings
pal <- colorNumeric("YlOrRd", beer_advocate$beer_avg, n = 6)
m <- leaflet(data = beer_advocate) %>%
addTiles() %>%
setView(lng =-72.70190, lat=41.75798, zoom = 8) %>%
addCircleMarkers(~lon, ~lat, popup = ~as.character(pop_up_content), color = ~pal(beer_avg)) %>%
addLegend("topright", pal = pal, values = ~beer_avg,
title = "Average Beer Ratings at Brewery",
opacity = 1
)
m
```