- download R, install
- download R-Studio, install
- run script:
source('https://raw.githubusercontent.com/HoustonUseRs/intro-to-web-scraping/master/scripts/setup.r')
- Pick a page
- Load a page
- Look at page
- Determine where info is
- Select info
- Copy info
- New file
- Paste into new file
- Save new file
- R
- RStudio
- Chrome Dev Tools
- Chrome Selector Gadget Plugin
- Text Editor
- Go to website (www.houstontx.gov/departments.html)
- Using Selector Gadget, click on the type of data you want. In this case, we want the names of all the people. So, by clicking on one of the names, the chosen name will be green. All other items matching the selector will be yellow. Go through and click on the yellow items that are unwanted, turning them red. Selector Gadget will sort through and figure out the proper/most succinct way to caputre the desired content. In this case, it is
.table150 a:nth-child(1)
- In R-Studio, make a new r script (Shift+command+N)
- Recreate manual steps (above) in R.
####libary (rvest)
-
to run code,
cmd + enter
-
copy url (http://www.houstontx.gov/departments.html)
-
save url as a variable:
depts_url <- 'http://www.houstontx.gov/departments.html'
-
create a variable for the html
depts_html <-read_html(depts_url)
-
depts_html <- read_html(depts_url)
-
`fileConn <- file("output.txt")'
-
writeLines (depts_emails, fileConn)
depts_url <- 'http://www.houstontx.gov/departments.html'
depts_html <- read_html(depts_url)
depts_html %>%
html_nodes('.table150 a') %>%
html_attr('href')```