Skip to content

Troubleshooting webscraping

mespe edited this page Oct 29, 2016 · 16 revisions

Problem

Matt Espe's R code stopped working. From R, it sends a query to a Web site:

https://power.larc.nasa.gov/cgi-bin/agro.cgi

People typically use this interactively in a Web browser. The page allows one to specify a start and end date, latitude and longitude and which data/columns you want returned.

Matt's code worked for several years. Then a month or so ago it broke. One thing that changed is the site switched requests from HTTP to HTTPS (not sure the difference? https://en.wikipedia.org/wiki/HTTPS.)

Strategy for solving this.

We took the quick/direct "try and see" approach to see if we could resolve this rapidly rather than discussing the big picture "what are you trying to do." We just went for what is failing and can we change one or two things to get it to work. And it worked.

This isn't always a very good approach. But it can be worth trying for 10 minutes or so to see if we can get things working again.

Steps

  • In the Web browser, check the service still works. (This is the equivalent of checking that the computer is plugged in. No use troubleshooting code if the web service is down.)

  • When it does, open the Developer tools in the Web browser and the Network panel specifically and rerun the query. Most web browsers have this tool. In Chrome, you can right-click and select "Inspect" (or Ctl+Shift+I)

  • Examine the details of the top-level request. Is it a GET or POST (What does this mean? http://www.w3schools.com/TAGS/ref_httpmethods.asp) What are the parameters being sent? Are there cookies in the request? What is the user-agent? The referer page?

  • In R, issue the query but with verbose = TRUE in the HTTP request options This allows us to see more details about the communication between R and the server.

  • This showed that the SSL certificates were found, that the client connected to the server. But something was not working from R.

  • We added the cookie (cut-and-paste from the browser), the user agent and the refer page to the request from R.

Still no success.

  • We looked at the parameters being sent across. These were identical, except

    • the R request had an additional parameter p="..."

    • the Web browser request included the parameter submit=Yes

  • We added the submit=Yes, and removed the p="...".

That worked.


Different approach

Matt had written the function by hand. However, he also used the RHTMLForms package to programmatically generate a function from the HTML form.

The RHTMLForms package can parse an online form for input fields and valid inputs, and can be a useful way to create a function that will submit an online form. However, it cannot parse HTTPS forms without help from functions in the RCurl package.

library(RHTMLForms)
library(RCurl)
library(XML)

u = "https://power.larc.nasa.gov/cgi-bin/agro.cgi"
doc = htmlParse(getURLContent(u))
docName(doc) = u
desc = getHTMLFormDescription(doc)

my_fun = createFunction(desc[[1]])

my_fun(ys = 1995, ye = 1995, lat = 39.5, lon = -121.5, .opts = list(verbose = TRUE))

This didn't work either.

So we attempted to have it add submit=Yes. This comes from the buttons in the original HTML form. So we used the addSubmit = TRUE in the call to createFunction().

That also didn't work.

Again, we use verbose = TRUE to show the details of the request. The submit=Yes was not included. Why not?

Look at the form description desc[[1]]. It doesn't have a button in there. That is because getHTMLFormDescription() discards these. We have to tell it not to with:

desc = getHTMLFormDescription(doc, dropButtons = FALSE)

This keeps the buttons and submit=Yes gets added. However, the request still fails to give the results we want.

Again, looking at the details of the request and comparing them to those in the Network panel of the Web browser. We see de=31%0A&.... The "de" is the end day of the month. This is 31 by default. What is the %0A. This is a URL-encoding of \n, the newline character. Where did this come from?

We can specify the value "31" ourselves. However that doesn't change things which is odd? We can debug how this all works and see where the \n is added.

The short answer is that the formQuery() function checks the values provided by the user and the default values and checks they are consistent with/valid inputs specified in the form. For good reason, this expands the values provided by the user to the matching complete form specified in the form. In this case, the HTML has:

 <option>31   
 </select>

So the HTML parser puts the \n after the 31 and before the </select> with the This is because there is no closing </option> as there should be. So "31\n" is the last of the permissible values.

We fix this in the RHTMLForms package to trim the white space of the <option> values when they are taken from the content of the node, not the value=".." attribute of the <option> node.

Then things work fine!

Since Duncan owns the RHTMLForms package, the fix was implemented straight to the source. If it were not a package under our control, we could have created a pull request with the package repo.