Finalised submission for second review ropensci

ropensci-archive · Aug 25, 2016 · 608c25f · 608c25f
1 parent fc615cc
commit 608c25f
Show file tree

Hide file tree

Showing 10 changed files with 221 additions and 126 deletions.
diff --git a/README.Rmd b/README.Rmd
@@ -35,20 +35,24 @@ The package is designed to collect data efficiently. It allows to download multi
 
 For similar functionalities see also the [openair](https://cran.r-project.org/package=openair) package, which relies on a local copy of the data on servers at King's College (UK), and the [ropenaq](https://CRAN.R-project.org/package=ropenaq) which provides UK-AIR latest measured levels (see https://uk-air.defra.gov.uk/latest/currentlevels) as well as data from other countries.
 
-### Dependencies
+# Dependencies & Installation
+
+## Dependencies
+
 The rdefra package depends on two things: 
 
 * The Geospatial Data Abstraction Library (gdal). If you use linux/ubuntu, this can be installed with the following command: `sudo apt-get install -y r-cran-rgdal`. 
 
 * Some additional CRAN packages. Check for missing dependencies and install them using the commands below:
 
 ```{r}
-packs <- c('httr', 'xml2', 'lubridate', 'tibble', 'dplyr', 'sp', 'devtools')
+packs <- c('httr', 'xml2', 'lubridate', 'tibble', 'dplyr', 'sp', 'devtools',
+           'leaflet', 'zoo', 'testthat')
 new.packages <- packs[!(packs %in% installed.packages()[,"Package"])]
 if(length(new.packages)) install.packages(new.packages)
 ```
 
-### Installation
+## Installation
 
 You can install this package from CRAN:
 
@@ -69,14 +73,14 @@ Load the rdefra package:
 library(rdefra)
 ```
 
-### Functions
+# Functions
 
 The package logic assumes that the user access the UK-AIR database in two steps:
 
 1. Browse the catalogue of available stations and selects some stations of interest.
 2. Retrieves data for the selected stations.
 
-#### Get metadata catalogue
+## Get metadata catalogue
 DEFRA monitoring stations can be downloaded and filtered using the function `ukair_catalogue()` with no input parameters, as in the example below. 
 
 ```{r, eval= TRUE}
@@ -99,77 +103,96 @@ stations_EnglandOzone <- ukair_catalogue(pollutant = 1, country_id = 1)
 
 The example above shows how to retrieve the `r dim(stations_EnglandOzone)[1]` stations in England in which ozone is measured.
 
-#### Get missing coordinates
+## Get missing coordinates
 Locating a station is extremely important to be able to carry out any spatial analysis. If coordinates are missing, for some stations in the catalogue, it might be possible to retrieve Easting and Northing (coordinates in the British National Grid) from DEFRA's web pages. Get E and N, transform them to latitude and longitude and populate the missing coordinates using the code below.
 
 ```{r}
 # Scrape DEFRA website to get Easting/Northing
 stations <- ukair_get_coordinates(stations_raw)
 ```
 
-#### Check hourly data availability
+## Check hourly data availability
 Pollution data started to be collected in 1972 and consists of hourly concentration of various species (in &#956;g/m<sup>3</sup>), such as ozone (O<sub>3</sub>), particulate matters (PM<sub>2.5</sub> and PM<sub>10</sub>), nitrogen dioxide (NO<sub>2</sub>), sulphur dioxide (SO<sub>2</sub>), and so on.
 
 The ID under which they are available differs from the UK.AIR.ID. The catalogue does not contain this additional station ID (called SiteID hereafter) but DEFRA's web pages contain references to both the UK.AIR.ID and the SiteID. The function below uses as input the UK.AIR.ID and outputs the SiteID, if available. 
 
 ```{r}
-stations$SiteID <- ukair_get_site_id(as.character(stations$UK.AIR.ID))
+stations$SiteID <- ukair_get_site_id(stations$UK.AIR.ID)
 ```
 
-### Get hourly data
+## Get hourly data
 
 The time series for a given station can be retrieved in one line of code:
 
-```{r}
-df <- ukair_get_hourly_data("ABD", years=2012:2016)
+```{r, eval = TRUE}
+# Get 1 year of hourly ozone data from London Marylebone Road monitoring station
+df <- ukair_get_hourly_data("MY1", years=2015)
+
+library(zoo)
+plot(zoo(x = df$Ozone, order.by = as.POSIXct(df$datetime)), 
+     main = "", xlab = "", ylab = expression(paste("Ozone concentration [",
+                                                    mu, "g/m^3]")))
 ```
 
+Highest concentrations seem to happen in late spring and at the beginning of summer. In order to check whether this happens every year, we can download multiple years of data and then compare them.
 
-### Cached catalogue
+```{r}
+# Get 3 years of hourly ozone data from the same monitoring station
+years <- 2013:2015
+df <- ukair_get_hourly_data("MY1", years)
+df$year <- lubridate::year(df$datetime)
+
+par(mfrow=c(3,1)) 
+for(yearDF in years){
+  df1 <- df[which(df$year == yearDF),]
+  plot(zoo(x = df1$Ozone, order.by = as.POSIXct(df1$datetime)), 
+     main = "", xlab = "", ylab = yearDF)
+}
+```
+
+# Cached catalogue
 
 For convenience, a cached version of the catalogue (last updated in August 2016) is included in the package and can be loaded using the following command:
 
-```{r}
+```{r, eval = TRUE}
 data('stations')
 ```
 
-### Applications
-
 The cached catalogue contains all the available site IDs and coordinates and can be quickly used as lookup table to find out the correspondence between the UK.AIR.ID and SiteID, as well as to investigate station characteristics. 
 
-In the raw catalogue, `r length(which(complete.cases(stations_raw[, c("Latitude", "Longitude")])))` stations contain valid coordinates. After scraping DEFRA's web pages, the number of stations with valid coordinates rises to `r length(which(complete.cases(stations[, c("Latitude", "Longitude")])))`. In the figure below, blue circles show stations with valid coordinates within the UK-AIR database (Air Information Resource) while red circles show locations for which hourly data is available.
+# Applications
+
+## Plotting stations' locations 
+
+In the raw catalogue, `r length(which(complete.cases(stations_raw[, c("Latitude", "Longitude")])))` stations contain valid coordinates. After scraping DEFRA's web pages, the number of stations with valid coordinates rises to `r length(which(complete.cases(stations[, c("Latitude", "Longitude")])))`. In the figure below, red circles show stations with available hourly data while blue circles show the remaining locations.
 
 ```{r, eval = TRUE}
 validStations <- which(!is.na(stations$SiteID))
 
 library(leaflet)
 leaflet(data = stations) %>% addTiles() %>% 
-  addCircleMarkers(lng = ~Longitude, lat = ~Latitude, radius = 0.5) %>% 
   addCircleMarkers(lng = ~Longitude[validStations], 
                    lat = ~Latitude[validStations], 
-                   radius = 0.5, color="red", popup = ~SiteID[validStations])
+                   radius = 0.5, color="red", popup = ~SiteID[validStations]) %>% 
+  addCircleMarkers(lng = ~Longitude, lat = ~Latitude, radius = 0.5) 
 ```
 
-Below, a basic analysis is carried out to assess the spatial distribution of the monitoring stations. These are concentrated largely in urban areas and used to estimate the background urban (environmental type) level of concentration of pollutants.
+## Analyse the spatial distribution of the monitoring stations
 
-```{r, echo=FALSE, eval=TRUE}
-# Environment.Type
-y <- as.data.frame(table(stations$Environment.Type[stations$Environment.Type != 'Unknown Unknown']))
-pctY <- round(y$Freq/sum(y$Freq)*100)
+Below are two plots showing the spatial distribution of the monitoring stations. These are concentrated largely in urban areas and used to estimate the background level of concentration of pollutants.
 
-# chart Regions
-barplot(height = pctY, names.arg = y$Var1, main="Types")
+```{r, echo=FALSE, eval=TRUE, fig.height = 10}
+# Zone
+dotchart(table(stations$Zone))
 ```
 
 ```{r, echo=FALSE, eval=TRUE}
-# Region
-x <- as.data.frame(table(stations$Zone))
-pctX <- round(x$Freq/sum(x$Freq)*100)
-
-# chart Regions
-barplot(height = pctX, names.arg = x$Var1, main="Regions")
+# Environment.Type
+dotchart(table(stations$Environment.Type))
 ```
 
+## Use multiple cores to speed up data retrieval from numerous sites
+
 Using parallel processing, the acquisition of data from hundreds of sites takes only few minutes:
 
 ```{r}
@@ -182,15 +205,15 @@ no_cores <- detectCores() - 1
 # Initiate cluster
 cl <- makeCluster(no_cores)
 
-system.time(myList <- parLapply(cl, IDstationHdata, 
+system.time(myList <- parLapply(cl, stations$SiteID[validStations], 
 ukair_get_hourly_data, years=1999:2016))
 
 stopCluster(cl)
 
 df <- bind_rows(myList)
 ```
 
-### Meta
+# Meta
 
 * Please [report any issues or bugs](https://github.com/kehraProject/r_rdefra/issues).
 * License: [GPL-3](https://opensource.org/licenses/GPL-3.0)

diff --git a/README.md b/README.md
diff --git a/preparePackage.R b/preparePackage.R
@@ -28,12 +28,16 @@ load("~/Dropbox/Repos/r_rdefra/rdefra/data/stations.rda")
 tools::checkRdaFiles("~/Dropbox/Repos/r_rdefra/rdefra/data/stations.rda")
 
 stations_raw <- ukair_catalogue()
-stations <- ukair_get_coordinates(stations_raw)
-stations$SiteID <- ukair_get_site_id(as.character(stations$UK.AIR.ID))
+
+stations <- ukair_get_coordinates(stations_raw, en = TRUE, force_coords = TRUE)
 length(which(!is.na(stations$Latitude)))
+length(which(!is.na(stations$Longitude)))
+
+stations$SiteID <- ukair_get_site_id(stations$UK.AIR.ID)
+length(which(!is.na(stations$SiteID)))
 
 save(stations,
-     file='~/Dropbox/Repos/r_rdefra/rdefra/data/stations.rda',
+     file='~/r_rdefra/rdefra/data/stations.rda',
      compress='gzip')
 
 # Build README

diff --git a/rdefra/R/stations.R b/rdefra/R/stations.R
@@ -20,6 +20,8 @@
 #'   \item{\code{Networks}}{Monitoring Networks}
 #'   \item{\code{AURN.Pollutants.Measured}}{Pollutant measured}
 #'   \item{\code{Site.Description}}{Site.Description}
+#'   \item{\code{Easting}}{Easting coordinate (British National Grid)}
+#'   \item{\code{Northing}}{Northing coordinate (British National Grid)}
 #'   \item{\code{SiteID}}{Site ID}
 #' }
 #'

diff --git a/rdefra/R/ukair_get_coordinates.R b/rdefra/R/ukair_get_coordinates.R
@@ -2,11 +2,11 @@
 #'
 #' @description This function takes as input the UK AIR ID and returns Easting and Northing coordinates (British National Grid, EPSG:27700).
 #'
-#' @param UK.AIR.ID contains the station identification code defined by DEFRA. It can be: a) an alphanumeric string, b) a vector of strings or c) a data frame. In the latter case, the column containing the codes should be named "UK.AIR.ID", all the other columns will be ignored.
+#' @param ids contains the station identification code defined by DEFRA. It can be: a) an alphanumeric string, b) a vector of strings or c) a data frame. In the latter case, the column containing the codes should be named "UK.AIR.ID", all the other columns will be ignored.
 #' @param en logical set to FALSE by default. If set to TRUE, it adds two columns to the output dataframe containing "Easting" and "Northing" coordinates, wherever available.
-#' @param force_coords logical set to FALSE by default. If set to TRUE forces the extraction of coordinates for all the IDs (not only those with missing coordinates).
+#' @param all_coords logical set to FALSE by default. If set to TRUE forces the extraction of coordinates for all the IDs (not only those with missing coordinates).
 #'
-#' @details If the input is a data frame with some of the columns named "UK.AIR.ID", "Latitude" and "Longitude", the function only infills missing Latitude/Longitude values. If you want to get the coordinates for all the IDs, set force_coords = TRUE.
+#' @details If the input is a data frame with some of the columns named "UK.AIR.ID", "Latitude" and "Longitude", the function only infills missing Latitude/Longitude values. If you want to get the coordinates for all the IDs, set all_coords = TRUE.
 #'
 #' @return A data.frame containing at least three columns named "UK.AIR.ID", "Latitude", and "Longitude". If en is set to TRUE, there are other two columns containing the "Easting" and "Northing" coordinates.
 #'
@@ -25,63 +25,73 @@
 #'  }
 #'
 
-ukair_get_coordinates <- function(UK.AIR.ID, en = FALSE, force_coords = FALSE){
+ukair_get_coordinates <- function(ids, en = FALSE, all_coords = FALSE){
 
-  # Check if UK.AIR.ID is a data.frame
-  if ("data.frame" %in% class(UK.AIR.ID)){
+  # Check if ids is a data.frame
+  if ("data.frame" %in% class(ids)){
 
-    nrows <- seq(1,dim(UK.AIR.ID)[1])
+    nrows <- seq(1,dim(ids)[1])
 
     # By default we are expected to just infill missing coordinates
-    if ("Latitude" %in% names(UK.AIR.ID) &
-        "Longitude" %in% names(UK.AIR.ID) & force_coords == FALSE){
-      nrows <- which(is.na(UK.AIR.ID$Latitude) | is.na(UK.AIR.ID$Longitude))
+    if ("Latitude" %in% names(ids) &
+        "Longitude" %in% names(ids) & all_coords == FALSE){
+      nrows <- which(is.na(ids$Latitude) | is.na(ids$Longitude))
     }
 
     # otherwise, we force to extract coordinates for all the given IDs
-    IDs <- UK.AIR.ID$UK.AIR.ID[nrows]
+    IDs <- ids$UK.AIR.ID[nrows]
 
   }else{
 
-    IDs <- UK.AIR.ID
+    IDs <- ids
 
   }
 
-  # If UK.AIR.ID is not a data.frame, it must be a string or vector of strings
+  # If ids is not a data.frame, it must be a string or vector of strings
   IDs <- as.character(IDs)
 
   # Get Easting and Northing
   enDF <- data.frame(t(sapply(IDs, ukair_get_coordinates_internal)))
+
+  # Remove NAs
+  rowsNoNAs <- which(!is.na(enDF$Easting) & !is.na(enDF$Northing))
+  enDFnoNAs <- enDF[rowsNoNAs,]
 
   # Transform Easting and Northing to Latitude and Longitude
   # first, define spatial points
-  sp::coordinates(enDF) <- ~Easting+Northing
-  sp::proj4string(enDF) <- sp::CRS("+init=epsg:27700")
+  sp::coordinates(enDFnoNAs) <- ~Easting+Northing
+  sp::proj4string(enDFnoNAs) <- sp::CRS("+init=epsg:27700")
   # then, convert coordinates from British National Grid to WGS84
-  latlon <- round(sp::spTransform(enDF, sp::CRS("+init=epsg:4326"))@coords, 6)
+  latlon <- round(sp::spTransform(enDFnoNAs, sp::CRS("+init=epsg:4326"))@coords, 6)
   pt <- data.frame(latlon)
   names(pt) <- c("Longitude", "Latitude")
 
-  dfExtended <- cbind(IDs, enDF@coords, pt)
+  dfExtended <- cbind(IDs[rowsNoNAs], enDFnoNAs@coords, pt)
   names(dfExtended)[1] <- "UK.AIR.ID"
 
-  if ("data.frame" %in% class(UK.AIR.ID)){
+  if ("data.frame" %in% class(ids)){
 
     # if the input was a data.frame
     # return the new data.frame with infilled coordinates
-    UK.AIR.ID[nrows, c("Latitude", "Longitude")] <- dfExtended[, c("Latitude",
-                                                                   "Longitude")]
+    rows2fill <- which(ids$UK.AIR.ID %in% dfExtended$UK.AIR.ID)
+    if(all(ids$UK.AIR.ID[rows2fill] == dfExtended$UK.AIR.ID)){
+      ids$Latitude[rows2fill] <- dfExtended$Latitude
+      ids$Longitude[rows2fill] <- dfExtended$Longitude
+    }else{
+      message("Check the order!")
+    }
 
     # Do we need Easting and Northing?
     # If not, we just keep Latitude and Longitude
     if (en == TRUE) {
-      UK.AIR.ID$Easting <- NA
-      UK.AIR.ID$Northing <- NA
-      UK.AIR.ID[nrows, c("Easting", "Northing")] <- dfExtended[, c("Easting",
-                                                                   "Northing")]
+
+      suppressWarnings(output <- dplyr::left_join(ids, dfExtended, 
+                                                  by = c("UK.AIR.ID", 
+                                                         "Latitude",
+                                                         "Longitude")))
     }
 
-    return(tibble::as_tibble(UK.AIR.ID))
+    return(tibble::as_tibble(output))
 
   }else{
 
@@ -113,9 +123,6 @@ ukair_get_coordinates_internal <- function(uka_id){
   # Extract tab row containing Easting and Northing
   page_tab <- xml2::xml_find_all(page_content,
                                  "//*[contains(@id,'tab_info')]")[[2]]
-  # @haozhu suggested:
-  # page_tab <- xml2::xml_find_first(page_content,
-  #                                 "//a[text()='Pre-Formatted Data Files']")
 
   # extract and clean all the columns
   vals <- trimws(xml2::xml_text(page_tab))

diff --git a/rdefra/data/stations.rda b/rdefra/data/stations.rda
diff --git a/rdefra/man/stations.Rd b/rdefra/man/stations.Rd
diff --git a/rdefra/man/ukair_get_coordinates.Rd b/rdefra/man/ukair_get_coordinates.Rd