Skip to content
This repository was archived by the owner on Dec 19, 2024. It is now read-only.

Commit

Permalink
Finalised submission for second review ropensci
Browse files Browse the repository at this point in the history
  • Loading branch information
cvitolo committed Aug 25, 2016
1 parent fc615cc commit 608c25f
Show file tree
Hide file tree
Showing 10 changed files with 221 additions and 126 deletions.
89 changes: 56 additions & 33 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -35,20 +35,24 @@ The package is designed to collect data efficiently. It allows to download multi

For similar functionalities see also the [openair](https://cran.r-project.org/package=openair) package, which relies on a local copy of the data on servers at King's College (UK), and the [ropenaq](https://CRAN.R-project.org/package=ropenaq) which provides UK-AIR latest measured levels (see https://uk-air.defra.gov.uk/latest/currentlevels) as well as data from other countries.

### Dependencies
# Dependencies & Installation

## Dependencies

The rdefra package depends on two things:

* The Geospatial Data Abstraction Library (gdal). If you use linux/ubuntu, this can be installed with the following command: `sudo apt-get install -y r-cran-rgdal`.

* Some additional CRAN packages. Check for missing dependencies and install them using the commands below:

```{r}
packs <- c('httr', 'xml2', 'lubridate', 'tibble', 'dplyr', 'sp', 'devtools')
packs <- c('httr', 'xml2', 'lubridate', 'tibble', 'dplyr', 'sp', 'devtools',
'leaflet', 'zoo', 'testthat')
new.packages <- packs[!(packs %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
```

### Installation
## Installation

You can install this package from CRAN:

Expand All @@ -69,14 +73,14 @@ Load the rdefra package:
library(rdefra)
```

### Functions
# Functions

The package logic assumes that the user access the UK-AIR database in two steps:

1. Browse the catalogue of available stations and selects some stations of interest.
2. Retrieves data for the selected stations.

#### Get metadata catalogue
## Get metadata catalogue
DEFRA monitoring stations can be downloaded and filtered using the function `ukair_catalogue()` with no input parameters, as in the example below.

```{r, eval= TRUE}
Expand All @@ -99,77 +103,96 @@ stations_EnglandOzone <- ukair_catalogue(pollutant = 1, country_id = 1)

The example above shows how to retrieve the `r dim(stations_EnglandOzone)[1]` stations in England in which ozone is measured.

#### Get missing coordinates
## Get missing coordinates
Locating a station is extremely important to be able to carry out any spatial analysis. If coordinates are missing, for some stations in the catalogue, it might be possible to retrieve Easting and Northing (coordinates in the British National Grid) from DEFRA's web pages. Get E and N, transform them to latitude and longitude and populate the missing coordinates using the code below.

```{r}
# Scrape DEFRA website to get Easting/Northing
stations <- ukair_get_coordinates(stations_raw)
```

#### Check hourly data availability
## Check hourly data availability
Pollution data started to be collected in 1972 and consists of hourly concentration of various species (in &#956;g/m<sup>3</sup>), such as ozone (O<sub>3</sub>), particulate matters (PM<sub>2.5</sub> and PM<sub>10</sub>), nitrogen dioxide (NO<sub>2</sub>), sulphur dioxide (SO<sub>2</sub>), and so on.

The ID under which they are available differs from the UK.AIR.ID. The catalogue does not contain this additional station ID (called SiteID hereafter) but DEFRA's web pages contain references to both the UK.AIR.ID and the SiteID. The function below uses as input the UK.AIR.ID and outputs the SiteID, if available.

```{r}
stations$SiteID <- ukair_get_site_id(as.character(stations$UK.AIR.ID))
stations$SiteID <- ukair_get_site_id(stations$UK.AIR.ID)
```

### Get hourly data
## Get hourly data

The time series for a given station can be retrieved in one line of code:

```{r}
df <- ukair_get_hourly_data("ABD", years=2012:2016)
```{r, eval = TRUE}
# Get 1 year of hourly ozone data from London Marylebone Road monitoring station
df <- ukair_get_hourly_data("MY1", years=2015)
library(zoo)
plot(zoo(x = df$Ozone, order.by = as.POSIXct(df$datetime)),
main = "", xlab = "", ylab = expression(paste("Ozone concentration [",
mu, "g/m^3]")))
```

Highest concentrations seem to happen in late spring and at the beginning of summer. In order to check whether this happens every year, we can download multiple years of data and then compare them.

### Cached catalogue
```{r}
# Get 3 years of hourly ozone data from the same monitoring station
years <- 2013:2015
df <- ukair_get_hourly_data("MY1", years)
df$year <- lubridate::year(df$datetime)
par(mfrow=c(3,1))
for(yearDF in years){
df1 <- df[which(df$year == yearDF),]
plot(zoo(x = df1$Ozone, order.by = as.POSIXct(df1$datetime)),
main = "", xlab = "", ylab = yearDF)
}
```

# Cached catalogue

For convenience, a cached version of the catalogue (last updated in August 2016) is included in the package and can be loaded using the following command:

```{r}
```{r, eval = TRUE}
data('stations')
```

### Applications

The cached catalogue contains all the available site IDs and coordinates and can be quickly used as lookup table to find out the correspondence between the UK.AIR.ID and SiteID, as well as to investigate station characteristics.

In the raw catalogue, `r length(which(complete.cases(stations_raw[, c("Latitude", "Longitude")])))` stations contain valid coordinates. After scraping DEFRA's web pages, the number of stations with valid coordinates rises to `r length(which(complete.cases(stations[, c("Latitude", "Longitude")])))`. In the figure below, blue circles show stations with valid coordinates within the UK-AIR database (Air Information Resource) while red circles show locations for which hourly data is available.
# Applications

## Plotting stations' locations

In the raw catalogue, `r length(which(complete.cases(stations_raw[, c("Latitude", "Longitude")])))` stations contain valid coordinates. After scraping DEFRA's web pages, the number of stations with valid coordinates rises to `r length(which(complete.cases(stations[, c("Latitude", "Longitude")])))`. In the figure below, red circles show stations with available hourly data while blue circles show the remaining locations.

```{r, eval = TRUE}
validStations <- which(!is.na(stations$SiteID))
library(leaflet)
leaflet(data = stations) %>% addTiles() %>%
addCircleMarkers(lng = ~Longitude, lat = ~Latitude, radius = 0.5) %>%
addCircleMarkers(lng = ~Longitude[validStations],
lat = ~Latitude[validStations],
radius = 0.5, color="red", popup = ~SiteID[validStations])
radius = 0.5, color="red", popup = ~SiteID[validStations]) %>%
addCircleMarkers(lng = ~Longitude, lat = ~Latitude, radius = 0.5)
```

Below, a basic analysis is carried out to assess the spatial distribution of the monitoring stations. These are concentrated largely in urban areas and used to estimate the background urban (environmental type) level of concentration of pollutants.
## Analyse the spatial distribution of the monitoring stations

```{r, echo=FALSE, eval=TRUE}
# Environment.Type
y <- as.data.frame(table(stations$Environment.Type[stations$Environment.Type != 'Unknown Unknown']))
pctY <- round(y$Freq/sum(y$Freq)*100)
Below are two plots showing the spatial distribution of the monitoring stations. These are concentrated largely in urban areas and used to estimate the background level of concentration of pollutants.

# chart Regions
barplot(height = pctY, names.arg = y$Var1, main="Types")
```{r, echo=FALSE, eval=TRUE, fig.height = 10}
# Zone
dotchart(table(stations$Zone))
```

```{r, echo=FALSE, eval=TRUE}
# Region
x <- as.data.frame(table(stations$Zone))
pctX <- round(x$Freq/sum(x$Freq)*100)
# chart Regions
barplot(height = pctX, names.arg = x$Var1, main="Regions")
# Environment.Type
dotchart(table(stations$Environment.Type))
```

## Use multiple cores to speed up data retrieval from numerous sites

Using parallel processing, the acquisition of data from hundreds of sites takes only few minutes:

```{r}
Expand All @@ -182,15 +205,15 @@ no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(no_cores)
system.time(myList <- parLapply(cl, IDstationHdata,
system.time(myList <- parLapply(cl, stations$SiteID[validStations],
ukair_get_hourly_data, years=1999:2016))
stopCluster(cl)
df <- bind_rows(myList)
```

### Meta
# Meta

* Please [report any issues or bugs](https://github.com/kehraProject/r_rdefra/issues).
* License: [GPL-3](https://opensource.org/licenses/GPL-3.0)
Expand Down
78 changes: 56 additions & 22 deletions README.md

Large diffs are not rendered by default.

10 changes: 7 additions & 3 deletions preparePackage.R
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,16 @@ load("~/Dropbox/Repos/r_rdefra/rdefra/data/stations.rda")
tools::checkRdaFiles("~/Dropbox/Repos/r_rdefra/rdefra/data/stations.rda")

stations_raw <- ukair_catalogue()
stations <- ukair_get_coordinates(stations_raw)
stations$SiteID <- ukair_get_site_id(as.character(stations$UK.AIR.ID))

stations <- ukair_get_coordinates(stations_raw, en = TRUE, force_coords = TRUE)
length(which(!is.na(stations$Latitude)))
length(which(!is.na(stations$Longitude)))

stations$SiteID <- ukair_get_site_id(stations$UK.AIR.ID)
length(which(!is.na(stations$SiteID)))

save(stations,
file='~/Dropbox/Repos/r_rdefra/rdefra/data/stations.rda',
file='~/r_rdefra/rdefra/data/stations.rda',
compress='gzip')

# Build README
Expand Down
2 changes: 2 additions & 0 deletions rdefra/R/stations.R
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@
#' \item{\code{Networks}}{Monitoring Networks}
#' \item{\code{AURN.Pollutants.Measured}}{Pollutant measured}
#' \item{\code{Site.Description}}{Site.Description}
#' \item{\code{Easting}}{Easting coordinate (British National Grid)}
#' \item{\code{Northing}}{Northing coordinate (British National Grid)}
#' \item{\code{SiteID}}{Site ID}
#' }
#'
Expand Down
63 changes: 35 additions & 28 deletions rdefra/R/ukair_get_coordinates.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
#'
#' @description This function takes as input the UK AIR ID and returns Easting and Northing coordinates (British National Grid, EPSG:27700).
#'
#' @param UK.AIR.ID contains the station identification code defined by DEFRA. It can be: a) an alphanumeric string, b) a vector of strings or c) a data frame. In the latter case, the column containing the codes should be named "UK.AIR.ID", all the other columns will be ignored.
#' @param ids contains the station identification code defined by DEFRA. It can be: a) an alphanumeric string, b) a vector of strings or c) a data frame. In the latter case, the column containing the codes should be named "UK.AIR.ID", all the other columns will be ignored.
#' @param en logical set to FALSE by default. If set to TRUE, it adds two columns to the output dataframe containing "Easting" and "Northing" coordinates, wherever available.
#' @param force_coords logical set to FALSE by default. If set to TRUE forces the extraction of coordinates for all the IDs (not only those with missing coordinates).
#' @param all_coords logical set to FALSE by default. If set to TRUE forces the extraction of coordinates for all the IDs (not only those with missing coordinates).
#'
#' @details If the input is a data frame with some of the columns named "UK.AIR.ID", "Latitude" and "Longitude", the function only infills missing Latitude/Longitude values. If you want to get the coordinates for all the IDs, set force_coords = TRUE.
#' @details If the input is a data frame with some of the columns named "UK.AIR.ID", "Latitude" and "Longitude", the function only infills missing Latitude/Longitude values. If you want to get the coordinates for all the IDs, set all_coords = TRUE.
#'
#' @return A data.frame containing at least three columns named "UK.AIR.ID", "Latitude", and "Longitude". If en is set to TRUE, there are other two columns containing the "Easting" and "Northing" coordinates.
#'
Expand All @@ -25,63 +25,73 @@
#' }
#'

ukair_get_coordinates <- function(UK.AIR.ID, en = FALSE, force_coords = FALSE){
ukair_get_coordinates <- function(ids, en = FALSE, all_coords = FALSE){

# Check if UK.AIR.ID is a data.frame
if ("data.frame" %in% class(UK.AIR.ID)){
# Check if ids is a data.frame
if ("data.frame" %in% class(ids)){

nrows <- seq(1,dim(UK.AIR.ID)[1])
nrows <- seq(1,dim(ids)[1])

# By default we are expected to just infill missing coordinates
if ("Latitude" %in% names(UK.AIR.ID) &
"Longitude" %in% names(UK.AIR.ID) & force_coords == FALSE){
nrows <- which(is.na(UK.AIR.ID$Latitude) | is.na(UK.AIR.ID$Longitude))
if ("Latitude" %in% names(ids) &
"Longitude" %in% names(ids) & all_coords == FALSE){
nrows <- which(is.na(ids$Latitude) | is.na(ids$Longitude))
}

# otherwise, we force to extract coordinates for all the given IDs
IDs <- UK.AIR.ID$UK.AIR.ID[nrows]
IDs <- ids$UK.AIR.ID[nrows]

}else{

IDs <- UK.AIR.ID
IDs <- ids

}

# If UK.AIR.ID is not a data.frame, it must be a string or vector of strings
# If ids is not a data.frame, it must be a string or vector of strings
IDs <- as.character(IDs)

# Get Easting and Northing
enDF <- data.frame(t(sapply(IDs, ukair_get_coordinates_internal)))

# Remove NAs
rowsNoNAs <- which(!is.na(enDF$Easting) & !is.na(enDF$Northing))
enDFnoNAs <- enDF[rowsNoNAs,]

# Transform Easting and Northing to Latitude and Longitude
# first, define spatial points
sp::coordinates(enDF) <- ~Easting+Northing
sp::proj4string(enDF) <- sp::CRS("+init=epsg:27700")
sp::coordinates(enDFnoNAs) <- ~Easting+Northing
sp::proj4string(enDFnoNAs) <- sp::CRS("+init=epsg:27700")
# then, convert coordinates from British National Grid to WGS84
latlon <- round(sp::spTransform(enDF, sp::CRS("+init=epsg:4326"))@coords, 6)
latlon <- round(sp::spTransform(enDFnoNAs, sp::CRS("+init=epsg:4326"))@coords, 6)
pt <- data.frame(latlon)
names(pt) <- c("Longitude", "Latitude")

dfExtended <- cbind(IDs, enDF@coords, pt)
dfExtended <- cbind(IDs[rowsNoNAs], enDFnoNAs@coords, pt)
names(dfExtended)[1] <- "UK.AIR.ID"

if ("data.frame" %in% class(UK.AIR.ID)){
if ("data.frame" %in% class(ids)){

# if the input was a data.frame
# return the new data.frame with infilled coordinates
UK.AIR.ID[nrows, c("Latitude", "Longitude")] <- dfExtended[, c("Latitude",
"Longitude")]
rows2fill <- which(ids$UK.AIR.ID %in% dfExtended$UK.AIR.ID)
if(all(ids$UK.AIR.ID[rows2fill] == dfExtended$UK.AIR.ID)){
ids$Latitude[rows2fill] <- dfExtended$Latitude
ids$Longitude[rows2fill] <- dfExtended$Longitude
}else{
message("Check the order!")
}

# Do we need Easting and Northing?
# If not, we just keep Latitude and Longitude
if (en == TRUE) {
UK.AIR.ID$Easting <- NA
UK.AIR.ID$Northing <- NA
UK.AIR.ID[nrows, c("Easting", "Northing")] <- dfExtended[, c("Easting",
"Northing")]

suppressWarnings(output <- dplyr::left_join(ids, dfExtended,
by = c("UK.AIR.ID",
"Latitude",
"Longitude")))
}

return(tibble::as_tibble(UK.AIR.ID))
return(tibble::as_tibble(output))

}else{

Expand Down Expand Up @@ -113,9 +123,6 @@ ukair_get_coordinates_internal <- function(uka_id){
# Extract tab row containing Easting and Northing
page_tab <- xml2::xml_find_all(page_content,
"//*[contains(@id,'tab_info')]")[[2]]
# @haozhu suggested:
# page_tab <- xml2::xml_find_first(page_content,
# "//a[text()='Pre-Formatted Data Files']")

# extract and clean all the columns
vals <- trimws(xml2::xml_text(page_tab))
Expand Down
Binary file modified rdefra/data/stations.rda
Binary file not shown.
2 changes: 2 additions & 0 deletions rdefra/man/stations.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions rdefra/man/ukair_get_coordinates.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 608c25f

Please sign in to comment.