Skip to content

Fetching and timestamps

Alison Appling edited this page Feb 8, 2018 · 1 revision

Overview

Fetch items all have a fetch method, whether you write it or use a built-in fetcher from vizlab. The fetch method is specified by the fetcher argument in the viz.yaml, and it does the work of actually downloading or creating a dataset.

Because it's time consuming to fetch data during every build, vizmake() skips the fetch when possible. Files are always fetched the first time (to create them) or if their dependencies or arguments in the viz.yaml change. Beyond those basic remake-type controls, we use two further controls to limit unnecessary fetching:

  • fetchTimestamp - The fetchTimestamp concept was motivated by datasets that must be fetched from remote data sources, though now we use it for all fetch items. We usually only want to download data files as often as the remote data are changed, so we represent the remote data status with a local timestamp file. The content of that timestamp file is a single datetime in a standard text format as produced by the writeTimestamp helper function, where the datetime should describe the time of the most recent update to the remote data - this could be the 'last-modified' time of a remote file or the timestamp of the most recently added observation in a stream of continuous monitoring data. We use fetchTimestamp to query the remote source for changes in the timestamp. If the remote data haven't changed (as indicated by whether the timestamp has changed), we don't bother refetching the data. If they have changed, we refetch. See the fetchTimestamp methods section below.

  • timetolive - We can further avoid querying a remote source too often by setting the fetch item's timetolive to some positive time interval, which causes vizmake to wait until that interval has passed until fetching the timestamp, let alone the file. See the timetolive and preferences.yaml section below.

vizmake()'s decision tree

Here's how vizmake decides whether to [re]fetch a fetch item:

Common needs and solutions

Here's how to set up your fetcher and timetolive preference to achieve some common goals:

Desired behavior when vizmake is called Example use case Set fetchTimestamp as Set timetolive to
Only fetch/create once ever (option 1) Local function result, no remote data fetchTimestamp.myfetcher <- alwaysCurrent (exactly this code line) Ignored
Only fetch/create once ever (option 2) Remote file that will never change fetchTimestamp.myfetcher <- [any function] Inf days
Fetch every time Small dataset updated remotely every 2 hours fetchTimestamp.myfetcher <- neverCurrent 0 secs (or omit)
Fetch if >=3 hours have elapsed since last fetch Big dataset updated remotely every 5 minutes fetchTimestamp.myfetcher <- neverCurrent 3 hours
Fetch from ScienceBase only if the file on SB changed SB file that colleagues will update periodically fetcher: sciencebase in viz.yaml 0 secs (or omit)
Fetch from URL only if the remote header's "last-modified" value has changed Single file at URL fetcher: url in viz.yaml 0 secs (or omit)
Fetch from URL with complex structure or no "last-modified" Complex precip shapefiles, dataRetrieval query fetchTimestamp.myfetcher <- [custom method] Up to you

fetchTimestamp methods

The fetchTimestamp method for a viz item is specified by the same viz.yaml argument, fetcher, that determines the item's fetch method. In other words, every fetcher must have both a fetch method and a fetchTimestamp method. For common and simple use cases, you can set your custom fetchTimestamp method equal to a built-in fetcher like alwaysCurrent or neverCurrent, whose names describe what those methods assume about the currency of the local timestamp relative to the remote timestamp. Three ways you might define your fetchTimestamp method are below. The third is a simplified version of fetchTimestamp.url; note the use of the readTimestamp and writeTimestamp helpers.

fetchTimestamp.myfetcher <- alwaysCurrent
fetchTimestamp.myfetcher <- neverCurrent
fetchTimestamp.myfetcher <- function(viz) {
  # URL will be specified in viz.yaml as remoteURL
  checkRequired(viz, "remoteURL")
  url <- viz$remoteURL
  
  # read the URL header and the current timestamp file
  new.timestamp <- headers(HEAD(url))[['last-modified']]
  
  # Parse the new.timestamp into POSIXct for passing to writeTimestamp
  new.timestamp <- parse_http_date(new.timestamp)
  attr(new.timestamp, "tzone") <- "UTC"

  # write the new timestamp to the file if needed
  old.timestamp <- readTimestamp(viz)
  if(!is.na(new.timestamp) && (is.na(old.timestamp) || (new.timestamp != old.timestamp))) {
    writeTimestamp(new.timestamp, viz)
  }
  
  invisible() # return nothing
}

Or you can use built-in pairs of fetch and fetchTimestamp methods by setting a known fetcher in the viz.yaml in one of these three ways:

fetcher: sciencebase
fetcher: url
fetcher: file

timetolive and preferences.yaml

Your timetolive settings live in an optional, local file named preferences.yaml. If you create this file, format it like this (where cuyahoga and iris_data are IDs of example fetch items from the viz.yaml):

timetolive:
  cuyahoga: 1 secs
  iris_data: 2 days

Don't git commit this file - it may need to be different on other computers or the Jenkins machine. If a fetch item listed in the viz.yaml is missing from this preferences.yaml file (or if preferences.yaml doesn't exist), the timetolive for that file will be 0 secs.