Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package Design #1

Open
taltstidl opened this issue Feb 5, 2019 · 15 comments
Open

Package Design #1

taltstidl opened this issue Feb 5, 2019 · 15 comments
Assignees

Comments

@taltstidl
Copy link

taltstidl commented Feb 5, 2019

There are two main sources for obtaining the GeyserTimes data in a machine-readable format:

  • The GeyserTimes Archive which contains a single file with all the eruptions in the database as well as smaller versions for each geyser. Each of the files is a gzip compressed tab-separated file.
  • The GeyserTimes API which can be used to obtain relatively small amounts of eruption data. It supports the retrieval of a list of all geysers as well as a list of eruptions for a given geyser between two dates.

The choice of source largely depends on whether you want to do a more detailed study on a larger chunk of data or whether you want to analyze the most recent behavior, but for a shorter time frame. Please keep in mind that GeyserTimes is running on donated resources and bombarding the server with requests should be avoided at all costs.

Before starting with the development itself, here are a few questions that we'd like to ask in order to improve the tooling that GeyserTimes provides as well as help guide the package design:

  • What GeyserTimes data do you want to access (e.g. Old Faithful eruptions between 2018-01-01 and 2018-12-31)?
  • What kind of analysis are you interested in doing on this data (e.g. calculate the average interval and the standard deviation)?
  • What type of charts are you planning on creating for visualizing the data (e.g. histogram, scatter plot, box plot)?

Thanks again for your interest! We're already looking forward to the day this package gets published on CRAN for everyone to use.


Side note: At my university it's exam time and thus my availability will be very limited over the next two weeks as I'm busy preparing for exams. I will still check this thread from time to time and do my best to answer any questions you might have.

@spkaluzny
Copy link
Collaborator

Data Content

I suggest that the single file with all the eruptions in the database be available in the package. That file, currently at about 16 Mb is too large for CRAN. CRAN policy states: As a general rule, neither data nor documentation should exceed 5MB

We could provide a function to download and create the full dataset. This would also allow the user to update the data, ensuring they have the latest observations. We do not want to do this download every time the package is loaded so we would have to work out a way to store the data locally (the user may not have write access to where the package is initially installed so we cannot assume to store it in the package directory).

Another approach would be to create a smaller package, which can be submitted to CRAN, and a separate data package with the full dataset, which can be hosted on another (non-CRAN) site. The functions in the small package would include code to make sure the full data package is available. This approach, using the drat package, is described in this paper: Hosting Data Packages via drat: A CaseStudy with Hurricane Exposure Data

We could also provide a function, using the GeyserTimes API, to access the individual geyser data for a range of dates. But this would likely result in more requests on the server.

A data set with the names and locations (both lat, long and a text description) of all the geysers in the database should be included.

Analysis

I have no particular analysis in mind for the geyser data. I hope the members of the GeyserTimes team who know more about geysers would give suggests and we would make sure the data in the package is appropriate for these analyses.

A vignette for the package that illustrates some some simple visualizations, data manipulation and analyses should be included. Analysis of inter-arrival times for eruptions of a geyser with a long history would be one simple analysis. Cross-correlation of eruption times at some of the geysers is another analysis that could illustrating working with data and visualizing it.

@hathawayj
Copy link

I like what @spkaluzny put forth for the data storage. I think there are a few other data points that may be worth building out access in R. I will describe all of them below.

  1. Eruption Times: Tools/Functions to parse, store, and process your archive as @spkaluzny explained.
  2. Webcam video retrieval: Tools/Functions to retrieve webcam clips, digest webcam video into data artifacts.
  3. Spatial Mapping: Spatial data for each geyser and pertinent mapping artifacts to display maps.
  4. Data Visualization: We could have a few canned visualizations as functions in the package that allow the user quick access to different standard visuals.
  5. Additional geyser variables: Height of geyser at each minute or second
  6. Additional External Variables: Tools/Functions to get other data from other sources. For example, weather, earthquakes, number of tourists in the park.

As to analysis, once we start processing the data some ideas may come to the team. If we start building out something for geyser analysis it might make sense to have two different packages.

@codemasta14
Copy link
Collaborator

This all looks great. Just as a reminder, I won't have much time to work on this until April.

@taltstidl
Copy link
Author

Sorry for the long silence, yet I've been a bit busy lately.

I'm also guessing the data storage approach suggested by @spkaluzny is the way to go. There are just a few open questions left in my mind:

  1. Where do we download the file to? I'm leaning towards user choice with a sane default location.
  2. How do we make sure that the user has a sufficiently recent version of the file available? Two options, either notify the user that a more recent version is available and let it be downloaded on demand (fewer downloads and thus lighter on the server) or just automatically download the newer version. First is probably better for interactive usage while the second is preferable for scripts. Do we let the user choose?
  3. What should the return value be? A normal data frame? Or something more modern from the tidyverse packages (which also has a read_tsv function)? Note: see https://www.tidyverse.org/ for more information on the tidyverse packages.

Tentatively, the data function might look like this: geysertimes::load_data(geyser), which takes an optional parameter geyser and returns a tibble with all eruptions in the database.

An example script using this might look like this (just a quick sketch, so don't expect this to run; also, I'm not exactly a R expert so excuse any mistakes):

library(geysertimes)
eruptions <- geysertimes::load_data("Old Faithful")
electronic <- filter(eruptions, E == 1)
electronic <- arrange(electronic, desc(time))
hist(electronic$interval)

This is a completely hypothetical usage for now, as the electronic data frame currently does not contain the field interval, which would need to be manually calculated. This is needed quite often and a suitable function should probably be distributed as part of the package.

What are your thoughts? Anyone who has the time is more than welcome to proceed with the implementation 😃.
@codemasta14 No worries, as opposed to most of your university work, we don't have any deadlines here so you are free to jump in whenever you can 😉.

@spkaluzny
Copy link
Collaborator

Thinking more about how to download the data for the package since the current full data compressed is 16 Mb which far exceeds CRAN's 5 Gb limit.

I don't think hosting a separate data package on another site using drat as I previous suggested would work. Someone would need to create a new version of the package every time we wanted to make new data available. And we would need a site that can mimic a package repository structure that we can regularly access to put new versions of the package there.

I think having functions to download the data and optionally save it in an appropriate location is the way to go. @TR4Android listed some of the details we would have to work for this.

  1. Where to download the data to since the package is not allowed to write to the package directory? As per CRAN rules, we would write to a tempfile() by default but suggest that the user supply a location so that the data is retained beyond the current session. I have recently discovered the rappdirs package that suggests many default locations for various types of files (data, configuration, logging, etc) on all R platforms. The get_data function would download the data and the load_data would make previously downloaded data available for the current R session. The load_data function would inform the users if no data had been previously downloaded.
  2. How do we make sure that the user has a sufficiently recent version of the file available?
    We would include a version number (perhaps just the date?) with every data download. The user could check geysertime::version and decide if she needed more recent data.
  3. I suggest we use a tibble as the data object. A tibble is created if we use the readr::read_tsv function to read the downloaded data.
  4. While the full dataset is bigger than what CRAN allows, it is not very large for R to handle. Do we need to have individual geyser datasets or can we just let the user extract them from the full dataset? We could provide an extractor function that easily gets particular geysers' data from the full dataset.

The LAGOSNE (Interface to the Lake Multi-Scaled Geospatial and Temporal Database) package downloads data on U.S. lakes. Many of the ideas I described above were developed after looking at how LAGOSNE handles its data.

@taltstidl
Copy link
Author

I agree with your analysis @spkaluzny, a drat package is not necessary in our case and would just complicate matters. Let me reply to some of your comments.

  1. A temporary file with a hint to supply a permanent location should be fine. I'm not sure about splitting the process into a get_data and load_data function, I think a single function for this should suffice and also is less error-prone. Thoughts?
  2. Currently the downloaded file and link contain the date of the archive. That should make it easy to version the file, as archives are created nightly. However, we still need a direct link to download the latest version of the archive, without having to figure out the backup date first. That's something we'll take care of.
  3. A tibble should work well here, so let's go with this 👍
  4. You are right, the file is still very small and a filter function should be enough. The geyser parameter would allow users finer control over the downloaded data but that's seldomly needed and can wait till the need arises (if it ever does).

Feel free to start with the implementation whenever you're ready. I think the download/load process should be good to go. Further down the road we should think about filtering functions (geyser, electronic, primary, etc.) and other helpers.

@codemasta14
Copy link
Collaborator

Alright everyone, I'm freed up with school for a couple months, and will have time to work on this. I'm going to dive in on what you've said previously, and get to work on developing things. But I do want to ask, has any implementation already been done?

@codemasta14
Copy link
Collaborator

Also as far as updating work as we go along, would you guys like to use a project to keep track of tasks? Would you prefer me to fork the repository and make pull requests, or just give me push permission and have me clone the repository. I'm the junior here, so I want to do things how you guys want to, and not step on toes, but I'm excited and ready to dive in.

@spkaluzny
Copy link
Collaborator

I have started on the package and had a similar question about posting / updating the repository. I think we want to post to geysertimes not geysertimes-r-package.

@taltstidl
Copy link
Author

@codemasta14 @spkaluzny Good to hear that work has started on this project! The suggested way for contributing is to create a new branch on this repository, do the work there (e.g. add the function for loading the database), then open a pull request on the master branch so that we can all discuss and iterate on this. Then, once we are all on the same page regarding functionality and implementation, we'll be merging this. We'll only be taking a moderating role here, so feel free to organize yourself somewhat. You all have write access to this repository so you can clone and push.

tl;dr proposed Development Pipeline (for v1 of this R package)

  1. Data downloading/loading
    • Create a new branch (e.g. data-load)
    • Commit and push changes
    • Create a new pull request based on the master branch
    • Discuss and iterate until all desired functionality is implemented
    • Merge into master branch
  2. Functions for filtering (by geyser, by dates, by electronic, by primary, etc.), sorting (mostly by date) as well as post-processing (e.g. adding intervals)
    • Similar to above, but can be more easily split among people
  3. Functions for generating charts and graphs (e.g. interval and duration distributions, up for discussion)
    • Similar to above, but can be more easily split among people

If you deviate from this, that's okay. As long as the end result is a organized repository with readable and functional code, we're fine with it. This is just meant to help you get started 😉

@spkaluzny
Copy link
Collaborator

I have created a branch, data-load, that consists of an initial full package with functions and help files for getting the data from https://geysertimes.org/archive/ and storing an R binary object of the data for subsequent use (in a tibble).

Some notes about the package:

  • The package name is geysertimes, all lower case. That seems to be the preferred package naming scheme these days.
  • I use a gt_ prefix for the functions since without the prefix the function names are quite common (path, version, etc)
  • There are two main functions, one for downloading the data from geysertimes.org,
    gt_get_data, (typically only done once) and one for loading that downloaded data in subsequent R sessions, gt_load_data.
  • The default location for the downloaded data is under the tempdir() to meet the CRAN requirements that no files be created by default other than under tempdir(). The gt_get_data does suggest that the user use the value returned by the gt_path function as the download location. The gt_path function sets an appropriate location based on the user's OS.
  • The version associated with a download of the data is just the date from the download archive file as a character string in the form yyyy-mm-dd.
  • The DESCRIPTION file currently lists me as author, maintainer. We need to decide who to add and in what role.
  • The License is MIT. We need to decide what we want.
  • I started a vignette that shows the basic workflow for getting the data and loading it into an R session.

@codemasta14
Copy link
Collaborator

Great. What can I do to add onto this and help?

@spkaluzny
Copy link
Collaborator

It would be good for the team to try out the package. Feedback on the design and its usage is welcome.

@codemasta14
Copy link
Collaborator

Hey everyone, I'm sorry I've been out of touch for the last couple weeks. I've been in the process of moving across the country for my summer internship. I'll be still working on this, but will have a little less time to do so until August when I finish up here.

@taltstidl
Copy link
Author

@codemasta14 No worries here, this package is done when it's done. We're not on any schedule here, so there's absolutely no rush. Have fun with your internship!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants