-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Package Design #1
Comments
Data ContentI suggest that the single file with all the eruptions in the database be available in the package. That file, currently at about 16 Mb is too large for CRAN. CRAN policy states: As a general rule, neither data nor documentation should exceed 5MB We could provide a function to download and create the full dataset. This would also allow the user to update the data, ensuring they have the latest observations. We do not want to do this download every time the package is loaded so we would have to work out a way to store the data locally (the user may not have write access to where the package is initially installed so we cannot assume to store it in the package directory). Another approach would be to create a smaller package, which can be submitted to CRAN, and a separate data package with the full dataset, which can be hosted on another (non-CRAN) site. The functions in the small package would include code to make sure the full data package is available. This approach, using the drat package, is described in this paper: Hosting Data Packages via drat: A CaseStudy with Hurricane Exposure Data We could also provide a function, using the GeyserTimes API, to access the individual geyser data for a range of dates. But this would likely result in more requests on the server. A data set with the names and locations (both lat, long and a text description) of all the geysers in the database should be included. AnalysisI have no particular analysis in mind for the geyser data. I hope the members of the GeyserTimes team who know more about geysers would give suggests and we would make sure the data in the package is appropriate for these analyses. A vignette for the package that illustrates some some simple visualizations, data manipulation and analyses should be included. Analysis of inter-arrival times for eruptions of a geyser with a long history would be one simple analysis. Cross-correlation of eruption times at some of the geysers is another analysis that could illustrating working with data and visualizing it. |
I like what @spkaluzny put forth for the data storage. I think there are a few other data points that may be worth building out access in R. I will describe all of them below.
As to analysis, once we start processing the data some ideas may come to the team. If we start building out something for geyser analysis it might make sense to have two different packages. |
This all looks great. Just as a reminder, I won't have much time to work on this until April. |
Sorry for the long silence, yet I've been a bit busy lately. I'm also guessing the data storage approach suggested by @spkaluzny is the way to go. There are just a few open questions left in my mind:
Tentatively, the data function might look like this: An example script using this might look like this (just a quick sketch, so don't expect this to run; also, I'm not exactly a R expert so excuse any mistakes): library(geysertimes)
eruptions <- geysertimes::load_data("Old Faithful")
electronic <- filter(eruptions, E == 1)
electronic <- arrange(electronic, desc(time))
hist(electronic$interval) This is a completely hypothetical usage for now, as the What are your thoughts? Anyone who has the time is more than welcome to proceed with the implementation 😃. |
Thinking more about how to download the data for the package since the current full data compressed is 16 Mb which far exceeds CRAN's 5 Gb limit. I don't think hosting a separate data package on another site using I think having functions to download the data and optionally save it in an appropriate location is the way to go. @TR4Android listed some of the details we would have to work for this.
The LAGOSNE (Interface to the Lake Multi-Scaled Geospatial and Temporal Database) package downloads data on U.S. lakes. Many of the ideas I described above were developed after looking at how |
I agree with your analysis @spkaluzny, a
Feel free to start with the implementation whenever you're ready. I think the download/load process should be good to go. Further down the road we should think about filtering functions (geyser, electronic, primary, etc.) and other helpers. |
Alright everyone, I'm freed up with school for a couple months, and will have time to work on this. I'm going to dive in on what you've said previously, and get to work on developing things. But I do want to ask, has any implementation already been done? |
Also as far as updating work as we go along, would you guys like to use a project to keep track of tasks? Would you prefer me to fork the repository and make pull requests, or just give me push permission and have me clone the repository. I'm the junior here, so I want to do things how you guys want to, and not step on toes, but I'm excited and ready to dive in. |
I have started on the package and had a similar question about posting / updating the repository. I think we want to post to geysertimes not geysertimes-r-package. |
@codemasta14 @spkaluzny Good to hear that work has started on this project! The suggested way for contributing is to create a new branch on this repository, do the work there (e.g. add the function for loading the database), then open a pull request on the tl;dr proposed Development Pipeline (for v1 of this R package)
If you deviate from this, that's okay. As long as the end result is a organized repository with readable and functional code, we're fine with it. This is just meant to help you get started 😉 |
I have created a branch, data-load, that consists of an initial full package with functions and help files for getting the data from https://geysertimes.org/archive/ and storing an R binary object of the data for subsequent use (in a tibble). Some notes about the package:
|
Great. What can I do to add onto this and help? |
It would be good for the team to try out the package. Feedback on the design and its usage is welcome. |
Hey everyone, I'm sorry I've been out of touch for the last couple weeks. I've been in the process of moving across the country for my summer internship. I'll be still working on this, but will have a little less time to do so until August when I finish up here. |
@codemasta14 No worries here, this package is done when it's done. We're not on any schedule here, so there's absolutely no rush. Have fun with your internship! |
There are two main sources for obtaining the GeyserTimes data in a machine-readable format:
The choice of source largely depends on whether you want to do a more detailed study on a larger chunk of data or whether you want to analyze the most recent behavior, but for a shorter time frame. Please keep in mind that GeyserTimes is running on donated resources and bombarding the server with requests should be avoided at all costs.
Before starting with the development itself, here are a few questions that we'd like to ask in order to improve the tooling that GeyserTimes provides as well as help guide the package design:
Thanks again for your interest! We're already looking forward to the day this package gets published on CRAN for everyone to use.
Side note: At my university it's exam time and thus my availability will be very limited over the next two weeks as I'm busy preparing for exams. I will still check this thread from time to time and do my best to answer any questions you might have.
The text was updated successfully, but these errors were encountered: