Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial package for getting and loading the data. #3

Merged
merged 3 commits into from
May 12, 2019
Merged

Conversation

spkaluzny
Copy link
Collaborator

A complete package for getting the data from the website and loading the data into an R session. See my note under the Package Design issue.

@codemasta14
Copy link
Collaborator

@spkaluzny @TR4Android, @glennon, I've looked through all of it, and it looks good to me. It implements all of the usage I built in my push. I like the standardized naming process as well. I like having gt_ to start the functions. Didn't we discuss having gt_load_data as an internal function and gt_read_data was available to the user? Maybe we could have a wrapper function of the two that is gt_data? I can start on that after we accept this push. Let's go ahead and accept this pull request.

Along the lines of developing and package attribution, @spkaluzny has entered his author information. I can update author information for me and J. Hathaway. @TR4Android and @glennon do you want to provide your name and email so we can update the R description file?

I would like to start working on the help functions. I'll document out the functions and provide a glossary to the variables in the help file.

@taltstidl
Copy link

Sorry for the late reply, I just gave this a spin and this looks like an awesome starting point. The way the latest archive version is determined is a bit finicky currently, but that's the best we can currently do. As to the column names, this is something that should definitely be more standardized (@codemasta14 you can find a short glossary on our website, not all fields are contained in the archive though). Discussion on that can continue in issue #4 and I'll merge any needed code changes in a separate pull request.

I love the added documentation, as I see this being an important aspect if we want people to actually use the package 😃. It would be nice if we could keep the vignettes up to date while we develop this, function references can be added at a later point, I guess. I'll try to get a pkgdown website up and running when the package is finished.

@spkaluzny One minor nitpick: can we move everything up one folder (i.e. move the contents of the geysertimes folder to the top level directory)? I feel like this is just an unnecessary indirection here and most packages I know of on GitHub include their R folder at the top level.

Thanks for taking the time to develop and review this!

@spkaluzny
Copy link
Collaborator Author

@TR4Android we can move the package code in my branch up one folder but that folder should be called geysertimes, the name of the package - currently it is called geysertimes-r-package. Is this something I have to do (not that experiences working with Github).

@spkaluzny
Copy link
Collaborator Author

spkaluzny commented May 9, 2019 via email

@taltstidl
Copy link

@spkaluzny Two things:

  • The repository name and folder name need not match. Just because the repository is called geysertimes-r-package on GitHub that does not mean that your local folder containing these files also needs to be named geysertimes-r-package. It's perfectly fine to have a folder geysertimes pointing to a geysertimes-r-package remote repository (thus eliminating the middle geysertimes folder currently in this repository). I'm guessing the folder name is a requirement of CRAN?
  • I share some of @codemasta14 sentiments. Especially for scripted usage, I feel that splitting the process into gt_get_data and gt_load_data might cause some confusion (especially when run at irregular intervals and always wanting the most up-to-date data). Personally, I manually adjusted the path to point to a folder in my workspace anyway (thus disregarding the default set by gt_path). In that case, the same path needs to be given to both functions, and thus a single function could be advantageous. I'm fine with both options though.

@spkaluzny
Copy link
Collaborator Author

The archive that we submit to CRAN will have to be named geysertimes_1.0.tar.gz (for version 1.0) and it will have to unpack into the folder called geysertimes. On Github the files could be in geysertimes-r-package instead of geysertimes-r-package/geysertimes.

However, most R packages on Github are stored under a folder that is the named the same as the package itself. The function remotes::install_github can be used to install a package directly from Github. If we put the package directly in geysertimes-r-package then the install command would be:
remotes::install_github("geysertimes/geysertimes-r-package")
but the resulting package that gets installed would be called geysertimes, which could be confusing.

Should I delete my current pull request and update the data-load branch to put the files directly under geysertimes-r-package?

@taltstidl
Copy link

I'm guessing that is an issue regardless of whether the subfolder geysertimes exists or not. I don't know the details of the devtools::install_github function, but I'm guessing it will still create a folder called geysertimes-r-package for our package and then download the geysertimes subfolder within that folder (don't know too much about R package development but I can see a possible confusion for the tools here, as it might expect a R folder in the top-level directory?). So if we really wanted to prevent this we'd probably need to rename this repository here on GitHub (which I'm reluctant to do as we might want to develop packages for other programming languages in the future. Python being a prime candidate here as lots of people use it for data analysis and machine learning.). The only option that would come to mind here is to rename both the repository and the package to something like geysertimes-r. Thoughts?

You can simply add any changes to the data-load branch and this pull request will automatically update. No need to close it and then open a new pull request.

@spkaluzny
Copy link
Collaborator Author

I have updated my data-load branch to have the top level package directory be geysertimes-r-package instead of geysertimes-r-package/geysertimes.

With my original structure, you would have to install from Github with:
remotes::install_github("geysertimes/geysertimes-r-package/geysertimes")
now you should just need
remotes::install_github("geysertimes/geysertimes-r-package)
but the package that is installed is called geysertimes which is what we want. The R package utilities get the name of the package from the Package: entry in the DESCRIPTION file, not from the directory name or name of the tar.gz archive containing the package source.

@taltstidl
Copy link

@spkaluzny Nice, thanks for doing the research 🎉! That should be more aligned with the way most people download packages from GitHub and the package name is also correct 👍. I'm merging this pull request, we can iterate on the data loading process later on if need be. Thanks for your implementation and documentation!

@taltstidl taltstidl merged commit c24ac9f into master May 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants