-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache data locally after download #8
Comments
This is a great idea. Are you thinking an R.cache dependency or Thomas J. Leeper On Wed, Mar 5, 2014 at 6:29 AM, Vincent Arel-Bundock <
|
+1 |
No need for a dependency. Just let user specify a local path where to save stuff. |
Note that CRAN has reservations for writing on local files: "Packages should not write in the users’ home filespace, nor anywhere else on the file system apart from the R session’s temporary directory (or during installation in the location pointed to by TMPDIR: and such usage should be cleaned up). Installing into the system’s R installation (e.g., scripts to its bin directory) is not allowed." http://cran.r-project.org/web/packages/policies.html But perhaps this is not a problem, if the package does not do this by default, but only after users active choice. Of course there should be an opportunity for packages that are designed for data access. But need to be careful how to implement that. My suggestion is to cache to local tmpdir. |
Just added a (I don't think this is a problem for CRAN as I have a number of packages that do this, the default is always set to not write the file.) I do like the cache idea though. This allows the user to not have to think about their file paths, which can differ when they run the function on different computers, for example. |
Cool. I think there should also be a way to replicate the file processing (e.g. create unique country names, drop duplicate country years, etc.). So we'd need to save a R script as well. |
@vincentarelbundock currently it saves the downloaded and processed data frame as the last step before |
That's cool, but in my view it's not enough. If you're given only the raw and the processed data, you still might have to reverse engineer to find out what the transformations were. Was that log(variable) or log(variable+1)? Which duplicate countries did we drop, exactly? I think what makes the most sense would be related to my issue here: #10 You would have 2 files:
And a standardized function:
|
I agree that in principle it would be useful to have both raw data and the full script available in a reproducible format. If this gets too messy, however, it would be useful to at least store the package version and function name. This is already rather useful as such, and virtually sufficient for full reproducibility. |
I think your current setup is going to be much messier in the long run. "Silos" for each datasets are the only way I can think of to cleanly separate the cleaning scripts. That's a real simple way to scale the project... |
The function calls + package version is effectively a recipe. So for example, having an R script that went: ## Using psData 1.0
library(psData)
# Download data
Main <- get_dpi(vars = 'checks', sha1 = 'foo', file = 'Checks.rda')
# Set as psData
MainP <- as.panel(Main, 'Country', 'Year')
# Transform
AS OF YET NOT CREATED TRANSFORMATION FUNCTIONS
DROP OBSERVATIONS, ETC It would definitely be helpful to add the package version and call to the metadata in Note also, something that hasn't been started yet: @briatte did a good job of creating a way of running basic data sets through Thoughts? |
Here's a repo I quickly put together as a proof of concept to show what I mean more concretely. https://github.com/vincentarelbundock/dpi I think you may be right. I guess I'm just suggesting that a slightly different repo structure with explicit silos would be cleaner, more scalable, and much easier to contribute to. In my experience, generic functions called things like "get_data" tend to try to do too much, and end up being unwieldy. Better to provide some basic functionalities, and cleanly delineate which transforms were operated on each dataset. Even if there is some code duplication, it is better for transparency to have a file called "database_political_institutions.R" that is more or less self-contained and can show the user exactly what is done in the pre-processing stage. |
I agree. Unfortunately I am personally too busy now with building the overall rOpenGov infra right now to contribute much code here. But based on our experiences I also think that Vincent's suggestion is the optimal way to go, if time will allow for others to advance this. |
I think there is some confusion among the various people working on the package. In the original version, each data set was hard silo-ed into their own functions. Then it was recommended that these be spun off into separate packages that would be called by However, in subsequent collaborator updates I think this is the change that @vincentarelbundock is arguing against. I largely agree. However, I see the logic in having a core set of functionalities in I do think I would also recommend against baking in things like transforming and recoding the data in the Finally, it would be great to work on a way to dynamically generate meta data like bibtex entries and noting psData versions. It will easier over the long term than updating these manually. |
Thanks @christophergandrud , that's useful background info. I don't want to insist too much because I don't have a good sense of the 10,000ft view, but I agree that a common One simple approach would be to have Plus there is a single point of entry for all datasets: a The info to include in the accompanying yaml files can be agreed on, and enforced at the Pull Request level. |
I agree about minimizing data transformations. My log example was a bad one. Think more about dropping YUG after break-up. |
@vincentarelbundock I really like your suggestion for the user facing We should have a separate discussion on how to deal with the 'problem countries' like Yugoslavia and East/West Germany. These are important things to consider because they affect merging, which is a going to be a core psData functionality. Right now there is a really crude method of notifying the user about/dropping duplicates. But this is kind of unsatisfactory. |
Yes, get_data as a wrapper for the full versions should be the way to go. We recently had related discussion in ropengov mailing list. It would be good to harmonize these conventions across different packages, and the model discussed here is the best candidate to me so far. Still need to think how to implement cacheing. |
Allow users to pass a flag that will cache the data locally after it is downloaded. Goals:
1- Reproducibility. Even if the online version disappears, the user will still have a local copy
2- Quicker download on second request.
Should be accompanied by a standalone cleaning script if applicable.
The text was updated successfully, but these errors were encountered: