-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestions for Data sets/Variables to Add #1
Comments
Note: sign in required indicates that the data source requires some sort of log in to access the data. This increases the difficulty of creating a function to download the data. |
Some big datasets from the American politics field:
|
I have scripts that process all of these except HIIK's Conflict Barometer. If those might be helpful, please send me an email at ulfelder gmail and I will pass them along. |
@ulfelder If you already have scripts for processing these data set, that would be really helpful. Would it be possible for you to fork the psData dev branch and just place the scripts in a new folder called misc. We can build from there. This way you'll be logged as a contributor and can more easily take credit. I can email you separately, if you'ld prefer that. |
Will do. I've only used GitHub in rudimentary ways so far, so please On Tue, Feb 25, 2014 at 5:13 AM, Christopher Gandrud <
Jay Ulfelder, Ph.D. |
No worries, just let me know if you have any questions. |
I've built draft methods for Quality of Government, World Development Indicators, Gleditsch and Ward and Powell and Thyne data. Let me know if I should merge them to your work. My design uses a data frame attribute to store the equivalent information that Stata uses to parameter panel data with the Last, there's been recent updates on SDMX in R, so I can also work on trying to get Eurostat and/or OECD data in there. |
|
It seems to me that the data sources described above would make several useful packages. Food for thought: it has often turned useful to create multiple smaller, compact packages than a single package that contains all: (i) Handling dependencies is considerably easier with smaller packages, (ii) tutorials remain more compact and readable, (iii) packages tend to remain more stable, and (iv) responsibilities can be more clearly allocated between developers. Good to start with a single package but also good to consider splitting it into more compact pieces already at an early stage, splitting does not add too much maintenance overhead in our experience, rather the contrary. This is our experience after working with Finnish open government data packages since 2009. Also the rOpenSci folks seem to prefer minimal, compact packages, mostly one package per one data source (API). I believe they reached the same conclusion. |
@antagomir I think that is a really good idea. I see at least two main characteristics that could divide these data sets into multiple packages that would make sense from a user/package perspective:
The work in psData currently and QoG is country-year and downloaded from URLs, not APIs. Users are likely to want to merge the different data sets into one data frame. So it makes sense to have a package that would gather and clean them in a consistent way such that they could be merged together easily. Conversely, for example, users probably don't want to merge survey data with with country-year data. It makes less sense to include this data in on package. Maybe the package should be renamed something like psCountryData? |
I agree, it is difficult to draw the line. One option to consider is to have separate packages for distinct data sources, and then have the merging functions either in a generalist package that depends on these individual data crawling packages, or in one of the data packages. This way one could still isolate some parts into their own packages and have most advantages of such split. |
Yeah, I just posted a similar thought over on #3. The focus would be on creating a common core syntax/capabilities that could be applied across packages that gather country-year data. Each individual data set-package would use a similar syntax to return data frames from multiple sources that could easily be merged together. Does this make sense? |
Absolutely. |
I agree. I think that is exactly the right framework. Thomas J. Leeper On Fri, Feb 28, 2014 at 3:02 PM, Leo Lahti [email protected] wrote:
|
Great, I'm directing this conversation over to #5. |
Feel free to add data sets and variables that would be useful to include in psData. Code contributions are also always very helpful
The text was updated successfully, but these errors were encountered: