Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for Data sets/Variables to Add #1

Closed
15 tasks
christophergandrud opened this issue Jan 31, 2014 · 17 comments
Closed
15 tasks

Suggestions for Data sets/Variables to Add #1

christophergandrud opened this issue Jan 31, 2014 · 17 comments

Comments

@christophergandrud
Copy link
Contributor

Feel free to add data sets and variables that would be useful to include in psData. Code contributions are also always very helpful

@christophergandrud
Copy link
Contributor Author

Note: sign in required indicates that the data source requires some sort of log in to access the data. This increases the difficulty of creating a function to download the data.

@jknowles
Copy link

Some big datasets from the American politics field:

@steffenzi
Copy link

@ulfelder
Copy link

I have scripts that process all of these except HIIK's Conflict Barometer. If those might be helpful, please send me an email at ulfelder gmail and I will pass them along.

@christophergandrud
Copy link
Contributor Author

@ulfelder If you already have scripts for processing these data set, that would be really helpful. Would it be possible for you to fork the psData dev branch and just place the scripts in a new folder called misc. We can build from there. This way you'll be logged as a contributor and can more easily take credit.

I can email you separately, if you'ld prefer that.

@ulfelder
Copy link

Will do. I've only used GitHub in rudimentary ways so far, so please
forgive me if it takes me a bit to figure out how get that done.

On Tue, Feb 25, 2014 at 5:13 AM, Christopher Gandrud <
[email protected]> wrote:

@ulfelder https://github.com/ulfelder If you already have scripts for
processing these data set, that would be really helpful. Would it be
possible for you to fork the psData dev branch and just place the
scripts in a new folder called misc. We can build from there.

Reply to this email directly or view it on GitHubhttps://github.com//issues/1#issuecomment-35993015
.

Jay Ulfelder, Ph.D.
Twitter: @jay_ulfelder http://twitter.com/#!/jay_ulfelder
Long-form blog: Dart-Throwing Chimphttp://dartthrowingchimp.wordpress.com/
Short-form blog: Tumbling Chimp http://dartthrowingchimp.tumblr.com/

@christophergandrud
Copy link
Contributor Author

No worries, just let me know if you have any questions.

@briatte
Copy link
Contributor

briatte commented Feb 28, 2014

I've built draft methods for Quality of Government, World Development Indicators, Gleditsch and Ward and Powell and Thyne data. Let me know if I should merge them to your work.

My design uses a data frame attribute to store the equivalent information that Stata uses to parameter panel data with the xt commands, plus more (e.g. the format of the country variable). I then pass these settings to functions that use it to manipulate CSTS data, e.g. merging, lagging, etc. I can also try translating these functions.

Last, there's been recent updates on SDMX in R, so I can also work on trying to get Eurostat and/or OECD data in there.

@briatte
Copy link
Contributor

briatte commented Feb 28, 2014

@christophergandrud
Copy link
Contributor Author

@briatte This is a great suggestion. I have a number of thoughts on what we might do that I've put in a new issue #3. I'ld really like to know what you think of them.

@antagomir
Copy link
Member

It seems to me that the data sources described above would make several useful packages.

Food for thought: it has often turned useful to create multiple smaller, compact packages than a single package that contains all: (i) Handling dependencies is considerably easier with smaller packages, (ii) tutorials remain more compact and readable, (iii) packages tend to remain more stable, and (iv) responsibilities can be more clearly allocated between developers. Good to start with a single package but also good to consider splitting it into more compact pieces already at an early stage, splitting does not add too much maintenance overhead in our experience, rather the contrary.

This is our experience after working with Finnish open government data packages since 2009. Also the rOpenSci folks seem to prefer minimal, compact packages, mostly one package per one data source (API). I believe they reached the same conclusion.

@christophergandrud
Copy link
Contributor Author

@antagomir I think that is a really good idea. I see at least two main characteristics that could divide these data sets into multiple packages that would make sense from a user/package perspective:

  • Country-year data / other data
  • Data downloaded from website URLS and APIs

The work in psData currently and QoG is country-year and downloaded from URLs, not APIs. Users are likely to want to merge the different data sets into one data frame. So it makes sense to have a package that would gather and clean them in a consistent way such that they could be merged together easily.

Conversely, for example, users probably don't want to merge survey data with with country-year data. It makes less sense to include this data in on package.

Maybe the package should be renamed something like psCountryData?

@antagomir
Copy link
Member

I agree, it is difficult to draw the line. One option to consider is to have separate packages for distinct data sources, and then have the merging functions either in a generalist package that depends on these individual data crawling packages, or in one of the data packages. This way one could still isolate some parts into their own packages and have most advantages of such split.

@christophergandrud
Copy link
Contributor Author

Yeah, I just posted a similar thought over on #3.

The focus would be on creating a common core syntax/capabilities that could be applied across packages that gather country-year data.

Each individual data set-package would use a similar syntax to return data frames from multiple sources that could easily be merged together.

Does this make sense?

@antagomir
Copy link
Member

Absolutely.

@leeper
Copy link
Member

leeper commented Feb 28, 2014

I agree. I think that is exactly the right framework.

Thomas J. Leeper
http://www.thomasleeper.com

On Fri, Feb 28, 2014 at 3:02 PM, Leo Lahti [email protected] wrote:

Absolutely.

Reply to this email directly or view it on GitHubhttps://github.com//issues/1#issuecomment-36352755
.

@christophergandrud
Copy link
Contributor Author

Great, I'm directing this conversation over to #5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants