Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhall of psData #5

Closed
christophergandrud opened this issue Feb 28, 2014 · 27 comments
Closed

Overhall of psData #5

christophergandrud opened this issue Feb 28, 2014 · 27 comments

Comments

@christophergandrud
Copy link
Contributor

Following discussions in #1 and #3, there seems to be a consensus emerging that we should focus on:

  1. packages for downloading/cleaning individual data sets
  2. at the same time, at least for data in country-year format, a common syntax/capabilities should be developed that makes it easy for users to download data sets with these individual packages and returns data frames that can be easily merged with one another.

As such, I think what we might want to do is

(a) Create a new text document repo that would be used to collaborate on a common psCountryData framework. (Probably starting with an .md document that would develop a framework checklist for individual country-year data packages to follow. Also, because my main professional incentive right now is to publish papers, this could be developed into a JSS style article laying out the framework and giving examples from packages that implement it. Any interested person could of course co-author.).

(a) Create a new package called psCountryData that would contain core functions shared by the individual data Getter/Variable Builder packages. For example, psData currently includes a CountryID function. This is a modified version of countrycode that is handy for creating merge ready country identifiers. It looks like there is some good stuff in @briatte's QoG package that could go in there too.

(c) Break up psData into individual Getter and Variable Builder packages that implement this framework. Similarly the two QoG packages could (depending on the authors' preferences), implement the unified syntax.

Any thoughts?

@leeper
Copy link
Member

leeper commented Feb 28, 2014

Yes, open a new repo and then let's brainstorm there.

Thomas J. Leeper
http://www.thomasleeper.com

On Fri, Feb 28, 2014 at 3:39 PM, Christopher Gandrud <
[email protected]> wrote:

Following discussions in #1 https://github.com/rOpenGov/psData/issues/1and
#3 #3, there seems to be a
consensus emerging that we should focus on:

packages for downloading/cleaning individual data sets
2.

at the same time, at least for data in country-year format, a common
syntax/capabilities should be developed that makes it easy for users to
download data sets with these individual packages and returns data frames
that can be easily merged with one another.

As such, I think what we might want to do is

(a) Create a new text document repo that would be used to collaborate on a
common psCountryData framework. (Probably starting with an .md document
that would develop a framework checklist for individual country-year data
packages to follow. Also, because my main professional incentive right now
is to publish papers, this could be developed into a JSS style article
laying out the framework and giving examples from packages that implement
it. Any interested person could of course co-author.).

(a) Create a new package called psCountryData that would contain core
functions shared by the individual data Getter/Variable Builder packages.
For example, psData currently includes a CountryID function. This is a
modified version of countrycodehttps://github.com/vincentarelbundock/countrycodethat is handy for creating merge ready country identifiers. It looks like
there is some good stuff in @briatte https://github.com/briatte's QoG
package that could go in there too.

(c) Break up psData into individual Getter and _Variable Builder_packages that implement this framework. Similarly the two QoG packages
could (depending on the authors' preferences), implement the unified syntax.

Any thoughts?

Reply to this email directly or view it on GitHubhttps://github.com//issues/5
.

@antagomir
Copy link
Member

We may also wish to develop general guidelines for parliamentary data and some other data types that are seen in many countries. Perhaps we should start a general guidelines section or repository under rOpenGov? In the longer run we could accumulate there multiple recommendation documents for different general data types. If the guidelines are general rather than tied to an individual package, perhaps people will be more likely to implement them also in their own work?

@christophergandrud
Copy link
Contributor Author

I like that idea a lot.

Maybe what we need are three repos:

  • A general rOpenGov syntax guideline document (not sure if this needs its own repo, could just be a section on rOpenGov)
  • A repo for the psCountryDataFramework write up.
  • A repo for the psCountryData R package

?

@briatte
Copy link
Contributor

briatte commented Feb 28, 2014

There might be two projects here:

  1. if psCountryData is going to use a unified method, it will also work for non-country panel designs, as long as the non-country codes are unique, as expected when using COW or ISO or whatever country codes. The same applies to the time variable, which might be year, quarter (thinking of Quandl or Eurostat data here), or else. Developer-side, we can keep that flexibility when thinking about the unified method. User-side, it should be relatively easy to offer a way to auto-detect whether the panel variable is a list of country codes, and detect its classification (users might like it).
  2. I have only limited experience with parliamentary data, and it's mostly with French legislative data, oriented towards cosponsorship networks. If that's the kind of data you are thinking of, I have experience with manipulating amendments, bills and resolutions, and can help with that, but it's a separate project, with very different questions (e.g. how to handle parliamentary sessions, unique identifiers for amendments, bills, resolutions etc.).

@antagomir
Copy link
Member

Yes,

Each R package needs its own repository (psCountryData R package).

I would avoid creating several new repositories for tutorials/guidelines; in the long term this will obscure the structure and scope of the rOpenGov account: the rOpenGov organization is primarily aimed for R packages. This helps to keep all automated procedures clear. We can have few repositories for supporting material, such as the website infrastructure, guidelines etc. but try to keep these at minimum. But this is a community project so I am ready to reconsider my views if there are other good suggestions. Ping @ouzor @jlehtoma @muuankarski

I would prefer either of these two options:

  1. Create a generic guidelines repository, and add the psCountryDataFramework materials under a subdirectory of it; later we may want to create other subdirectories for other themes, such as ParliamentaryDataFramework etc.

  2. Think of writing the paper as a separate project, and create the repository elsewhere. We can then point to these documents from the general rOpenGov syntax guideline document. The general rOpenGov syntax guideline document could be in rOpenGov wiki, or in its own repository.

@antagomir
Copy link
Member

@briatte any plans to extend your work on French parliamentary data into an R package? Seems you are already getting close to that with all functions and examples.

@briatte
Copy link
Contributor

briatte commented Feb 28, 2014

More on 1.

A possible way to implement the unified method would be to think of CSTS data like igraph thinks of network objects (as always having vertices available to the V function and edges to the E functions). The CountryID function is the equivalent to the V function, and instead of edges, you have a similar TimeID function for handling time series cross sections.

Since the country identifier is really a column name, storing the panel type only requires one character string. The panel type format (country, region, firm, etc.) might be another interesting information to store, in order to auto-convert country codes before merging datasets (I think ISO-3N is the best baseline here). The same line of thinking applies to the time variable.

The metadata you want for the unified method is therefore something like

panel = "cid", format = "iso2c", time = "year", date = "%Y"

where format does extra things with country codes using countrycode, and date extra things with timestamps using lubridate.

If you want to split that in two components, then your method is set_panel(get_data("qog"), panel = "cid", format = "iso2c", time = "year", date = "%Y"))

where get_data("qog") is the getter function and set_panel the workhorse to get the panel attributes attached to the object.

psdata thus sounds like a fine name for the project, because it means either "political science" or "panel/series" data, which is what the package really offers to get: data, plus a method to manipulate country-year, or election-year, or region-decade, or firm-quarter data. It's all possible behind an end result that works mostly with CSTS data.

I'm not sure I'll have enough functions left to have a separate qogdata package. The only QOG-specific function I coded apart from get_qog is one that collapses old and new states (e.g. Sudan pre/post split, independences) into single-state country codes.

@briatte
Copy link
Contributor

briatte commented Feb 28, 2014

@antagomir More on 2 (off topic).

I'm currently updating the repo to have a single method to process cosponsorships in amendments, bills and resolutions, for both the French National Assembly and the Senate. The data (scraper) functions are idiosyncratic, but the cosponsorship network functions might or not make up package material.

The only other legislature I know with similar data is the U.S. Congress (the U.S. experts are Fowler, Waugh, Kirkland). Do you know of any other legislature with similar data? A package would become interesting if it could deal with data from several countries.

@christophergandrud
Copy link
Contributor Author

@antagomir I like the idea of a generic rOpenGov framework depository where the psData framework material is a subdirectory. This will highlight the collaborative nature of the project, centralise the different frameworks into one place so that they are easy to follow and cross-observe, while not clogging up rOpenGov with lots of non-R repos.

@antagomir
Copy link
Member

@christophergandrud Ok, sounds good. I will cross-check with other core team members and hopefully we could launch the repository shortly.

@briatte Not aware at the moment but this is certainly interesting, keeping in mind.

@christophergandrud
Copy link
Contributor Author

@briatte I like the ideas you propose in 1. I want to think about this some more, but these are some initial thoughts:

  • I still like the idea of breaking up the data getters into different packages. Maybe like Zelig does with non-core model types, the main psData package could draw on these other packages when needed. It could prompt the user to download a specific package if they don't already have it. My main thought is that some of the data sets in psData get now actually require a lot of idiosyncratic post processing (especially the ones for Reinhart and Rogoff data and Axel Dreher's IMF program data). These would be easier to maintain as stand alone packages.
  • In the get_data method, it would be good if the user could specify which URL or data version to gather. Related to this, I've been meaning to add sha1 tags (like in source_data) so that the user can make sure that they are getting the version of the data they think they are getting.
  • I'm fine having the set_panel function return a non-data frame class object, though I think this might trip up novice users who try to stick the returned data into a function that only takes data frames. At the least we could have a simple function to strip the object down to a data frame (or clearly document how to do this).

Though if we do create a psData class it would definitely be good to come up with a merging method.

Otherwise, I think this is a really nice direction to head.

@antagomir Great.

@christophergandrud
Copy link
Contributor Author

Oh, @briatte I really like the idea of psData as both 'political science data' and 'panel series data' and the general framework you sketched out that makes it useful for different types.

@briatte
Copy link
Contributor

briatte commented Feb 28, 2014

  • I'm on board for separating get_data from set_panel, the mount of processing is justified (although not always: there is almost none on Gledistch and Ward, for instance, so we can include these as vignette/examples in the psdata package).

As as I understand, the code would look like

get.r
- get_wdi -> WDI::WDI(...)
- get_qog -> rQog::read.qog(...)
- get_etc -> etc.
- get_etc -> etc.
- get_etc -> etc.

panel.r
- panel_set   (perform all sorts of panel data checks)
- panel_shift (lag/lead)
- panel_decay (by @zmjones; for survival functions)
- panel_tse   (by @zmjones; time since event)

methods.r
- merge
- print     (equivalent to -xtdes- in Stata)
- summarize (equivalent to -xtsum- in Stata)
- lag       (see also dplyr::lag and stats::lag)
- lead      (see also dplyr::lag and stats::lag)
- plot      (simple ggplot2 code for time series)
- sample
- etc.
  • The get_data functions should be able to follow some parameter guidelines like those you suggest. Those already do: they use start and end parameters for downloading specific time ranges.
  • In my draft, the set_panel function is called xtset, as in Stata, and the other xt functions work with a data frame attribute, which is inoffensive for users who just want the data and nothing else. But this attribute is not a real S3 class like panel.data.frame could be.

Re: conventions, I don't have strong preferences for repo organization. I have a tendency to use only lowercase, underscores and periods in name.

@zmjones
Copy link

zmjones commented Feb 28, 2014

is there any interest in adding the ability to deal with k-adic data in here? i was sort of working on a package for dealing with this sort of data, but it would probably be better to integrate this with what you all are doing.

@antagomir
Copy link
Member

Tools for k-adic data, what purpose? Open government data or more generic? I think this should go to a separate issue.

@zmjones
Copy link

zmjones commented Feb 28, 2014

Country data mostly.

@antagomir
Copy link
Member

Any examples, might help to connect?

@zmjones
Copy link

zmjones commented Feb 28, 2014

So lots of people in political science and econ look at things that occur between pairs of states (or triads, etc.) and how features of those states condition what the outcome is and such. Trade, conflict, etc. The data manipulations are really similar to what you have to do with monadic country year data. Constructing the data sets is another issue that is a pain point for many researchers.

@briatte
Copy link
Contributor

briatte commented Feb 28, 2014

Merging multilevel data is a related issue, but at that stage, the solution must be some clever dplyr tricks rather than the tricks I coded for CSTS.

@antagomir
Copy link
Member

Might be good to make dyadic things into their own package but try to coordinate with the other packages? Very welcome to join rOpenGov with the dyadic package as well.

@jlehtoma
Copy link

jlehtoma commented Mar 1, 2014

+1 for creating a separate "framework" repo to host documents and guidelines. As @antagomir pointed out, for individual packages/frameworks a GitHub wiki (which is a git repo in itself, of course) is a viable option as well.

@jlehtoma
Copy link

jlehtoma commented Mar 1, 2014

Related to @briatte 's comment and still off topic, I recently started finpar package for various data on the Finnish Parliament using an unofficial API. The API and the underlying database + website are still being developed, but data is already available for individual MP activities (thus potentially also "cosponsorships in amendments, bills and resolutions"). Any co-development in dealing with parliamentary data would therefore be interesting.

@christophergandrud
Copy link
Contributor Author

Going way back up. I agree that the ability to create dyads, lags, leads, and this sort of thing would be a really useful. I also +1 @antagomir suggestion that they should be separated into their own package. This package would require psData objects. So they would be closely linked.

On that note: I think a good goal of psData could be to get data from multiple sources (by linking into specific packages), similarly build variables suggested in the literature (e.g. the winset variable in psData now), and merge it into one panel-series object.

Maybe an example work flow could go something like:

# Download and clean QoG and WDI data
qog.data <- set_panel(get_data('qog', vars = 'WHATEVER VARIABLES'))
wdi.data <- set_panel(get_data('wdi', vars = 'WHATEVER VARIABLES'))

# Merge together
combined <- ps_merge(qog.data, wdi.data, all.y = TRUE)

Then combined could be turned into dyadic data, lags created, etc, using functions from another packaged called something like psManipulate?

Re syntax style: I don't really have a preference either. In the current version of psData I've tended to use CamelCase. But we could definitely do all lowercase with _ separating words. What about (modified) CamelCase for package names, lowercase/_ for function names and arguments and lowercase/. for object names in the documentation?

A new issue for the country-year data we should eventually discuss: how to deal with the divided countries, e.g. East/West Germany.

@antagomir
Copy link
Member

@christophergandrud Your latest suggestion sounds feasible to me. Something along these lines we need. The best way to get fowrard is to just get something functional implemented. Then we can learn by experience.

I have now created a new repository for document guidelines: https://github.com/rOpenGov/guidelines-docs (is guidelines-docs a good name - we can still change if you have better suggestions?) Also let me know if you are missing premissions and I will add. Ping @christophergandrud @leeper @jlehtoma @muuankarski @briatte

@christophergandrud
Copy link
Contributor Author

@antagomir Perfect. Later today I'll create a sub directory in https://github.com/rOpenGov/guidelines-docs for psData and put in an .md file to begin working on the first draft of the guidelines.

I'm going to close this thread and direct the conversation over to that repo. It's been great working on these issues here.

@christophergandrud
Copy link
Contributor Author

Very early/incomplete first version of the guidelines https://github.com/rOpenGov/guidelines-docs/tree/master/psData-guidelines.

Please help edit, improve.

@briatte
Copy link
Contributor

briatte commented Mar 2, 2014

@jlehtoma just to add: finpar looks good! Let me know if you want me to try having my cosponsorship network functions to also work with your data. I'm almost done coding a unified method for bills, resolutions and amendments from both the upper and lower chambers in France, and plan to test it with the U.S. data (initially produced in Matlab) too.

If you plan to develop legislative data as packages, I think @alexstorer has some work on the U.S. Congress, in Python, and Dimiter Toshkov has some data for the European Parliament. It would be great to offer a comparative framework for all of this, but it's a different project that requires a different forum.

@christophergandrud the guidelines look great so far. I'll add stuff inspired by my draft code. And I like the possibility to stick mostly with lowercase, I confess not liking CamelCase very much ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants