-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathdata_summary.html
318 lines (317 loc) · 12.6 KB
/
data_summary.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
layout: docs
---
<h1>What’s in EveryPolitician’s data?</h1>
<p class="section-intro">
Short answer: names. And dates of birth, twitter handles, political
group memberships, email addresses, honorific titles, image URLs, and
all sorts of useful things — all provided we can find them. How
complete this data is depends on the sources that people around the
world have found for us for each country.
</p>
<p>
This is an overview of what you can find in our data. You might also
want to read about the <a href="technical.html">two data formats
available</a> (CSV or JSON), and how the <a
href="data_structure.html">data structure</a> encapsulates
<strong>entities</strong> and <strong>relationships</strong>.
</p>
<p>
The CSV files, which are often easier to work with, contain a useful
distillation of what’s in the JSON. The fields that are used below to
describe the data appear as column headings in the CSV; the same data
will always be present in the JSON too, but in a more structured and
often richer form. For example, where the CSV file may contain one
<code>name</code> for a politician, the JSON data might also include
variations under <code>other_names</code>, mapped by language code.
The JSON often includes more detailed membership information too,
such as start and end dates, and so on. It will usually contain URLs
for the sources of our data.
</p>
<p>
Determine the <em>actual</em> names of the folders and files by
looking in the <code>data/</code> folder, or inside the
<a href="repo_structure.html">index file, <code>countries.json</code></a>.
</p>
<div class="nested-boxes">
<div class="odd-nest">
<ul>
<li>
folders for each <strong>country</strong>
<br>
e.g., <code>Algeria/</code>, <code>American_Samoa/</code>, <code>Albania/</code>
</li>
</ul>
<div>
<ul>
<li>
folders for each <strong>legislature</strong>
<br>
e.g., <code>Assembly/</code>, <code>House/</code>, <code>Senate/</code>
(many countries only have one)
</li>
</ul>
<div class="odd-nest">
<ul>
<li>
JSON file (Popolo format) of all data
(<strong>terms</strong>, <strong>groups</strong>, <strong>persons</strong>)
<br>
e.g., <code>ep-popolo-v1.0.json</code>
</li>
</ul>
<ul>
<li>
CSV file for each <strong>term</strong> containing <strong>persons</strong>:
<br>
e.g., <code>term-19.csv</code>, <code>term-2011.csv</code>, <code>term-1999-02-16.csv</code>
<br>
(in practice, if there’s only one term available, it’ll typically be the current one)
<br>
<div>
<table class="definition">
<tr>
<td><strong>CSV field</strong></td>
<td><strong>use & example value</strong></td>
</tr>
<tr>
<th>id</th>
<td>
unique within EveryPolitician
<em>for this politician</em> (so you can use it to track
people across terms, for example)
<br>
It’s a <a href="https://en.wikipedia.org/wiki/Universally_unique_identifier">UUID</a>,
and looks like
<br><code>2bc9cc09-a33a-42d9-89c3-14effb20b8b0</code>
</td>
</tr>
<tr>
<th>name</th>
<td>
common name for this politician: <code>Anne Example</code>
<br>See <a href="#about-names">more about names</a>
</td>
</tr>
<tr>
<th>sort_name</th>
<td>name for sort ordering: <code>Example, Anne</code></td>
</tr>
<tr>
<th>email</th>
<td>email address: <code>[email protected]</code></td>
</tr>
<tr>
<th>twitter</th>
<td>twitter handle: <code>example</code> (for <code>@example</code>)</td>
</tr>
<tr>
<th>facebook</th>
<td>Facebook name: <code>AnneExample</code> (for <code>facebook.com/AnneExample)</code>
</tr>
<tr>
<th>group</th>
<td>name of political group <code>Example Faction</code></td>
</tr>
<tr>
<th>group_id<span class="superscript">†</span></th>
<td>id for group</td>
</tr>
<tr>
<th>area</th>
<td>name of area represented <code>Example County</code></td>
</tr>
<tr>
<th>area_id<span class="superscript">†</span></th>
<td>id for area</td>
</tr>
<tr>
<th>term<span class="superscript">†</span></th>
<td>id or index for term</td>
</tr>
<tr>
<th>start_date</th>
<td>start of membership (if needed) <code>2010-11-29</code></td>
</tr>
<tr>
<th>end_date</th>
<td>end of membership (if needed) <code>2012-07-14</code></td>
</tr>
<tr>
<th>image</th>
<td>URL to image file <code>http://example.com/example.jpg</code></td>
</tr>
<tr>
<th>gender</th>
<td>gender if known: currently <code>male</code> or <code>female</code></td>
</tr>
</table>
</div>
</li>
</ul>
</div>
</div>
</div>
</div>
<p>
<span class="superscript">†</span>
You can consider fields marked † as identifiers that are
unique within this legislature (unlike <code>id</code> itself, which
identifies the politician within <em>all</em> of EveryPolitician’s data). For
example, all politicians in this legislature with the same
<code>area_id</code> are representing the same area (which is what
you’d expect). Where appropriate, the actual value might be a useful
value. For example, for <code>area_id</code> we like to use <a
href="http://opencivicdata.readthedocs.org/en/latest/ocdids.html">Open
Civic Data IDs</a> like this:
<code>ocd-division/country:us/state:ca/cd:47</code>, but if no such
mapping is available from our sources, we’ll probably use a slug like
<code>area-foo-county</code>. If you can suggest better values
whenever you find us using our own ones, please get in touch!
</p>
<h2>
Legislatures and terms
</h2>
<p>
For every country, we break the data down into
<strong>legislatures</strong> (for example, the “parliament” or
“congress” of the country). Currently, we’re only working with
national, that is, top-level, legislatures. In the future we might
include local or state-level legislatures too.
</p>
<p>
Many countries have a single legislature. Some have two. For example,
the <a
href="http://everypolitician.org/united-states-of-america/">United
States of America</a> has the <em>House of Representatives</em> and
the <em>Senate</em>.
</p>
<p>
Furthermore, we split the data for those down into
<strong>terms</strong>. Like legislatures themselves, the way this
works may vary from country to country. Terms have a start and end
date, and the data therefore describes the membership of the
legislation during that period. In most democracies, new terms begin
after every national election.
</p>
<p>
This is pragmatic because we know online services and sites are
likely to be concerned with the country’s <em>current</em>
legislature, which means the data for politicians in the current
term. Often — but not always — this is also the data we’re most
likely to be able to source online, so if you see a country only has
one term in its data, it will probably be the most recent one. We’re
always interested in adding historic data too (that is, data for
previous terms), so if you know where we can get it for your country,
please let us know. Similarly, if we are lacking the current term
(maybe there’s just been an election?), please let us know — it might
be that our source has been updated and we haven’t pulled down the
latest changes.
</p>
<h2>
Politicians
</h2>
<p>
<strong>The richness of the data we have will depend on the source or
sources from which we’re getting it.</strong> We use a wide variety
of sources, some official and many more that are not. Indeed, the
EveryPolitician project exists partly because in many places it’s
still depressingly hard to get even basic information, such as who all
the current legislators are.
</p>
<p>
So the absolute minimum data we need for a politician to be in
EveryPolitician is their <strong>name</strong> (by implication, we
also have their membership of the legislature and its term). Unless
the system they are in does not use such mechanisms, we also try to
have the <strong>area</strong> they represent, and the
<strong>group</strong> (perhaps a party or faction) they belong to.
There may also be start and end dates if we know they did not serve
the full term.
</p>
<a name="about-names"></a>
<h3>
About names
</h3>
<p>
Names are a key part of any data set based around people. In the
basic CSV, the <code>name</code> will be the common name that we’ve
got from the source, but we’re aware that names are not quite that
simple. So we sometimes have separate fields for:
</p>
<p>
<code>name</code>
<code>given_name</code>
<code>family_name</code>
<code>honorific_prefix</code>
<code>honorific_suffix</code>
<code>patronymic_name</code>
<code>sort_name</code>
</p>
<p>
The JSON data includes <code>other_names</code>, which may contain
aliases as well as names for the politician in different languages.
</p>
<h3>
If you <em>only</em> want names
</h3>
<p>
For convenience, we isolate all the names of all the politicians
within each legislature and make them available in their own file.
(Like other files, look inside <code>countries.json</code> to find
its path and filename; in this case under <code>names</code>). This
file contains <code>name</code> and <code>id</code> fields. Note that
there may be more than one name for each politician (for example, we
might have the name of a Welsh politician rendered in English and
Chinese). If you need to consolidate them, use the <code>id</code> to
match them (or dive into the JSON data instead).
</p>
<p>
In fact, we’ve even built an external service that puts all these
names (together with identifier fields, such as <code>id</code>) into
a single CSV file (caution: that currently contains well over 80,000
names and growing): see
<a href="https://github.com/everypolitician/everypolitician-names/">everypolitician-names</a>.
</p>
<h3>
IDs to other data sets, including Wikidata
</h3>
<p>
We give each politician in EveryPolitician a unique ID (actually a
UUID). It’s keyed as <code>id</code> in the CSV. But often we’ll know
useful IDs for the same politician in other data sets too. You can
find these in the JSON data under <code>identifiers</code>.
</p>
<p>
Each identifier consists of an (<code>identifier</code>,
<code>scheme</code>) pair. For example, here’s an entry for the
identifier in Wikidata (the database behind Wikipedia):
</p>
<pre>
"identifiers": [
{
"identifier": "Q3785077",
"scheme": "wikidata"
}
]
</pre>
<p>
(Incidentally, you can view Wikidata entries like this:
<a href="https://www.wikidata.org/wiki/Q3785077">wikidata.org/wiki/Q3785077</a>,
or using <a href="https://tools.wmflabs.org/reasonator/?q=Q3785077">the Reasonator</a>.
This example is Estonian politician Taavi Rõivas.)
<p>
We do this for other entities (such as legislatures themselves; and
we’re adding external identifiers for political parties or factions
too) as well as politicians. Remember that external data sets might
not be complete, so it’s possible to have a useful identifier which
isn’t populated for <em>all</em> records. For example, Wikidata only
holds data for entries that satisfy their “notability” criteria.
</p>
<p>
Some legislatures have adopted the good practice of assigning unique
IDs to their politicians, which can be useful when using their
official APIs. If you know that your country’s legistature provides
such IDs please tell us, and we’ll add them as identifiers in this way.
</p>