Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📊 VDEM: new release v14 #2489

Closed
wants to merge 111 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
111 commits
Select commit Hold shift + click to select a range
3f4ab15
minor fix in bearing estimation
lucasrodes Mar 5, 2024
cadf714
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 5, 2024
d101cb1
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 5, 2024
d0f2cd1
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 5, 2024
1540a37
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 6, 2024
b296f02
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 6, 2024
408b8f3
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 7, 2024
45b81c1
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 7, 2024
a33001c
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 7, 2024
6f4ffc5
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 8, 2024
f44abe0
git checkoutMerge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 8, 2024
2b2b475
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 11, 2024
37df27e
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 14, 2024
2480846
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 15, 2024
9277d98
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 15, 2024
d6daad3
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 15, 2024
7ae1c94
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 18, 2024
94fb55d
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 18, 2024
72726d1
snapshot of VDEM 14
lucasrodes Mar 18, 2024
361cbf3
read file in zip folder
lucasrodes Mar 18, 2024
6b02f8b
Merge branch 'feat/snap-read-in-zip' into data/vdem
lucasrodes Mar 18, 2024
4693871
read file in zip/tar folder
lucasrodes Mar 18, 2024
6e3804d
minor tweaks
lucasrodes Mar 18, 2024
acc2262
Merge branch 'feat/snap-read-in-zip' into data/vdem
lucasrodes Mar 18, 2024
24df220
meadow
lucasrodes Mar 18, 2024
5e4947e
only load relevant columns
lucasrodes Mar 19, 2024
97c3f44
wip
lucasrodes Mar 19, 2024
71a9f0d
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 19, 2024
cb045e7
Merge branch 'master' into data/vdem
lucasrodes Mar 19, 2024
cb4a45a
enhance: generalise fillna table method
lucasrodes Mar 19, 2024
9fed765
wip
lucasrodes Mar 19, 2024
41c3ac2
restructure argument
lucasrodes Mar 19, 2024
f324c05
Merge branch 'enhance/table-fillna' into data/vdem
lucasrodes Mar 19, 2024
94c0710
clean step
lucasrodes Mar 20, 2024
d2e6872
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 20, 2024
17e49fc
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 20, 2024
f8ef2be
Merge branch 'master' into data/vdem
lucasrodes Mar 20, 2024
91c6ae8
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 20, 2024
3785a94
Merge branch 'master' into data/vdem
lucasrodes Mar 20, 2024
0967d6e
wip
lucasrodes Mar 20, 2024
94c288c
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 20, 2024
3c5c669
Merge branch 'master' into data/vdem
lucasrodes Mar 20, 2024
f9ba671
wip: impute
lucasrodes Mar 20, 2024
1a2f721
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 20, 2024
7d4c1cf
git checMerge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 21, 2024
8dc430b
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 21, 2024
c36035e
Merge branch 'master' of https://github.com/owid/etl
lucasrodes Mar 21, 2024
e7785ba
Merge branch 'master' into data/vdem
lucasrodes Mar 21, 2024
bd7181e
wip
lucasrodes Mar 21, 2024
e801908
wip
lucasrodes Mar 21, 2024
353548c
wip
lucasrodes Mar 21, 2024
9d94240
impute country list
lucasrodes Mar 21, 2024
1f0f650
wip
lucasrodes Mar 22, 2024
c41398f
Merge branch 'master' into data/vdem
lucasrodes Mar 22, 2024
fa73ac1
hotfix
lucasrodes Mar 22, 2024
7cde990
wip: vdem garden
lucasrodes Mar 22, 2024
40b4193
wip: grapher
lucasrodes Mar 22, 2024
9fa4386
wip
lucasrodes Mar 22, 2024
15d95f0
Merge branch 'master' into data/vdem
lucasrodes Mar 26, 2024
705dae8
wip: metadata
lucasrodes Mar 26, 2024
35a2882
wip: metadata descriptions
lucasrodes Mar 27, 2024
83ec5d2
wip: metadata
lucasrodes Mar 27, 2024
6ef92fa
wip: add missing metadata
lucasrodes Mar 28, 2024
1d88e9b
wip
lucasrodes Mar 28, 2024
9d0e28f
wip
lucasrodes Mar 28, 2024
9917f12
Merge branch 'master' into data/vdem
lucasrodes Mar 28, 2024
58003c4
Merge branch 'master' into data/vdem
lucasrodes Mar 29, 2024
beef8ec
wip
lucasrodes Mar 29, 2024
a873af6
enhance: use former country names
lucasrodes Apr 1, 2024
d3d5974
wip: aggregate
lucasrodes Apr 1, 2024
1363379
Merge branch 'master' into data/vdem
lucasrodes Apr 1, 2024
e871162
wip
lucasrodes Apr 2, 2024
2b24aa6
hotfix
lucasrodes Apr 2, 2024
06f59ea
ci/cd
lucasrodes Apr 2, 2024
bcf9318
hotfix
lucasrodes Apr 2, 2024
db3a6ce
Merge branch 'master' into data/vdem
lucasrodes Apr 2, 2024
31a7ada
hotfix: population-weight
lucasrodes Apr 2, 2024
702b436
wip
lucasrodes Apr 2, 2024
d27ef5e
ci/cd
lucasrodes Apr 2, 2024
77b35df
hotfix bmr
lucasrodes Apr 2, 2024
0711d12
fix: widget key was duplicate
lucasrodes Apr 3, 2024
435adfc
Merge branch 'fix/chart-upgrader' into data/vdem
lucasrodes Apr 3, 2024
dc63eee
enhance regime indicator's metadata
lucasrodes Apr 3, 2024
36e751c
giMerge branch 'master' into data/vdem-2
lucasrodes Apr 3, 2024
8a2f461
fix imputes
lucasrodes Apr 3, 2024
b8d8d2f
fix unstandardised name
lucasrodes Apr 3, 2024
76e4b85
fix world aggregate
lucasrodes Apr 4, 2024
97a114a
title fix
lucasrodes Apr 5, 2024
92e5dad
fix World estimates
lucasrodes Apr 5, 2024
79e2276
show number of suggestions if limit is exceeded
lucasrodes Apr 5, 2024
46ed54d
fix count of countries (use age not experience)
lucasrodes Apr 6, 2024
52eb861
update number of decimals
lucasrodes Apr 6, 2024
45584f9
decimals: 2
lucasrodes Apr 6, 2024
8af4210
edit number of decimal places
lucasrodes Apr 9, 2024
85a15d4
Merge branch 'master' into data/vdem-2
lucasrodes Apr 9, 2024
42b9f88
Merge branch 'master' into data/vdem-2
lucasrodes Apr 9, 2024
6c40d49
chart-upgrader: drop rows with old=new
lucasrodes Apr 9, 2024
0703c04
wizard: chart-upgrader improve explore mode
lucasrodes Apr 9, 2024
d65cab1
fix infinity %
lucasrodes Apr 9, 2024
83e93b2
rollback to show number of variables
lucasrodes Apr 9, 2024
59175c2
Merge branch 'master' into data/vdem-2
lucasrodes Apr 18, 2024
d233d8a
ensure variable is boolean
lucasrodes Apr 18, 2024
284c2d3
improve explore mode
lucasrodes Apr 18, 2024
de8deea
Merge branch 'master' into data/vdem-2
lucasrodes Apr 18, 2024
62ba9da
fix origins for table
lucasrodes Apr 18, 2024
1095d8b
remove countries in population-weighted indicators
lucasrodes Apr 19, 2024
41d45df
Merge branch 'master' into data/vdem-2
lucasrodes Apr 19, 2024
ab1774b
set aggregate to NaN in pop-weighted indicators
lucasrodes Apr 19, 2024
7e0b045
fix: confusion with egal_vdem and egaldem_vdem
lucasrodes Apr 19, 2024
2747540
add default entities with colours
lucasrodes Apr 19, 2024
d3cda20
Merge branch 'master' into data/vdem-2
lucasrodes Apr 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 16 additions & 7 deletions apps/wizard/pages/charts/variable_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ def ask_and_get_variable_mapping(search_form) -> "VariableConfig":
# Show only first 100 variables to map (otherwise app crashes)
if len(suggestions) > LIMIT_VARS_TO_MAP:
st.warning(
f"Too many variables to map! Showing only the first {LIMIT_VARS_TO_MAP}. If you want to map more variables, do it in batches. That is, first map this batch and approve the generated chart revisions in admin. Once you are done, run this app again. Make sure you have approved the previously generated revisions!"
f"Too many variables to map ({len(suggestions)})! Showing only the first {LIMIT_VARS_TO_MAP}. If you want to map more variables, do it in batches. That is, first map this batch and approve the generated chart revisions in admin. Once you are done, run this app again. Make sure you have approved the previously generated revisions!"
)
suggestions = suggestions[:LIMIT_VARS_TO_MAP]

Expand Down Expand Up @@ -313,6 +313,7 @@ def ask_and_get_variable_mapping(search_form) -> "VariableConfig":
def show_explore_df(df_data, variable_old, variable_new, variable_id_to_display, element_check) -> None:
if element_check: # type: ignore
with st.container(border=True):
# plot_comparison_two_variables(df_data, variable_old, variable_new, variable_id_to_display) # type: ignore
try:
plot_comparison_two_variables(df_data, variable_old, variable_new, variable_id_to_display) # type: ignore
except Exception:
Expand Down Expand Up @@ -363,9 +364,15 @@ def build_df_comparison_two_variables_cached(df, variable_old, variable_new, var
df_variables.loc[:, "value"] = df_variables.value.astype(float)
# Reshape dataframe
df_variables = df_variables.pivot(index=["entityName", "year"], columns="variableId", values="value").reset_index()
df_variables["Relative difference (abs, %)"] = (
(100 * (df_variables[variable_old] - df_variables[variable_new]) / df_variables[variable_old]).round(2)
mask = df_variables[variable_old] == 0
df_variables.loc[~mask, "Relative difference (abs, %)"] = (
(
100
* (df_variables.loc[~mask, variable_old] - df_variables.loc[~mask, variable_new])
/ df_variables.loc[~mask, variable_old]
).round(2)
).abs()
df_variables.loc[mask, "Relative difference (abs, %)"] = float("inf")
df_variables = df_variables.rename(columns=var_id_to_display).sort_values(
"Relative difference (abs, %)", ascending=False
)
Expand All @@ -386,13 +393,15 @@ def plot_comparison_two_variables(df, variable_old, variable_new, var_id_to_disp
# st.write(countries)
# if countries:
# df_variables = df_variables[df_variables["entityName"].isin(countries)]
score = round(100 - df_variables["Relative difference (abs, %)"].mean(), 1)
relative_diff = df_variables["Relative difference (abs, %)"]
relative_diff.loc[relative_diff == float("inf")] = float("nan")
score = round(100 - relative_diff.mean(), 1)
if score == 100:
score = round(100 - df_variables["Relative difference (abs, %)"].mean(), 2)
score = round(100 - relative_diff.mean(), 2)
if score == 100:
score = round(100 - df_variables["Relative difference (abs, %)"].mean(), 3)
score = round(100 - relative_diff.mean(), 3)
if score == 100:
score = round(100 - df_variables["Relative difference (abs, %)"].mean(), 4)
score = round(100 - relative_diff.mean(), 4)
num_nan_score = df_variables["Relative difference (abs, %)"].isna().sum()

nrows_0 = df_variables.shape[0]
Expand Down
10 changes: 10 additions & 0 deletions dag/democracy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,13 @@ steps:
- data://garden/demography/2023-03-31/population
data://grapher/democracy/2024-03-07/bmr:
- data://garden/democracy/2024-03-07/bmr

# Varieties of Democracy (2024)
data://meadow/democracy/2024-03-07/vdem:
- snapshot://democracy/2024-03-18/vdem.zip
data://garden/democracy/2024-03-07/vdem:
- data://meadow/democracy/2024-03-07/vdem
- data://garden/regions/2023-01-01/regions
- data://garden/demography/2023-03-31/population
data://grapher/democracy/2024-03-07/vdem:
- data://garden/democracy/2024-03-07/vdem
122 changes: 23 additions & 99 deletions etl/steps/data/garden/democracy/2024-03-07/bmr.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
import yaml
from owid.catalog import Dataset, Variable
from owid.catalog.tables import Table, concat
from shared import add_population_in_dummies, expand_observations, from_wide_to_long

from etl.data_helpers import geo
from etl.helpers import PathFinder, create_dataset
Expand Down Expand Up @@ -443,7 +444,28 @@ def make_tables_population_counters(tb: Table, ds_regions: Dataset, ds_populatio
tb_ = make_table_with_dummies(tb_, ds_regions)

# Add population in dummies (population value replaces 1, 0 otherwise)
tb_ = add_population_in_dummies(tb_, ds_population)
tb_ = add_population_in_dummies(
tb_,
ds_population,
expected_countries_without_population=[
"Pakistan (former)",
"Korea (former)",
"Duchy of Parma and Piacenza",
"Orange Free State",
"Federal Republic of Central America",
"Grand Duchy of Tuscany",
"Democratic Republic of Vietnam",
"Kingdom of Saxony",
"Duchy of Modena and Reggio",
"Kingdom of the Two Sicilies",
"Kingdom of Sardinia",
"Great Colombia",
"Grand Duchy of Baden",
"Kingdom of Wurttemberg",
"Republic of Vietnam",
"Kingdom of Bavaria",
],
)

# Get region aggregates
tb_ = geo.add_regions_to_table(
Expand Down Expand Up @@ -569,42 +591,6 @@ def make_table_with_dummies(
return tb_


def from_wide_to_long(tb: Table) -> Table:
"""Format a particular shape of table from wide to long format.

The expected input table format is:

| year | country | indicator_a_1 | indicator_a_2 | indicator_b_1 | indicator_b_2 |
|------|---------|---------------|---------------|---------------|---------------|
| 2000 | USA | 1 | 2 | 3 | 4 |
| 2000 | CAN | 5 | 6 | 7 | 8 |

The generated output is:

| year | country | category | indicator_a | indicator_b |
|------|---------|------------|-------------|-------------|
| 2000 | USA | category_1 | 1 | 3 |
| 2000 | USA | category_2 | 2 | 4 |
"""
# Melt the DataFrame to long format
tb = tb.melt(id_vars=["year", "country"], var_name="indicator_type", value_name="value")

# Extract indicator names and types
tb["indicator"] = tb["indicator_type"].apply(lambda x: "_".join(x.split("_")[:-1]))
tb["category"] = tb["indicator_type"].apply(lambda x: x.split("_")[-1])

# Drop the original 'indicator_type' column as it's no longer needed
tb.drop("indicator_type", axis=1, inplace=True)

# Pivot the table to get 'indicator_a' and 'indicator_b' as separate columns
tb = tb.pivot(index=["year", "country", "category"], columns="indicator", values="value").reset_index()

# Rename the columns to match your requirements
tb.columns.name = None # Remove the hierarchy

return tb


def expand_observations_without_leading_to_duplicates(tb: Table, ds_regions: Dataset) -> Table:
"""Expand observations (accounting for overlaps between former and current countries).

Expand Down Expand Up @@ -636,68 +622,6 @@ def expand_observations_without_leading_to_duplicates(tb: Table, ds_regions: Dat
return tb


def expand_observations(tb: Table) -> Table:
"""Expand to have a row per (year, country)."""
# Add missing years for each triplet ("warcode", "campcode", "ccode")

# List of countries
regions = set(tb["country"])

# List of possible years
years = np.arange(tb["year"].min(), tb["year"].max() + 1)

# New index
new_idx = pd.MultiIndex.from_product([years, regions], names=["year", "country"])

# Reset index
tb = tb.set_index(["year", "country"]).reindex(new_idx).reset_index()

# Type of `year`
tb["year"] = tb["year"].astype("int")
return tb


def add_population_in_dummies(tb: Table, ds_population: Dataset):
# Add population column
tb = geo.add_population_to_table(
tb,
ds_population,
interpolate_missing_population=True,
expected_countries_without_population=[
"Pakistan (former)",
"Korea (former)",
"Duchy of Parma and Piacenza",
"Orange Free State",
"Federal Republic of Central America",
"Grand Duchy of Tuscany",
"Democratic Republic of Vietnam",
"Kingdom of Saxony",
"Duchy of Modena and Reggio",
"Kingdom of the Two Sicilies",
"Kingdom of Sardinia",
"Great Colombia",
"Grand Duchy of Baden",
"Kingdom of Wurttemberg",
"Republic of Vietnam",
"Kingdom of Bavaria",
],
)
tb = cast(Table, tb.dropna(subset="population"))
# Add metadata (origins combined indicator+population)
cols = [col for col in tb.columns if col not in ["year", "country", "population"]]
meta = {col: tb[col].metadata for col in cols} | {"population": tb["population"].metadata}
## Encode population in indicators: Population if 1, 0 otherwise
tb[cols] = tb[cols].multiply(tb["population"], axis=0)
tb = tb.drop(columns="population")
## Add metadata back (combine origins from population)
for col in cols:
metadata = meta[col]
metadata.origins += meta["population"].origins
tb[col].metadata = meta[col]

return tb


def _get_countries_to_ignore_population(ds_regions: Dataset) -> Set[str]:
"""List of countries to ignore when working with population.

Expand Down
118 changes: 118 additions & 0 deletions etl/steps/data/garden/democracy/2024-03-07/shared.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
from typing import Callable, List, Optional, cast

import numpy as np
import pandas as pd
from owid.catalog import Dataset, Table

from etl.data_helpers import geo


def from_wide_to_long(
tb: Table,
indicator_name_callback: Optional[Callable] = None,
indicator_category_callback: Optional[Callable] = None,
column_dimension_name: str = "category",
) -> Table:
"""Format a particular shape of table from wide to long format.

tb: Table with wide format.
indicator_name_callback: Function to extract the indicator name from the column name.
indicator_category_callback: Function to extract the indicator category from the column name.

If no `indicator_name_callback` and `indicator_category_callback` are provided, it proceed expects the following input:

| year | country | indicator_a_1 | indicator_a_2 | indicator_b_1 | indicator_b_2 |
|------|---------|---------------|---------------|---------------|---------------|
| 2000 | USA | 1 | 2 | 3 | 4 |
| 2000 | CAN | 5 | 6 | 7 | 8 |

and then generates the output:

| year | country | category | indicator_a | indicator_b |
|------|---------|------------|-------------|-------------|
| 2000 | USA | category_1 | 1 | 3 |
| 2000 | USA | category_2 | 2 | 4 |
"""
# Melt the DataFrame to long format
tb = tb.melt(id_vars=["year", "country"], var_name="indicator_type", value_name="value")

# Get callables
if indicator_name_callback is None:

def default_indicator_name(x):
return "_".join(x.split("_")[:-1])

indicator_name_callback = default_indicator_name

if indicator_category_callback is None:

def default_indicator_category(x):
return x.split("_")[-1]

indicator_category_callback = default_indicator_category

# Extract indicator names and types
tb["indicator"] = tb["indicator_type"].apply(indicator_name_callback)
tb[column_dimension_name] = tb["indicator_type"].apply(indicator_category_callback)

# Drop the original 'indicator_type' column as it's no longer needed
tb.drop("indicator_type", axis=1, inplace=True)

# Pivot the table to get 'indicator_a' and 'indicator_b' as separate columns
tb = tb.pivot(index=["year", "country", column_dimension_name], columns="indicator", values="value").reset_index()

# Rename the columns to match your requirements
tb.columns.name = None # Remove the hierarchy

return tb


def expand_observations(tb: Table) -> Table:
"""Expand to have a row per (year, country)."""
# Add missing years for each triplet ("warcode", "campcode", "ccode")

# List of countries
regions = set(tb["country"])

# List of possible years
years = np.arange(tb["year"].min(), tb["year"].max() + 1)

# New index
new_idx = pd.MultiIndex.from_product([years, regions], names=["year", "country"])

# Reset index
tb = tb.set_index(["year", "country"]).reindex(new_idx).reset_index()

# Type of `year`
tb["year"] = tb["year"].astype("int")
return tb


def add_population_in_dummies(
tb: Table,
ds_population: Dataset,
expected_countries_without_population: Optional[List[str]] = None,
drop_population: bool = True,
):
# Add population column
tb = geo.add_population_to_table(
tb,
ds_population,
interpolate_missing_population=True,
expected_countries_without_population=expected_countries_without_population,
)
tb = cast(Table, tb.dropna(subset="population"))
# Add metadata (origins combined indicator+population)
cols = [col for col in tb.columns if col not in ["year", "country", "population"]]
meta = {col: tb[col].metadata for col in cols} | {"population": tb["population"].metadata}
## Encode population in indicators: Population if 1, 0 otherwise
tb[cols] = tb[cols].multiply(tb["population"], axis=0)
if drop_population:
tb = tb.drop(columns="population")
## Add metadata back (combine origins from population)
for col in cols:
metadata = meta[col]
metadata.origins += meta["population"].origins
tb[col].metadata = meta[col]

return tb
Loading
Loading